Breaking

NVIDIA Touts Up to 15x Faster Blackwell Inference With DFlash

June 23, 2026 at 13:06 EDT

Infra & Chips
Open Source
Foundation Models

NVIDIA said on June 23 that DFlash, an open-source lightweight block diffusion model used for speculative decoding, can boost LLM inference throughput by up to 15x on Blackwell GPUs while preserving user interactivity, according to its technical blog.

February 2026 · NVIDIA × Z Lab UC San Diego

DFlash: up to 15× faster LLM inference on Blackwell

An open-source lightweight block diffusion model that drafts a whole block of tokens in a single parallel pass — accelerating speculative decoding while keeping responsiveness intact.

15×

peak throughput gain vs autoregressive

8–16

tokens drafted per single forward pass

80–90%

draft acceptance rate (task dependent)

Throughput gain over autoregressive baseline

Relative speedup (baseline = 1×). Column height proportional to the multiplier.

1×

Autoregressive baseline

5.1×

Qwen3-8B · B200 / SGLang

5.8×

Gemma 4 31B · Blackwell Ultra / vLLM

15×

gpt-oss-120b · DGX B300 8GPU / TRT-LLM

How block-diffusion drafting works

DFlash drafts a full block in one parallel pass

→

Target model verifies the block in a single batch

→

Multiple tokens accepted per pass

Where it shines

1.5× faster even than EAGLE-3 on gpt-oss-120b
High acceptance on math & code workloads
Drop-in for SGLang, vLLM, TensorRT-LLM, MLX
88–108 tok/s reported on Qwen3-Coder-Next

Task-dependent limits

Acceptance drops on agent benchmarks (AgentBench)
Speedup tapers as context grows (KV cache expansion)
Draft can bottleneck when paired with a quantized target

Continue reading

The rest of this article is for AI News Blitz readers. Choose an option below to keep reading.