Breaking

Gemma 4 26B-A4B Runs 16 Instances on a Single DGX Spark at 300 tokens/s

June 23, 2026 at 12:10 EDT

Foundation Models
Infra & Chips
Open Source

Google DeepMind's open-weight model "Gemma-4-26B-A4B" ran 16 instances in parallel on a single NVIDIA DGX Spark, hitting about 18 tok/s per instance and roughly 300 tok/s in aggregate. In a demo video released by the official Gemma team on June 23, 2026, the model was shown capable of scaling up to 32 parallel runs on the same hardware, underscoring the inference efficiency of its architecture.

April 2, 2026 · Google DeepMind

16 AI Models, One Desktop: Gemma 4 Runs in Parallel on a Single DGX Spark

Google ran its open MoE model Gemma 4 26B-A4B in 16 concurrent instances on one compact Grace Blackwell unit — 18 tokens/sec each, ~300 tok/s combined — with up to 32 parallel runs deemed possible.

16×

parallel instances on one desktop unit

300 tok/s

aggregate throughput in the demo

4B/25.2B

active vs total params (MoE)

~16GB

model size after NVFP4 quantization

Aggregate throughput scales with parallelism

Reported total tokens/sec across configurations — taller column means more combined output

246

300

403

1 session
(user)

4 parallel
(user, ~90% eff.)

16 parallel
(official demo)

8 parallel
(user)

Why it fits a single box

MoE design

only ~4B of 25.2B active

→

NVFP4 quant

compressed to ~16GB

→

128GB unified

many concurrent runs

Drawing attention

High concurrency on one compact unit, not a cloud cluster
256K context, multimodal input, native function calling
~200K-token retrieval worked; suited to many agents at once
Apache 2.0 weights; runs on vLLM, llama.cpp, Ollama

Caveats raised

FP4 kernels can fail on driver/flashinfer mismatches, slowing fallback
Mixed FP8 vs NVFP4 assessments
Parallel efficiency declines with very large prompts
Questions over $3,999–$4,699 price and 273GB/s bandwidth

Continue reading

The rest of this article is for AI News Blitz readers. Choose an option below to keep reading.