Gemma-4 on the DGX Spark: NVFP4 vs BF16
Nine identical benchmarks, two precisions. NVFP4 runs 22 to 92 percent faster per token, and peak-hour capacity grows 69 percent on the Spark.
In the BF16 baseline of Gemma-4 on the DGX Spark I ran nine benchmarks with Gemma-4-26B-A4B in BF16. Decode speed held up just fine, prefill decided when the wall came, and the system queued neatly under pressure instead of crashing. That story seemed done, until NVIDIA released an NVFP4-quantized version of that same model.
Same architecture and fine-tune, same server config, only the precision changes. From BF16 (16 bits per parameter) to NVFP4 (4 bits per parameter, NVIDIA’s take on FP4). Four times smaller per weight, and if the Blackwell kernels cooperate, also a lot faster on compute-heavy tasks.
On paper, nice. In practice: the official vLLM v0.20.1 release recognizes this checkpoint without any fuss, and the numbers were faster across the board than the BF16 baseline. Both tests fall under the guide running LLMs on the DGX Spark.
Why look into this at all
For an office with a local AI machine, memory budget is the most limiting thing after compute. A 26B model in BF16 takes ~48 GB of GPU memory for weights alone. On a Spark with 128 GB of unified memory, that leaves about 65 GB for KV-cache. Enough for the office scenario from the first blog, but not much room to run, say, 30+ users with large context side by side.
NVFP4 reduces that to ~18 GB for weights. Not four times smaller than BF16 (the vision encoder stays BF16, and scale factors cost space too), but about 2.7× smaller. That gives you toward 95 GB of KV-cache headroom, which in theory should support much higher concurrency. On top of that, less memory traffic is needed per forward pass, so by definition less bandwidth pressure, and that was already the bottleneck in BF16 under multi-user load. So the question was simple: how much of that theoretical gain survives in practice?
What NVFP4 actually is
NVFP4 is NVIDIA’s take on FP4: floating-point numbers with 4 bits per value. Four bits, not four bytes, so a factor of 4 less per parameter than BF16. By storing a scaling factor per group of weights, accuracy stays reasonably intact.
For Blackwell it works like this. NVIDIA’s datacenter cards (B100, B200, SM10.0) have tensor cores that can compute natively with 4-bit values, and that is much faster than the same calculation in FP16 or BF16. The DGX Spark, on the other hand, is desktop Blackwell (GB10, SM12.1) and that architecture has no native FP4 compute. On a datacenter B200 (SM10.0) you’d expect another 2 to 3× on top of this thanks to native FP4 tensor cores. The Spark lacks that hardware path, so all the gain comes from memory bandwidth, not from compute. What you get in that case is “weight-only” FP4: the weights are physically stored as 4-bit (hence the memory gain), but during compute they get decoded on the fly to FP16 for the matrix multiplications. A vLLM warning makes that explicit:
Your GPU does not have native support for FP4 computation but FP4 quantization
is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel.
This may degrade performance for compute-heavy workloads.
So you get the memory gain in full, the compute gain only partially. The Marlin INT4 GEMM kernel is optimized, but not as fast as native FP4 on SM10.0 would be. Worth factoring in when you look at the numbers further down.
The test setup
Server config identical to the first blog, only the model swaps:
docker run -d --name vllm-bench \
--gpus all --ipc=host \
-v appliance_hf-cache:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:v0.20.1 \
--model nvidia/Gemma-4-26B-A4B-NVFP4 \
--served-model-name gemma-4-26b-a4b-nvfp4 \
--max-model-len 131072 \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8 \
--limit-mm-per-prompt '{"image":0,"audio":0}' \
--async-scheduling \
--no-enable-prefix-caching \
--host 0.0.0.0 \
--port 8000
Tests are one-to-one identical to the first blog: same commands, same concurrency levels, same datasets for the open-loop tests, same seed. That is on purpose, because if you want to measure the effect of an isolated variable (in this case the precision), everything around it has to stay the same. Exactly how I measure those concurrency levels, seeds and open-loop arrivals is described in the Arena measurement method.
| Comparison | BF16 | NVFP4 |
|---|---|---|
| Model | google/gemma-4-26B-A4B-it | nvidia/Gemma-4-26B-A4B-NVFP4 |
| Active params | 4B | 4B |
| Total params | 26B | 26B |
| Model memory | ~48 GB | ~18 GB |
| KV-cache headroom | ~65 GB | ~95 GB |
| MoE backend | (default) | MARLIN (forced) |
Three numbers sum up where this lands. Click through for the full run in the Arena, with all seeds, concurrency levels and commands:
An interactive version of all the numbers is on the Arena page for Gemma-4-26B-A4B-NVFP4, including commands and TTFT percentiles for all 9 tests.
Run A: context scaling from 4k to 25k
Decode per user as context grows, c=1/5/10:
| Context | Users | BF16 d/u | NVFP4 d/u | Gain |
|---|---|---|---|---|
| 4k | 1 | 24.08 | 29.80 | +24% |
| 4k | 5 | 12.55 | 22.01 | +75% |
| 4k | 10 | 9.48 | 16.94 | +79% |
| 8k | 1 | 23.69 | 29.31 | +24% |
| 8k | 5 | 11.48 | 19.28 | +68% |
| 8k | 10 | 8.52 | 14.35 | +68% |
| 16k | 1 | 23.34 | 28.55 | +22% |
| 16k | 5 | 10.05 | 15.67 | +56% |
| 16k | 10 | 6.79 | 10.06 | +48% |
| 25k | 1 | 22.75 | 27.70 | +22% |
| 25k | 5 | 8.46 | 12.46 | +47% |
| 25k | 10 | 5.40 | 7.55 | +40% |
At c=1 the gain is stable around +22-24% across all contexts. Memory bandwidth barely matters for single-user, so the gain here sits in the compute path itself. Marlin’s INT4 decode plus FP16 matmul is slightly faster than BF16’s direct FP16 matmul, even though it’s two steps.
At c=10 the difference scales much more strongly with workload type, from +40% at 25k context to +79% at 4k. That’s because under multi-user the memory bandwidth becomes the bottleneck, and NVFP4 reads fewer bytes per forward pass. The more concurrent, the more that counts, until you hit the KV-cache memory limits again (25k context with multiple users) and the gain flattens out.
TTFT (first token) is better too:
| Context | Users | BF16 TTFT | NVFP4 TTFT |
|---|---|---|---|
| 4k | 10 | 4.46s | 4.20s |
| 8k | 10 | 7.99s | 7.84s |
| 16k | 10 | 18.92s | 18.69s |
| 25k | 10 | 35.67s | 35.65s |
On TTFT the gain is small. That makes sense: prefill is compute-heavy, and on SM12.1 without native FP4 tensor cores Marlin has to decode the weights on the fly for the matmul. That gives back some of what the memory bandwidth gained. For decode, bandwidth counts more than compute; for prefill, the other way around.
Run B: 25k context, concurrency up to 20
The stress test from part one:
| Users | BF16 d/u | NVFP4 d/u | BF16 TTFT | NVFP4 TTFT |
|---|---|---|---|---|
| 5 | 8.51 t/s | 12.43 t/s | 19.86s | 19.72s |
| 10 | 5.37 t/s | 7.56 t/s | 35.44s | 35.51s |
| 20 | 3.16 t/s | 4.26 t/s | 67.37s | 67.40s |
The aggregate decode plateau shifts from 32 t/s to 36 t/s at c=20: a 12% higher ceiling at 25k context under maximum pressure. TTFT is practically identical between BF16 and NVFP4 because prefill is the wall here and that doesn’t get much faster on SM12.1. Decode per user is clearly better though: at twenty parallel 25k prompts you get 4.26 instead of 3.16 t/s, +35%. Still not chat speed, but a noticeable difference once the tokens start flowing.
Run C: 1k prompt, 1k output
The short-prompt + long-answer workload, close to agent flows and code generation:
| Users | BF16 d/u | NVFP4 d/u | Gain |
|---|---|---|---|
| 1 | 23.86 | 29.45 | +23% |
| 5 | 13.59 | 24.69 | +82% |
| 10 | 10.92 | 20.88 | +91% |
At c=10 per-user decode sits at well over 20 t/s, above reading speed and close to a comfortable streaming UI. Aggregate decode at c=10 hits 209 t/s instead of 86 t/s in BF16, almost a doubling.
Run E: multi-turn (depth 4)
Five consecutive turns per conversation, ten conversations in parallel: the most realistic office shape.
| Users | BF16 d/u | NVFP4 d/u | BF16 TTFT | NVFP4 TTFT |
|---|---|---|---|---|
| 1 | 23.97 | 29.61 | 0.53s | 0.33s |
| 5 | 13.07 | 23.98 | 1.32s | 1.11s |
| 10 | 10.43 | 19.51 | 2.13s | 1.94s |
For ten parallel 5-turn conversations: 1.94 seconds to first token, 19.51 t/s per user. That fits comfortably within what a reader experiences as chat, and is 87% faster per token than BF16 in the same test.
Run F: RAG mix (8k prompt)
| Users | BF16 d/u | NVFP4 d/u | BF16 TTFT | NVFP4 TTFT |
|---|---|---|---|---|
| 5 | 12.11 | 20.91 | 4.32s | 4.28s |
| 10 | 9.31 | 15.96 | 7.99s | 8.00s |
| 20 | 6.05 | 10.57 | 14.61s | 14.45s |
8k context is roughly what a RAG flow with four chunks of 2k tokens takes in. At ten users you wait 8 seconds to first token (almost the same as BF16, because of the compute bottleneck), then 16 t/s streaming. For “ask something about your documents” flows that’s plenty workable, and where the gain sits: in decode speed, not in TTFT.
Run G: short instruction, 4096 output tokens
The agent / code-generation shape:
| Users | BF16 d/u | NVFP4 d/u | BF16 TTFT | NVFP4 TTFT |
|---|---|---|---|---|
| 1 | 24.17 | 29.59 | 0.24s | 0.11s |
| 5 | 14.32 | 25.79 | 0.38s | 0.23s |
| 10 | 11.75 | 22.54 | 0.48s | 0.37s |
A TTFT of 110 milliseconds at single-user is very low, lower than most hosted APIs manage over the network. And 22.54 t/s per user at c=10 is plenty for agent streams. Aggregate decode at c=10 in this test comes out at 225 t/s versus 84 t/s in BF16, almost 2.7× as much. For a team running ten concurrent agents that each produce long structured output, this is the most important number.
Run H: open-loop, random 4k workload
The synthetic office baseline with Poisson arrivals:
| Metric | BF16 | NVFP4 |
|---|---|---|
| Achieved RPS | 0.27 | 0.29 |
| Peak concurrent | 36 | 16 |
| TTFT P50 | 1286 ms | 1006 ms |
| TTFT P99 | 3316 ms | 2893 ms |
| TPOT P50 | 182 ms | 64 ms |
| Total tok/s | 1215 | 1302 |
What stands out is that peak concurrent drops from 36 to 16 at an identical arrival rate (0.3 rps) and identical prompts. Because NVFP4 handles each request faster, the queue stays shorter, and that’s an important insight for capacity planning: NVFP4 gives you not only lower latency per request, but also less queue pressure at the same arrival rate. At the same time TPOT P50 drops from 182ms to 64ms. Median inter-token latency almost three times faster, then. For a chat UI that shows token streaming, that’s the difference between artificially waiting for an answer and just reading along.
Run I: ShareGPT replay (real conversations)
Real multi-turn conversation data:
| Metric | BF16 | NVFP4 |
|---|---|---|
| Peak concurrent | 17 | 10 |
| TTFT P50 | 353 ms | 152 ms |
| TTFT P99 | 637 ms | 265 ms |
| TPOT P50 | 95 ms | 39 ms |
A P99 TTFT of 265 milliseconds, for 99 percent of users. A TPOT of 39 ms works out to 25.6 t/s per user. You can safely call that realtime chat for 25 employees with realistic ShareGPT-style prompts.
Run J: Monday-morning peak
The heaviest scenario from part one: overloaded server, 1.5 rps target with max 25 concurrent requests.
| Metric | BF16 | NVFP4 |
|---|---|---|
| Configured RPS | 1.50 | 1.50 |
| Achieved RPS | 0.26 | 0.44 |
| TTFT P50 | 1132 ms | 920 ms |
| TTFT P99 | 6157 ms | 6054 ms |
| TPOT P50 | 187 ms | 108 ms |
| Total tok/s | 1173 | 1984 |
The most measurable number of the whole day is that achieved RPS goes from 0.26 to 0.44. Same target, same concurrency cap, same Poisson arrivals, and NVFP4 processes 69% more requests per second before the queue clogs up.
P99 TTFT shifts only marginally (6.16s to 6.05s). That fits the pattern: prefill is compute-bound on SM12.1, and NVFP4 isn’t much faster there. But TPOT P50 drops from 187ms to 108ms, and aggregate token throughput grows from 1173 to 1984 t/s. For a 25-person office at peak hours, that’s the difference between enough and a squeeze: more requests per second processed, with faster streaming for whoever’s up next.
What this means for on-prem AI
If you have a Spark and run Gemma-4-26B, NVFP4 is the upgrade. In all 9 tests NVFP4 is the winner, and it frees up 30 GB of memory for other purposes like more KV-cache, a second small model alongside it, or batch jobs. At Kamoo this NVFP4 config now sits next to the BF16 baseline in bench-spark/, and one command switches between the two.
For a 25-person office with realistic ShareGPT-like prompts you notice it right away. TPOT P50 drops from 95 ms to 39 ms, P99 TTFT from 637 ms to 265 ms. And when peak load comes, the system delivers 69% more requests per second before it fills up. For agent flows and code generation (Run G shape) the Spark in NVFP4 is at its strongest: ten parallel agents, each 4096 tokens of output, 22.5 t/s per user with TTFT under 400 ms.
For 25k context stress (Run B) it stays the wall. NVFP4 barely lowers it (TTFT differs by less than a second), because prefill stays prefill, and ten parallel 25k prompts wait 35 seconds for the first token. Quantization changes nothing about that on this hardware. Decode speed it does change: 7.56 t/s/user instead of 5.37, so once the tokens come, they run faster.
What this run doesn’t say
This is not NVFP4 on SM10.0 (datacenter Blackwell). There native FP4 compute would make the difference much bigger, with an expectation of a further 2-3× speedup on top of what we see here. On an H100 or B200 these numbers are therefore not representative; the Spark has a specific SM12.1 handicap (no native FP4) that doesn’t exist in the cloud.
This is also not a comparison with dense Gemma-4-31B in NVFP4. Dense goes through a different code path in vLLM’s loader. For a follow-up blog, dense NVFP4 with the same test suite would give a third data point.
And this is not a long-term accuracy comparison. NVFP4 quantization has potentially small accuracy effects. For the typical tasks in an office (summarization, ticket classification, RAG) rarely noticeable, for edge cases possibly yes.
What NVIDIA did publish is in the NVFP4 model card: on MMLU-Pro, GPQA-Diamond and LiveCodeBench, NVFP4 sits within 0.2 to 0.7 points of their own BF16 baseline. NVIDIA’s own BF16 baseline itself deviates from Google’s official Gemma-4 card numbers. Eval harnesses differ more than precision itself, so cross-comparing between vendors without an identical harness is shaky. That falls within run-to-run variance, no real degradation. What’s curious about that same table is that NVIDIA’s BF16 baseline in turn deviates from what Google publishes in the official Gemma-4 card: MMLU-Pro 85.0 vs 82.6, GPQA 80.3 vs 82.3, LiveCodeBench 80.5 vs 77.1. Not because quantization gets better than the original, but because the eval harness apparently matters more than the precision itself. Different prompts, different temperature, different stop criteria. Cross-comparisons between vendors are therefore hard to pin down without the same harness.
What sticks
Decode sells the benchmark, prefill decides the experience. That held in part one and it still holds. What NVFP4 adds is that decode gets faster in every workload, and most where it matters: at larger context and more users at once. TTFT stays roughly the same on SM12.1 because prefill is compute-bound and the Spark has no native FP4 tensor cores. For what the user feels once the tokens start flowing, NVFP4 on this hardware is a lot better than BF16, and it costs nothing in setup pain: one official vLLM image, one model flag, and it runs.