Gemma-4 on the DGX Spark: NVFP4 vs BF16

In the BF16 baseline of Gemma-4 on the DGX Spark I ran nine benchmarks with Gemma-4-26B-A4B in BF16. Decode speed held up just fine, prefill decided when the wall came, and the system queued neatly under pressure instead of crashing. That story seemed done, until NVIDIA released an NVFP4-quantized version of that same model.

Same architecture and fine-tune, same server config, only the precision changes. From BF16 (16 bits per parameter) to NVFP4 (4 bits per parameter, NVIDIA’s take on FP4). Four times smaller per weight, and if the Blackwell kernels cooperate, also a lot faster on compute-heavy tasks.

On paper, nice. In practice: the official vLLM v0.20.1 release recognizes this checkpoint without any fuss, and the numbers were faster across the board than the BF16 baseline. Both tests fall under the guide running LLMs on the DGX Spark.

Why look into this at all

For an office with a local AI machine, memory budget is the most limiting thing after compute. A 26B model in BF16 takes ~48 GB of GPU memory for weights alone. On a Spark with 128 GB of unified memory, that leaves about 65 GB for KV-cache. Enough for the office scenario from the first blog, but not much room to run, say, 30+ users with large context side by side.

NVFP4 reduces that to ~18 GB for weights. Not four times smaller than BF16 (the vision encoder stays BF16, and scale factors cost space too), but about 2.7× smaller. That gives you toward 95 GB of KV-cache headroom, which in theory should support much higher concurrency. On top of that, less memory traffic is needed per forward pass, so by definition less bandwidth pressure, and that was already the bottleneck in BF16 under multi-user load. So the question was simple: how much of that theoretical gain survives in practice?

What NVFP4 actually is

NVFP4 is NVIDIA’s take on FP4: floating-point numbers with 4 bits per value. Four bits, not four bytes, so a factor of 4 less per parameter than BF16. By storing a scaling factor per group of weights, accuracy stays reasonably intact.

For Blackwell it works like this. NVIDIA’s datacenter cards (B100, B200, SM10.0) have tensor cores that can compute natively with 4-bit values, and that is much faster than the same calculation in FP16 or BF16. The DGX Spark, on the other hand, is desktop Blackwell (GB10, SM12.1) and that architecture has no native FP4 compute. On a datacenter B200 (SM10.0) you’d expect another 2 to 3× on top of this thanks to native FP4 tensor cores. The Spark lacks that hardware path, so all the gain comes from memory bandwidth, not from compute. What you get in that case is “weight-only” FP4: the weights are physically stored as 4-bit (hence the memory gain), but during compute they get decoded on the fly to FP16 for the matrix multiplications. A vLLM warning makes that explicit:

Your GPU does not have native support for FP4 computation but FP4 quantization
is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel.
This may degrade performance for compute-heavy workloads.

So you get the memory gain in full, the compute gain only partially. The Marlin INT4 GEMM kernel is optimized, but not as fast as native FP4 on SM10.0 would be. Worth factoring in when you look at the numbers further down.

The test setup

Server config identical to the first blog, only the model swaps:

docker run -d --name vllm-bench \
  --gpus all --ipc=host \
  -v appliance_hf-cache:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:v0.20.1 \
  --model nvidia/Gemma-4-26B-A4B-NVFP4 \
  --served-model-name gemma-4-26b-a4b-nvfp4 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --limit-mm-per-prompt '{"image":0,"audio":0}' \
  --async-scheduling \
  --no-enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8000

Tests are one-to-one identical to the first blog: same commands, same concurrency levels, same datasets for the open-loop tests, same seed. That is on purpose, because if you want to measure the effect of an isolated variable (in this case the precision), everything around it has to stay the same. Exactly how I measure those concurrency levels, seeds and open-loop arrivals is described in the Arena measurement method.

Comparison	BF16	NVFP4
Model	google/gemma-4-26B-A4B-it	nvidia/Gemma-4-26B-A4B-NVFP4
Active params	4B	4B
Total params	26B	26B
Model memory	~48 GB	~18 GB
KV-cache headroom	~65 GB	~95 GB
MoE backend	(default)	MARLIN (forced)

Three numbers sum up where this lands. Click through for the full run in the Arena, with all seeds, concurrency levels and commands:

Decode @ c=10 (256→4k) 22.5 tok/s +92% vs BF16 Monday-peak RPS 0.44 +69% vs BF16 ShareGPT TPOT P50 39 ms −59% vs BF16

An interactive version of all the numbers is on the Arena page for Gemma-4-26B-A4B-NVFP4, including commands and TTFT percentiles for all 9 tests.

Run A: context scaling from 4k to 25k

Decode per user as context grows, c=1/5/10:

Context	Users	BF16 d/u	NVFP4 d/u	Gain
4k	1	24.08	29.80	+24%
4k	5	12.55	22.01	+75%
4k	10	9.48	16.94	+79%
8k	1	23.69	29.31	+24%
8k	5	11.48	19.28	+68%
8k	10	8.52	14.35	+68%
16k	1	23.34	28.55	+22%
16k	5	10.05	15.67	+56%
16k	10	6.79	10.06	+48%
25k	1	22.75	27.70	+22%
25k	5	8.46	12.46	+47%
25k	10	5.40	7.55	+40%

At c=1 the gain is stable around +22-24% across all contexts. Memory bandwidth barely matters for single-user, so the gain here sits in the compute path itself. Marlin’s INT4 decode plus FP16 matmul is slightly faster than BF16’s direct FP16 matmul, even though it’s two steps.

At c=10 the difference scales much more strongly with workload type, from +40% at 25k context to +79% at 4k. That’s because under multi-user the memory bandwidth becomes the bottleneck, and NVFP4 reads fewer bytes per forward pass. The more concurrent, the more that counts, until you hit the KV-cache memory limits again (25k context with multiple users) and the gain flattens out.

TTFT (first token) is better too:

Context	Users	BF16 TTFT	NVFP4 TTFT
4k	10	4.46s	4.20s
8k	10	7.99s	7.84s
16k	10	18.92s	18.69s
25k	10	35.67s	35.65s

On TTFT the gain is small. That makes sense: prefill is compute-heavy, and on SM12.1 without native FP4 tensor cores Marlin has to decode the weights on the fly for the matmul. That gives back some of what the memory bandwidth gained. For decode, bandwidth counts more than compute; for prefill, the other way around.

Run B: 25k context, concurrency up to 20

The stress test from part one:

Users	BF16 d/u	NVFP4 d/u	BF16 TTFT	NVFP4 TTFT
5	8.51 t/s	12.43 t/s	19.86s	19.72s
10	5.37 t/s	7.56 t/s	35.44s	35.51s
20	3.16 t/s	4.26 t/s	67.37s	67.40s

The aggregate decode plateau shifts from 32 t/s to 36 t/s at c=20: a 12% higher ceiling at 25k context under maximum pressure. TTFT is practically identical between BF16 and NVFP4 because prefill is the wall here and that doesn’t get much faster on SM12.1. Decode per user is clearly better though: at twenty parallel 25k prompts you get 4.26 instead of 3.16 t/s, +35%. Still not chat speed, but a noticeable difference once the tokens start flowing.

Run C: 1k prompt, 1k output

The short-prompt + long-answer workload, close to agent flows and code generation:

Users	BF16 d/u	NVFP4 d/u	Gain
1	23.86	29.45	+23%
5	13.59	24.69	+82%
10	10.92	20.88	+91%

At c=10 per-user decode sits at well over 20 t/s, above reading speed and close to a comfortable streaming UI. Aggregate decode at c=10 hits 209 t/s instead of 86 t/s in BF16, almost a doubling.

Run E: multi-turn (depth 4)

Five consecutive turns per conversation, ten conversations in parallel: the most realistic office shape.

Users	BF16 d/u	NVFP4 d/u	BF16 TTFT	NVFP4 TTFT
1	23.97	29.61	0.53s	0.33s
5	13.07	23.98	1.32s	1.11s
10	10.43	19.51	2.13s	1.94s

For ten parallel 5-turn conversations: 1.94 seconds to first token, 19.51 t/s per user. That fits comfortably within what a reader experiences as chat, and is 87% faster per token than BF16 in the same test.

Run F: RAG mix (8k prompt)

Users	BF16 d/u	NVFP4 d/u	BF16 TTFT	NVFP4 TTFT
5	12.11	20.91	4.32s	4.28s
10	9.31	15.96	7.99s	8.00s
20	6.05	10.57	14.61s	14.45s

8k context is roughly what a RAG flow with four chunks of 2k tokens takes in. At ten users you wait 8 seconds to first token (almost the same as BF16, because of the compute bottleneck), then 16 t/s streaming. For “ask something about your documents” flows that’s plenty workable, and where the gain sits: in decode speed, not in TTFT.

Run G: short instruction, 4096 output tokens

The agent / code-generation shape:

Users	BF16 d/u	NVFP4 d/u	BF16 TTFT	NVFP4 TTFT
1	24.17	29.59	0.24s	0.11s
5	14.32	25.79	0.38s	0.23s
10	11.75	22.54	0.48s	0.37s

A TTFT of 110 milliseconds at single-user is very low, lower than most hosted APIs manage over the network. And 22.54 t/s per user at c=10 is plenty for agent streams. Aggregate decode at c=10 in this test comes out at 225 t/s versus 84 t/s in BF16, almost 2.7× as much. For a team running ten concurrent agents that each produce long structured output, this is the most important number.

Run H: open-loop, random 4k workload

The synthetic office baseline with Poisson arrivals:

Metric	BF16	NVFP4
Achieved RPS	0.27	0.29
Peak concurrent	36	16
TTFT P50	1286 ms	1006 ms
TTFT P99	3316 ms	2893 ms
TPOT P50	182 ms	64 ms
Total tok/s	1215	1302

What stands out is that peak concurrent drops from 36 to 16 at an identical arrival rate (0.3 rps) and identical prompts. Because NVFP4 handles each request faster, the queue stays shorter, and that’s an important insight for capacity planning: NVFP4 gives you not only lower latency per request, but also less queue pressure at the same arrival rate. At the same time TPOT P50 drops from 182ms to 64ms. Median inter-token latency almost three times faster, then. For a chat UI that shows token streaming, that’s the difference between artificially waiting for an answer and just reading along.

Run I: ShareGPT replay (real conversations)

Real multi-turn conversation data:

Metric	BF16	NVFP4
Peak concurrent	17	10
TTFT P50	353 ms	152 ms
TTFT P99	637 ms	265 ms
TPOT P50	95 ms	39 ms

A P99 TTFT of 265 milliseconds, for 99 percent of users. A TPOT of 39 ms works out to 25.6 t/s per user. You can safely call that realtime chat for 25 employees with realistic ShareGPT-style prompts.

Run J: Monday-morning peak

The heaviest scenario from part one: overloaded server, 1.5 rps target with max 25 concurrent requests.

Metric	BF16	NVFP4
Configured RPS	1.50	1.50
Achieved RPS	0.26	0.44
TTFT P50	1132 ms	920 ms
TTFT P99	6157 ms	6054 ms
TPOT P50	187 ms	108 ms
Total tok/s	1173	1984

The most measurable number of the whole day is that achieved RPS goes from 0.26 to 0.44. Same target, same concurrency cap, same Poisson arrivals, and NVFP4 processes 69% more requests per second before the queue clogs up.

P99 TTFT shifts only marginally (6.16s to 6.05s). That fits the pattern: prefill is compute-bound on SM12.1, and NVFP4 isn’t much faster there. But TPOT P50 drops from 187ms to 108ms, and aggregate token throughput grows from 1173 to 1984 t/s. For a 25-person office at peak hours, that’s the difference between enough and a squeeze: more requests per second processed, with faster streaming for whoever’s up next.

What this means for on-prem AI

If you have a Spark and run Gemma-4-26B, NVFP4 is the upgrade. In all 9 tests NVFP4 is the winner, and it frees up 30 GB of memory for other purposes like more KV-cache, a second small model alongside it, or batch jobs. At Kamoo this NVFP4 config now sits next to the BF16 baseline in bench-spark/, and one command switches between the two.

For a 25-person office with realistic ShareGPT-like prompts you notice it right away. TPOT P50 drops from 95 ms to 39 ms, P99 TTFT from 637 ms to 265 ms. And when peak load comes, the system delivers 69% more requests per second before it fills up. For agent flows and code generation (Run G shape) the Spark in NVFP4 is at its strongest: ten parallel agents, each 4096 tokens of output, 22.5 t/s per user with TTFT under 400 ms.

For 25k context stress (Run B) it stays the wall. NVFP4 barely lowers it (TTFT differs by less than a second), because prefill stays prefill, and ten parallel 25k prompts wait 35 seconds for the first token. Quantization changes nothing about that on this hardware. Decode speed it does change: 7.56 t/s/user instead of 5.37, so once the tokens come, they run faster.

What this run doesn’t say

This is not NVFP4 on SM10.0 (datacenter Blackwell). There native FP4 compute would make the difference much bigger, with an expectation of a further 2-3× speedup on top of what we see here. On an H100 or B200 these numbers are therefore not representative; the Spark has a specific SM12.1 handicap (no native FP4) that doesn’t exist in the cloud.

This is also not a comparison with dense Gemma-4-31B in NVFP4. Dense goes through a different code path in vLLM’s loader. For a follow-up blog, dense NVFP4 with the same test suite would give a third data point.

And this is not a long-term accuracy comparison. NVFP4 quantization has potentially small accuracy effects. For the typical tasks in an office (summarization, ticket classification, RAG) rarely noticeable, for edge cases possibly yes.

What NVIDIA did publish is in the NVFP4 model card: on MMLU-Pro, GPQA-Diamond and LiveCodeBench, NVFP4 sits within 0.2 to 0.7 points of their own BF16 baseline. NVIDIA’s own BF16 baseline itself deviates from Google’s official Gemma-4 card numbers. Eval harnesses differ more than precision itself, so cross-comparing between vendors without an identical harness is shaky. That falls within run-to-run variance, no real degradation. What’s curious about that same table is that NVIDIA’s BF16 baseline in turn deviates from what Google publishes in the official Gemma-4 card: MMLU-Pro 85.0 vs 82.6, GPQA 80.3 vs 82.3, LiveCodeBench 80.5 vs 77.1. Not because quantization gets better than the original, but because the eval harness apparently matters more than the precision itself. Different prompts, different temperature, different stop criteria. Cross-comparisons between vendors are therefore hard to pin down without the same harness.

What sticks

Decode sells the benchmark, prefill decides the experience. That held in part one and it still holds. What NVFP4 adds is that decode gets faster in every workload, and most where it matters: at larger context and more users at once. TTFT stays roughly the same on SM12.1 because prefill is compute-bound and the Spark has no native FP4 tensor cores. For what the user feels once the tokens start flowing, NVFP4 on this hardware is a lot better than BF16, and it costs nothing in setup pain: one official vLLM image, one model flag, and it runs.

Gemma-4 on the DGX Spark: NVFP4 vs BF16

Why look into this at all

What NVFP4 actually is

The test setup

What this means for on-prem AI

What this run doesn’t say

What sticks

Gemma-4 v23 on the DGX Spark

The three numbers behind a fast DGX Spark

Nemotron-3 on the DGX Spark: BF16 vs FP8 vs NVFP4

Why look into this at all

What NVFP4 actually is

The test setup

What this means for on-prem AI

What this run doesn’t say

What sticks

Read next

Gemma-4 v23 on the DGX Spark

The three numbers behind a fast DGX Spark

Nemotron-3 on the DGX Spark: BF16 vs FP8 vs NVFP4