The three numbers behind a fast DGX Spark
Decode, prefill and queueing: three numbers decide whether a DGX Spark feels fast under a real workload, and those three are exactly what most reviews skip.
Can you seriously run large language models locally on a DGX Spark? Yes. That is the boring answer, and it is also the answer every review hands you: a model name, a number, tokens per second, done.
The useful answer is harder. A model that handles one demo prompt nicely tells you nothing about a Monday morning with ten people, big context, agent flows and someone pasting half a novel into a ticket. That is where it starts to chafe, or it doesn’t. And that does not depend on the Spark, it depends on your workload.
I have a Spark sitting in the lab and ran a stack of models on it, in BF16, FP8 and NVFP4. Nine workloads, two measurement methods, and a few runs redone because the first ones looked suspiciously good. What was left after all that measuring is not a scoreboard. It is one way of looking at it that held up every time, and it is below. The hard numbers per model are in the separate posts, and the complete guide with the setup, the cost and who it works for is at Running LLMs on the DGX Spark. This piece is about that one lens.
What the thing actually is
The DGX Spark is NVIDIA’s smallest Blackwell machine. A GB10 superchip, 128 GB unified memory, small enough for a server rack. No separate graphics card with its own memory pool, but one memory that the CPU and the GPU share together. Remember that number, 128 GB. It is your entire budget, and everything that follows is a division sum inside that 128.
One thing you need to know up front, because it explains half the numbers later. The Spark runs on desktop Blackwell, SM12.1, and that chip cannot compute natively in 4-bit. The big datacenter Blackwell, the B200, can. The result: from 4-bit quantization you get the full memory gain on the Spark, but not the full compute gain. vLLM works around this by pulling 4-bit weights back up to higher precision during compute.
That works fine. But it is exactly why you should not blindly stick the pretty FP4 numbers from a B200 onto your own Spark.
What fits in 128 GB
Short version: the weights go in first, the rest is KV-cache for all users together. Precision is therefore a design choice up front, not a knob afterward, and I wrote a separate post about it. The question is never whether a model fits, but what is left when it does. The full division sum is in the guide.
How fast it really is
This is where most DGX Spark reviews go wrong. They grab one prompt, measure tokens per second, and call that “the speed”. But speed on this machine is not a number. It is three things, they feel different and they behave differently. Pull them apart and the whole Spark falls into place.
Decode is nearly free
Decode is the text that comes in once the model is actually generating. On the Spark that is boringly stable, and boring is a compliment here. One user on a 26B model gets between 23 and 24 tokens per second in BF16, whether you feed it 4k or 25k context. Ten users at once: about 9 to 12 each, and that is where it sticks. Decode therefore hangs on how many people are busy at the same time, not on how long their prompt is.
And quantization lifts that whole line up. NVFP4 won on decode in all nine tests, by 22 to 92 percent depending on the workload. On a lighter MoE model like Nemotron-3, single-user decode even brushes up against 60 t/s. Decode, in short, is not the problem.
Prefill is the bill
Prefill is. Prefill is the silence before the first token, and that is what a user experiences as “slow”, not the tokens after it.
Prefill scales with your prompt size, and that hurts. A short prompt is processed within half a second, even with ten people at once. Throw 25k context at it with those same ten users and you wait 35 seconds for the first character. Same machine, same concurrency, just a longer prompt. Double the prompt, roughly double the wait.
And quantization? Barely helps here. Prefill is compute, and compute is exactly where that SM12.1 handicap sits. NVFP4 makes your decode faster. Your prefill stays prefill.
Under pressure it queues, it doesn’t crash
That leaves the question: what does it do when you simply throw too much at it? The answer is reassuringly boring. It does not fall over. It gets in line.
In the heaviest test I wanted to push 1.5 requests per second through the machine. It managed almost six times less than that. And yet not a single one of the 300 requests failed. The slowdown also did not go to everyone, it went to the tail: the average user noticed little, the unlucky one percent waited six seconds for their first token.
For on-prem that is the best outcome you can hope for. A crash is a phone call. A queue is a bit of patience. An office lives with the second, not the first.
That is the whole model. Decode is nearly free, prefill is the bill, queueing is your safety net. The numbers underneath it, nine workloads per model and two measurement methods, are in the arena and in the separate posts: the BF16 baseline, NVFP4 against BF16 and Nemotron-3 in three precisions.
The rest is in the guide
Which engine I run (vLLM), what a Spark costs, and who this does or does not work for: that is the complete picture, and it belongs in the guide, not in this one lens story. The short version of “who for”: local only gets interesting once the data is not allowed to leave the building. If you don’t have that requirement and you just want the fastest, cheapest tokens, then a cloud API is the more honest answer.
Running local is not a principle. It is a division: what has to stay in, and what is allowed to go out.
Do it yourself
Everything underneath is open. The models are on Hugging Face, vLLM is open source, and the raw benchmark output plus the scripts are on GitHub. The methodology explains which nine workloads I run and why.
If you have a Spark yourself, you should be able to walk the same route and get roughly the same numbers. If that doesn’t work out, that is exactly what I want to know. Feel free to email.