Every model in the arena runs through the same 9 benchmarks on the same DGX Spark. 5 closed-loop for pure throughput, 4 open-loop for how it feels under real concurrency. Below: the why, the how, and every command one-to-one. Pairs with the guide running LLMs on the DGX Spark.
02Why 9 benchmarks
One benchmark is no benchmark. A model that hits 120 t/s on an empty loop can collapse to 14 t/s per user under 32 concurrent requests. And a model with fast tokens/sec can have a TTFT of 4 seconds. Useless for chat, fine for batch.
So: measure both. Throughput in isolation (closed-loop) and behaviour under pressure (open-loop). Short prompts, long prompts, short outputs, long outputs. That's how you see what a model is really suited for, not just what the release notes claim.
9
benchmarks
3
runs · seed 42
±2%
run-to-run
03Two tools, two questions
Aclosed-loop
llama-benchy
How fast can this model generate when nobody else is on the channel?
Closed-loop means: the moment a request finishes, I fire off the next one. No waiting between requests, no burst peaks, just pure decode throughput. Result: tokens/sec per stream, with a clean TTFT as reference.
Ideal for: a single user, batch processing, code completion. Not representative of: a chat app with many concurrent sessions.
Bopen-loop
vllm bench serve
How does it hold up under X concurrent users who don't wait for each other?
Open-loop generates requests following a Poisson process, independent of what the server is already doing. At rate 0.3 req/s a new one arrives roughly every ~3 seconds, whether the previous answer is done or not. That's what production looks like.
Result: p50/p95 TTFT and per-user tokens/sec under load. I do report TTFT numbers (above 2s your app starts to feel slow), but they don't drop models from the ranking.
04The 9 benchmarks
Each card below describes one benchmark: what it measures, with which parameters, and why those numbers are chosen that way. Expand "view command" for the exact CLI command I run.
01 · llama-benchy closed-loop
Chat
Korte prompt, lang antwoord. De vorm die als normale chat moet aanvoelen, TTFT bepaalt of het "snappy" is.
De maatstaf voor "voelt het snappy aan?". Korte prompt vraagt weinig prefill, lange output stresst decode. Wat een gewone chat-message er voor de gebruiker uit ziet.
Bij 25k tokens zie je waar het model zijn prefill-budget opmaakt. TTFT springt vaak van honderden ms naar seconden. Belangrijk om te weten waar de muur staat.
Random dataset · 4000 tokens in, 500 tokens uit · request-rate 0.3, burstiness 0.7. Een rustig kantoor.
dataset random rate (req/s) 0,30 burstiness 0,7 prompts 200
Een rustig kantoor: paar mensen tegelijk, gemiddelde lengte. Als dit instort heb je een ander probleem dan welk model. Baseline voor alle open-loop tests.
Synthetische distributies zijn geen vervanging voor echte data. ShareGPT V3 heeft de natuurlijke spreiding van echte gesprekken, kort, lang, code, prozaïsch.
Lange chain-of-thought outputs · 1k in / 4k uit · trage rate (0.2 req/s) want elke request kost veel decode-budget. Test of TTFT stabiel blijft.
dataset random rate (req/s) 0,20 burstiness 1,0 prompts 50
Lange thinking-traces eten decode-budget. Trage rate (0.2 req/s) klinkt niet veel maar elke request kost 4k tokens output. Test of langlopende requests TTFT van nieuwe requests blijven blokkeren.
prefix caching off · async scheduling per profile env
Quant
BF16 · FP8 · NVFP4
each model in available precisions
OS
Ubuntu 24.04
CUDA 13.0 · Docker container per profile
Warmup
3 runs / number
reported is mean ± stddev · seed 42
06Reproducibility
Every command above is copy-paste-able. No hidden flags, no dataset tweaks, no seed shopping.
To reproduce you need: a Spark (or a comparable Blackwell GPU with enough VRAM for the checkpoints), vLLM 0.19+, and the two benchmark tools. The prompt datasets are standard vLLM defaults: ShareGPT V3 for open-loop, synthetic random for the baselines.
Got a different result? That's interesting. Let me know. Especially if you use a different platform: the whole point is to see where Spark does and doesn't match other setups.
✓
Fixed seed (42) across all runs
✓
Identical vLLM version per model set
✓
3 runs reported as mean ± stddev
✓
Numbers within ±2% run-to-run
✓
No prefix caching in the default profile
✓
No latency gate: every model stays visible, ranking on throughput + quality
A model you'd like to see in the suite?
Send a huggingface link or vLLM config. If it fits on Spark I'll run it and publish the numbers, plain, in the same table.