The workbook.
build logs and field notes
What I build, what I break, and what I learn along the way. On-prem AI on a DGX Spark, agents, MCP servers and evaluations. Not from scratch, but once you're building with it.
Subscribe · RSS- 23-06-26 On-prem AI 7 min001On-prem AI 7 min
Gemma-4 v23 on the DGX Spark
New vLLM v0.23.0 runs for Gemma-4 on the DGX Spark: BF16, NVFP4 and MTP compared across decode, TTFT, tails and practical local-agent limits.
- 22-05-26 On-prem AI 5 min002On-prem AI 5 min
The three numbers behind a fast DGX Spark
Decode, prefill and queueing: three numbers decide whether a DGX Spark feels fast under a real workload, and those three are exactly what most reviews skip.
- 05-05-26 Reflections 7 min003Reflections 7 min
Why this blog and arena exist
I looked for concrete numbers on local AI on the DGX Spark and never found them. So I measure them myself, building the blog and arena as an open workbench.
- 03-05-26 On-prem AI 14 min004On-prem AI 14 min
Gemma-4 on the DGX Spark: NVFP4 vs BF16
Nine identical benchmarks, two precisions. NVFP4 runs 22 to 92 percent faster per token, and peak-hour capacity grows 69 percent on the Spark.
- 03-05-26 On-prem AI 18 min005On-prem AI 18 min
Nemotron-3 on the DGX Spark: BF16 vs FP8 vs NVFP4
One model, three precisions, the same Spark. What memory budget, decode speed and tail-latency do when you go from 16 bit to 8 bit to 4 bit.
- 01-05-26 On-prem AI 27 min006On-prem AI 27 min
Gemma-4 on the DGX Spark: the price of context
Nine benchmarks of Gemma-4-26B-A4B-it on the DGX Spark with llama-benchy and vLLM. Decode holds up; prefill and queueing decide how it feels.
- 01-05-26 On-prem AI 8 min007On-prem AI 8 min
I put a 24/7 assistant on a Raspberry Pi
A build-log about OpenClaw on a Raspberry Pi 5: Slack as the interface, GPT-5.5 as the model, and the Pi as an always-on agent layer next to the DGX Spark.
- 01-05-26 On-prem AI 8 min008On-prem AI 8 min
What quantization turned out to be
A practical look back at quantization on the DGX Spark: what BF16, FP8 and NVFP4 do to memory, speed and tail latency, after three rounds with vLLM.