Production AI
in practice.
on-prem · ai agents · local ai models
Hi, I'm Django de Vreng, co-founder of Kamoo. I build agents, MCP servers and on-prem deployments on the DGX Spark, and share what I learn along the way.
- Posts published
- 8
- Models in the arena
- 19
- GPU hours on-prem
- 1.4k
- DGX Spark in lab
- 128GB
I'm Django.
I build the layer between language models and real work: agents, MCP servers, local models and on-prem AI. Not as a one-off demo, but as software that fetches context, uses tools, prepares decisions and fails cleanly when it has to.
On this blog I share the technical work in progress: benchmarks on the DGX Spark, build logs from agent projects, and field notes on what holds up in production. Most of it comes out of work at Kamoo, but the blog stays personal.
Read my whole story →- 23-06-26 On-prem AI001On-prem AI
Gemma-4 v23 on the DGX Spark
New vLLM v0.23.0 runs for Gemma-4 on the DGX Spark: BF16, NVFP4 and MTP compared across decode, TTFT, tails and practical local-agent limits.
- 22-05-26 On-prem AI002On-prem AI
The three numbers behind a fast DGX Spark
Decode, prefill and queueing: three numbers decide whether a DGX Spark feels fast under a real workload, and those three are exactly what most reviews skip.
- 05-05-26 Reflection003Reflection
Why this blog and arena exist
I looked for concrete numbers on local AI on the DGX Spark and never found them. So I measure them myself, building the blog and arena as an open workbench.
- 03-05-26 On-prem AI004On-prem AI
Gemma-4 on the DGX Spark: NVFP4 vs BF16
Nine identical benchmarks, two precisions. NVFP4 runs 22 to 92 percent faster per token, and peak-hour capacity grows 69 percent on the Spark.
- 03-05-26 On-prem AI005On-prem AI
Nemotron-3 on the DGX Spark: BF16 vs FP8 vs NVFP4
One model, three precisions, the same Spark. What memory budget, decode speed and tail-latency do when you go from 16 bit to 8 bit to 4 bit.