Llama 3.3 1B
RTX Pro 6000 Blackwell · vLLM · 8,192 ctx
FastDMS implementation of DMS KV-cache compression. Benchmarks show 1.5-2x faster decoding than vLLM BF16/FP8 with 5-8x less KV memory. Quality metrics (KLD, token match) comparable or better than vLLM's FP8/TurboQuant. Tested on Llama-3.2-1B and Qwen3-8B DMS checkpoints. Training took ~20 min on RTX Pro 6000 Blackwell.