- throughput:
- 243.9 t/s gen · 13809.2 t/s pp
- quant:
- Q4_K_M (gguf)
- kv:
- Q8
Benchmark of FWHT CUDA implementation for kv-cache quantization. Results show 1-2% pp boost and 7-9% tg boost on Gemma 4 26B.A4B Q4_K_M with -ctk q8_0 -ctv q8_0. pp2048 and tg128 values reported; highest t/s from cuda-fwt column.
- throughput:
- 3238.0 t/s gen
Benchmark of Qwen3.6-35B-A3B with LDLM diffusion model on RTX 5090 32GB. Throughput 3,238 tok/s at 10 diffusion steps, seq len 64, batch size 1. Also reports ~6,500 tok/s for 4 steps (extrapolated). Untrained weights. Also mentions Qwen3.6-27B at 745 tok/s (10 steps) and ~1,500 tok/s (4 steps).
summarizationtool-use
User is a vet building a dictation/SOAP scribe. Reports inconsistent output from local models (Gemma 4, Qwen 3.6 35B A3B) compared to frontier models. System prompt is a 25-30k token markdown file. Hardware: Core Ultra 9, 128GB RAM, RTX 5090, Proxmox, AnythingLLM + Ollama (llama.cpp).
- throughput:
- 578.0 t/s gen
- quant:
- AWQ-4bit
DFlash speculative decoding with vLLM 0.19.2rc1. Baseline 228 t/s, best with DFlash 578 t/s (2.56x speedup). Draft model: z-lab/gemma-4-26B-A4B-it-DFlash. Input 256 tokens, output 1024 tokens.
- throughput:
- 73.6 t/s gen · 2883.0 t/s pp
- quant:
- NVFP4 (safetensors)
- kv:
- Q8
- flash attention:
- on
coding
MTP enabled with 3 speculative tokens. KV cache fp8_e4m3. Prefix caching tested. Stability pass at 200k: 10/10 runs. Generation speed varies 59-111 tok/s. Mean MTP acceptance length 2.28.
- throughput:
- 215.1 t/s gen
- quant:
- Q4 (gguf)
MTP grafted model; Q4 speed increase only 6% on 5090. Also tested Q8 on 5090+3090: 148.20 t/s without MTP, 152.02 t/s with MTP.
- quant:
- NVFP4 (safetensors)
nvidia/Gemma-4-26B-A4B-NVFP4 works on 5090 with 80% allocation (of 32GB) got around 50k context. Model size 18.8GB. Benchmarks provided: GPQA Diamond 80.30% (baseline) vs 79.90% (NVFP4), AIME 2025 88.95% vs 90.00%, MMLU Pro 85.00% vs 84.80%, LiveCodeBench (pass@1) 80.50% vs 79.80%, IFBench 77.77% vs 78.1%, IFEval 96.60% vs 96.40%.
- throughput:
- 106.5 t/s gen
- quant:
- INT4 (safetensors)
- kv:
- Q8
Qwen3.6-27B-INT4 via vllm 0.19 on 1x RTX 5090. Achieves 105-108 tps generation with 256k context. Uses fp8_e4m3 KV cache, flashinfer attention, MTP speculative decoding (3 tokens). Model from Lorbus quant (AutoRound).
- throughput:
- 80.0 t/s gen
- quant:
- NVFP4 (safetensors)
Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19.1rc1. Uses NVFP4 quantization.
- quant:
- IQ4_XS (gguf)
- kv:
- Q8
- rating:
- 5/5
codingtool-use
User reports Qwen 3.6 27B is excellent for pyspark/python and data transformation debugging. Running on ASUS ROG Strix SCAR 18 with RTX 5090 laptop (24GB VRAM) and 64GB DDR5 RAM. Using llama.cpp with IQ4_XS quant at 200k context with Q8_0 KV cache. Initially tried q4_k_m at q4_0. Cancelling cloud subscriptions due to local performance. No tokens/sec reported.