- throughput:
- 243.9 t/s gen · 13809.2 t/s pp
- quant:
- Q4_K_M (gguf)
- kv:
- Q8
Benchmark of FWHT CUDA implementation for kv-cache quantization. Results show 1-2% pp boost and 7-9% tg boost on Gemma 4 26B.A4B Q4_K_M with -ctk q8_0 -ctv q8_0. pp2048 and tg128 values reported; highest t/s from cuda-fwt column.
- throughput:
- 125.3 t/s gen
codingmath
Benchmark of Gemma 4 31B dense with MTP and DFlash speculative decoding. Also tested Gemma 4 26B-A4B MoE (25.2B total, 3.8B active). MTP 3.11x faster, DFlash 3.03x faster than baseline at concurrency 1. Baseline 40.3 tok/s, MTP 125.3 tok/s, DFlash 122.1 tok/s. At concurrency 16: baseline 375 tok/s, MTP 953 tok/s, DFlash 725 tok/s. For MoE: baseline 177.1 tok/s, MTP 264.2 tok/s, DFlash 306.4 tok/s at concurrency 1. At concurrency 16: baseline 975 tok/s, MTP 1808 tok/s, DFlash 1957 tok/s. Coding, math, STEM, reasoning benefited more.
- throughput:
- 578.0 t/s gen
- quant:
- AWQ-4bit
DFlash speculative decoding with vLLM 0.19.2rc1. Baseline 228 t/s, best with DFlash 578 t/s (2.56x speedup). Draft model: z-lab/gemma-4-26B-A4B-it-DFlash. Input 256 tokens, output 1024 tokens.
- throughput:
- 97.0 t/s gen
coding
Multi-Token Prediction (MTP) implementation yields 40% speedup (138 t/s with MTP).
- throughput:
- 7.5 t/s gen
- flash attention:
- on
User reports poor performance with Gemma 4 (7.5 tok/s) and Qwen3.6-27B (locking up), while Qwen3.6-35B-A3 is fast. Suspects a bug with dense models.
- throughput:
- 7.5 t/s gen
- flash attention:
- on
User reports poor performance with dense models (Gemma4-31B ~7.5 t/s, Qwen3.6-27B locking up) on M5 Max 128GB, while Qwen3.6-35B-A3B MoE is fast. Mentions using DFLASH (likely flash attention).
- throughput:
- 40.0 t/s gen
- quant:
- Q4_K_M (gguf)
text-generation
~35-45 tok/s on RX 7900 XTX. Gemma 4 E4B Q4_K_M via Ollama. Best consumer AMD option. Source: gemma4-ai.com AMD GPU guide
- throughput:
- 58.0 t/s gen · 83.0 t/s pp
- quant:
- FP16 (safetensors)
text-generation
Gemma 4 E4B on RX 7900 XTX via vLLM + ROCm. Default path: 57.96 gen tok/s, 82.96 prompt tok/s. Source: flexinfer.ai
- throughput:
- 7.5 t/s gen
- flash attention:
- on
User reports poor performance with Gemma4-31B (7.5 tok/s) and Qwen3.6-27B (locking up) on M5 Max 128GB, while Qwen3.6-35B-A3 is fast. Mentions using DFLASH.
- throughput:
- 7.5 t/s gen
- quant:
- Q4 (gguf)
text-generation
~5-10 tok/s on CPU. E2B is usable CPU-only. Source: gemma4-ai.com hardware guide
- throughput:
- 55.0 t/s gen
- quant:
- Q4_K_M (gguf)
text-generation
~55 tok/s on RTX 4070 12GB. Ada Lovelace efficiency. Source: estimated from compute-market tiers
- throughput:
- 60.0 t/s gen
- quant:
- Q4_K_M (gguf)
text-generation
~60 tok/s on RTX 3060 12GB. E2B runs effortlessly. Source: estimated from compute-market tiers
- throughput:
- 45.0 t/s gen
- quant:
- Q4_K_M (gguf)
text-generation
~45 tok/s on RTX 3060 12GB. E4B fits easily. Source: compute-market.com
- quant:
- NVFP4 (safetensors)
nvidia/Gemma-4-26B-A4B-NVFP4 works on 5090 with 80% allocation (of 32GB) got around 50k context. Model size 18.8GB. Benchmarks provided: GPQA Diamond 80.30% (baseline) vs 79.90% (NVFP4), AIME 2025 88.95% vs 90.00%, MMLU Pro 85.00% vs 84.80%, LiveCodeBench (pass@1) 80.50% vs 79.80%, IFBench 77.77% vs 78.1%, IFEval 96.60% vs 96.40%.
- throughput:
- 149.6 t/s gen · 15.6 t/s pp
- quant:
- Q4_K_M (gguf)
text-generation
~150 tok/s generation. Star performer. Source: n1n.ai
- throughput:
- 16.9 t/s gen
- quant:
- bf16 (safetensors)
text-generation
bf16 no quantization. 10.25GB VRAM, 61ms TTFT. Source: dev.to Gaurav Vij
- throughput:
- 17.5 t/s gen
- quant:
- Q4 (gguf)
text-generation
15-20 tok/s range, usable for simple tasks. Source: gemma4-ai.com
- throughput:
- 68.8 t/s gen · 204.1 t/s pp
User reports very fast performance on M5 Pro with 48GB RAM. Prompt eval rate: 204.07 t/s, generation rate: 68.76 t/s.