llamaperf

Gemma 4

Google DeepMind · 18 reports

By engine

EngineAvg t/sRangeN
Ollama61.218–1506
llama.cpp40.58–973
vLLM318.058–5782
throughput:
243.9 t/s gen · 13809.2 t/s pp
quant:
Q4_K_M (gguf)
kv:
Q8

Benchmark of FWHT CUDA implementation for kv-cache quantization. Results show 1-2% pp boost and 7-9% tg boost on Gemma 4 26B.A4B Q4_K_M with -ctk q8_0 -ctv q8_0. pp2048 and tg128 values reported; highest t/s from cuda-fwt column.

Gemma 4 31B

H100 80GB · vLLM · 32,768 ctx

throughput:
125.3 t/s gen
codingmath

Benchmark of Gemma 4 31B dense with MTP and DFlash speculative decoding. Also tested Gemma 4 26B-A4B MoE (25.2B total, 3.8B active). MTP 3.11x faster, DFlash 3.03x faster than baseline at concurrency 1. Baseline 40.3 tok/s, MTP 125.3 tok/s, DFlash 122.1 tok/s. At concurrency 16: baseline 375 tok/s, MTP 953 tok/s, DFlash 725 tok/s. For MoE: baseline 177.1 tok/s, MTP 264.2 tok/s, DFlash 306.4 tok/s at concurrency 1. At concurrency 16: baseline 975 tok/s, MTP 1808 tok/s, DFlash 1957 tok/s. Coding, math, STEM, reasoning benefited more.

Tone: negative
throughput:
7.5 t/s gen
flash attention:
on

User reports poor performance with Gemma 4 (7.5 tok/s) and Qwen3.6-27B (locking up), while Qwen3.6-35B-A3 is fast. Suspects a bug with dense models.

Tone: mixed
throughput:
7.5 t/s gen
flash attention:
on

User reports poor performance with dense models (Gemma4-31B ~7.5 t/s, Qwen3.6-27B locking up) on M5 Max 128GB, while Qwen3.6-35B-A3B MoE is fast. Mentions using DFLASH (likely flash attention).

Tone: positive
throughput:
40.0 t/s gen
quant:
Q4_K_M (gguf)
text-generation

~35-45 tok/s on RX 7900 XTX. Gemma 4 E4B Q4_K_M via Ollama. Best consumer AMD option. Source: gemma4-ai.com AMD GPU guide

throughput:
58.0 t/s gen · 83.0 t/s pp
quant:
FP16 (safetensors)
text-generation

Gemma 4 E4B on RX 7900 XTX via vLLM + ROCm. Default path: 57.96 gen tok/s, 82.96 prompt tok/s. Source: flexinfer.ai

Tone: negative
throughput:
7.5 t/s gen
flash attention:
on

User reports poor performance with Gemma4-31B (7.5 tok/s) and Qwen3.6-27B (locking up) on M5 Max 128GB, while Qwen3.6-35B-A3 is fast. Mentions using DFLASH.

Tone: positive
throughput:
55.0 t/s gen
quant:
Q4_K_M (gguf)
text-generation

~55 tok/s on RTX 4070 12GB. Ada Lovelace efficiency. Source: estimated from compute-market tiers

Tone: positive
quant:
NVFP4 (safetensors)

nvidia/Gemma-4-26B-A4B-NVFP4 works on 5090 with 80% allocation (of 32GB) got around 50k context. Model size 18.8GB. Benchmarks provided: GPQA Diamond 80.30% (baseline) vs 79.90% (NVFP4), AIME 2025 88.95% vs 90.00%, MMLU Pro 85.00% vs 84.80%, LiveCodeBench (pass@1) 80.50% vs 79.80%, IFBench 77.77% vs 78.1%, IFEval 96.60% vs 96.40%.

Gemma 4 26B

m5 pro (18 core cpu, 20 core gpu) · Ollama

Tone: positive
throughput:
68.8 t/s gen · 204.1 t/s pp

User reports very fast performance on M5 Pro with 48GB RAM. Prompt eval rate: 204.07 t/s, generation rate: 68.76 t/s.