Gemma 4

Google DeepMind · 18 reports

By engine

Engine	Avg t/s	Range	N
Ollama	61.2	18–150	6
llama.cpp	40.5	8–97	3
vLLM	318.0	58–578	2

Gemma 4 26B (4B active)

RTX 5090 · llama.cpp

throughput:: 243.9 t/s gen · 13809.2 t/s pp
quant:: Q4_K_M (gguf)
kv:: Q8

Benchmark of FWHT CUDA implementation for kv-cache quantization. Results show 1-2% pp boost and 7-9% tg boost on Gemma 4 26B.A4B Q4_K_M with -ctk q8_0 -ctv q8_0. pp2048 and tg128 values reported; highest t/s from cuda-fwt column.

Gemma 4 31B

H100 80GB · vLLM · 32,768 ctx

throughput:: 125.3 t/s gen

codingmath

Benchmark of Gemma 4 31B dense with MTP and DFlash speculative decoding. Also tested Gemma 4 26B-A4B MoE (25.2B total, 3.8B active). MTP 3.11x faster, DFlash 3.03x faster than baseline at concurrency 1. Baseline 40.3 tok/s, MTP 125.3 tok/s, DFlash 122.1 tok/s. At concurrency 16: baseline 375 tok/s, MTP 953 tok/s, DFlash 725 tok/s. For MoE: baseline 177.1 tok/s, MTP 264.2 tok/s, DFlash 306.4 tok/s at concurrency 1. At concurrency 16: baseline 975 tok/s, MTP 1808 tok/s, DFlash 1957 tok/s. Coding, math, STEM, reasoning benefited more.

Gemma 4 26B (4B active) cyankiwi/gemma-4 it-AWQ-4bit

RTX 5090 · vLLM

throughput:: 578.0 t/s gen
quant:: AWQ-4bit

DFlash speculative decoding with vLLM 0.19.2rc1. Baseline 228 t/s, best with DFlash 578 t/s (2.56x speedup). Draft model: z-lab/gemma-4-26B-A4B-it-DFlash. Input 256 tokens, output 1024 tokens.

Gemma 4 26B assistant

M5 Max 64GB · llama.cpp

throughput:: 97.0 t/s gen

coding

Multi-Token Prediction (MTP) implementation yields 40% speedup (138 t/s with MTP).

Gemma 4 31B

M5 Max 128GB

throughput:: 7.5 t/s gen
flash attention:: on

User reports poor performance with Gemma 4 (7.5 tok/s) and Qwen3.6-27B (locking up), while Qwen3.6-35B-A3 is fast. Suspects a bug with dense models.

Gemma 4 31B

M5 Max 128GB

throughput:: 7.5 t/s gen
flash attention:: on

User reports poor performance with dense models (Gemma4-31B ~7.5 t/s, Qwen3.6-27B locking up) on M5 Max 128GB, while Qwen3.6-35B-A3B MoE is fast. Mentions using DFLASH (likely flash attention).

Gemma 4 8B E4B Instruct

RX 7900 XTX · Ollama

throughput:: 40.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~35-45 tok/s on RX 7900 XTX. Gemma 4 E4B Q4_K_M via Ollama. Best consumer AMD option. Source: gemma4-ai.com AMD GPU guide

Gemma 4 8B E4B Instruct

RX 7900 XTX · vLLM

throughput:: 58.0 t/s gen · 83.0 t/s pp
quant:: FP16 (safetensors)

text-generation

Gemma 4 E4B on RX 7900 XTX via vLLM + ROCm. Default path: 57.96 gen tok/s, 82.96 prompt tok/s. Source: flexinfer.ai

Gemma 4 31B

M5 Max 128GB

throughput:: 7.5 t/s gen
flash attention:: on

User reports poor performance with Gemma4-31B (7.5 tok/s) and Qwen3.6-27B (locking up) on M5 Max 128GB, while Qwen3.6-35B-A3 is fast. Mentions using DFLASH.

Gemma 4 5.1B E2B Instruct

AMD Threadripper 256GB · llama.cpp

throughput:: 7.5 t/s gen
quant:: Q4 (gguf)

text-generation

~5-10 tok/s on CPU. E2B is usable CPU-only. Source: gemma4-ai.com hardware guide

Gemma 4 8B E4B Instruct

RTX 4070 · Ollama

throughput:: 55.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~55 tok/s on RTX 4070 12GB. Ada Lovelace efficiency. Source: estimated from compute-market tiers

Gemma 4 5.1B E2B Instruct

RTX 3060 12GB · Ollama

throughput:: 60.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~60 tok/s on RTX 3060 12GB. E2B runs effortlessly. Source: estimated from compute-market tiers

Gemma 4 8B E4B Instruct

RTX 3060 12GB · Ollama

throughput:: 45.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~45 tok/s on RTX 3060 12GB. E4B fits easily. Source: compute-market.com

Gemma 4 26B (4B active)

RTX 5090 · 50,000 ctx

quant:: NVFP4 (safetensors)

nvidia/Gemma-4-26B-A4B-NVFP4 works on 5090 with 80% allocation (of 32GB) got around 50k context. Model size 18.8GB. Benchmarks provided: GPQA Diamond 80.30% (baseline) vs 79.90% (NVFP4), AIME 2025 88.95% vs 90.00%, MMLU Pro 85.00% vs 84.80%, LiveCodeBench (pass@1) 80.50% vs 79.80%, IFBench 77.77% vs 78.1%, IFEval 96.60% vs 96.40%.