llamaperf

RTX 5090

NVIDIA · 32GB · 10 reports

See what fits on this GPU →
throughput:
243.9 t/s gen · 13809.2 t/s pp
quant:
Q4_K_M (gguf)
kv:
Q8

Benchmark of FWHT CUDA implementation for kv-cache quantization. Results show 1-2% pp boost and 7-9% tg boost on Gemma 4 26B.A4B Q4_K_M with -ctk q8_0 -ctv q8_0. pp2048 and tg128 values reported; highest t/s from cuda-fwt column.

throughput:
3238.0 t/s gen

Benchmark of Qwen3.6-35B-A3B with LDLM diffusion model on RTX 5090 32GB. Throughput 3,238 tok/s at 10 diffusion steps, seq len 64, batch size 1. Also reports ~6,500 tok/s for 4 steps (extrapolated). Untrained weights. Also mentions Qwen3.6-27B at 745 tok/s (10 steps) and ~1,500 tok/s (4 steps).

Tone: mixed
summarizationtool-use

User is a vet building a dictation/SOAP scribe. Reports inconsistent output from local models (Gemma 4, Qwen 3.6 35B A3B) compared to frontier models. System prompt is a 25-30k token markdown file. Hardware: Core Ultra 9, 128GB RAM, RTX 5090, Proxmox, AnythingLLM + Ollama (llama.cpp).

Qwen3.6 27B

RTX 5090 · vLLM · 200,000 ctx

Tone: positive
throughput:
73.6 t/s gen · 2883.0 t/s pp
quant:
NVFP4 (safetensors)
kv:
Q8
flash attention:
on
coding

MTP enabled with 3 speculative tokens. KV cache fp8_e4m3. Prefix caching tested. Stability pass at 200k: 10/10 runs. Generation speed varies 59-111 tok/s. Mean MTP acceptance length 2.28.

Tone: mixed
throughput:
215.1 t/s gen
quant:
Q4 (gguf)

MTP grafted model; Q4 speed increase only 6% on 5090. Also tested Q8 on 5090+3090: 148.20 t/s without MTP, 152.02 t/s with MTP.

Tone: positive
quant:
NVFP4 (safetensors)

nvidia/Gemma-4-26B-A4B-NVFP4 works on 5090 with 80% allocation (of 32GB) got around 50k context. Model size 18.8GB. Benchmarks provided: GPQA Diamond 80.30% (baseline) vs 79.90% (NVFP4), AIME 2025 88.95% vs 90.00%, MMLU Pro 85.00% vs 84.80%, LiveCodeBench (pass@1) 80.50% vs 79.80%, IFBench 77.77% vs 78.1%, IFEval 96.60% vs 96.40%.

Qwen3.6 27B

RTX 5090 · vLLM · 262,144 ctx

Tone: positive
throughput:
106.5 t/s gen
quant:
INT4 (safetensors)
kv:
Q8

Qwen3.6-27B-INT4 via vllm 0.19 on 1x RTX 5090. Achieves 105-108 tps generation with 256k context. Uses fp8_e4m3 KV cache, flashinfer attention, MTP speculative decoding (3 tokens). Model from Lorbus quant (AutoRound).

Qwen3.6 27B

RTX 5090 · vLLM · 218,000 ctx

Tone: positive
throughput:
80.0 t/s gen
quant:
NVFP4 (safetensors)

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19.1rc1. Uses NVFP4 quantization.

Qwen3.6 27B

RTX 5090 · llama.cpp · 200,000 ctx

Tone: positive
quant:
IQ4_XS (gguf)
kv:
Q8
rating:
5/5
codingtool-use

User reports Qwen 3.6 27B is excellent for pyspark/python and data transformation debugging. Running on ASUS ROG Strix SCAR 18 with RTX 5090 laptop (24GB VRAM) and 64GB DDR5 RAM. Using llama.cpp with IQ4_XS quant at 200k context with Q8_0 KV cache. Initially tried q4_k_m at q4_0. Cancelling cloud subscriptions due to local performance. No tokens/sec reported.