RTX 5090

NVIDIA · 32GB · 10 reports

Gemma 4 26B (4B active)

RTX 5090 · llama.cpp

throughput:: 243.9 t/s gen · 13809.2 t/s pp
quant:: Q4_K_M (gguf)
kv:: Q8

Benchmark of FWHT CUDA implementation for kv-cache quantization. Results show 1-2% pp boost and 7-9% tg boost on Gemma 4 26B.A4B Q4_K_M with -ctk q8_0 -ctv q8_0. pp2048 and tg128 values reported; highest t/s from cuda-fwt column.

Qwen3.6 35B (3B active)

RTX 5090 · 64 ctx

throughput:: 3238.0 t/s gen

Benchmark of Qwen3.6-35B-A3B with LDLM diffusion model on RTX 5090 32GB. Throughput 3,238 tok/s at 10 diffusion steps, seq len 64, batch size 1. Also reports ~6,500 tok/s for 4 steps (extrapolated). Untrained weights. Also mentions Qwen3.6-27B at 745 tok/s (10 steps) and ~1,500 tok/s (4 steps).

Qwen3.6 35B (3B active)

RTX 5090 · Ollama

summarizationtool-use

User is a vet building a dictation/SOAP scribe. Reports inconsistent output from local models (Gemma 4, Qwen 3.6 35B A3B) compared to frontier models. System prompt is a 25-30k token markdown file. Hardware: Core Ultra 9, 128GB RAM, RTX 5090, Proxmox, AnythingLLM + Ollama (llama.cpp).

Gemma 4 26B (4B active) cyankiwi/gemma-4 it-AWQ-4bit

RTX 5090 · vLLM

throughput:: 578.0 t/s gen
quant:: AWQ-4bit

DFlash speculative decoding with vLLM 0.19.2rc1. Baseline 228 t/s, best with DFlash 578 t/s (2.56x speedup). Draft model: z-lab/gemma-4-26B-A4B-it-DFlash. Input 256 tokens, output 1024 tokens.

Qwen3.6 27B

RTX 5090 · vLLM · 200,000 ctx

throughput:: 73.6 t/s gen · 2883.0 t/s pp
quant:: NVFP4 (safetensors)
kv:: Q8
flash attention:: on

coding

MTP enabled with 3 speculative tokens. KV cache fp8_e4m3. Prefix caching tested. Stability pass at 200k: 10/10 runs. Generation speed varies 59-111 tok/s. Mean MTP acceptance length 2.28.

Qwen3.6 35B (3B active)

RTX 5090 · llama.cpp

throughput:: 215.1 t/s gen
quant:: Q4 (gguf)

MTP grafted model; Q4 speed increase only 6% on 5090. Also tested Q8 on 5090+3090: 148.20 t/s without MTP, 152.02 t/s with MTP.

Gemma 4 26B (4B active)

RTX 5090 · 50,000 ctx

quant:: NVFP4 (safetensors)

nvidia/Gemma-4-26B-A4B-NVFP4 works on 5090 with 80% allocation (of 32GB) got around 50k context. Model size 18.8GB. Benchmarks provided: GPQA Diamond 80.30% (baseline) vs 79.90% (NVFP4), AIME 2025 88.95% vs 90.00%, MMLU Pro 85.00% vs 84.80%, LiveCodeBench (pass@1) 80.50% vs 79.80%, IFBench 77.77% vs 78.1%, IFEval 96.60% vs 96.40%.

Qwen3.6 27B

RTX 5090 · vLLM · 262,144 ctx

throughput:: 106.5 t/s gen
quant:: INT4 (safetensors)
kv:: Q8

Qwen3.6-27B-INT4 via vllm 0.19 on 1x RTX 5090. Achieves 105-108 tps generation with 256k context. Uses fp8_e4m3 KV cache, flashinfer attention, MTP speculative decoding (3 tokens). Model from Lorbus quant (AutoRound).

Qwen3.6 27B

RTX 5090 · vLLM · 218,000 ctx

throughput:: 80.0 t/s gen
quant:: NVFP4 (safetensors)

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19.1rc1. Uses NVFP4 quantization.

Qwen3.6 27B

RTX 5090 · llama.cpp · 200,000 ctx

quant:: IQ4_XS (gguf)
kv:: Q8
rating:: 5/5

codingtool-use

User reports Qwen 3.6 27B is excellent for pyspark/python and data transformation debugging. Running on ASUS ROG Strix SCAR 18 with RTX 5090 laptop (24GB VRAM) and 64GB DDR5 RAM. Using llama.cpp with IQ4_XS quant at 200k context with Q8_0 KV cache. Initially tried q4_k_m at q4_0. Cancelling cloud subscriptions due to local performance. No tokens/sec reported.