llamaperf

RTX 3090

NVIDIA · 24GB · 7 reports

See what fits on this GPU →

Qwen3.6 27B

RTX 3090 · BeeLlama · 128,000 ctx

quant:
Q5_K_S (gguf)
kv:
Q8
long-context

Benchmark of KV cache quantization methods using Qwen3.6 27B at 64k and 128k context. Also tested IQ4_XS quant. Article at https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context

Qwen3.6 27B

RTX 3090 · ik_llama.cpp · 156,000 ctx

Tone: positive
throughput:
72.9 t/s gen · 1261.0 t/s pp
quant:
IQ4_KS (gguf)
kv:
Q8
flash attention:
on
mtp (multi-token prediction):
on
coding

Best setup on RTX 3090 24GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf, 156k context, q8_0/q8_0 KV, MTP, vision on CPU. Prefill 1261 tok/s, decode 72.9 tok/s. Also tested llama.cpp and BeeLlama.

Tone: positive
throughput:
29.0 t/s gen · 239.0 t/s pp

Benchmark of 4x RTX 3090 with Qwen3.6-27B FP16 using vLLM TP=4. Power limit sweep from 200W to unrestricted. Peak efficiency at 220W. User expresses satisfaction with Qwen 3.6 27B as daily driver.

Qwen3.6 27B

RTX 3090 · llama.cpp · 100,000 ctx

Tone: positive
throughput:
50.0 t/s gen
quant:
Q4_K_M (gguf)
kv:
Q4
flash attention:
on
coding

Uses MTP (Multi-Token Prediction) GGUF with am17an commit for faster inference. KV cache quantized to Q4_0. Speculative draft set to 2. At 100k context, VRAM usage ~19GB. Performance degrades above 90k context.

Qwen3.6 27B

RTX 3090 · vLLM · 218,000 ctx

Tone: positive
throughput:
66.0 t/s gen
tool-usecoding

~218K context @ ~50/66 TPS (text, narr/code). Tool calls with ~25K-token outputs now complete without OOM after fixing Genesis patch (PN12). Lower TPS than earlier config but higher context + stability.

Qwen2.5 32B Coder

RTX 3090 · llama.cpp · 32,768 ctx

Tone: mixed
throughput:
28.0 t/s gen · 450.0 t/s pp
quant:
Q4_K_M (gguf)
kv:
Q8
coding

Solid for autocomplete, occasionally hallucinates imports in multi-file refactors. Build b4400.

Qwen3.6 27B

RTX 3090 · Luce DFlash (custom C++/CUDA stack on ggml) · 256,000 ctx

Tone: positive
throughput:
913.0 t/s pp
quant:
Q4_K_M (gguf)
kv:
Q4
flash attention:
on
codingmath

Speculative decoding (DFlash) on single RTX 3090. Target: Qwen3.6-27B Q4_K_M GGUF (~16 GB). Draft: z-lab Qwen3.6-27B-DFlash bf16 (~3.46 GB). DDTree tree-verify, block size 16, budget 22, greedy verify. KV cache compressed to TQ3_0 (3.5 bpv, ~9.7x vs F16) with 4096-slot ring buffer enabling 256K context in 24 GB. Sliding-window flash attention (2048-token window) at decode. Prefill ubatch auto-bumps from 16 to 192 for prompts >2048 tokens. OpenAI-compatible HTTP endpoint. CUDA only, no Metal/ROCm/multi-GPU. Bit-identical output to autoregressive in AR mode; draft matches z-lab PyTorch reference at cos sim 0.999812.