llamaperf

Qwen3.6

Alibaba · 46 reports

Qwen3.6 27B

RTX 3060 12GB · llama.cpp · 64,000 ctx

Tone: positive
throughput:
43.3 t/s gen · 456.1 t/s pp
quant:
Q4_K_S (gguf)

Dual RTX 3060 setup with tensor parallel. MTP enabled. Context 64k. Prefill 456 t/s, generation 43.26 t/s at 12k context. Without MTP, context 96k, generation 31 t/s. User praises value and stability of CUDA.

throughput:
3500.0 t/s gen · 30000.0 t/s pp

Two benchmarks: Qwen3.6 27B BF16 and Qwen3.6 35B BF16. For 35B, best gen tps 3500 at 128 concurrency with MTP off, prompt tps 30000. Also tested 27B with MTP on/off.

Qwen3.6 27B

RX 9070 · llama.cpp · 131,072 ctx

Tone: positive
throughput:
46.9 t/s gen · 398.4 t/s pp
quant:
UD-Q5_K_XL (gguf)
flash attention:
on
mtp (multi-token prediction):
on
codingagentic

User runs two RX 9070 XTs with ROCm, uses MTP (spec-type = draft-mtp, spec-draft-n-max = 2). Prompt t/s varies; generation t/s around 45-52. Draft acceptance rate ~0.8-0.99. User praises speed, smarts, steerability for agentic coding tasks. Quant is UD-Q5_K_XL (unsloth GGUF).

Qwen3.6 35B (3B active)

RTX 4070 Ti Super · ik_llama.cpp · 131,072 ctx

Tone: positive
throughput:
110.2 t/s gen
quant:
IQ4_XS-4.19bpw (gguf)
kv:
Q8
mtp (multi-token prediction):
on
codingsummarizationmath

Benchmark comparing llama.cpp (89.76 t/s) vs ik_llama.cpp (110.24 t/s) with MTP on Qwen3.6-35B-A3B IQ4_XS quant. 23% speed increase. CPU: Ryzen 7 9700X, OS: CachyOS. GPU used as secondary with iGPU for display.

Qwen3.6 35B (3B active)

RTX 5080 · llama.cpp · 131,072 ctx

throughput:
56.0 t/s gen · 1584.0 t/s pp
quant:
Q4_K_XL (gguf)
kv:
Q8
flash attention:
on
mtp (multi-token prediction):
off
codingagentic

Best config for 35B Q4_K_XL at 128k context: no MTP, --fit-target 1536. MTP doesn't help at 128k. 27B IQ3 fits fully on GPU and benefits from MTP (73 tok/s).

Qwen3.6 27B

M2 Max 96GB · llama.cpp · 256,000 ctx

Tone: positive
throughput:
8.0 t/s gen
quant:
F16 (gguf)
mtp (multi-token prediction):
on
codingagentic

User benchmarked Qwen 3.6 27b F16 on M2 Max 96GB using llama.cpp with MTP speculative decoding. Generation speed varied 8-18 tok/s depending on task; without MTP got 6.6 tok/s. Used for agentic coding to create a Pacman game. Also tested Q8 quant but results were worse. Context up to 150k+ tokens usable. Chat template fixes were critical.

Qwen3.6 27B

RTX 3090 · BeeLlama · 128,000 ctx

quant:
Q5_K_S (gguf)
kv:
Q8
long-context

Benchmark of KV cache quantization methods using Qwen3.6 27B at 64k and 128k context. Also tested IQ4_XS quant. Article at https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context

Qwen3.6 27B

RTX 3090 · ik_llama.cpp · 156,000 ctx

Tone: positive
throughput:
72.9 t/s gen · 1261.0 t/s pp
quant:
IQ4_KS (gguf)
kv:
Q8
flash attention:
on
mtp (multi-token prediction):
on
coding

Best setup on RTX 3090 24GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf, 156k context, q8_0/q8_0 KV, MTP, vision on CPU. Prefill 1261 tok/s, decode 72.9 tok/s. Also tested llama.cpp and BeeLlama.

Qwen3.6 27B

AMD MI50 32GB · llama.cpp

quant:
Q8_0 (gguf)
kv:
Q8
mtp (multi-token prediction):
on

Benchmark comparing MTP KV cache quantization (Q8_0) vs no quantization on Qwen3.6-27B-Q8_0 with llama.cpp. Results show negligible wall time difference (~0.14s) with quantized draft KV cache. Also tested with tensor parallelism on 2xMI50 32GB. Aggregate accept rate ~0.735-0.741.

throughput:
21.2 t/s gen
quant:
Q4_K_M (gguf)
mtp (multi-token prediction):
on

MTP enabled with --spec-type draft-mtp --spec-draft-n-max 3. Baseline without MTP: 11.7 tok/s. Also tested Q8_0: 7.4 → 18.1 tok/s (2.44×).

Qwen3.6 27B

AMD Strix Halo 128GB · llama.cpp · 128,000 ctx

Tone: mixed
quant:
Q8_0 (gguf)
flash attention:
on
long-context

Benchmark compares MTP vs non-MTP for 27B and 35B-A3B models. 27B-MTP shows significant speedup in generation and overall wall time for long-context chat; 35B-MTP shows mixed results with faster generation but slower end-to-end due to prefill overhead.

throughput:
21.1 t/s gen
quant:
Q4_K_M (gguf)
coding

Strix Halo 128GB, ROCm backend, Q4_K_M quant, chat workload. Also tested RTX 3090 and RTX 5070. Multiple models and quants reported.

Tone: positive
throughput:
29.0 t/s gen · 239.0 t/s pp

Benchmark of 4x RTX 3090 with Qwen3.6-27B FP16 using vLLM TP=4. Power limit sweep from 200W to unrestricted. Peak efficiency at 220W. User expresses satisfaction with Qwen 3.6 27B as daily driver.

throughput:
3238.0 t/s gen

Benchmark of Qwen3.6-35B-A3B with LDLM diffusion model on RTX 5090 32GB. Throughput 3,238 tok/s at 10 diffusion steps, seq len 64, batch size 1. Also reports ~6,500 tok/s for 4 steps (extrapolated). Untrained weights. Also mentions Qwen3.6-27B at 745 tok/s (10 steps) and ~1,500 tok/s (4 steps).

Qwen3.6 27B unsloth

RTX 3060 12GB · llama.cpp · 32,000 ctx

Tone: positive
throughput:
70.0 t/s gen · 780.0 t/s pp
quant:
Q2-XS (gguf)
kv:
Q8
flash attention:
off
rating:
4/5
codingcreative-writingtool-usesummarizationvisionagenticmultilingual

defnitly want to try an higher qwant. I vé took that one beacause gguf sise 11,9 go, ans barely offload this dense modèle to cpu

Tone: mixed
summarizationtool-use

User is a vet building a dictation/SOAP scribe. Reports inconsistent output from local models (Gemma 4, Qwen 3.6 35B A3B) compared to frontier models. System prompt is a 25-30k token markdown file. Hardware: Core Ultra 9, 128GB RAM, RTX 5090, Proxmox, AnythingLLM + Ollama (llama.cpp).

Qwen3.6 27B Heretic-v2

RTX 4090 · llama.cpp · 262,000 ctx

Tone: positive
throughput:
80.0 t/s gen
quant:
Q4_K_M (gguf)
kv:
Q4

MTP draft acceptance ~73%, TBQ4_0 KV cache, MTP draft 3. Fork: https://github.com/Indras-Mirror/llama.cpp-mtp

Qwen3.6 35B (3B active)

RTX 3060 12GB · llama.cpp · 32,768 ctx

Tone: positive
throughput:
46.8 t/s gen · 914.0 t/s pp
quant:
IQ4_XS (gguf)
kv:
Q8
flash attention:
on
coding

Best plain llama-bench: pp512 ~914 t/s, tg128 ~46.8 t/s. Practical coding profile: 32k context, ~43.4 t/s generation. MTP gave ~47.7 t/s (2% improvement).

Qwen3.6 27B

M2 Max 96GB · llama.cpp · 262,144 ctx

Tone: positive
throughput:
28.0 t/s gen
quant:
Q5_K_M (gguf)
kv:
Q4
codingagentic

MTP speculative decoding gives 2.5x speedup. Tested on M2 Max 96GB with Q5_K_M quant and q4_0 KV cache. Also provides hardware recommendations for various Apple Silicon and NVIDIA GPUs.

Qwen3.6 27B

RTX 5090 · vLLM · 200,000 ctx

Tone: positive
throughput:
73.6 t/s gen · 2883.0 t/s pp
quant:
NVFP4 (safetensors)
kv:
Q8
flash attention:
on
coding

MTP enabled with 3 speculative tokens. KV cache fp8_e4m3. Prefix caching tested. Stability pass at 200k: 10/10 runs. Generation speed varies 59-111 tok/s. Mean MTP acceptance length 2.28.

Tone: mixed
throughput:
215.1 t/s gen
quant:
Q4 (gguf)

MTP grafted model; Q4 speed increase only 6% on 5090. Also tested Q8 on 5090+3090: 148.20 t/s without MTP, 152.02 t/s with MTP.

Qwen3.6 27B

RTX 3090 · llama.cpp · 100,000 ctx

Tone: positive
throughput:
50.0 t/s gen
quant:
Q4_K_M (gguf)
kv:
Q4
flash attention:
on
coding

Uses MTP (Multi-Token Prediction) GGUF with am17an commit for faster inference. KV cache quantized to Q4_0. Speculative draft set to 2. At 100k context, VRAM usage ~19GB. Performance degrades above 90k context.

Qwen3.6 27B

V100 32GB · llama.cpp · 200,000 ctx

Tone: positive
throughput:
54.5 t/s gen
kv:
Q8
codingtool-use

MTP branch of llama.cpp by am17an. 29-30 t/s without MTP, 54-55 t/s with MTP at 150W power limit. Falls to 40-45 t/s after 50k tokens. Used as vscode copilot.

Qwen3.6 27B

RTX 5060 Ti 16GB · llama.cpp · 75,000 ctx

Tone: positive
throughput:
22.0 t/s gen · 760.0 t/s pp
quant:
IQ4_XS (gguf)
kv:
Q8
flash attention:
on

User tested Qwen3.6 27B IQ4_XS on RTX 5060 Ti 16GB with llama.cpp (TheTom's TurboQuant fork). Prompt processing 760 t/s, generation 22 t/s. Context window limited to 75k. KV cache quant turbo4/turbo2. Also tested BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS, Q3_K_XL, Q3_K_M, Q2_K_XL on L40S or RTX 5060 Ti. Quality comparison using chess board SVG generation task. Recommends IQ4_XS as minimum.

Tone: positive
throughput:
63.0 t/s gen
quant:
4-bit (mlx)
codingcreative-writing

MTPLX engine achieves 63 tok/s on Qwen3.6-27B 4-bit MLX on M5 Max 64GB, up from 28 tok/s baseline. Uses native MTP heads with temperature 0.6, top_p 0.95, top_k 20. Optimal depth D3. Custom patched MLX fork with Metal kernels.

Qwen3.6 27B

RTX 5000 PRO 48GB · vLLM · 196,608 ctx

Tone: positive
throughput:
80.0 t/s gen
quant:
FP8 (safetensors)
kv:
F16
flash attention:
on
codingagentic

Uses vLLM 0.20.1 with CUDA 12.9. BF16 KV cache. MTP=2 speculative decoding. Performance range 60-90 TPS, reported 80 TPS typical.

Tone: mixed
codingsummarization

Compared Qwen3.6-27B (with and without thinking) against Coder-Next. 27B with thinking disabled was most consistent (95.8% ship rate). 27B and Coder-Next statistically tied overall. Also mentions Qwen3.6-35B-A3B performed poorly and was dropped.

Qwen3.6 35B (3B active)

RTX 2060 Max-Q 6GB · llama.cpp · 65,536 ctx

Tone: positive
throughput:
23.0 t/s gen
kv:
Q8
flash attention:
on
agentic

Running on a 5-year-old laptop with RTX 2060 Max-Q 6GB VRAM and 24GB RAM. Uses llama.cpp with Q8 KV cache, flash attention, and 64k context. Also mentions a long context variant with 128k context using Tom's fork. Model is Qwen3.6-35B-A3B (MoE, 3B active).

Qwen3.6 27B

M5 Max 128GB · MLX · 290,000 ctx

Tone: mixed
throughput:
5.5 t/s gen · 160.0 t/s pp
quant:
Q8 (mlx)
long-context

User reports 160 tok/s prefill, 5-6 tok/s generation (later 4-5 tok/s) on M5 Max 128GB with Qwen 3.6 27B Q8 MLX at 290k context. GPU utilization 36-50%. User feels performance is lower than expected and seeks comparison.

Tone: mixed
throughput:
32.0 t/s pp
quant:
Q6 (gguf)
coding

User is considering adding a second 7900 XTX for 48GB VRAM to run larger models. Currently running Qwen 27B Q6 dense with 32K context at 32 t/s prompt processing. Main use case is coding via opencode.

Qwen3.6 27B

M5 Max 128GB · MLX · 290,000 ctx

Tone: mixed
throughput:
5.5 t/s gen · 160.0 t/s pp
quant:
Q8 (mlx)
long-context

User reports 160 tok/s prefill and 5-6 tok/s generation on M5 Max 128GB with Qwen 3.6 27B Q8 MLX at 290k context. GPU utilization only 36-50%, feels off compared to expected 8-14 tok/s generation. Seeking comparison from others.

Tone: mixed
throughput:
32.0 t/s gen
coding

Qwen 3.6 27B on MacBook Pro M5 Max 64GB: 32 tokens/sec, 18m04s, 33946 tokens. Compared to Gemma 4 31B (27 t/s, 3m51s, 6209 tokens). Qwen showed more creativity but Gemma won for game logic.

Qwen3.6 27B

RTX 3090 · vLLM · 218,000 ctx

Tone: positive
throughput:
66.0 t/s gen
tool-usecoding

~218K context @ ~50/66 TPS (text, narr/code). Tool calls with ~25K-token outputs now complete without OOM after fixing Genesis patch (PN12). Lower TPS than earlier config but higher context + stability.

Qwen3.6 27B

M3 Max 128GB · MLX · 290,000 ctx

Tone: mixed
throughput:
5.5 t/s gen · 160.0 t/s pp
quant:
Q8 (mlx)
long-context

User reports 160 tok/s prefill, 5-6 tok/s generation on M5 Max 128GB with Qwen 3.6 27B Q8 MLX at 290k context. GPU utilization only 36-50%, feels off compared to expected 8-14 tok/s generation. Asks for comparison with other setups.

Qwen3.6 27B

H100 80GB · vLLM · 128,000 ctx

Tone: positive
throughput:
45.0 t/s gen
codingagentic

User rents GPU instance with 2x H100s (160GB VRAM) to run Qwen3.6-27B at 45 t/s. Uses vLLM for inference. Runs multiple agents (Claude Code, QwenCode, social media bots) hitting the API simultaneously. Context length 128K. Cost ~$0.90/hr, spent $120 last month. Model outperformed 120B model in tests.

Qwen3.6 27B

RTX 4090 · LM Studio · 120,000 ctx

Tone: mixed
throughput:
25.0 t/s gen
quant:
Q4_0 (gguf)
kv:
Q4
coding

User reports 3000 tokens in ~2 minutes (25 t/s) with Q4_0 quant, 120k context, both caches quantized to 4_0. Seeking faster performance. Also includes a reply with vLLM benchmark on RTX 3090: 27B INT4 quant, 125K context, TurboQuant 3-bit NC KV cache, MTP speculative decoding, 82 tok/s generation, 0.3-0.6s TTFT.

Qwen3.6 35B (3B active)

16GB VRAM (Vulkan backend, unspecified GPU model) · llama.cpp (llama-server.exe, Vulkan backend) · 80,000 ctx

Tone: positive
throughput:
38.0 t/s gen · 1021.0 t/s pp
quant:
IQ4_XS (gguf)
kv:
Q8
flash attention:
on
codingtool-usesummarizationagentic

Poster used llama-server.exe with Vulkan backend on a 16GB VRAM GPU (model not specified). Model is Qwen3.6-35B-A3B (MoE, 34.66B params, ~4.25 bpw) in IQ4_XS quant from Unsloth. Used --n-gpu-layers 99, --n-cpu-moe 16 (offloading MoE experts to CPU), --threads 14, --batch-size 1024, --ubatch-size 1024, --flash-attn 1, --cache-type-k q8_0, --cache-type-v q8_0, --ctx-size 80000, --cache-ram 2048, --no-mmap. Prompt processing: 1021.05 ± 1.24 t/s (pp80000). Generation: 37.96 ± 0.10 t/s (tg1000). Combined with a pi coding agent for file operations, tool calls, summarizations, and MCP calls. Poster says it's usable for someone with a 16GB GPU.

Qwen3.6 27B

Unknown GPU

Tone: positive
throughput:
27.6 t/s gen
quant:
Q6_K (gguf)
vision

Used open-visual in Open WebUI for image generation. Multiple prompts with generation speeds around 27 t/s.

Qwen3.6 27B

RTX 5090 · vLLM · 262,144 ctx

Tone: positive
throughput:
106.5 t/s gen
quant:
INT4 (safetensors)
kv:
Q8

Qwen3.6-27B-INT4 via vllm 0.19 on 1x RTX 5090. Achieves 105-108 tps generation with 256k context. Uses fp8_e4m3 KV cache, flashinfer attention, MTP speculative decoding (3 tokens). Model from Lorbus quant (AutoRound).

Qwen3.6 35B (3B active)

5070 ti · LM Studio

Tone: positive
throughput:
55.0 t/s gen
quant:
IQ4_XS (gguf)
coding

Switched from Qwen3.6 35b-a3b IQ4_XS to Qwen3.6 27b IQ3_M. 35b-a3b got 50-60 t/s but slow prompt processing; 27b got ~40 t/s but consistent. 27b found a bug the 35b couldn't. Dense model handles compression better than MoE.

Qwen3.6 35B (3B active)

3070 8gb · llama.cpp · 128,000 ctx

Tone: positive
throughput:
30.0 t/s gen
quant:
Q5_K_S (gguf)

User found that larger quants (Q4_K_XL, Q5_K_S) gave better speed than smaller Q4 (IQ4_XS) on MoE model Qwen3.6-35B-A3B. With Q5_K_S, ~30 t/s at 128k context. Also tested Q4_K_XL: 32 t/s. IQ4_XS gave 25-30 t/s with 32k context. System: RTX 3070 8GB + 64GB DDR4.

Qwen3.6 27B

RTX 5090 · vLLM · 218,000 ctx

Tone: positive
throughput:
80.0 t/s gen
quant:
NVFP4 (safetensors)

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19.1rc1. Uses NVFP4 quantization.

Qwen3.6 26.9B

5070Ti 16GB, 2060 6GB · llama-server · 128,000 ctx

Tone: positive
throughput:
19.2 t/s gen · 186.8 t/s pp
quant:
Q4_K_M (gguf)
kv:
Q8

User reports using a 5070Ti 16GB and a 2060 6GB to run Qwen3.6-27B Q4_K_M with llama-server. At 71k actual context, pp=186.76 t/s, tg=19.21 t/s. Also provides llama-bench results with CUDA showing tg speeds around 16-25 t/s depending on configuration.

Qwen3.6 27B

Unknown GPU · llama-cpp-python · 32,768 ctx

Tone: positive
throughput:
22.5 t/s gen
quant:
Q4_K_M (gguf)
codingtool-use

Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer. Benchmarks: HumanEval, HellaSwag, BFCL. Q4_K_M best practical variant: 1.45x faster than BF16, 48% less peak RAM, 68.8% smaller model file, nearly identical function calling score.

Qwen3.6 27B

RTX 3090 · Luce DFlash (custom C++/CUDA stack on ggml) · 256,000 ctx

Tone: positive
throughput:
913.0 t/s pp
quant:
Q4_K_M (gguf)
kv:
Q4
flash attention:
on
codingmath

Speculative decoding (DFlash) on single RTX 3090. Target: Qwen3.6-27B Q4_K_M GGUF (~16 GB). Draft: z-lab Qwen3.6-27B-DFlash bf16 (~3.46 GB). DDTree tree-verify, block size 16, budget 22, greedy verify. KV cache compressed to TQ3_0 (3.5 bpv, ~9.7x vs F16) with 4096-slot ring buffer enabling 256K context in 24 GB. Sliding-window flash attention (2048-token window) at decode. Prefill ubatch auto-bumps from 16 to 192 for prompts >2048 tokens. OpenAI-compatible HTTP endpoint. CUDA only, no Metal/ROCm/multi-GPU. Bit-identical output to autoregressive in AR mode; draft matches z-lab PyTorch reference at cos sim 0.999812.

Qwen3.6 27B

RTX 5090 · llama.cpp · 200,000 ctx

Tone: positive
quant:
IQ4_XS (gguf)
kv:
Q8
rating:
5/5
codingtool-use

User reports Qwen 3.6 27B is excellent for pyspark/python and data transformation debugging. Running on ASUS ROG Strix SCAR 18 with RTX 5090 laptop (24GB VRAM) and 64GB DDR5 RAM. Using llama.cpp with IQ4_XS quant at 200k context with Q8_0 KV cache. Initially tried q4_k_m at q4_0. Cancelling cloud subscriptions due to local performance. No tokens/sec reported.