- throughput:
- 43.3 t/s gen · 456.1 t/s pp
- quant:
- Q4_K_S (gguf)
Dual RTX 3060 setup with tensor parallel. MTP enabled. Context 64k. Prefill 456 t/s, generation 43.26 t/s at 12k context. Without MTP, context 96k, generation 31 t/s. User praises value and stability of CUDA.
- throughput:
- 3500.0 t/s gen · 30000.0 t/s pp
Two benchmarks: Qwen3.6 27B BF16 and Qwen3.6 35B BF16. For 35B, best gen tps 3500 at 128 concurrency with MTP off, prompt tps 30000. Also tested 27B with MTP on/off.
- throughput:
- 46.9 t/s gen · 398.4 t/s pp
- quant:
- UD-Q5_K_XL (gguf)
- flash attention:
- on
- mtp (multi-token prediction):
- on
codingagentic
User runs two RX 9070 XTs with ROCm, uses MTP (spec-type = draft-mtp, spec-draft-n-max = 2). Prompt t/s varies; generation t/s around 45-52. Draft acceptance rate ~0.8-0.99. User praises speed, smarts, steerability for agentic coding tasks. Quant is UD-Q5_K_XL (unsloth GGUF).
- throughput:
- 110.2 t/s gen
- quant:
- IQ4_XS-4.19bpw (gguf)
- kv:
- Q8
- mtp (multi-token prediction):
- on
codingsummarizationmath
Benchmark comparing llama.cpp (89.76 t/s) vs ik_llama.cpp (110.24 t/s) with MTP on Qwen3.6-35B-A3B IQ4_XS quant. 23% speed increase. CPU: Ryzen 7 9700X, OS: CachyOS. GPU used as secondary with iGPU for display.
- throughput:
- 56.0 t/s gen · 1584.0 t/s pp
- quant:
- Q4_K_XL (gguf)
- kv:
- Q8
- flash attention:
- on
- mtp (multi-token prediction):
- off
codingagentic
Best config for 35B Q4_K_XL at 128k context: no MTP, --fit-target 1536. MTP doesn't help at 128k. 27B IQ3 fits fully on GPU and benefits from MTP (73 tok/s).
- throughput:
- 8.0 t/s gen
- quant:
- F16 (gguf)
- mtp (multi-token prediction):
- on
codingagentic
User benchmarked Qwen 3.6 27b F16 on M2 Max 96GB using llama.cpp with MTP speculative decoding. Generation speed varied 8-18 tok/s depending on task; without MTP got 6.6 tok/s. Used for agentic coding to create a Pacman game. Also tested Q8 quant but results were worse. Context up to 150k+ tokens usable. Chat template fixes were critical.
- quant:
- Q5_K_S (gguf)
- kv:
- Q8
long-context
Benchmark of KV cache quantization methods using Qwen3.6 27B at 64k and 128k context. Also tested IQ4_XS quant. Article at https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context
- throughput:
- 72.9 t/s gen · 1261.0 t/s pp
- quant:
- IQ4_KS (gguf)
- kv:
- Q8
- flash attention:
- on
- mtp (multi-token prediction):
- on
coding
Best setup on RTX 3090 24GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf, 156k context, q8_0/q8_0 KV, MTP, vision on CPU. Prefill 1261 tok/s, decode 72.9 tok/s. Also tested llama.cpp and BeeLlama.
- quant:
- Q8_0 (gguf)
- kv:
- Q8
- mtp (multi-token prediction):
- on
Benchmark comparing MTP KV cache quantization (Q8_0) vs no quantization on Qwen3.6-27B-Q8_0 with llama.cpp. Results show negligible wall time difference (~0.14s) with quantized draft KV cache. Also tested with tensor parallelism on 2xMI50 32GB. Aggregate accept rate ~0.735-0.741.
- throughput:
- 21.2 t/s gen
- quant:
- Q4_K_M (gguf)
- mtp (multi-token prediction):
- on
MTP enabled with --spec-type draft-mtp --spec-draft-n-max 3. Baseline without MTP: 11.7 tok/s. Also tested Q8_0: 7.4 → 18.1 tok/s (2.44×).
- quant:
- Q8_0 (gguf)
- flash attention:
- on
long-context
Benchmark compares MTP vs non-MTP for 27B and 35B-A3B models. 27B-MTP shows significant speedup in generation and overall wall time for long-context chat; 35B-MTP shows mixed results with faster generation but slower end-to-end due to prefill overhead.
- throughput:
- 21.1 t/s gen
- quant:
- Q4_K_M (gguf)
coding
Strix Halo 128GB, ROCm backend, Q4_K_M quant, chat workload. Also tested RTX 3090 and RTX 5070. Multiple models and quants reported.
- throughput:
- 29.0 t/s gen · 239.0 t/s pp
Benchmark of 4x RTX 3090 with Qwen3.6-27B FP16 using vLLM TP=4. Power limit sweep from 200W to unrestricted. Peak efficiency at 220W. User expresses satisfaction with Qwen 3.6 27B as daily driver.
- throughput:
- 3238.0 t/s gen
Benchmark of Qwen3.6-35B-A3B with LDLM diffusion model on RTX 5090 32GB. Throughput 3,238 tok/s at 10 diffusion steps, seq len 64, batch size 1. Also reports ~6,500 tok/s for 4 steps (extrapolated). Untrained weights. Also mentions Qwen3.6-27B at 745 tok/s (10 steps) and ~1,500 tok/s (4 steps).
- throughput:
- 70.0 t/s gen · 780.0 t/s pp
- quant:
- Q2-XS (gguf)
- kv:
- Q8
- flash attention:
- off
- rating:
- 4/5
codingcreative-writingtool-usesummarizationvisionagenticmultilingual
defnitly want to try an higher qwant. I vé took that one beacause gguf sise 11,9 go, ans barely offload this dense modèle to cpu
summarizationtool-use
User is a vet building a dictation/SOAP scribe. Reports inconsistent output from local models (Gemma 4, Qwen 3.6 35B A3B) compared to frontier models. System prompt is a 25-30k token markdown file. Hardware: Core Ultra 9, 128GB RAM, RTX 5090, Proxmox, AnythingLLM + Ollama (llama.cpp).
- throughput:
- 80.0 t/s gen
- quant:
- Q4_K_M (gguf)
- kv:
- Q4
MTP draft acceptance ~73%, TBQ4_0 KV cache, MTP draft 3. Fork: https://github.com/Indras-Mirror/llama.cpp-mtp
- throughput:
- 46.8 t/s gen · 914.0 t/s pp
- quant:
- IQ4_XS (gguf)
- kv:
- Q8
- flash attention:
- on
coding
Best plain llama-bench: pp512 ~914 t/s, tg128 ~46.8 t/s. Practical coding profile: 32k context, ~43.4 t/s generation. MTP gave ~47.7 t/s (2% improvement).
- throughput:
- 28.0 t/s gen
- quant:
- Q5_K_M (gguf)
- kv:
- Q4
codingagentic
MTP speculative decoding gives 2.5x speedup. Tested on M2 Max 96GB with Q5_K_M quant and q4_0 KV cache. Also provides hardware recommendations for various Apple Silicon and NVIDIA GPUs.
- throughput:
- 73.6 t/s gen · 2883.0 t/s pp
- quant:
- NVFP4 (safetensors)
- kv:
- Q8
- flash attention:
- on
coding
MTP enabled with 3 speculative tokens. KV cache fp8_e4m3. Prefix caching tested. Stability pass at 200k: 10/10 runs. Generation speed varies 59-111 tok/s. Mean MTP acceptance length 2.28.
- throughput:
- 215.1 t/s gen
- quant:
- Q4 (gguf)
MTP grafted model; Q4 speed increase only 6% on 5090. Also tested Q8 on 5090+3090: 148.20 t/s without MTP, 152.02 t/s with MTP.
- throughput:
- 50.0 t/s gen
- quant:
- Q4_K_M (gguf)
- kv:
- Q4
- flash attention:
- on
coding
Uses MTP (Multi-Token Prediction) GGUF with am17an commit for faster inference. KV cache quantized to Q4_0. Speculative draft set to 2. At 100k context, VRAM usage ~19GB. Performance degrades above 90k context.
- throughput:
- 54.5 t/s gen
- kv:
- Q8
codingtool-use
MTP branch of llama.cpp by am17an. 29-30 t/s without MTP, 54-55 t/s with MTP at 150W power limit. Falls to 40-45 t/s after 50k tokens. Used as vscode copilot.
- throughput:
- 22.0 t/s gen · 760.0 t/s pp
- quant:
- IQ4_XS (gguf)
- kv:
- Q8
- flash attention:
- on
User tested Qwen3.6 27B IQ4_XS on RTX 5060 Ti 16GB with llama.cpp (TheTom's TurboQuant fork). Prompt processing 760 t/s, generation 22 t/s. Context window limited to 75k. KV cache quant turbo4/turbo2. Also tested BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS, Q3_K_XL, Q3_K_M, Q2_K_XL on L40S or RTX 5060 Ti. Quality comparison using chess board SVG generation task. Recommends IQ4_XS as minimum.
- throughput:
- 63.0 t/s gen
- quant:
- 4-bit (mlx)
codingcreative-writing
MTPLX engine achieves 63 tok/s on Qwen3.6-27B 4-bit MLX on M5 Max 64GB, up from 28 tok/s baseline. Uses native MTP heads with temperature 0.6, top_p 0.95, top_k 20. Optimal depth D3. Custom patched MLX fork with Metal kernels.
- throughput:
- 80.0 t/s gen
- quant:
- FP8 (safetensors)
- kv:
- F16
- flash attention:
- on
codingagentic
Uses vLLM 0.20.1 with CUDA 12.9. BF16 KV cache. MTP=2 speculative decoding. Performance range 60-90 TPS, reported 80 TPS typical.
codingsummarization
Compared Qwen3.6-27B (with and without thinking) against Coder-Next. 27B with thinking disabled was most consistent (95.8% ship rate). 27B and Coder-Next statistically tied overall. Also mentions Qwen3.6-35B-A3B performed poorly and was dropped.
- throughput:
- 23.0 t/s gen
- kv:
- Q8
- flash attention:
- on
agentic
Running on a 5-year-old laptop with RTX 2060 Max-Q 6GB VRAM and 24GB RAM. Uses llama.cpp with Q8 KV cache, flash attention, and 64k context. Also mentions a long context variant with 128k context using Tom's fork. Model is Qwen3.6-35B-A3B (MoE, 3B active).
- throughput:
- 5.5 t/s gen · 160.0 t/s pp
- quant:
- Q8 (mlx)
long-context
User reports 160 tok/s prefill, 5-6 tok/s generation (later 4-5 tok/s) on M5 Max 128GB with Qwen 3.6 27B Q8 MLX at 290k context. GPU utilization 36-50%. User feels performance is lower than expected and seeks comparison.
- throughput:
- 32.0 t/s pp
- quant:
- Q6 (gguf)
coding
User is considering adding a second 7900 XTX for 48GB VRAM to run larger models. Currently running Qwen 27B Q6 dense with 32K context at 32 t/s prompt processing. Main use case is coding via opencode.
- throughput:
- 5.5 t/s gen · 160.0 t/s pp
- quant:
- Q8 (mlx)
long-context
User reports 160 tok/s prefill and 5-6 tok/s generation on M5 Max 128GB with Qwen 3.6 27B Q8 MLX at 290k context. GPU utilization only 36-50%, feels off compared to expected 8-14 tok/s generation. Seeking comparison from others.
- throughput:
- 32.0 t/s gen
coding
Qwen 3.6 27B on MacBook Pro M5 Max 64GB: 32 tokens/sec, 18m04s, 33946 tokens. Compared to Gemma 4 31B (27 t/s, 3m51s, 6209 tokens). Qwen showed more creativity but Gemma won for game logic.
- throughput:
- 66.0 t/s gen
tool-usecoding
~218K context @ ~50/66 TPS (text, narr/code). Tool calls with ~25K-token outputs now complete without OOM after fixing Genesis patch (PN12). Lower TPS than earlier config but higher context + stability.
- throughput:
- 5.5 t/s gen · 160.0 t/s pp
- quant:
- Q8 (mlx)
long-context
User reports 160 tok/s prefill, 5-6 tok/s generation on M5 Max 128GB with Qwen 3.6 27B Q8 MLX at 290k context. GPU utilization only 36-50%, feels off compared to expected 8-14 tok/s generation. Asks for comparison with other setups.
- throughput:
- 45.0 t/s gen
codingagentic
User rents GPU instance with 2x H100s (160GB VRAM) to run Qwen3.6-27B at 45 t/s. Uses vLLM for inference. Runs multiple agents (Claude Code, QwenCode, social media bots) hitting the API simultaneously. Context length 128K. Cost ~$0.90/hr, spent $120 last month. Model outperformed 120B model in tests.
- throughput:
- 25.0 t/s gen
- quant:
- Q4_0 (gguf)
- kv:
- Q4
coding
User reports 3000 tokens in ~2 minutes (25 t/s) with Q4_0 quant, 120k context, both caches quantized to 4_0. Seeking faster performance. Also includes a reply with vLLM benchmark on RTX 3090: 27B INT4 quant, 125K context, TurboQuant 3-bit NC KV cache, MTP speculative decoding, 82 tok/s generation, 0.3-0.6s TTFT.
- throughput:
- 38.0 t/s gen · 1021.0 t/s pp
- quant:
- IQ4_XS (gguf)
- kv:
- Q8
- flash attention:
- on
codingtool-usesummarizationagentic
Poster used llama-server.exe with Vulkan backend on a 16GB VRAM GPU (model not specified). Model is Qwen3.6-35B-A3B (MoE, 34.66B params, ~4.25 bpw) in IQ4_XS quant from Unsloth. Used --n-gpu-layers 99, --n-cpu-moe 16 (offloading MoE experts to CPU), --threads 14, --batch-size 1024, --ubatch-size 1024, --flash-attn 1, --cache-type-k q8_0, --cache-type-v q8_0, --ctx-size 80000, --cache-ram 2048, --no-mmap. Prompt processing: 1021.05 ± 1.24 t/s (pp80000). Generation: 37.96 ± 0.10 t/s (tg1000). Combined with a pi coding agent for file operations, tool calls, summarizations, and MCP calls. Poster says it's usable for someone with a 16GB GPU.
- throughput:
- 27.6 t/s gen
- quant:
- Q6_K (gguf)
vision
Used open-visual in Open WebUI for image generation. Multiple prompts with generation speeds around 27 t/s.
- throughput:
- 106.5 t/s gen
- quant:
- INT4 (safetensors)
- kv:
- Q8
Qwen3.6-27B-INT4 via vllm 0.19 on 1x RTX 5090. Achieves 105-108 tps generation with 256k context. Uses fp8_e4m3 KV cache, flashinfer attention, MTP speculative decoding (3 tokens). Model from Lorbus quant (AutoRound).
- throughput:
- 55.0 t/s gen
- quant:
- IQ4_XS (gguf)
coding
Switched from Qwen3.6 35b-a3b IQ4_XS to Qwen3.6 27b IQ3_M. 35b-a3b got 50-60 t/s but slow prompt processing; 27b got ~40 t/s but consistent. 27b found a bug the 35b couldn't. Dense model handles compression better than MoE.
- throughput:
- 30.0 t/s gen
- quant:
- Q5_K_S (gguf)
User found that larger quants (Q4_K_XL, Q5_K_S) gave better speed than smaller Q4 (IQ4_XS) on MoE model Qwen3.6-35B-A3B. With Q5_K_S, ~30 t/s at 128k context. Also tested Q4_K_XL: 32 t/s. IQ4_XS gave 25-30 t/s with 32k context. System: RTX 3070 8GB + 64GB DDR4.
- throughput:
- 80.0 t/s gen
- quant:
- NVFP4 (safetensors)
Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19.1rc1. Uses NVFP4 quantization.
- throughput:
- 19.2 t/s gen · 186.8 t/s pp
- quant:
- Q4_K_M (gguf)
- kv:
- Q8
User reports using a 5070Ti 16GB and a 2060 6GB to run Qwen3.6-27B Q4_K_M with llama-server. At 71k actual context, pp=186.76 t/s, tg=19.21 t/s. Also provides llama-bench results with CUDA showing tg speeds around 16-25 t/s depending on configuration.
- throughput:
- 22.5 t/s gen
- quant:
- Q4_K_M (gguf)
codingtool-use
Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer. Benchmarks: HumanEval, HellaSwag, BFCL. Q4_K_M best practical variant: 1.45x faster than BF16, 48% less peak RAM, 68.8% smaller model file, nearly identical function calling score.
- throughput:
- 913.0 t/s pp
- quant:
- Q4_K_M (gguf)
- kv:
- Q4
- flash attention:
- on
codingmath
Speculative decoding (DFlash) on single RTX 3090. Target: Qwen3.6-27B Q4_K_M GGUF (~16 GB). Draft: z-lab Qwen3.6-27B-DFlash bf16 (~3.46 GB). DDTree tree-verify, block size 16, budget 22, greedy verify. KV cache compressed to TQ3_0 (3.5 bpv, ~9.7x vs F16) with 4096-slot ring buffer enabling 256K context in 24 GB. Sliding-window flash attention (2048-token window) at decode. Prefill ubatch auto-bumps from 16 to 192 for prompts >2048 tokens. OpenAI-compatible HTTP endpoint. CUDA only, no Metal/ROCm/multi-GPU. Bit-identical output to autoregressive in AR mode; draft matches z-lab PyTorch reference at cos sim 0.999812.
- quant:
- IQ4_XS (gguf)
- kv:
- Q8
- rating:
- 5/5
codingtool-use
User reports Qwen 3.6 27B is excellent for pyspark/python and data transformation debugging. Running on ASUS ROG Strix SCAR 18 with RTX 5090 laptop (24GB VRAM) and 64GB DDR5 RAM. Using llama.cpp with IQ4_XS quant at 200k context with Q8_0 KV cache. Initially tried q4_k_m at q4_0. Cancelling cloud subscriptions due to local performance. No tokens/sec reported.