Qwen3.6 35B (3B active)
RTX 5080 · llama.cpp · 131,072 ctx
- throughput:
- 56.0 t/s gen · 1584.0 t/s pp
- quant:
- Q4_K_XL (gguf)
- kv:
- Q8
- flash attention:
- on
- mtp (multi-token prediction):
- off
codingagentic
Best config for 35B Q4_K_XL at 128k context: no MTP, --fit-target 1536. MTP doesn't help at 128k. 27B IQ3 fits fully on GPU and benefits from MTP (73 tok/s).