AMD Strix Halo 128GB

AMD · 128GB unified memory · 3 reports

Qwen3.6 27B

throughput:: 21.2 t/s gen
quant:: Q4_K_M (gguf)
mtp (multi-token prediction):: on

MTP enabled with --spec-type draft-mtp --spec-draft-n-max 3. Baseline without MTP: 11.7 tok/s. Also tested Q8_0: 7.4 → 18.1 tok/s (2.44×).

Qwen3.6 27B

AMD Strix Halo 128GB · llama.cpp · 128,000 ctx

quant:: Q8_0 (gguf)
flash attention:: on

long-context

Benchmark compares MTP vs non-MTP for 27B and 35B-A3B models. 27B-MTP shows significant speedup in generation and overall wall time for long-context chat; 35B-MTP shows mixed results with faster generation but slower end-to-end due to prefill overhead.

Qwen3.6 27B

AMD Strix Halo 128GB · llama.cpp

throughput:: 21.1 t/s gen
quant:: Q4_K_M (gguf)

coding

Strix Halo 128GB, ROCm backend, Q4_K_M quant, chat workload. Also tested RTX 3090 and RTX 5070. Multiple models and quants reported.