llamaperf

AMD Strix Halo 128GB

AMD · 128GB unified memory · 3 reports

See what fits on this GPU →
throughput:
21.2 t/s gen
quant:
Q4_K_M (gguf)
mtp (multi-token prediction):
on

MTP enabled with --spec-type draft-mtp --spec-draft-n-max 3. Baseline without MTP: 11.7 tok/s. Also tested Q8_0: 7.4 → 18.1 tok/s (2.44×).

Qwen3.6 27B

AMD Strix Halo 128GB · llama.cpp · 128,000 ctx

Tone: mixed
quant:
Q8_0 (gguf)
flash attention:
on
long-context

Benchmark compares MTP vs non-MTP for 27B and 35B-A3B models. 27B-MTP shows significant speedup in generation and overall wall time for long-context chat; 35B-MTP shows mixed results with faster generation but slower end-to-end due to prefill overhead.

throughput:
21.1 t/s gen
quant:
Q4_K_M (gguf)
coding

Strix Halo 128GB, ROCm backend, Q4_K_M quant, chat workload. Also tested RTX 3090 and RTX 5070. Multiple models and quants reported.