llamaperf

Best GPUs for 70B local LLMs

70B models in Q4_K_M quantization need roughly 40GB of VRAM plus headroom for context. That puts the realistic floor at a 48GB card (or two 24GB cards), or an Apple Silicon Mac with 64GB+ unified memory. The list below filters to hardware that has actually been reported running models at this scale.

Ranked from 35 community reports on llamaperf.

Ranked by community reports

#GPUVRAMReportsFastest t/s
1RTX 5090nvidia32GB103238.0
2M5 Max 128GBapple128GB57.5
3RTX Pro 6000 Blackwellnvidia96GB33500.0
4H100 80GBnvidia80GB3125.3
5M5 Max 64GBapple64GB397.0
6AMD Strix Halo 128GBamd128GB321.2
7M2 Max 96GBapple96GB228.0
8AMD MI50 32GBamd32GB29.7
9RTX A6000 48GBnvidia48GB116.9
10AMD Threadripper 256GBamd256GB17.5
11M3 Max 128GBapple128GB15.5
12DGX Sparknvidia128GB1

Models that fit

No reports yet

These match the profile but nobody has submitted a report yet.

What to look for

VRAM math for 70B models

At Q4_K_M, a 70B model is roughly 40–43GB of weights. At Q5/Q6 you're at 48–55GB. KV cache scales with context length (a 4K context on a 70B model adds another ~3GB at full precision; less with KV quantization). Plan for 48GB minimum for comfortable Q4, 64GB+ for higher quants or longer contexts.

Single big card vs multi-card

An RTX A6000 (48GB) or H100 (80GB) holds the whole model in one device — no inter-GPU communication overhead. Two RTX 3090s with NVLink approach this but pay a small latency penalty for tensor-parallel split. For interactive use both work; for serving at scale, single-card is simpler.

Apple Silicon as a 70B host

M2 Ultra and M3 Ultra Macs with 128GB+ run 70B models well, often at 5–12 tokens-per-second depending on quant. The advantage is total cost — a Mac Studio Ultra is competitive in price with a single H100 and runs the same model class without datacenter cooling.

Frequently asked

What's the minimum VRAM for a 70B local LLM?

Around 40GB at Q4 quantization, but practical use needs 48GB or more once you account for context and KV cache. Below that, you'll need to offload layers to system memory which sharply degrades tokens-per-second.

Can I run a 70B model on a single RTX 4090?

Not well. A 24GB card requires aggressive quantization (Q2/Q3) or offloading large portions to CPU memory, both of which degrade quality and speed substantially. Two 3090s/4090s, or a single A6000/H100, is the standard solution.

Is a Mac Studio Ultra good for 70B models?

Yes — M-Ultra Macs with 128GB+ unified memory run 70B models comfortably with no special setup. Throughput is lower than a discrete A100/H100 but the total cost is much lower.

How we rank

Hardware is sorted by the number of community submissions on llamaperf — a proxy for how widely each card is used in practice for local LLM inference. Within that, we surface the fastest tokens-per-second observed on each as a quality signal. Submissions come primarily from r/LocalLLaMA discussions and direct user uploads. Nothing here is sponsored or affiliate-driven.

See also