llamaperf

Best GPUs for 30B local LLMs

30B-class models in Q4 fit comfortably in 24GB of VRAM with room for a useful context window — the sweet spot for a single consumer GPU. Apple Silicon Macs with 32GB+ unified memory also handle them well. Ranked from community reports.

Ranked from 52 community reports on llamaperf.

Ranked by community reports

#GPUVRAMReportsFastest t/s
1RTX 5090nvidia32GB103238.0
2RTX 3090nvidia24GB772.9
3M5 Max 128GBapple128GB57.5
4RTX Pro 6000 Blackwellnvidia96GB33500.0
5RTX 4090nvidia24GB3149.6
6H100 80GBnvidia80GB3125.3
7M5 Max 64GBapple64GB397.0
8RX 7900 XTXamd24GB358.0
9AMD Strix Halo 128GBamd128GB321.2
10M2 Max 96GBapple96GB228.0
11AMD MI50 32GBamd32GB29.7
12RTX 4070 Ti Supernvidia16GB1110.2
13RTX 5080nvidia16GB156.0
14RX 9070amd16GB146.9
15RTX 5060 Ti 16GBnvidia16GB122.0
16RTX A6000 48GBnvidia48GB116.9
17AMD Threadripper 256GBamd256GB17.5
18M3 Max 128GBapple128GB15.5
19DGX Sparknvidia128GB1

Models that fit

No reports yet

These match the profile but nobody has submitted a report yet.

What to look for

24GB cards are the sweet spot

RTX 3090s and 4090s (both 24GB) hold a 30B-class model in Q4 with plenty of headroom for an 8–16K context. This is arguably the best price/capability point in local LLM inference today — you get most of the quality of a 70B model at a fraction of the hardware cost.

16GB cards work with tighter quants

An RTX 4060 Ti 16GB or RTX 4070 Ti Super 16GB can run 30B models at Q3/Q4 with shorter contexts, though you'll feel the squeeze with longer prompts. Q3 quants noticeably hurt quality on most models — Q4 is the practical floor.

Frequently asked

What's the best GPU for a 30B local LLM?

RTX 3090 (used) or RTX 4090 (new) — both 24GB — are the standard recommendations. They hold a 30B model in Q4 with headroom for a useful context window and run at 25–50 tokens-per-second on most engines.

Can a 16GB GPU run 30B models?

Yes, with caveats. Q3/Q4 quants of 30B-class models fit in ~14–17GB depending on the architecture. You'll have less context room and may need to lower precision further than ideal. A 24GB card is meaningfully better.

How we rank

Hardware is sorted by the number of community submissions on llamaperf — a proxy for how widely each card is used in practice for local LLM inference. Within that, we surface the fastest tokens-per-second observed on each as a quality signal. Submissions come primarily from r/LocalLLaMA discussions and direct user uploads. Nothing here is sponsored or affiliate-driven.

See also