Best GPUs for 30B local LLMs

30B-class models in Q4 fit comfortably in 24GB of VRAM with room for a useful context window — the sweet spot for a single consumer GPU. Apple Silicon Macs with 32GB+ unified memory also handle them well. Ranked from community reports.

Ranked from 52 community reports on llamaperf.

Ranked by community reports

#	GPU	VRAM	Reports	Fastest t/s
1	RTX 5090nvidia	32GB	10	3238.0
2	RTX 3090nvidia	24GB	7	72.9
3	M5 Max 128GBapple	128GB	5	7.5
4	RTX Pro 6000 Blackwellnvidia	96GB	3	3500.0
5	RTX 4090nvidia	24GB	3	149.6
6	H100 80GBnvidia	80GB	3	125.3
7	M5 Max 64GBapple	64GB	3	97.0
8	RX 7900 XTXamd	24GB	3	58.0
9	AMD Strix Halo 128GBamd	128GB	3	21.2
10	M2 Max 96GBapple	96GB	2	28.0
11	AMD MI50 32GBamd	32GB	2	9.7
12	RTX 4070 Ti Supernvidia	16GB	1	110.2
13	RTX 5080nvidia	16GB	1	56.0
14	RX 9070amd	16GB	1	46.9
15	RTX 5060 Ti 16GBnvidia	16GB	1	22.0
16	RTX A6000 48GBnvidia	48GB	1	16.9
17	AMD Threadripper 256GBamd	256GB	1	7.5
18	M3 Max 128GBapple	128GB	1	5.5
19	DGX Sparknvidia	128GB	1	—

Models that fit

Qwen3.6782 Gemma 496 Qwen3.52

No reports yet

These match the profile but nobody has submitted a report yet.

What to look for

24GB cards are the sweet spot

RTX 3090s and 4090s (both 24GB) hold a 30B-class model in Q4 with plenty of headroom for an 8–16K context. This is arguably the best price/capability point in local LLM inference today — you get most of the quality of a 70B model at a fraction of the hardware cost.

16GB cards work with tighter quants

An RTX 4060 Ti 16GB or RTX 4070 Ti Super 16GB can run 30B models at Q3/Q4 with shorter contexts, though you'll feel the squeeze with longer prompts. Q3 quants noticeably hurt quality on most models — Q4 is the practical floor.

Frequently asked

What's the best GPU for a 30B local LLM?

RTX 3090 (used) or RTX 4090 (new) — both 24GB — are the standard recommendations. They hold a 30B model in Q4 with headroom for a useful context window and run at 25–50 tokens-per-second on most engines.

Can a 16GB GPU run 30B models?

Yes, with caveats. Q3/Q4 quants of 30B-class models fit in ~14–17GB depending on the architecture. You'll have less context room and may need to lower precision further than ideal. A 24GB card is meaningfully better.

How we rank

Hardware is sorted by the number of community submissions on llamaperf — a proxy for how widely each card is used in practice for local LLM inference. Within that, we surface the fastest tokens-per-second observed on each as a quality signal. Submissions come primarily from r/LocalLLaMA discussions and direct user uploads. Nothing here is sponsored or affiliate-driven.