Best GPUs for 30B local LLMs
30B-class models in Q4 fit comfortably in 24GB of VRAM with room for a useful context window — the sweet spot for a single consumer GPU. Apple Silicon Macs with 32GB+ unified memory also handle them well. Ranked from community reports.
Ranked from 52 community reports on llamaperf.
Ranked by community reports
| # | GPU | VRAM | Reports | Fastest t/s |
|---|---|---|---|---|
| 1 | RTX 5090nvidia | 32GB | 10 | 3238.0 |
| 2 | RTX 3090nvidia | 24GB | 7 | 72.9 |
| 3 | M5 Max 128GBapple | 128GB | 5 | 7.5 |
| 4 | RTX Pro 6000 Blackwellnvidia | 96GB | 3 | 3500.0 |
| 5 | RTX 4090nvidia | 24GB | 3 | 149.6 |
| 6 | H100 80GBnvidia | 80GB | 3 | 125.3 |
| 7 | M5 Max 64GBapple | 64GB | 3 | 97.0 |
| 8 | RX 7900 XTXamd | 24GB | 3 | 58.0 |
| 9 | AMD Strix Halo 128GBamd | 128GB | 3 | 21.2 |
| 10 | M2 Max 96GBapple | 96GB | 2 | 28.0 |
| 11 | AMD MI50 32GBamd | 32GB | 2 | 9.7 |
| 12 | RTX 4070 Ti Supernvidia | 16GB | 1 | 110.2 |
| 13 | RTX 5080nvidia | 16GB | 1 | 56.0 |
| 14 | RX 9070amd | 16GB | 1 | 46.9 |
| 15 | RTX 5060 Ti 16GBnvidia | 16GB | 1 | 22.0 |
| 16 | RTX A6000 48GBnvidia | 48GB | 1 | 16.9 |
| 17 | AMD Threadripper 256GBamd | 256GB | 1 | 7.5 |
| 18 | M3 Max 128GBapple | 128GB | 1 | 5.5 |
| 19 | DGX Sparknvidia | 128GB | 1 | — |
Models that fit
No reports yet
These match the profile but nobody has submitted a report yet.
What to look for
24GB cards are the sweet spot
RTX 3090s and 4090s (both 24GB) hold a 30B-class model in Q4 with plenty of headroom for an 8–16K context. This is arguably the best price/capability point in local LLM inference today — you get most of the quality of a 70B model at a fraction of the hardware cost.
16GB cards work with tighter quants
An RTX 4060 Ti 16GB or RTX 4070 Ti Super 16GB can run 30B models at Q3/Q4 with shorter contexts, though you'll feel the squeeze with longer prompts. Q3 quants noticeably hurt quality on most models — Q4 is the practical floor.
Frequently asked
What's the best GPU for a 30B local LLM?
RTX 3090 (used) or RTX 4090 (new) — both 24GB — are the standard recommendations. They hold a 30B model in Q4 with headroom for a useful context window and run at 25–50 tokens-per-second on most engines.
Can a 16GB GPU run 30B models?
Yes, with caveats. Q3/Q4 quants of 30B-class models fit in ~14–17GB depending on the architecture. You'll have less context room and may need to lower precision further than ideal. A 24GB card is meaningfully better.
How we rank
Hardware is sorted by the number of community submissions on llamaperf — a proxy for how widely each card is used in practice for local LLM inference. Within that, we surface the fastest tokens-per-second observed on each as a quality signal. Submissions come primarily from r/LocalLLaMA discussions and direct user uploads. Nothing here is sponsored or affiliate-driven.