Best local LLMs by hardware tier

Rankings only make sense once you fix the hardware. Pick a tier below — the leaderboard re-sorts to the models the community actually runs there, weighted by report count, fastest observed tokens-per-second, and recency.

Hardware tier

NVIDIA ≤12GB

RTX 3060 12GB, RTX 4070

NVIDIA 16GB

RTX 4060 Ti 16GB, 4070 Ti Super, 4080, 5080

NVIDIA 24GB

RTX 3090, RTX 3090 Ti, RTX 4090

NVIDIA 32GB+

RTX 5090, A100, H100, RTX Pro 6000, DGX Spark

Apple ≤36GB

M4 24GB, M-Pro/M-Max 32–36GB

Apple 48–96GB

M-Pro/M-Max 48–96GB

Apple/Strix 128+

M-Max/M-Ultra 128GB+, AMD Strix Halo 128GB

AMD/CPU

RX 7900 XTX, MI50, Intel, CPU-only rigs

Model size

All sizes Small (<14B)Mid (14–40B)Large (>40B)

NVIDIA 32 GB+ workstation

Workstation and datacenter cards. 70B-class in a single device. Ranked from 18 reports.

#	Model family	Best variant tested	Reports	Fastest t/s
1	Qwen3.6Alibaba — · on RTX Pro 6000 Blackwell	— on RTX Pro 6000 Blackwell	10	3500.0
2	Gemma 4Google DeepMind cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit · 26B-A4B · AWQ-4bit · on RTX 5090	cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit · 26B-A4B · AWQ-4bit on RTX 5090	5	578.0
3	Qwen3.5Alibaba NuExtract3 · 4B · on H100 80GB	NuExtract3 · 4B on H100 80GB	1	—
4	GLM-5.1Zhipu AI NVFP4 · on DGX Spark	NVFP4 on DGX Spark	1	—
5	Llama 3.3Meta — · on RTX Pro 6000 Blackwell	— on RTX Pro 6000 Blackwell	1	—

How we rank

A single global "best models" list doesn't really exist — what runs well on a 5090 is often unrunnable on a 4060, and a 7B that screams on an M3 Max is usually a poor pick on an H100. So we fix the hardware first, then rank the families that actually have community reports on it. The score blends popularity (log-scaled report count), fastest observed tokens-per-second normalized within the bucket, recency (90-day half-life), and a small bias for rows where we know the variant + quant + GPU cleanly. Click into a family for the full breakdown of records.