Best local LLMs by hardware tier
Rankings only make sense once you fix the hardware. Pick a tier below — the leaderboard re-sorts to the models the community actually runs there, weighted by report count, fastest observed tokens-per-second, and recency.
NVIDIA 32 GB+ workstation
Workstation and datacenter cards. 70B-class in a single device. Ranked from 18 reports.
| # | Model family | Best variant tested | Reports | Fastest t/s |
|---|---|---|---|---|
| 1 | Qwen3.6Alibaba — · on RTX Pro 6000 Blackwell | — | 10 | 3500.0 |
| 2 | Gemma 4Google DeepMind cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit · 26B-A4B · AWQ-4bit · on RTX 5090 | cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit · 26B-A4B · AWQ-4bit on RTX 5090 | 5 | 578.0 |
| 3 | Qwen3.5Alibaba NuExtract3 · 4B · on H100 80GB | NuExtract3 · 4B on H100 80GB | 1 | — |
| 4 | GLM-5.1Zhipu AI NVFP4 · on DGX Spark | NVFP4 on DGX Spark | 1 | — |
| 5 | Llama 3.3Meta — · on RTX Pro 6000 Blackwell | — | 1 | — |
How we rank
A single global "best models" list doesn't really exist — what runs well on a 5090 is often unrunnable on a 4060, and a 7B that screams on an M3 Max is usually a poor pick on an H100. So we fix the hardware first, then rank the families that actually have community reports on it. The score blends popularity (log-scaled report count), fastest observed tokens-per-second normalized within the bucket, recency (90-day half-life), and a small bias for rows where we know the variant + quant + GPU cleanly. Click into a family for the full breakdown of records.