llamaperf

llama.cpp

The reference C/C++ inference runtime for GGUF-quantized open-weight LLMs.

16 community reports

llama.cpp is the de facto reference implementation for running open-weight LLMs locally. It introduced the GGUF format that almost every other consumer-facing local LLM tool now consumes, and supports CPU, CUDA, ROCm, Metal, Vulkan, and SYCL backends.

Almost every easier UI you've heard of (Ollama, LM Studio, Jan, GPT4All) wraps llama.cpp under the hood. Reports tagged with 'Ollama' or 'LM Studio' on llamaperf are typically running llama.cpp internally — the engine field captures the user-facing tool, not the underlying runtime.

Performance is competitive across hardware. On NVIDIA, exllamav2 and vLLM beat it on raw throughput for batched workloads, but llama.cpp wins on portability, quant variety (Q2_K through Q8_0 plus i-quants), and single-user latency.

Top GPUs running llama.cpp

GPUVRAMReportsFastest t/s
RTX 5090nvidia32GB3243.9
RTX 3060 12GBnvidia12GB370.0
AMD Strix Halo 128GBamd128GB321.2
RTX 3090nvidia24GB250.0
M2 Max 96GBapple96GB228.0
M5 Max 64GBapple64GB197.0
RTX 4090nvidia24GB180.0
RTX 5080nvidia16GB156.0

Top models on llama.cpp

Frequently asked

Is llama.cpp the fastest engine for local LLMs?

It depends on the workload. For single-user interactive inference on consumer hardware, llama.cpp is competitive with or faster than alternatives. For batched serving on NVIDIA, vLLM and exllamav2 are typically faster. On Apple Silicon, MLX often edges it out.

What hardware does llama.cpp support?

CPU (any architecture with reasonable SIMD), CUDA (NVIDIA), ROCm (AMD), Metal (Apple Silicon), Vulkan (cross-vendor), and SYCL (Intel). The portability is unmatched.

What is GGUF?

GGUF is the file format llama.cpp uses to package quantized model weights and metadata in a single file. It superseded the older GGML format and is now the most widely used local-LLM file format.