Question 1

Is llama.cpp the fastest engine for local LLMs?

Accepted Answer

It depends on the workload. For single-user interactive inference on consumer hardware, llama.cpp is competitive with or faster than alternatives. For batched serving on NVIDIA, vLLM and exllamav2 are typically faster. On Apple Silicon, MLX often edges it out.

Question 2

What hardware does llama.cpp support?

Accepted Answer

CPU (any architecture with reasonable SIMD), CUDA (NVIDIA), ROCm (AMD), Metal (Apple Silicon), Vulkan (cross-vendor), and SYCL (Intel). The portability is unmatched.

Question 3

What is GGUF?

Accepted Answer

GGUF is the file format llama.cpp uses to package quantized model weights and metadata in a single file. It superseded the older GGML format and is now the most widely used local-LLM file format.

GPU	VRAM	Reports	Fastest t/s
RTX 5090nvidia	32GB	3	243.9
RTX 3060 12GBnvidia	12GB	3	70.0
AMD Strix Halo 128GBamd	128GB	3	21.2
RTX 3090nvidia	24GB	2	50.0
M2 Max 96GBapple	96GB	2	28.0
M5 Max 64GBapple	64GB	1	97.0
RTX 4090nvidia	24GB	1	80.0
RTX 5080nvidia	16GB	1	56.0

llama.cpp

Top GPUs running llama.cpp

Top models on llama.cpp

Frequently asked

Is llama.cpp the fastest engine for local LLMs?

What hardware does llama.cpp support?

What is GGUF?