llamaperf

H100 80GB

NVIDIA · 80GB · 3 reports

See what fits on this GPU →
Tone: positive
visionsummarization

Model based on Qwen3.5-4B. Trained on 8xH100 for 3 days. Supports Safetensors, GGUF, MLX weights. Requires as little as 4GB VRAM. Multiple quantizations available (GPTQ, W8A8, FP8, Q4, Q6). Tested with vLLM, SGLang, llama.cpp.

Gemma 4 31B

H100 80GB · vLLM · 32,768 ctx

throughput:
125.3 t/s gen
codingmath

Benchmark of Gemma 4 31B dense with MTP and DFlash speculative decoding. Also tested Gemma 4 26B-A4B MoE (25.2B total, 3.8B active). MTP 3.11x faster, DFlash 3.03x faster than baseline at concurrency 1. Baseline 40.3 tok/s, MTP 125.3 tok/s, DFlash 122.1 tok/s. At concurrency 16: baseline 375 tok/s, MTP 953 tok/s, DFlash 725 tok/s. For MoE: baseline 177.1 tok/s, MTP 264.2 tok/s, DFlash 306.4 tok/s at concurrency 1. At concurrency 16: baseline 975 tok/s, MTP 1808 tok/s, DFlash 1957 tok/s. Coding, math, STEM, reasoning benefited more.

Qwen3.6 27B

H100 80GB · vLLM · 128,000 ctx

Tone: positive
throughput:
45.0 t/s gen
codingagentic

User rents GPU instance with 2x H100s (160GB VRAM) to run Qwen3.6-27B at 45 t/s. Uses vLLM for inference. Runs multiple agents (Claude Code, QwenCode, social media bots) hitting the API simultaneously. Context length 128K. Cost ~$0.90/hr, spent $120 last month. Model outperformed 120B model in tests.