Model based on Qwen3.5-4B. Trained on 8xH100 for 3 days. Supports Safetensors, GGUF, MLX weights. Requires as little as 4GB VRAM. Multiple quantizations available (GPTQ, W8A8, FP8, Q4, Q6). Tested with vLLM, SGLang, llama.cpp.
H100 80GB
NVIDIA · 80GB · 3 reports
Gemma 4 31B
H100 80GB · vLLM · 32,768 ctx
- throughput:
- 125.3 t/s gen
Benchmark of Gemma 4 31B dense with MTP and DFlash speculative decoding. Also tested Gemma 4 26B-A4B MoE (25.2B total, 3.8B active). MTP 3.11x faster, DFlash 3.03x faster than baseline at concurrency 1. Baseline 40.3 tok/s, MTP 125.3 tok/s, DFlash 122.1 tok/s. At concurrency 16: baseline 375 tok/s, MTP 953 tok/s, DFlash 725 tok/s. For MoE: baseline 177.1 tok/s, MTP 264.2 tok/s, DFlash 306.4 tok/s at concurrency 1. At concurrency 16: baseline 975 tok/s, MTP 1808 tok/s, DFlash 1957 tok/s. Coding, math, STEM, reasoning benefited more.
Qwen3.6 27B
2× H100 80GB · vLLM · 128,000 ctx
- throughput:
- 45.0 t/s gen
User rents GPU instance with 2x H100s (160GB VRAM) to run Qwen3.6-27B at 45 t/s. Uses vLLM for inference. Runs multiple agents (Claude Code, QwenCode, social media bots) hitting the API simultaneously. Context length 128K. Cost ~$0.90/hr, spent $120 last month. Model outperformed 120B model in tests.