PRICING / TCO

$ / M tokens leaderboard

TCO efficiency per accelerator, recomputed on every build from the live case corpus

Formula

$/M tokens = (hw_rent_per_hour + tdp_w × PUE / 1000 × kWh_price) × 1,000,000 / (decode_tok_s_per_card × 3600)

assumptions:
  hw_rent_per_hour = $2.50 USD / card / hour
  kWh_price        = $0.10 USD / kWh
  PUE              = $1.3
  TDP              = vendor-rated, per hardware
  decode_tok_s     = measured (Tier 0 case)

⚠ Compute-only BoM estimate — excludes datacenter amortization, networking, ops, licensing. Real production $/M tokens are typically 1.5–3× of this. Use for relative ranking, not absolute procurement quotes.

Best cost per card (18 cards with measured data)

# Hardware Best $/M Median Worst cases Best case
1 🏆 NVIDIA H100 SXM5 80GB $0.42 $1.37 $3.11 3 detail →
2 🏆 NVIDIA H200 SXM 141GB $1.20 $2.40 $2.40 2 detail →
3 🏆 AMD Instinct MI355X $1.32 $1.32 $1.32 1 detail →
4 AMD Instinct MI325X $1.89 $1.89 $1.89 1 detail →
5 Intel Gaudi 3 $2.01 $2.01 $2.01 1 detail →
6 AMD Instinct MI300X $2.62 $2.62 $2.62 1 detail →
7 NVIDIA A100 SXM4 80GB $3.83 $3.83 $3.83 1 detail →
8 NVIDIA L40S $4.88 $4.88 $4.88 1 detail →
9 沐曦 曦云 C500 🇨🇳 $4.88 $4.88 $4.88 1 detail →
10 海光 DCU K100 🇨🇳 $6.74 $6.74 $6.74 1 detail →
11 AWS Trainium 2 $12.67 $12.67 $12.67 1 detail →
12 昇腾 910B 🇨🇳 $13.34 $13.34 $13.34 1 detail →
13 寒武纪 思元 590 🇨🇳 $14.89 $23.57 $23.57 2 detail →
14 壁仞 BR104 🇨🇳 $23.51 $23.51 $23.51 1 detail →
15 Google TPU Trillium (v6e) $31.05 $31.05 $31.05 1 detail →
16 摩尔线程 MTT S4000 🇨🇳 $35.53 $35.53 $35.53 1 detail →
17 天数智芯 天垓 100 🇨🇳 $51.29 $51.29 $51.29 1 detail →
18 昇腾 910C 🇨🇳 $115.16 $115.16 $115.16 1 detail →

All cases · sorted by $/M tokens (22)

Case Hardware ×N Model / Precision decode tok/s/card TDP W $/h total $/M tokens
Gemma 4 26B on 4× H100 SXM with FP8 h100-sxm5 ×4 gemma-4 · fp8-e4m3 1,700 700 $2.59 $0.42 🏆
DeepSeek V4 Flash with disaggregated prefill (H100) + decode (H200) via Mooncake h200-sxm ×16 deepseek-v4-flash · fp8-e4m3 600 700 $2.59 $1.20
Qwen3.5 397B Reasoning on 8× MI355X with FP4 mi355x ×8 qwen3.5-397b · fp4 563 1400 $2.68 $1.32
DeepSeek V4 Flash on 8×H100 SXM with vLLM FP8 h100-sxm5 ×8 deepseek-v4-flash · fp8-e4m3 525 700 $2.59 $1.37
Qwen3.6 Plus on 8× MI325X with SGLang FP8 mi325x ×8 qwen3.6-plus · fp8-e4m3 388 1000 $2.63 $1.89
GPT-OSS on 8× Intel Gaudi 3 with vLLM gaudi-3 ×8 gpt-oss · fp8-e4m3 363 900 $2.62 $2.01
GLM-5.1 on 8× H200 SXM with vLLM BF16 h200-sxm ×8 glm-5.1 · bf16 300 700 $2.59 $2.40
Llama 4 Scout on 8× MI300X with vLLM BF16 mi300x ×8 llama-4-scout · bf16 275 750 $2.60 $2.62
Llama 4 Scout on 8×H100 SXM with vLLM (public benchmark) h100-sxm5 ×8 llama-4-scout · bf16 231 700 $2.59 $3.11
Llama 3.3 70B on 8× A100 SXM4 80GB with vLLM a100-sxm4 ×8 llama-3.3-70b · bf16 185 400 $2.55 $3.83
Qwen2.5-Coder 32B on 4× L40S with vLLM (FP8) l40s ×4 qwen2.5-coder-32b · fp8-e4m3 145 350 $2.55 $4.88
Gemma 4 on 4× MetaX 曦云 C500 with INT8 metax-c500 ×4 gemma-4 · int8 145 350 $2.55 $4.88
Llama 4 Scout on 8× Hygon DCU K100 with vLLM dcu-k100 ×8 llama-4-scout · bf16 106 600 $2.58 $6.74
DeepSeek V3 on AWS Trainium 2 (64-chip Trn2 instance) trainium-2 ×64 deepseek-r1 · bf16 56 500 $2.56 $12.67
DeepSeek R1 on 16× Ascend 910B with MindIE ascend-910b ×16 deepseek-r1 · bf16 53 400 $2.55 $13.34
Qwen3.6 Plus on 8× Cambricon MLU590 with LMDeploy mlu590 ×8 qwen3.6-plus · int8 48 350 $2.55 $14.89
GLM-5.1 on 8× Biren BR104 (export-control variant) br104 ×8 glm-5.1 · int8 30 300 $2.54 $23.51
Kimi K2.6 on 16× Cambricon MLU590 (with vLLM port) mlu590 ×16 kimi-k2.6 · bf16 30 350 $2.55 $23.57
Llama 4 Maverick on TPU Trillium (v6e) 256-chip pod trillium ×256 llama-4-maverick · bf16 23 250 $2.53 $31.05
DeepSeek V4 Flash on 16× MTT S4000 (Moore Threads KUAE) mtt-s4000 ×16 deepseek-v4-flash · fp16 20 450 $2.56 $35.53
DeepSeek R1 on 16× Iluvatar 天垓 100 (Iluvatar IxRT) iluvatar-bi ×16 deepseek-r1 · int8 14 300 $2.54 $51.29
DeepSeek V4 Pro on Huawei CloudMatrix 384 with MindIE ascend-910c ×384 deepseek-v4-pro · bf16 6 700 $2.59 $115.16
Want to tweak the assumptions? Open the calculator The TCO panel surfaces every assumption ($/card/hr, TDP, etc.) and accepts custom model/hardware/parallel configs.