$ / M tokens leaderboard

PRICING / TCO

TCO efficiency per accelerator, recomputed on every build from the live case corpus

Formula

$/M tokens = (hw_rent_per_hour + tdp_w × PUE / 1000 × kWh_price) × 1,000,000 / (decode_tok_s_per_card × 3600)

assumptions:
  hw_rent_per_hour = $2.50 USD / card / hour
  kWh_price        = $0.10 USD / kWh
  PUE              = $1.3
  TDP              = vendor-rated, per hardware
  decode_tok_s     = measured (Tier 0 case)

⚠ Compute-only BoM estimate — excludes datacenter amortization, networking, ops, licensing. Real production $/M tokens are typically 1.5–3× of this. Use for relative ranking, not absolute procurement quotes.

Best cost per card (18 cards with measured data)

#	Hardware	Best $/M	Median	Worst	cases	Best case
1 🏆	NVIDIA H100 SXM5 80GB	$0.42	$1.37	$3.11	3	detail →
2 🏆	NVIDIA H200 SXM 141GB	$1.20	$2.40	$2.40	2	detail →
3 🏆	AMD Instinct MI355X	$1.32	$1.32	$1.32	1	detail →
4	AMD Instinct MI325X	$1.89	$1.89	$1.89	1	detail →
5	Intel Gaudi 3	$2.01	$2.01	$2.01	1	detail →
6	AMD Instinct MI300X	$2.62	$2.62	$2.62	1	detail →
7	NVIDIA A100 SXM4 80GB	$3.83	$3.83	$3.83	1	detail →
8	NVIDIA L40S	$4.88	$4.88	$4.88	1	detail →
9	沐曦曦云 C500 🇨🇳	$4.88	$4.88	$4.88	1	detail →
10	海光 DCU K100 🇨🇳	$6.74	$6.74	$6.74	1	detail →
11	AWS Trainium 2	$12.67	$12.67	$12.67	1	detail →
12	昇腾 910B 🇨🇳	$13.34	$13.34	$13.34	1	detail →
13	寒武纪思元 590 🇨🇳	$14.89	$23.57	$23.57	2	detail →
14	壁仞 BR104 🇨🇳	$23.51	$23.51	$23.51	1	detail →
15	Google TPU Trillium (v6e)	$31.05	$31.05	$31.05	1	detail →
16	摩尔线程 MTT S4000 🇨🇳	$35.53	$35.53	$35.53	1	detail →
17	天数智芯天垓 100 🇨🇳	$51.29	$51.29	$51.29	1	detail →
18	昇腾 910C 🇨🇳	$115.16	$115.16	$115.16	1	detail →

All cases · sorted by $/M tokens (22)

Case	Hardware ×N	Model / Precision	decode tok/s/card	TDP W	$/h total	$/M tokens
Gemma 4 26B on 4× H100 SXM with FP8	h100-sxm5 ×4	gemma-4 · fp8-e4m3	1,700	700	$2.59	$0.42 🏆
DeepSeek V4 Flash with disaggregated prefill (H100) + decode (H200) via Mooncake	h200-sxm ×16	deepseek-v4-flash · fp8-e4m3	600	700	$2.59	$1.20
Qwen3.5 397B Reasoning on 8× MI355X with FP4	mi355x ×8	qwen3.5-397b · fp4	563	1400	$2.68	$1.32
DeepSeek V4 Flash on 8×H100 SXM with vLLM FP8	h100-sxm5 ×8	deepseek-v4-flash · fp8-e4m3	525	700	$2.59	$1.37
Qwen3.6 Plus on 8× MI325X with SGLang FP8	mi325x ×8	qwen3.6-plus · fp8-e4m3	388	1000	$2.63	$1.89
GPT-OSS on 8× Intel Gaudi 3 with vLLM	gaudi-3 ×8	gpt-oss · fp8-e4m3	363	900	$2.62	$2.01
GLM-5.1 on 8× H200 SXM with vLLM BF16	h200-sxm ×8	glm-5.1 · bf16	300	700	$2.59	$2.40
Llama 4 Scout on 8× MI300X with vLLM BF16	mi300x ×8	llama-4-scout · bf16	275	750	$2.60	$2.62
Llama 4 Scout on 8×H100 SXM with vLLM (public benchmark)	h100-sxm5 ×8	llama-4-scout · bf16	231	700	$2.59	$3.11
Llama 3.3 70B on 8× A100 SXM4 80GB with vLLM	a100-sxm4 ×8	llama-3.3-70b · bf16	185	400	$2.55	$3.83
Qwen2.5-Coder 32B on 4× L40S with vLLM (FP8)	l40s ×4	qwen2.5-coder-32b · fp8-e4m3	145	350	$2.55	$4.88
Gemma 4 on 4× MetaX 曦云 C500 with INT8	metax-c500 ×4	gemma-4 · int8	145	350	$2.55	$4.88
Llama 4 Scout on 8× Hygon DCU K100 with vLLM	dcu-k100 ×8	llama-4-scout · bf16	106	600	$2.58	$6.74
DeepSeek V3 on AWS Trainium 2 (64-chip Trn2 instance)	trainium-2 ×64	deepseek-r1 · bf16	56	500	$2.56	$12.67
DeepSeek R1 on 16× Ascend 910B with MindIE	ascend-910b ×16	deepseek-r1 · bf16	53	400	$2.55	$13.34
Qwen3.6 Plus on 8× Cambricon MLU590 with LMDeploy	mlu590 ×8	qwen3.6-plus · int8	48	350	$2.55	$14.89
GLM-5.1 on 8× Biren BR104 (export-control variant)	br104 ×8	glm-5.1 · int8	30	300	$2.54	$23.51
Kimi K2.6 on 16× Cambricon MLU590 (with vLLM port)	mlu590 ×16	kimi-k2.6 · bf16	30	350	$2.55	$23.57
Llama 4 Maverick on TPU Trillium (v6e) 256-chip pod	trillium ×256	llama-4-maverick · bf16	23	250	$2.53	$31.05
DeepSeek V4 Flash on 16× MTT S4000 (Moore Threads KUAE)	mtt-s4000 ×16	deepseek-v4-flash · fp16	20	450	$2.56	$35.53
DeepSeek R1 on 16× Iluvatar 天垓 100 (Iluvatar IxRT)	iluvatar-bi ×16	deepseek-r1 · int8	14	300	$2.54	$51.29
DeepSeek V4 Pro on Huawei CloudMatrix 384 with MindIE	ascend-910c ×384	deepseek-v4-pro · bf16	6	700	$2.59	$115.16

Want to tweak the assumptions? Open the calculator → The TCO panel surfaces every assumption ($/card/hr, TDP, etc.) and accepts custom model/hardware/parallel configs.