Showcase · what the data tells us

SHOWCASE

Showcase · what the data tells us

Insights auto-computed from the data corpus, refreshed every build

Every insight here is computed live from the corpus — adding a new case automatically refreshes these numbers.

⚡ Top decode tok/s/card

Gemma 4 26B on 4× H100 SXM with FP8

1,700 tok/s/card

h100-sxm5 ×4 · gemma-4 · tensorrt-llm · fp8-e4m3

💰 Lowest $/M tokens

Gemma 4 26B on 4× H100 SXM with FP8

$0.42 /M tokens

h100-sxm5 ×4 · assumes $2.5/h/card + $0.10/kWh + PUE 1.3

🎯 Most-mature software stack

AMD Instinct MI300X

1.50 / 1.00

measured/theoretical ratio across 1 cases. Reaching 150% of the theoretical roofline.

⚠️ Largest software-stack gap

昇腾 910C

0.00 / 1.00

0% of theoretical — large headroom for kernel/op-library tuning. Common pattern for Chinese silicon: stacks like CANN/MUSA/MindIE close this gap year over year.

📊 Most case coverage

NVIDIA H100 SXM5 80GB

3 cases

The data flywheel is spinning — this card has the most independent reproductions logged.

🚀 China stack catch-up opportunity

昇腾 910C

+138 pp headroom

Currently 0.00 vs overseas mean 1.38. Each +0.05 in efficiency ≈ +10% effective hardware throughput.

📈 Biggest gen-over-gen jump

NVIDIA L40S → NVIDIA B200 SXM 180GB

+515 % BF16

366 → 2250 TFLOPS BF16 · 2023 → 2024

🌍 Most portable model

DeepSeek R1

3 distinct hardware

Logged on 3 different accelerators — the most deployment-friendly frontier model in the corpus.

🏗 Largest scale-up domain

Google TPU v5p

8960 cards / domain

ICI · 4800 GB/s. Holds an entire frontier MoE in a single scale-up domain.

↻ How are these computed?

On every build, this page walks the full corpus of cases / hardware / models and recomputes the metrics above. So adding a single case will move "Lowest $/M tokens" automatically.

Formulas live in /learn. Data confidence is broken down at /quality.