1. Roofline model — how Tier 1 is computed
Each operator's throughput is bounded by two ceilings:
- Compute ceiling: hardware peak FLOPS / op's FLOPs-per-token
- Memory bandwidth ceiling: hardware BW / op's bytes-per-token
Real throughput = min(compute, memory_bw) × efficiency.
Where the two ceilings intersect is the ridge point:
- Arithmetic intensity (FLOPs/byte) above ridge → compute-bound, more FLOPS helps
- Below ridge → memory-bound, quantization or higher BW helps more
See real Roofline chart in calculator →
2. Three-tier evidence model
Every number carries an evidence tier. We do not conflate vendor-claimed with measured:
From whitepapers, datasheets, product pages. Most data starts here, especially for Chinese vendors with limited public benchmarks.
Default assumption: real-world ≈ 60-80% of claim
Third-party or community-contributed measurements with attestation + raw log.
Long-term goal: increase ratio year-over-year
Derived from public information (back-calculated from MLPerf submissions, die-shot estimates).
Used only for filling key gaps
Live distribution at /quality →
3. Chinese hardware ecosystem
Whether a Chinese accelerator can run a given model depends on:
- Programming model: CANN (Ascend) / BANG (Cambricon) / DTK+HIP (Hygon) / MUSA (Moore Threads) / etc.
- Operator library: most transformer ops are covered, but FP8/FP4/MoE features still catching up
- Inference engine: vendor-official (MindIE) vs community ports (vllm-ascend / vllm-musa / lmdeploy-mlu)
- Quantization: BF16/FP16 widely supported, INT8 mostly, FP8/FP4 rarely
Performance is improving rapidly post-export-control — track via /china genealogy.
4. Disaggregated inference
Traditional: same GPU pool runs both prefill (compute-heavy) and decode (memory-bw-heavy), wasting half the resources.
Disaggregated:
- Prefill pool: high-compute cards (e.g., H100) handle prompt
- Decode pool: high-memory-bw cards (e.g., H200) handle generation
- KV cache transferred across scale-out network
Cost: scheduling complexity + extra KV transfer latency. Benefit: 30-50% lower $/token.
Known systems: Mooncake, DistServe, SGLang disagg.
5. TCO formula
$/M tokens = (hw $/h × cards + power_kW × $/kWh × PUE × cards) × 1M
─────────────────────────────────────────────
decode_throughput tok/s × 3600 / 1M All inputs adjustable in calculator. PUE = data center cooling overhead, 1.3 is industry typical.
6. How to use this site for selection
- Know your target model → check model details
- Pick candidate cards → use compare for radar + Roofline overlay
- Run calculator → Tier 0 (real cases) + Tier 1 (theoretical ceiling)
- Check /quality for evidence reliability
- Need a substitute? Every hardware page has a "closest substitute" widget (incl. cross-border markers)