LEARN

How to read this site

Methodology · evidence tiers · Chinese hardware context.

1. Roofline model — how Tier 1 is computed

Each operator's throughput is bounded by two ceilings:

  • Compute ceiling: hardware peak FLOPS / op's FLOPs-per-token
  • Memory bandwidth ceiling: hardware BW / op's bytes-per-token

Real throughput = min(compute, memory_bw) × efficiency.

Where the two ceilings intersect is the ridge point:

  • Arithmetic intensity (FLOPs/byte) above ridge → compute-bound, more FLOPS helps
  • Below ridge → memory-bound, quantization or higher BW helps more

See real Roofline chart in calculator →

2. Three-tier evidence model

Every number carries an evidence tier. We do not conflate vendor-claimed with measured:

📄 Vendor-claimed

From whitepapers, datasheets, product pages. Most data starts here, especially for Chinese vendors with limited public benchmarks.

Default assumption: real-world ≈ 60-80% of claim

✅ Measured

Third-party or community-contributed measurements with attestation + raw log.

Long-term goal: increase ratio year-over-year

⚠️ Estimated

Derived from public information (back-calculated from MLPerf submissions, die-shot estimates).

Used only for filling key gaps

Live distribution at /quality →

3. Chinese hardware ecosystem

Whether a Chinese accelerator can run a given model depends on:

  1. Programming model: CANN (Ascend) / BANG (Cambricon) / DTK+HIP (Hygon) / MUSA (Moore Threads) / etc.
  2. Operator library: most transformer ops are covered, but FP8/FP4/MoE features still catching up
  3. Inference engine: vendor-official (MindIE) vs community ports (vllm-ascend / vllm-musa / lmdeploy-mlu)
  4. Quantization: BF16/FP16 widely supported, INT8 mostly, FP8/FP4 rarely

Performance is improving rapidly post-export-control — track via /china genealogy.

4. Disaggregated inference

Traditional: same GPU pool runs both prefill (compute-heavy) and decode (memory-bw-heavy), wasting half the resources.

Disaggregated:

  • Prefill pool: high-compute cards (e.g., H100) handle prompt
  • Decode pool: high-memory-bw cards (e.g., H200) handle generation
  • KV cache transferred across scale-out network

Cost: scheduling complexity + extra KV transfer latency. Benefit: 30-50% lower $/token.

Known systems: Mooncake, DistServe, SGLang disagg.

5. TCO formula

$/M tokens = (hw $/h × cards + power_kW × $/kWh × PUE × cards) × 1M
              ─────────────────────────────────────────────
                          decode_throughput tok/s × 3600 / 1M

All inputs adjustable in calculator. PUE = data center cooling overhead, 1.3 is industry typical.

6. How to use this site for selection

  1. Know your target model → check model details
  2. Pick candidate cards → use compare for radar + Roofline overlay
  3. Run calculator → Tier 0 (real cases) + Tier 1 (theoretical ceiling)
  4. Check /quality for evidence reliability
  5. Need a substitute? Every hardware page has a "closest substitute" widget (incl. cross-border markers)

References