Learn · EvoKernel Spec

1. Roofline model — how Tier 1 is computed

Each operator's throughput is bounded by two ceilings:

Compute ceiling: hardware peak FLOPS / op's FLOPs-per-token
Memory bandwidth ceiling: hardware BW / op's bytes-per-token

Real throughput = min(compute, memory_bw) × efficiency.

Where the two ceilings intersect is the ridge point:

Arithmetic intensity (FLOPs/byte) above ridge → compute-bound, more FLOPS helps
Below ridge → memory-bound, quantization or higher BW helps more

See real Roofline chart in calculator →

2. Three-tier evidence model

Every number carries an evidence tier. We do not conflate vendor-claimed with measured:

📄 Vendor-claimed

From whitepapers, datasheets, product pages. Most data starts here, especially for Chinese vendors with limited public benchmarks.

Default assumption: real-world ≈ 60-80% of claim

✅ Measured

Third-party or community-contributed measurements with attestation + raw log.

Long-term goal: increase ratio year-over-year

⚠️ Estimated

Derived from public information (back-calculated from MLPerf submissions, die-shot estimates).

Used only for filling key gaps

Live distribution at /quality →

3. Chinese hardware ecosystem

Whether a Chinese accelerator can run a given model depends on:

Programming model: CANN (Ascend) / BANG (Cambricon) / DTK+HIP (Hygon) / MUSA (Moore Threads) / etc.
Operator library: most transformer ops are covered, but FP8/FP4/MoE features still catching up
Inference engine: vendor-official (MindIE) vs community ports (vllm-ascend / vllm-musa / lmdeploy-mlu)
Quantization: BF16/FP16 widely supported, INT8 mostly, FP8/FP4 rarely

Performance is improving rapidly post-export-control — track via /china genealogy.

4. Disaggregated inference

Traditional: same GPU pool runs both prefill (compute-heavy) and decode (memory-bw-heavy), wasting half the resources.

Disaggregated:

Prefill pool: high-compute cards (e.g., H100) handle prompt
Decode pool: high-memory-bw cards (e.g., H200) handle generation
KV cache transferred across scale-out network

Cost: scheduling complexity + extra KV transfer latency. Benefit: 30-50% lower $/token.

Known systems: Mooncake, DistServe, SGLang disagg.

5. TCO formula

$/M tokens = (hw $/h × cards + power_kW × $/kWh × PUE × cards) × 1M
              ─────────────────────────────────────────────
                          decode_throughput tok/s × 3600 / 1M

All inputs adjustable in calculator. PUE = data center cooling overhead, 1.3 is industry typical.

6. How to use this site for selection

Know your target model → check model details
Pick candidate cards → use compare for radar + Roofline overlay
Run calculator → Tier 0 (real cases) + Tier 1 (theoretical ceiling)
Check /quality for evidence reliability
Need a substitute? Every hardware page has a "closest substitute" widget (incl. cross-border markers)

References

SemiAnalysis InferenceX — reference inspiration
NVIDIA Deep Learning Performance Guide
Mooncake (disaggregated inference)
DistServe (disaggregated inference)
About / data schema

How to read this site