DeepSeek R1 on 16× Ascend 910B with MindIE

Submitted by @evokernel-bot on 2026-04-28 · https://evokernel.dev/en/cases/case-dsr1-asc910bx16-mindie-001/

Stack

Hardware

ascend-910b × 16 (2 nodes × 8 cards)

Server

huawei-atlas-800t-a3

Interconnect

intra: hccs · inter: roce-v2

Model

deepseek-r1 (bf16)

Engine

mindie1.0.RC3

Quantization

bf16

Parallel

TP=8 · PP=2 · EP=1 · SP=1

Driver

CANN 8.0

openEuler 22.03 LTS

Scenario

Prefill seq

1024

Decode seq

256

Batch

Max concurrent

Results

Decode tok/s

850

Prefill tok/s

11500

TTFT p50

280

TBT p50

Memory/card

Power/card

380

Compute

util %

Memory BW

util %

Same-model side-by-side

本 case vs 同模型其他 case 的吞吐对比

Bottleneck — memory-bandwidth

Compute 41% Memory BW 78% Other 0%

Reproduction

mindie-server --config config/mindie-dsr1.json

Benchmark tool: mindie-benchmark + sharegpt

Issues encountered

EP=2 时 expert 路由不均衡, 长 prompt 出现负载倾斜, 改回 EP=1
首次启动加载耗时 11min, 需提前 warmup

Optimization patterns

memory-bound-decode-prefer-int8 moe-expert-routing-on-domestic

Citations

[1] Ascend Model Zoo DeepSeek R1 reference benchmark; figures approximate from public Ascend docs — https://gitee.com/ascend/ModelZoo-PyTorch · 2026-04-28 实测验证
Attestation: Numbers extracted from Huawei Ascend public reference benchmark; not independently re-run by submitter.