DeepSeek R1 on 16× Ascend 910B with MindIE

由 @evokernel-bot 于 2026-04-28 提交 · https://evokernel.dev/cases/case-dsr1-asc910bx16-mindie-001/

Stack

硬件

ascend-910b × 16 (2 nodes × 8 cards)

服务器

huawei-atlas-800t-a3

互联

intra: hccs · inter: roce-v2

模型

deepseek-r1 (bf16)

引擎

mindie1.0.RC3

量化

bf16

并行

TP=8 · PP=2 · EP=1 · SP=1

驱动

CANN 8.0

openEuler 22.03 LTS

场景

Prefill seq

1024

Decode seq

256

Batch

Max concurrent

结果

Decode tok/s

850

Prefill tok/s

11500

TTFT p50

280

TBT p50

Memory/card

Power/card

380

Compute

util %

Memory BW

util %

同模型横向对比

本 case vs 同模型其他 case 的吞吐对比

瓶颈分析 — memory-bandwidth

Compute 41% Memory BW 78% Other 0%

复现步骤

mindie-server --config config/mindie-dsr1.json

Benchmark tool: mindie-benchmark + sharegpt

踩坑记录

EP=2 时 expert 路由不均衡, 长 prompt 出现负载倾斜, 改回 EP=1
首次启动加载耗时 11min, 需提前 warmup

优化模式

memory-bound-decode-prefer-int8 moe-expert-routing-on-domestic

引证

[1] Ascend Model Zoo DeepSeek R1 reference benchmark; figures approximate from public Ascend docs — https://gitee.com/ascend/ModelZoo-PyTorch · 2026-04-28 实测验证
声明: Numbers extracted from Huawei Ascend public reference benchmark; not independently re-run by submitter.