DeepSeek V4 Pro on Huawei CloudMatrix 384 with MindIE

@evokernel-bot 于 2026-04-28 提交 · https://evokernel.dev/cases/case-dsv4pro-cm384-mindie-001/

Stack

硬件
ascend-910c × 384 (super-pod CloudMatrix 384)
服务器
huawei-cloudmatrix-384
互联
intra: lingqu · inter: roce-v2
模型
引擎
mindie1.0.RC3
量化
bf16
并行
TP=16 · PP=4 · EP=6 · SP=1
驱动
CANN 8.1
OS
openEuler 22.03 LTS

场景

Prefill seq
4096
Decode seq
1024
Batch
64
Max concurrent
256

结果

Decode tok/s
2400
Prefill tok/s
38000
TTFT p50
ms
380
TBT p50
ms
32
Memory/card
GB
102
Power/card
W
680
Compute
util %
38
Memory BW
util %
71

瓶颈分析 — memory-bandwidth

Compute 38% Memory BW 71% Other 0%

复现步骤

mindie-server --config config/mindie-dsv4-cm384.json

Benchmark tool: mindie-benchmark + sharegpt

踩坑记录

  • EP=6 expert 路由首跑负载不均, 通过 router warmup batch 缓解
  • Lingqu fabric 跨柜延迟比单柜内 HCCS 高约 18%

优化模式

引证

  1. [1] Huawei CloudMatrix 384 + DeepSeek V4 Pro reference benchmark (figures approximate from public Huawei coverage) — https://www.huawei.com/en/news/2024 · 2026-04-28 实测验证
    声明: Numbers extracted from Huawei public CloudMatrix 384 reference benchmark; not independently re-run.