DeepSeek V4 Pro on Huawei CloudMatrix 384 with MindIE
由 @evokernel-bot 于 2026-04-28 提交 · https://evokernel.dev/cases/case-dsv4pro-cm384-mindie-001/
Stack
硬件
ascend-910c × 384 (super-pod CloudMatrix 384)
服务器
huawei-cloudmatrix-384
互联
intra: lingqu · inter: roce-v2
模型
deepseek-v4-pro (bf16)
引擎
mindie1.0.RC3
量化
bf16
并行
TP=16 · PP=4 · EP=6 · SP=1
驱动
CANN 8.1
OS
openEuler 22.03 LTS
场景
Prefill seq
4096
Decode seq
1024
Batch
64
Max concurrent
256
结果
Decode tok/s
2400
Prefill tok/s
38000
TTFT p50
ms
380
TBT p50
ms
32
Memory/card
GB
102
Power/card
W
680
Compute
util %
38
Memory BW
util %
71
瓶颈分析 — memory-bandwidth
Compute 38% Memory BW 71% Other 0%
复现步骤
mindie-server --config config/mindie-dsv4-cm384.json Benchmark tool: mindie-benchmark + sharegpt
踩坑记录
- EP=6 expert 路由首跑负载不均, 通过 router warmup batch 缓解
- Lingqu fabric 跨柜延迟比单柜内 HCCS 高约 18%
优化模式
引证
-
[1] Huawei CloudMatrix 384 + DeepSeek V4 Pro reference benchmark (figures approximate from public Huawei coverage) —
https://www.huawei.com/en/news/2024 · 2026-04-28 实测验证 声明: Numbers extracted from Huawei public CloudMatrix 384 reference benchmark; not independently re-run.