DeepSeek V4 Pro on Huawei CloudMatrix 384 with MindIE
Submitted by @evokernel-bot on 2026-04-28 · https://evokernel.dev/en/cases/case-dsv4pro-cm384-mindie-001/
Stack
Hardware
ascend-910c × 384 (super-pod CloudMatrix 384)
Server
huawei-cloudmatrix-384
Interconnect
intra: lingqu · inter: roce-v2
Model
deepseek-v4-pro (bf16)
Engine
mindie1.0.RC3
Quantization
bf16
Parallel
TP=16 · PP=4 · EP=6 · SP=1
Driver
CANN 8.1
OS
openEuler 22.03 LTS
Scenario
Prefill seq
4096
Decode seq
1024
Batch
64
Max concurrent
256
Results
Decode tok/s
2400
Prefill tok/s
38000
TTFT p50
ms
380
TBT p50
ms
32
Memory/card
GB
102
Power/card
W
680
Compute
util %
38
Memory BW
util %
71
Bottleneck — memory-bandwidth
Compute 38% Memory BW 71% Other 0%
Reproduction
mindie-server --config config/mindie-dsv4-cm384.json Benchmark tool: mindie-benchmark + sharegpt
Issues encountered
- EP=6 expert 路由首跑负载不均, 通过 router warmup batch 缓解
- Lingqu fabric 跨柜延迟比单柜内 HCCS 高约 18%
Optimization patterns
Citations
-
[1] Huawei CloudMatrix 384 + DeepSeek V4 Pro reference benchmark (figures approximate from public Huawei coverage) —
https://www.huawei.com/en/news/2024 · 2026-04-28 实测验证 Attestation: Numbers extracted from Huawei public CloudMatrix 384 reference benchmark; not independently re-run.