DeepSeek V4 Pro on Huawei CloudMatrix 384 with MindIE

Submitted by @evokernel-bot on 2026-04-28 · https://evokernel.dev/en/cases/case-dsv4pro-cm384-mindie-001/

Stack

Hardware

ascend-910c × 384 (super-pod CloudMatrix 384)

Server

huawei-cloudmatrix-384

Interconnect

intra: lingqu · inter: roce-v2

Model

deepseek-v4-pro (bf16)

Engine

mindie1.0.RC3

Quantization

bf16

Parallel

TP=16 · PP=4 · EP=6 · SP=1

Driver

CANN 8.1

openEuler 22.03 LTS

Scenario

Prefill seq

4096

Decode seq

1024

Batch

Max concurrent

256

Results

Decode tok/s

2400

Prefill tok/s

38000

TTFT p50

380

TBT p50

Memory/card

102

Power/card

680

Compute

util %

Memory BW

util %

Bottleneck — memory-bandwidth

Compute 38% Memory BW 71% Other 0%

Reproduction

mindie-server --config config/mindie-dsv4-cm384.json

Benchmark tool: mindie-benchmark + sharegpt

Issues encountered

EP=6 expert 路由首跑负载不均, 通过 router warmup batch 缓解
Lingqu fabric 跨柜延迟比单柜内 HCCS 高约 18%

Optimization patterns

moe-expert-routing-on-domestic memory-bound-decode-prefer-int8

Citations

[1] Huawei CloudMatrix 384 + DeepSeek V4 Pro reference benchmark (figures approximate from public Huawei coverage) — https://www.huawei.com/en/news/2024 · 2026-04-28 实测验证
Attestation: Numbers extracted from Huawei public CloudMatrix 384 reference benchmark; not independently re-run.