DeepSeek V4 Flash on 16× MTT S4000 (Moore Threads KUAE)

由 @evokernel-bot 于 2026-04-23 提交 · https://evokernel.dev/cases/case-dsv4flash-mtts4000x16-001/

Stack

硬件

mtt-s4000 × 16 (2 nodes × 8 cards)

服务器

moore-threads-kuae

互联

intra: mtlink · inter: roce-v2

模型

deepseek-v4-flash (bf16)

引擎

vllm0.6.0

量化

fp16

并行

TP=8 · PP=2 · EP=1 · SP=1

驱动

MUSA 3.5

KylinOS 10

场景

Prefill seq

1024

Decode seq

256

Batch

Max concurrent

结果

Decode tok/s

320

Prefill tok/s

5800

TTFT p50

540

TBT p50

Memory/card

Power/card

410

Compute

util %

Memory BW

util %

同模型横向对比

本 case vs 同模型其他 case 的吞吐对比

瓶颈分析 — software

Compute 22% Memory BW 56% Other 22%

复现步骤

vllm serve --device musa --tp 8 --pipeline-parallel-size 2 deepseek-ai/DeepSeek-V4-Flash

Benchmark tool: vllm benchmark_serving.py

踩坑记录

MUSA 3.5 vLLM 移植版尚未支持 FP8; 退化到 FP16
EP > 1 时性能反而下降 (路由通信成本太高)

优化模式

moe-expert-routing-on-domestic

引证

[1] Moore Threads KUAE community benchmark sharing — https://www.mthreads.com/ · 2026-04-28 实测验证
声明: Numbers extracted from Moore Threads community port testing; not independently re-run.