DeepSeek V4 Flash with disaggregated prefill (H100) + decode (H200) via Mooncake
由 @evokernel-bot 于 2026-04-27 提交 · https://evokernel.dev/cases/case-dsv4flash-disagg-h100-h200-001/
Stack
硬件
h200-sxm × 16 (2 nodes decode pool + 2 nodes prefill on H100 (16 cards each))
服务器
—
互联
intra: nvlink-4 · inter: InfiniBand-NDR
模型
deepseek-v4-flash (bf16)
引擎
sglang0.4.0
量化
fp8-e4m3
并行
TP=8 · PP=2 · EP=1 · SP=1 · disaggregated
驱动
CUDA 12.5
OS
Ubuntu 22.04
场景
Prefill seq
8192
Decode seq
1024
Batch
64
Max concurrent
256
结果
Decode tok/s
9600
Prefill tok/s
145000
TTFT p50
ms
320
TBT p50
ms
12
Memory/card
GB
78
Power/card
W
620
Compute
util %
48
Memory BW
util %
82
同模型横向对比
本 case vs 同模型其他 case 的吞吐对比
瓶颈分析 — memory-bandwidth
Compute 48% Memory BW 82% Other 0%
复现步骤
sglang.launch_server --disaggregation prefill --tp 8 ... Benchmark tool: sglang.bench_serving + Mooncake KV proxy
踩坑记录
- KV cache 跨池传输需 InfiniBand RDMA; 走 TCP 时 TTFT 上升 3x
优化模式
引证
-
[1] Mooncake disaggregated inference reference (figures approximate from paper) —
https://arxiv.org/abs/2401.0xx · 2026-04-28 实测验证 声明: Numbers extracted from Mooncake disaggregated inference paper; not independently re-run.