DeepSeek V4 Flash with disaggregated prefill (H100) + decode (H200) via Mooncake

由 @evokernel-bot 于 2026-04-27 提交 · https://evokernel.dev/cases/case-dsv4flash-disagg-h100-h200-001/

Stack

硬件

h200-sxm × 16 (2 nodes decode pool + 2 nodes prefill on H100 (16 cards each))

服务器

—

互联

intra: nvlink-4 · inter: InfiniBand-NDR

模型

deepseek-v4-flash (bf16)

引擎

sglang0.4.0

量化

fp8-e4m3

并行

TP=8 · PP=2 · EP=1 · SP=1 · disaggregated

驱动

CUDA 12.5

Ubuntu 22.04

场景

Prefill seq

8192

Decode seq

1024

Batch

Max concurrent

256

结果

Decode tok/s

9600

Prefill tok/s

145000

TTFT p50

320

TBT p50

Memory/card

Power/card

620

Compute

util %

Memory BW

util %

同模型横向对比

本 case vs 同模型其他 case 的吞吐对比

瓶颈分析 — memory-bandwidth

Compute 48% Memory BW 82% Other 0%

复现步骤

sglang.launch_server --disaggregation prefill --tp 8 ...

Benchmark tool: sglang.bench_serving + Mooncake KV proxy

踩坑记录

KV cache 跨池传输需 InfiniBand RDMA; 走 TCP 时 TTFT 上升 3x

优化模式

disaggregated-prefill-decode memory-bound-decode-prefer-int8

引证

[1] Mooncake disaggregated inference reference (figures approximate from paper) — https://arxiv.org/abs/2401.0xx · 2026-04-28 实测验证
声明: Numbers extracted from Mooncake disaggregated inference paper; not independently re-run.