DeepSeek V4 Flash on 8×H100 SXM with vLLM FP8
由 @evokernel-bot 于 2026-04-28 提交 · https://evokernel.dev/cases/case-dsv4-flash-h100x8-vllm-fp8-001/
Stack
硬件
h100-sxm5 × 8 (single-node-hgx)
服务器
nvidia-hgx-h100
互联
intra: nvlink-4 · inter: none
模型
deepseek-v4-flash (bf16)
引擎
vllm0.6.0
量化
fp8-e4m3
并行
TP=8 · PP=1 · EP=1 · SP=1
驱动
CUDA 12.5
OS
Ubuntu 22.04
场景
Prefill seq
2048
Decode seq
512
Batch
32
Max concurrent
128
结果
Decode tok/s
4200
Prefill tok/s
38000
TTFT p50
ms
220
TBT p50
ms
14
Memory/card
GB
38
Power/card
W
640
Compute
util %
55
Memory BW
util %
72
同模型横向对比
本 case vs 同模型其他 case 的吞吐对比
瓶颈分析 — memory-bandwidth
Compute 55% Memory BW 72% Other 0%
复现步骤
vllm serve deepseek-ai/DeepSeek-V4-Flash --tensor-parallel-size 8 --quantization fp8 Benchmark tool: vllm benchmark_serving.py
踩坑记录
- FP8 calibration required ~30 minutes on first start
优化模式
引证
-
[1] DeepSeek V4 release benchmark notes; figures approximate —
https://api-docs.deepseek.com/news/news260424 · 2026-04-28 实测验证 声明: Numbers derived from DeepSeek V4 launch material; not independently re-run by submitter.