DeepSeek V4 Flash on 8×H100 SXM with vLLM FP8

由 @evokernel-bot 于 2026-04-28 提交 · https://evokernel.dev/cases/case-dsv4-flash-h100x8-vllm-fp8-001/

Stack

硬件

h100-sxm5 × 8 (single-node-hgx)

服务器

nvidia-hgx-h100

互联

intra: nvlink-4 · inter: none

模型

deepseek-v4-flash (bf16)

引擎

vllm0.6.0

量化

fp8-e4m3

并行

TP=8 · PP=1 · EP=1 · SP=1

驱动

CUDA 12.5

Ubuntu 22.04

场景

Prefill seq

2048

Decode seq

512

Batch

Max concurrent

128

结果

Decode tok/s

4200

Prefill tok/s

38000

TTFT p50

220

TBT p50

Memory/card

Power/card

640

Compute

util %

Memory BW

util %

同模型横向对比

本 case vs 同模型其他 case 的吞吐对比

瓶颈分析 — memory-bandwidth

Compute 55% Memory BW 72% Other 0%

复现步骤

vllm serve deepseek-ai/DeepSeek-V4-Flash --tensor-parallel-size 8 --quantization fp8

Benchmark tool: vllm benchmark_serving.py

踩坑记录

FP8 calibration required ~30 minutes on first start

优化模式

memory-bound-decode-prefer-int8

引证

[1] DeepSeek V4 release benchmark notes; figures approximate — https://api-docs.deepseek.com/news/news260424 · 2026-04-28 实测验证
声明: Numbers derived from DeepSeek V4 launch material; not independently re-run by submitter.