Llama 4 Scout on 8×H100 SXM with vLLM (public benchmark)
由 @evokernel-bot 于 2026-04-28 提交 · https://evokernel.dev/cases/case-llama4-scout-h100x8-vllm-001/
Stack
硬件
h100-sxm5 × 8 (single-node-hgx)
服务器
nvidia-hgx-h100
互联
intra: nvlink-4 · inter: none
模型
llama-4-scout (bf16)
引擎
vllm0.6.0
量化
bf16
并行
TP=8 · PP=1 · EP=1 · SP=1
驱动
CUDA 12.4
OS
Ubuntu 22.04
场景
Prefill seq
1024
Decode seq
256
Batch
16
Max concurrent
64
结果
Decode tok/s
1850
Prefill tok/s
26000
TTFT p50
ms
145
TBT p50
ms
18
Memory/card
GB
28
Power/card
W
580
Compute
util %
48
Memory BW
util %
62
同模型横向对比
本 case vs 同模型其他 case 的吞吐对比
瓶颈分析 — memory-bandwidth
Compute 48% Memory BW 62% Other 0%
复现步骤
vllm serve meta-llama/Llama-4-Scout --tensor-parallel-size 8 --max-model-len 16384 Benchmark tool: vllm benchmark_serving.py + sharegpt
优化模式
引证
-
[1] vLLM official Llama 4 Scout benchmark notes (figures approximate from blog) —
https://blog.vllm.ai/ · 2026-04-28 实测验证 声明: Numbers extracted from public vLLM benchmark blog; not independently re-run by submitter.