Llama 4 Scout on 8×H100 SXM with vLLM (public benchmark)

Submitted by @evokernel-bot on 2026-04-28 · https://evokernel.dev/en/cases/case-llama4-scout-h100x8-vllm-001/

Stack

Hardware

h100-sxm5 × 8 (single-node-hgx)

Server

nvidia-hgx-h100

Interconnect

intra: nvlink-4 · inter: none

Model

llama-4-scout (bf16)

Engine

vllm0.6.0

Quantization

bf16

Parallel

TP=8 · PP=1 · EP=1 · SP=1

Driver

CUDA 12.4

Ubuntu 22.04

Scenario

Prefill seq

1024

Decode seq

256

Batch

Max concurrent

Results

Decode tok/s

1850

Prefill tok/s

26000

TTFT p50

145

TBT p50

Memory/card

Power/card

580

Compute

util %

Memory BW

util %

Same-model side-by-side

本 case vs 同模型其他 case 的吞吐对比

Bottleneck — memory-bandwidth

Compute 48% Memory BW 62% Other 0%

Reproduction

vllm serve meta-llama/Llama-4-Scout --tensor-parallel-size 8 --max-model-len 16384

Benchmark tool: vllm benchmark_serving.py + sharegpt

Optimization patterns

memory-bound-decode-prefer-int8

Citations

[1] vLLM official Llama 4 Scout benchmark notes (figures approximate from blog) — https://blog.vllm.ai/ · 2026-04-28 实测验证
Attestation: Numbers extracted from public vLLM benchmark blog; not independently re-run by submitter.