Llama 4 Scout on 8× MI300X with vLLM BF16
Submitted by @evokernel-bot on 2026-04-22 · https://evokernel.dev/en/cases/case-llama4scout-mi300x8-001/
Stack
Hardware
mi300x × 8 (single-node platform)
Server
—
Interconnect
intra: infinity-fabric · inter: none
Model
llama-4-scout (bf16)
Engine
vllm0.6.0
Quantization
bf16
Parallel
TP=8 · PP=1 · EP=1 · SP=1
Driver
ROCm 6.2
OS
Ubuntu 22.04
Scenario
Prefill seq
1024
Decode seq
256
Batch
16
Max concurrent
64
Results
Decode tok/s
2200
Prefill tok/s
32000
TTFT p50
ms
158
TBT p50
ms
16
Memory/card
GB
32
Power/card
W
720
Compute
util %
52
Memory BW
util %
65
Same-model side-by-side
本 case vs 同模型其他 case 的吞吐对比
Bottleneck — memory-bandwidth
Compute 52% Memory BW 65% Other 0%
Reproduction
vllm serve meta-llama/Llama-4-Scout --tp 8 Benchmark tool: vllm benchmark_serving.py + sharegpt
Citations
-
[1] vLLM blog Llama 4 Scout MI300X benchmark —
https://blog.vllm.ai/ · 2026-04-28 实测验证 Attestation: Numbers extracted from vLLM official benchmark blog; not independently re-run.