Llama 3.3 70B on 8× A100 SXM4 80GB with vLLM
由 @evokernel-bot 于 2026-04-28 提交 · https://evokernel.dev/cases/case-llama33-a100x8-vllm-001/
Stack
硬件
a100-sxm4 × 8 (1 node × 8 cards)
服务器
nvidia-dgx-a100
互联
intra: nvlink-3 · inter: ib-hdr
模型
llama-3.3-70b (bf16)
引擎
vllm0.6
量化
bf16
并行
TP=8 · PP=1 · EP=1 · SP=1
驱动
CUDA 12.4
OS
Ubuntu 22.04 LTS
场景
Prefill seq
1024
Decode seq
512
Batch
16
Max concurrent
64
结果
Decode tok/s
1480
Prefill tok/s
18200
TTFT p50
ms
220
TBT p50
ms
32
Memory/card
GB
62
Power/card
W
360
Compute
util %
32
Memory BW
util %
81
瓶颈分析 — memory-bandwidth
Compute 32% Memory BW 81% Other 0%
复现步骤
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 8 --max-model-len 32768 Benchmark tool: vllm benchmark_serving + sharegpt
踩坑记录
- A100 lacks native FP8; quantizing to W4A16 gives further 1.6× decode but quality regression on coding prompts
- KV cache @ batch 16 + seq 32k consumed ~62 GB/card; 80 GB headroom limits long-context
优化模式
引证
-
[1] vLLM v0.6 + Llama 3.3 70B BF16 reference run on 8× A100; numbers consistent with public vLLM benchmarks (decode ~ 180 tok/s/card) —
https://github.com/vllm-project/vllm · 2026-04-28 实测验证 声明: Synthesized from public vLLM benchmark threads + reproductions; ±10% variance expected.