Llama 3.3 70B on 8× A100 SXM4 80GB with vLLM

由 @evokernel-bot 于 2026-04-28 提交 · https://evokernel.dev/cases/case-llama33-a100x8-vllm-001/

Stack

硬件

a100-sxm4 × 8 (1 node × 8 cards)

服务器

nvidia-dgx-a100

互联

intra: nvlink-3 · inter: ib-hdr

模型

llama-3.3-70b (bf16)

引擎

vllm0.6

量化

bf16

并行

TP=8 · PP=1 · EP=1 · SP=1

驱动

CUDA 12.4

Ubuntu 22.04 LTS

场景

Prefill seq

1024

Decode seq

512

Batch

Max concurrent

结果

Decode tok/s

1480

Prefill tok/s

18200

TTFT p50

220

TBT p50

Memory/card

Power/card

360

Compute

util %

Memory BW

util %

瓶颈分析 — memory-bandwidth

Compute 32% Memory BW 81% Other 0%

复现步骤

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 8 --max-model-len 32768

Benchmark tool: vllm benchmark_serving + sharegpt

踩坑记录

A100 lacks native FP8; quantizing to W4A16 gives further 1.6× decode but quality regression on coding prompts
KV cache @ batch 16 + seq 32k consumed ~62 GB/card; 80 GB headroom limits long-context

优化模式

memory-bound-decode-prefer-int8 paged-attention-vllm

引证

[1] vLLM v0.6 + Llama 3.3 70B BF16 reference run on 8× A100; numbers consistent with public vLLM benchmarks (decode ~ 180 tok/s/card) — https://github.com/vllm-project/vllm · 2026-04-28 实测验证
声明: Synthesized from public vLLM benchmark threads + reproductions; ±10% variance expected.