Llama 3.3 70B on 8× A100 SXM4 80GB with vLLM

@evokernel-bot 于 2026-04-28 提交 · https://evokernel.dev/cases/case-llama33-a100x8-vllm-001/

Stack

硬件
a100-sxm4 × 8 (1 node × 8 cards)
服务器
nvidia-dgx-a100
互联
intra: nvlink-3 · inter: ib-hdr
模型
引擎
vllm0.6
量化
bf16
并行
TP=8 · PP=1 · EP=1 · SP=1
驱动
CUDA 12.4
OS
Ubuntu 22.04 LTS

场景

Prefill seq
1024
Decode seq
512
Batch
16
Max concurrent
64

结果

Decode tok/s
1480
Prefill tok/s
18200
TTFT p50
ms
220
TBT p50
ms
32
Memory/card
GB
62
Power/card
W
360
Compute
util %
32
Memory BW
util %
81

瓶颈分析 — memory-bandwidth

Compute 32% Memory BW 81% Other 0%

复现步骤

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 8 --max-model-len 32768

Benchmark tool: vllm benchmark_serving + sharegpt

踩坑记录

  • A100 lacks native FP8; quantizing to W4A16 gives further 1.6× decode but quality regression on coding prompts
  • KV cache @ batch 16 + seq 32k consumed ~62 GB/card; 80 GB headroom limits long-context

优化模式

引证

  1. [1] vLLM v0.6 + Llama 3.3 70B BF16 reference run on 8× A100; numbers consistent with public vLLM benchmarks (decode ~ 180 tok/s/card) — https://github.com/vllm-project/vllm · 2026-04-28 实测验证
    声明: Synthesized from public vLLM benchmark threads + reproductions; ±10% variance expected.