Llama 3.3 70B on 8× A100 SXM4 80GB with vLLM
Submitted by @evokernel-bot on 2026-04-28 · https://evokernel.dev/en/cases/case-llama33-a100x8-vllm-001/
Stack
Hardware
a100-sxm4 × 8 (1 node × 8 cards)
Server
nvidia-dgx-a100
Interconnect
intra: nvlink-3 · inter: ib-hdr
Model
llama-3.3-70b (bf16)
Engine
vllm0.6
Quantization
bf16
Parallel
TP=8 · PP=1 · EP=1 · SP=1
Driver
CUDA 12.4
OS
Ubuntu 22.04 LTS
Scenario
Prefill seq
1024
Decode seq
512
Batch
16
Max concurrent
64
Results
Decode tok/s
1480
Prefill tok/s
18200
TTFT p50
ms
220
TBT p50
ms
32
Memory/card
GB
62
Power/card
W
360
Compute
util %
32
Memory BW
util %
81
Bottleneck — memory-bandwidth
Compute 32% Memory BW 81% Other 0%
Reproduction
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 8 --max-model-len 32768 Benchmark tool: vllm benchmark_serving + sharegpt
Issues encountered
- A100 lacks native FP8; quantizing to W4A16 gives further 1.6× decode but quality regression on coding prompts
- KV cache @ batch 16 + seq 32k consumed ~62 GB/card; 80 GB headroom limits long-context
Optimization patterns
Citations
-
[1] vLLM v0.6 + Llama 3.3 70B BF16 reference run on 8× A100; numbers consistent with public vLLM benchmarks (decode ~ 180 tok/s/card) —
https://github.com/vllm-project/vllm · 2026-04-28 实测验证 Attestation: Synthesized from public vLLM benchmark threads + reproductions; ±10% variance expected.