Llama 3.3 70B on 8× A100 SXM4 80GB with vLLM

Submitted by @evokernel-bot on 2026-04-28 · https://evokernel.dev/en/cases/case-llama33-a100x8-vllm-001/

Stack

Hardware

a100-sxm4 × 8 (1 node × 8 cards)

Server

nvidia-dgx-a100

Interconnect

intra: nvlink-3 · inter: ib-hdr

Model

llama-3.3-70b (bf16)

Engine

vllm0.6

Quantization

bf16

Parallel

TP=8 · PP=1 · EP=1 · SP=1

Driver

CUDA 12.4

Ubuntu 22.04 LTS

Scenario

Prefill seq

1024

Decode seq

512

Batch

Max concurrent

Results

Decode tok/s

1480

Prefill tok/s

18200

TTFT p50

220

TBT p50

Memory/card

Power/card

360

Compute

util %

Memory BW

util %

Bottleneck — memory-bandwidth

Compute 32% Memory BW 81% Other 0%

Reproduction

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 8 --max-model-len 32768

Benchmark tool: vllm benchmark_serving + sharegpt

Issues encountered

A100 lacks native FP8; quantizing to W4A16 gives further 1.6× decode but quality regression on coding prompts
KV cache @ batch 16 + seq 32k consumed ~62 GB/card; 80 GB headroom limits long-context

Optimization patterns

memory-bound-decode-prefer-int8 paged-attention-vllm

Citations

[1] vLLM v0.6 + Llama 3.3 70B BF16 reference run on 8× A100; numbers consistent with public vLLM benchmarks (decode ~ 180 tok/s/card) — https://github.com/vllm-project/vllm · 2026-04-28 实测验证
Attestation: Synthesized from public vLLM benchmark threads + reproductions; ±10% variance expected.