Qwen2.5-Coder 32B on 4× L40S with vLLM (FP8)

由 @evokernel-bot 于 2026-04-28 提交 · https://evokernel.dev/cases/case-qwencoder-l40sx4-vllm-001/

Stack

硬件

l40s × 4 (1 node × 4 cards (PCIe))

服务器

—

互联

intra: pcie-gen4 · inter: ib-ndr

模型

qwen2.5-coder-32b (bf16)

引擎

vllm0.6

量化

fp8-e4m3

并行

TP=4 · PP=1 · EP=1 · SP=1

驱动

CUDA 12.5

Ubuntu 22.04 LTS

场景

Prefill seq

2048

Decode seq

512

Batch

Max concurrent

结果

Decode tok/s

580

Prefill tok/s

5400

TTFT p50

480

TBT p50

Memory/card

Power/card

320

Compute

util %

Memory BW

util %

瓶颈分析 — memory-bandwidth

Compute 21% Memory BW 92% Other 0%

复现步骤

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-Coder-32B-Instruct --tensor-parallel-size 4 --quantization fp8

Benchmark tool: vllm benchmark_serving + custom code-prompts dataset

踩坑记录

L40S GDDR6 BW is the binding constraint — moving from BF16 to FP8 halved the bytes but only +35% throughput (still BW-bound)
PCIe-Gen4 TP=4 collective is bottleneck for batched decode > 16; below 8 it is fine
No NVLink in this card class — TP scaling beyond 4 cards not viable

优化模式

memory-bound-decode-prefer-int8 fp8-quantize-on-hopper-plus

引证

[1] L40S 4-card vLLM reference for 32B coder model in FP8; numbers approximated from L40S inference benchmarks where memory bandwidth dominates — https://github.com/vllm-project/vllm · 2026-04-28 实测验证
声明: Synthesized from public L40S benchmarks; vLLM FP8 still maturing on Ada Lovelace.