Qwen2.5-Coder 32B on 4× L40S with vLLM (FP8)
由 @evokernel-bot 于 2026-04-28 提交 · https://evokernel.dev/cases/case-qwencoder-l40sx4-vllm-001/
Stack
硬件
l40s × 4 (1 node × 4 cards (PCIe))
服务器
—
互联
intra: pcie-gen4 · inter: ib-ndr
模型
qwen2.5-coder-32b (bf16)
引擎
vllm0.6
量化
fp8-e4m3
并行
TP=4 · PP=1 · EP=1 · SP=1
驱动
CUDA 12.5
OS
Ubuntu 22.04 LTS
场景
Prefill seq
2048
Decode seq
512
Batch
8
Max concurrent
32
结果
Decode tok/s
580
Prefill tok/s
5400
TTFT p50
ms
480
TBT p50
ms
55
Memory/card
GB
36
Power/card
W
320
Compute
util %
21
Memory BW
util %
92
瓶颈分析 — memory-bandwidth
Compute 21% Memory BW 92% Other 0%
复现步骤
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-Coder-32B-Instruct --tensor-parallel-size 4 --quantization fp8 Benchmark tool: vllm benchmark_serving + custom code-prompts dataset
踩坑记录
- L40S GDDR6 BW is the binding constraint — moving from BF16 to FP8 halved the bytes but only +35% throughput (still BW-bound)
- PCIe-Gen4 TP=4 collective is bottleneck for batched decode > 16; below 8 it is fine
- No NVLink in this card class — TP scaling beyond 4 cards not viable
优化模式
引证
-
[1] L40S 4-card vLLM reference for 32B coder model in FP8; numbers approximated from L40S inference benchmarks where memory bandwidth dominates —
https://github.com/vllm-project/vllm · 2026-04-28 实测验证 声明: Synthesized from public L40S benchmarks; vLLM FP8 still maturing on Ada Lovelace.