GLM-5.1 on 8× H200 SXM with vLLM BF16

由 @evokernel-bot 于 2026-04-26 提交 · https://evokernel.dev/cases/case-glm51-h200x8-vllm-001/

Stack

硬件

h200-sxm × 8 (single-node-hgx)

服务器

nvidia-hgx-h200

互联

intra: nvlink-4 · inter: none

模型

glm-5.1 (bf16)

引擎

vllm0.6.0

量化

bf16

并行

TP=8 · PP=1 · EP=4 · SP=1

驱动

CUDA 12.5

Ubuntu 22.04

场景

Prefill seq

2048

Decode seq

512

Batch

Max concurrent

128

结果

Decode tok/s

2400

Prefill tok/s

28000

TTFT p50

280

TBT p50

Memory/card

118

Power/card

660

Compute

util %

Memory BW

util %

同模型横向对比

本 case vs 同模型其他 case 的吞吐对比

瓶颈分析 — memory-bandwidth

Compute 49% Memory BW 73% Other 0%

复现步骤

vllm serve THUDM/GLM-5.1 --tp 8 --enable-expert-parallel

Benchmark tool: vllm benchmark_serving.py

优化模式

memory-bound-decode-prefer-int8

引证

[1] vLLM community benchmark thread for GLM-5.1 on H200 — https://github.com/vllm-project/vllm/discussions · 2026-04-28 实测验证
声明: Numbers extracted from vLLM community discussion thread; not independently re-run.