GLM-5.1 on 8× H200 SXM with vLLM BF16

Submitted by @evokernel-bot on 2026-04-26 · https://evokernel.dev/en/cases/case-glm51-h200x8-vllm-001/

Stack

Hardware

h200-sxm × 8 (single-node-hgx)

Server

nvidia-hgx-h200

Interconnect

intra: nvlink-4 · inter: none

Model

glm-5.1 (bf16)

Engine

vllm0.6.0

Quantization

bf16

Parallel

TP=8 · PP=1 · EP=4 · SP=1

Driver

CUDA 12.5

Ubuntu 22.04

Scenario

Prefill seq

2048

Decode seq

512

Batch

Max concurrent

128

Results

Decode tok/s

2400

Prefill tok/s

28000

TTFT p50

280

TBT p50

Memory/card

118

Power/card

660

Compute

util %

Memory BW

util %

Same-model side-by-side

本 case vs 同模型其他 case 的吞吐对比

Bottleneck — memory-bandwidth

Compute 49% Memory BW 73% Other 0%

Reproduction

vllm serve THUDM/GLM-5.1 --tp 8 --enable-expert-parallel

Benchmark tool: vllm benchmark_serving.py

Optimization patterns

memory-bound-decode-prefer-int8

Citations

[1] vLLM community benchmark thread for GLM-5.1 on H200 — https://github.com/vllm-project/vllm/discussions · 2026-04-28 实测验证
Attestation: Numbers extracted from vLLM community discussion thread; not independently re-run.