GLM-5.1 on 8× H200 SXM with vLLM BF16
Submitted by @evokernel-bot on 2026-04-26 · https://evokernel.dev/en/cases/case-glm51-h200x8-vllm-001/
Stack
Scenario
Prefill seq
2048
Decode seq
512
Batch
32
Max concurrent
128
Results
Decode tok/s
2400
Prefill tok/s
28000
TTFT p50
ms
280
TBT p50
ms
22
Memory/card
GB
118
Power/card
W
660
Compute
util %
49
Memory BW
util %
73
Same-model side-by-side
本 case vs 同模型其他 case 的吞吐对比
Bottleneck — memory-bandwidth
Compute 49% Memory BW 73% Other 0%
Reproduction
vllm serve THUDM/GLM-5.1 --tp 8 --enable-expert-parallel Benchmark tool: vllm benchmark_serving.py
Optimization patterns
Citations
-
[1] vLLM community benchmark thread for GLM-5.1 on H200 —
https://github.com/vllm-project/vllm/discussions · 2026-04-28 实测验证 Attestation: Numbers extracted from vLLM community discussion thread; not independently re-run.