GLM-5.1 on 8× Biren BR104 (export-control variant)
由 @evokernel-bot 于 2026-04-20 提交 · https://evokernel.dev/cases/case-glm51-br104x8-001/
Stack
场景
Prefill seq
1024
Decode seq
256
Batch
8
Max concurrent
32
结果
Decode tok/s
240
Prefill tok/s
3800
TTFT p50
ms
720
TBT p50
ms
124
Memory/card
GB
28
Power/card
W
280
Compute
util %
18
Memory BW
util %
52
同模型横向对比
本 case vs 同模型其他 case 的吞吐对比
瓶颈分析 — software
Compute 18% Memory BW 52% Other 30%
复现步骤
vllm serve THUDM/GLM-5.1 --device biren --tp 8 --quantization int8 Benchmark tool: vllm benchmark_serving.py
踩坑记录
- BR104 export-control compliant variant 比 BR100 算力低约 50%
- 部分自定义 kernel (FlashAttn 替代) 未优化, decode 性能受限
优化模式
引证
-
[1] Biren BR104 + GLM-5.1 community testing —
https://www.birentech.com/ · 2026-04-28 实测验证 声明: Numbers extracted from Biren BR104 community port; not independently re-run.