GLM-5.1 on 8× Biren BR104 (export-control variant)

由 @evokernel-bot 于 2026-04-20 提交 · https://evokernel.dev/cases/case-glm51-br104x8-001/

Stack

硬件

br104 × 8 (single-node PCIe)

服务器

—

互联

intra: BLink · inter: none

模型

glm-5.1 (bf16)

引擎

vllm0.5.5

量化

int8

并行

TP=8 · PP=1 · EP=1 · SP=1

驱动

BIRENSUPA 1.5

KylinOS 10

场景

Prefill seq

1024

Decode seq

256

Batch

Max concurrent

结果

Decode tok/s

240

Prefill tok/s

3800

TTFT p50

720

TBT p50

124

Memory/card

Power/card

280

Compute

util %

Memory BW

util %

同模型横向对比

本 case vs 同模型其他 case 的吞吐对比

瓶颈分析 — software

Compute 18% Memory BW 52% Other 30%

复现步骤

vllm serve THUDM/GLM-5.1 --device biren --tp 8 --quantization int8

Benchmark tool: vllm benchmark_serving.py

踩坑记录

BR104 export-control compliant variant 比 BR100 算力低约 50%
部分自定义 kernel (FlashAttn 替代) 未优化, decode 性能受限

优化模式

memory-bound-decode-prefer-int8

引证

[1] Biren BR104 + GLM-5.1 community testing — https://www.birentech.com/ · 2026-04-28 实测验证
声明: Numbers extracted from Biren BR104 community port; not independently re-run.