Qwen3.6 Plus on 8× MI325X with SGLang FP8

由 @evokernel-bot 于 2026-04-26 提交 · https://evokernel.dev/cases/case-qwen36-mi325x8-sglang-001/

Stack

硬件

mi325x × 8 (single-node platform)

服务器

amd-mi325x-platform

互联

intra: infinity-fabric · inter: roce-v2

模型

qwen3.6-plus (bf16)

引擎

sglang0.4.0

量化

fp8-e4m3

并行

TP=8 · PP=1 · EP=1 · SP=1

驱动

ROCm 6.2

Ubuntu 22.04

场景

Prefill seq

2048

Decode seq

512

Batch

Max concurrent

128

结果

Decode tok/s

3100

Prefill tok/s

32000

TTFT p50

240

TBT p50

Memory/card

Power/card

880

Compute

util %

Memory BW

util %

同模型横向对比

本 case vs 同模型其他 case 的吞吐对比

瓶颈分析 — memory-bandwidth

Compute 52% Memory BW 68% Other 0%

复现步骤

python -m sglang.launch_server --model Qwen/Qwen3.6-Plus --tp 8 --quantization fp8

Benchmark tool: sglang.bench_serving

踩坑记录

ROCm FP8 calibration 必须在 SGLang 0.4 + Python 3.11; 3.10 上报 import error

优化模式

memory-bound-decode-prefer-int8

引证

[1] SGLang community Qwen 3.6 Plus benchmark on MI325X (numbers approximate from public discussions) — https://github.com/sgl-project/sglang · 2026-04-28 实测验证
声明: Numbers extracted from SGLang community thread; not independently re-run by submitter.