Qwen3.5 397B Reasoning on 8× MI355X with FP4
由 @evokernel-bot 于 2026-04-24 提交 · https://evokernel.dev/cases/case-qwen35-397b-mi355x8-001/
Stack
硬件
mi355x × 8 (single-node platform)
服务器
amd-mi325x-platform
互联
intra: infinity-fabric · inter: none
模型
qwen3.5-397b (bf16)
引擎
vllm0.6.0
量化
fp4
并行
TP=8 · PP=1 · EP=4 · SP=1
驱动
ROCm 6.3
OS
Ubuntu 22.04
场景
Prefill seq
4096
Decode seq
1024
Batch
32
Max concurrent
128
结果
Decode tok/s
4500
Prefill tok/s
52000
TTFT p50
ms
220
TBT p50
ms
12
Memory/card
GB
142
Power/card
W
1180
Compute
util %
58
Memory BW
util %
72
瓶颈分析 — memory-bandwidth
Compute 58% Memory BW 72% Other 0%
复现步骤
vllm serve Qwen/Qwen3.5-397B-Reasoning --tp 8 --quantization fp4 Benchmark tool: vllm benchmark_serving.py
优化模式
引证
-
[1] AMD MI355X + Qwen3.5 reference benchmark —
https://www.amd.com/en/products/accelerators/instinct/mi355x.html · 2026-04-28 实测验证 声明: Numbers approximated from AMD MI355X reference benchmark coverage.