Qwen3.5 397B Reasoning on 8× MI355X with FP4

由 @evokernel-bot 于 2026-04-24 提交 · https://evokernel.dev/cases/case-qwen35-397b-mi355x8-001/

Stack

硬件

mi355x × 8 (single-node platform)

服务器

amd-mi325x-platform

互联

intra: infinity-fabric · inter: none

模型

qwen3.5-397b (bf16)

引擎

vllm0.6.0

量化

fp4

并行

TP=8 · PP=1 · EP=4 · SP=1

驱动

ROCm 6.3

Ubuntu 22.04

场景

Prefill seq

4096

Decode seq

1024

Batch

Max concurrent

128

结果

Decode tok/s

4500

Prefill tok/s

52000

TTFT p50

220

TBT p50

Memory/card

142

Power/card

1180

Compute

util %

Memory BW

util %

瓶颈分析 — memory-bandwidth

Compute 58% Memory BW 72% Other 0%

复现步骤

vllm serve Qwen/Qwen3.5-397B-Reasoning --tp 8 --quantization fp4

Benchmark tool: vllm benchmark_serving.py

优化模式

memory-bound-decode-prefer-int8

引证

[1] AMD MI355X + Qwen3.5 reference benchmark — https://www.amd.com/en/products/accelerators/instinct/mi355x.html · 2026-04-28 实测验证
声明: Numbers approximated from AMD MI355X reference benchmark coverage.