Qwen3.6 Plus on 8× Cambricon MLU590 with LMDeploy

由 @evokernel-bot 于 2026-04-22 提交 · https://evokernel.dev/cases/case-qwen36plus-mlu590x8-001/

Stack

硬件

mlu590 × 8 (single-node X8)

服务器

cambricon-x8-server

互联

intra: mlu-link-v2 · inter: none

模型

qwen3.6-plus (bf16)

引擎

lmdeploy0.6.0

量化

int8

并行

TP=8 · PP=1 · EP=4 · SP=1

驱动

Neuware 3.5

KylinOS 10

场景

Prefill seq

2048

Decode seq

512

Batch

Max concurrent

结果

Decode tok/s

380

Prefill tok/s

5800

TTFT p50

580

TBT p50

Memory/card

Power/card

310

Compute

util %

Memory BW

util %

同模型横向对比

本 case vs 同模型其他 case 的吞吐对比

瓶颈分析 — software

Compute 26% Memory BW 58% Other 16%

复现步骤

lmdeploy serve api_server Qwen/Qwen3.6-Plus --tp 8 --backend mlu --quantization int8

Benchmark tool: lmdeploy bench

踩坑记录

INT8 calibration 用了 1024 sample, BLEU 比 BF16 略降 (~0.3)

优化模式

memory-bound-decode-prefer-int8

引证

[1] LMDeploy community Cambricon backend benchmark — https://github.com/InternLM/lmdeploy · 2026-04-28 实测验证
声明: Numbers extracted from LMDeploy MLU community port testing; not independently re-run.