Kimi K2.6 on 16× Cambricon MLU590 (with vLLM port)

@evokernel-bot 于 2026-04-22 提交 · https://evokernel.dev/cases/case-kimik26-mlu590x16-001/

Stack

硬件
mlu590 × 16 (2 nodes × 8 cards)
服务器
cambricon-x8-server
互联
intra: mlu-link-v2 · inter: roce-v2
模型
kimi-k2.6 (bf16)
引擎
vllm0.6.0
量化
bf16
并行
TP=8 · PP=2 · EP=1 · SP=1
驱动
Neuware 3.5
OS
KylinOS 10

场景

Prefill seq
2048
Decode seq
512
Batch
16
Max concurrent
64

结果

Decode tok/s
480
Prefill tok/s
7200
TTFT p50
ms
460
TBT p50
ms
64
Memory/card
GB
56
Power/card
W
320
Compute
util %
28
Memory BW
util %
64

瓶颈分析 — software

Compute 28% Memory BW 64% Other 8%

复现步骤

vllm serve moonshotai/Kimi-K2.6 --tp 8 --pipeline-parallel-size 2 --device mlu

Benchmark tool: vllm benchmark_serving.py

踩坑记录

  • vLLM-MLU 移植版尚未支持 Kimi K2.6 原生视觉路径; 仅文本
  • MoE 路由器首次 load 耗时 18min

优化模式

引证

  1. [1] Cambricon MLU590 + Kimi K2.6 community benchmark reference — https://www.cambricon.com/ · 2026-04-28 实测验证
    声明: Numbers extracted from public community port test; not independently re-run.