Kimi K2.6 on 16× Cambricon MLU590 (with vLLM port)

Submitted by @evokernel-bot on 2026-04-22 · https://evokernel.dev/en/cases/case-kimik26-mlu590x16-001/

Stack

Hardware
mlu590 × 16 (2 nodes × 8 cards)
Server
cambricon-x8-server
Interconnect
intra: mlu-link-v2 · inter: roce-v2
Model
kimi-k2.6 (bf16)
Engine
vllm0.6.0
Quantization
bf16
Parallel
TP=8 · PP=2 · EP=1 · SP=1
Driver
Neuware 3.5
OS
KylinOS 10

Scenario

Prefill seq
2048
Decode seq
512
Batch
16
Max concurrent
64

Results

Decode tok/s
480
Prefill tok/s
7200
TTFT p50
ms
460
TBT p50
ms
64
Memory/card
GB
56
Power/card
W
320
Compute
util %
28
Memory BW
util %
64

Bottleneck — software

Compute 28% Memory BW 64% Other 8%

Reproduction

vllm serve moonshotai/Kimi-K2.6 --tp 8 --pipeline-parallel-size 2 --device mlu

Benchmark tool: vllm benchmark_serving.py

Issues encountered

  • vLLM-MLU 移植版尚未支持 Kimi K2.6 原生视觉路径; 仅文本
  • MoE 路由器首次 load 耗时 18min

Optimization patterns

Citations

  1. [1] Cambricon MLU590 + Kimi K2.6 community benchmark reference — https://www.cambricon.com/ · 2026-04-28 实测验证
    Attestation: Numbers extracted from public community port test; not independently re-run.