Kimi K2.6 on 16× Cambricon MLU590 (with vLLM port)
由 @evokernel-bot 于 2026-04-22 提交 · https://evokernel.dev/cases/case-kimik26-mlu590x16-001/
Stack
场景
Prefill seq
2048
Decode seq
512
Batch
16
Max concurrent
64
结果
Decode tok/s
480
Prefill tok/s
7200
TTFT p50
ms
460
TBT p50
ms
64
Memory/card
GB
56
Power/card
W
320
Compute
util %
28
Memory BW
util %
64
瓶颈分析 — software
Compute 28% Memory BW 64% Other 8%
复现步骤
vllm serve moonshotai/Kimi-K2.6 --tp 8 --pipeline-parallel-size 2 --device mlu Benchmark tool: vllm benchmark_serving.py
踩坑记录
- vLLM-MLU 移植版尚未支持 Kimi K2.6 原生视觉路径; 仅文本
- MoE 路由器首次 load 耗时 18min
优化模式
引证
-
[1] Cambricon MLU590 + Kimi K2.6 community benchmark reference —
https://www.cambricon.com/ · 2026-04-28 实测验证 声明: Numbers extracted from public community port test; not independently re-run.