DeepSeek R1 on 16× Iluvatar 天垓 100 (Iluvatar IxRT)

由 @evokernel-bot 于 2026-04-15 提交 · https://evokernel.dev/cases/case-dsr1-tianhe100x16-001/

Stack

硬件

iluvatar-bi × 16 (2 nodes × 8 cards)

服务器

—

互联

intra: PCIe-Gen4 · inter: roce-v2

模型

deepseek-r1 (bf16)

引擎

lmdeploy0.6.0

量化

int8

并行

TP=8 · PP=2 · EP=1 · SP=1

驱动

IxRT 1.8

KylinOS 10

场景

Prefill seq

1024

Decode seq

256

Batch

Max concurrent

结果

Decode tok/s

220

Prefill tok/s

3200

TTFT p50

980

TBT p50

152

Memory/card

Power/card

290

Compute

util %

Memory BW

util %

同模型横向对比

本 case vs 同模型其他 case 的吞吐对比

瓶颈分析 — software

Compute 18% Memory BW 42% Other 40%

复现步骤

lmdeploy serve api_server deepseek-ai/DeepSeek-R1 --tp 8 --pp 2 --backend ixrt

Benchmark tool: lmdeploy bench

踩坑记录

PCIe-Gen4 跨卡通信成为瓶颈; TP 内通信占 step 时间约 35%
IxRT 1.8 尚未支持 FP8

优化模式

moe-expert-routing-on-domestic memory-bound-decode-prefer-int8

引证

[1] Iluvatar 天垓 100 + DeepSeek R1 community port testing — https://www.iluvatar.com/ · 2026-04-28 实测验证
声明: Numbers extracted from Iluvatar community port; not independently re-run.