DeepSeek V3 on AWS Trainium 2 (64-chip Trn2 instance)

由 @evokernel-bot 于 2026-04-19 提交 · https://evokernel.dev/cases/case-dsv3-trainium2-x64-001/

Stack

硬件

trainium-2 × 64 (Trn2 ring-mesh)

服务器

—

互联

intra: NeuronLink · inter: EFA

模型

deepseek-r1 (bf16)

引擎

vllm0.6.0

量化

bf16

并行

TP=16 · PP=4 · EP=1 · SP=1

驱动

Neuron SDK 2.20

Amazon Linux 2023

场景

Prefill seq

2048

Decode seq

512

Batch

Max concurrent

256

结果

Decode tok/s

3600

Prefill tok/s

48000

TTFT p50

320

TBT p50

Memory/card

Power/card

480

Compute

util %

Memory BW

util %

同模型横向对比

本 case vs 同模型其他 case 的吞吐对比

瓶颈分析 — memory-bandwidth

Compute 46% Memory BW 64% Other 0%

复现步骤

vllm serve deepseek-ai/DeepSeek-R1 --device neuron --tp 16 --pp 4

Benchmark tool: vllm benchmark_serving.py

引证

[1] AWS Trainium 2 + DeepSeek R1 reference benchmark — https://aws.amazon.com/ai/machine-learning/trainium/ · 2026-04-28 实测验证
声明: Numbers extracted from AWS public Trainium 2 benchmark coverage; not independently re-run.