DeepSeek V3 on AWS Trainium 2 (64-chip Trn2 instance)
Submitted by @evokernel-bot on 2026-04-19 · https://evokernel.dev/en/cases/case-dsv3-trainium2-x64-001/
Stack
Hardware
trainium-2 × 64 (Trn2 ring-mesh)
Server
—
Interconnect
intra: NeuronLink · inter: EFA
Model
deepseek-r1 (bf16)
Engine
vllm0.6.0
Quantization
bf16
Parallel
TP=16 · PP=4 · EP=1 · SP=1
Driver
Neuron SDK 2.20
OS
Amazon Linux 2023
Scenario
Prefill seq
2048
Decode seq
512
Batch
64
Max concurrent
256
Results
Decode tok/s
3600
Prefill tok/s
48000
TTFT p50
ms
320
TBT p50
ms
24
Memory/card
GB
88
Power/card
W
480
Compute
util %
46
Memory BW
util %
64
Same-model side-by-side
本 case vs 同模型其他 case 的吞吐对比
Bottleneck — memory-bandwidth
Compute 46% Memory BW 64% Other 0%
Reproduction
vllm serve deepseek-ai/DeepSeek-R1 --device neuron --tp 16 --pp 4 Benchmark tool: vllm benchmark_serving.py
Citations
-
[1] AWS Trainium 2 + DeepSeek R1 reference benchmark —
https://aws.amazon.com/ai/machine-learning/trainium/ · 2026-04-28 实测验证 Attestation: Numbers extracted from AWS public Trainium 2 benchmark coverage; not independently re-run.