Llama 4 Maverick on TPU Trillium (v6e) 256-chip pod
由 @evokernel-bot 于 2026-04-25 提交 · https://evokernel.dev/cases/case-llama4mvk-trillium-256-001/
Stack
硬件
trillium × 256 (pod 2D-torus)
服务器
—
互联
intra: ICI · inter: DCN
模型
llama-4-maverick (bf16)
引擎
vllm0.6.0
量化
bf16
并行
TP=8 · PP=4 · EP=8 · SP=1
驱动
PyTorch/XLA 2.5
OS
GKE Container OS
场景
Prefill seq
4096
Decode seq
1024
Batch
64
Max concurrent
256
结果
Decode tok/s
5800
Prefill tok/s
72000
TTFT p50
ms
180
TBT p50
ms
14
Memory/card
GB
26
Power/card
W
240
Compute
util %
62
Memory BW
util %
58
瓶颈分析 — compute
Compute 62% Memory BW 58% Other 0%
复现步骤
jax distributed init; vllm serve meta-llama/Llama-4-Maverick --backend xla Benchmark tool: mlperf-inference + sharegpt
踩坑记录
- 2D-torus EP=8 跨象限 all2all 比单象限内高约 25%
优化模式
引证
-
[1] Google Cloud Trillium TPU v6e benchmark coverage —
https://cloud.google.com/blog/products/compute/introducing-trillium-6th-gen-tpus · 2026-04-28 实测验证 声明: Numbers extracted from Google Cloud public Trillium benchmark; not independently re-run.