Llama 4 Maverick on TPU Trillium (v6e) 256-chip pod

由 @evokernel-bot 于 2026-04-25 提交 · https://evokernel.dev/cases/case-llama4mvk-trillium-256-001/

Stack

硬件

trillium × 256 (pod 2D-torus)

服务器

—

互联

intra: ICI · inter: DCN

模型

llama-4-maverick (bf16)

引擎

vllm0.6.0

量化

bf16

并行

TP=8 · PP=4 · EP=8 · SP=1

驱动

PyTorch/XLA 2.5

GKE Container OS

场景

Prefill seq

4096

Decode seq

1024

Batch

Max concurrent

256

结果

Decode tok/s

5800

Prefill tok/s

72000

TTFT p50

180

TBT p50

Memory/card

Power/card

240

Compute

util %

Memory BW

util %

瓶颈分析 — compute

Compute 62% Memory BW 58% Other 0%

复现步骤

jax distributed init; vllm serve meta-llama/Llama-4-Maverick --backend xla

Benchmark tool: mlperf-inference + sharegpt

踩坑记录

2D-torus EP=8 跨象限 all2all 比单象限内高约 25%

优化模式

disaggregated-prefill-decode

引证

[1] Google Cloud Trillium TPU v6e benchmark coverage — https://cloud.google.com/blog/products/compute/introducing-trillium-6th-gen-tpus · 2026-04-28 实测验证
声明: Numbers extracted from Google Cloud public Trillium benchmark; not independently re-run.