G Google Last verified

Google TPU Trillium (v6e)

PROPRIETARY 在售 发布于 2024 tpu-v6
BF16
TFLOP/s
918 厂商声称
FP8
TFLOP/s
918 厂商声称
FP4
TFLOP/s
不支持
Memory
GB
32 厂商声称
Mem BW
GB/s
1640 厂商声称
TDP
W
250 厂商声称

完整规格

算力

FP4 TFLOPS
不支持
FP8 TFLOPS
918
BF16 TFLOPS
918
FP16 TFLOPS
918
INT8 TOPS
1836

显存

容量
32 GB
带宽
1640 GB/s
类型
HBM2e

芯片架构 🟢 vendor floorplan

XPU count
1
HBM stacks
2
制程
5 nm

Scale-Up (节点内)

协议
ICI
单链带宽
3200 GB/s
World size
256
拓扑
2d-torus
交换机

Scale-Out (节点间)

单卡出口
100 Gbps
协议
DCN
NIC

拓扑示意

拓扑结构 · Topology
256 卡 scale-up domain
芯片内部 / Die-level architecture
HBM HBM Google TPU Trillium (v6e) L2 / shared cache · NoC L1$ / register file (per XPU) 1 XPUs · darker block = tensor / matrix engine 918 TFLOPS BF16 · 918 FP8 · 32 GB HBM2e @ 1.6 TB/s · 250 W TDP

🟢 vendor floorplan 1 XPUs · 2× HBM · 5 nm


集群拓扑 / Cluster topology · ICI @ 3200 GB/s
Spine (ICI fabric) Leaf switches N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 N13 N14 N15 N16 N17 N18 N19 N20 N21 N22 N23 N24 N25 N26 N27 N28 N29 N30 N31 N32 Super-pod (rack-scale) · 256 cards in single scale-up domain · 3200 GB/s/link · 2-tier Clos fabric
Scale-Up · 域内
ICI
3200 GB/s · 拓扑: 2d-torus
world_size = 256
Scale-Out · 跨域
DCN
100 Gbps/卡 NIC

能跑哪些模型?

Quick estimates · decode tok/s/card 上界

TP=8 · FP8 · batch=16 · prefill=1024 · decode=256 · 已应用 efficiency 校准

在计算器中调整 →
模型 参数 (active) Decode tok/s/card 瓶颈
DeepSeek V4 Pro
deepseek
49B 显存不足
DeepSeek V4 Flash
deepseek
13B 显存不足
Mistral Small 4
mistral
22B 23 内存带宽
GLM-5 Reasoning
zhipu
32B 19 内存带宽
GLM-5.1
zhipu
32B 显存不足
Qwen3.6 Plus
alibaba
35B 显存不足
Kimi K2.6
moonshot
32B 显存不足
MiniMax M2.7
minimax
46B 显存不足

算子级 fit · 任意模型瓶颈类型 + 上界

算子级 fit · operator-level fit (per-token roofline)

基于每个模型 operator_decomposition + 本卡 BF16 918 TFLOPS / 1,640 GB/s 计算 · ridge point ≈ 560 FLOPs/byte

上界 = min(计算屋顶, 内存带宽屋顶) · efficiency 未应用
模型 domain 主导算子 AI · F/B 瓶颈 tok/s 上界
DeepSeek V4 Pro llm matmul 245.5 💾 内存带宽 67k
GraphCast scientific graph-message-passing 0.9 💾 内存带宽 3026
AlphaFold 3 scientific pair-bias-attention 2.3 💾 内存带宽 909
GPT-OSS llm matmul 0.7 💾 内存带宽 133
Gemma 4 26B llm matmul 0.7 💾 内存带宽 99
DeepSeek V4 Flash llm matmul 0.8 💾 内存带宽 93
Mistral Small 4 llm matmul 0.6 💾 内存带宽 42
Llama 4 Maverick llm matmul 0.8 💾 内存带宽 42
需要 efficiency 校准 + concurrency 扫描 + TCO 估算 → 在计算器中评估 →

算子支持 & 优化空间

算子支持 & 优化空间 / Operator support & headroom

Per-operator support derived from software_support.engines + scale-up topology. Optimization headroom from measured efficiency factor.

Optimization headroom
+46 pp
moderate

Reaching 54% of roofline. Moderate headroom; focus on attention + MoE kernel fusion.

Communication (collective)
All-to-All 🟢 mature
all-to-all via ICI world_size=256
AllReduce 🟢 mature
ICI ring all-reduce
Attention
Multi-Head Attention 🟢 mature
paged-attention via vLLM/SGLang/MindIE
FlashAttention-3 🟢 mature
FA-3 on modern engine + tensor cores
Matrix multiply (GEMM)
Matrix Multiplication 🟢 mature
GEMM supported on all inference engines
MoE routing
MoE Routing 🟢 mature
MoE gating supported via vLLM ≥0.4 / SGLang
Normalization
RMSNorm 🟢 mature
fused into engine kernels
Embedding
fused into engine kernels
Activation
SiLU / Swish 🟢 mature
fused into engine kernels
Softmax 🟢 mature
fused into engine kernels

软件栈支持

引擎 状态 BF16FP16FP4FP8 E4M3FP8 E5M2INT4 AWQ
HanGuangAI 未确认
LMDeploy 未确认
MindIE 未确认
MoRI 未确认
SGLang 未确认
TensorRT-LLM (Dynamo) 未确认
vLLM 社区
实测校准 efficiency factor

基于 1 个该硬件的实测案例计算得出, 计算器使用此值替代默认 0.5。

0.54
measured / theoretical (n=1)

已有部署案例 (1)

引证

  1. [1] Google introduces Trillium (TPU v6e) blog post — https://cloud.google.com/blog/products/compute/introducing-trillium-6th-gen-tpus · 访问于 2026-04-28 厂商声称
  2. [2] Trillium (TPU v6e): single TensorCore chip, 2× HBM2e ⇒ 32 GB; 2D-torus ICI fabric (256 chips/pod); estimated TSMC 5nm-class — https://cloud.google.com/blog/products/compute/introducing-trillium-6th-gen-tpus · 访问于 2026-04-28 社区估算
⚠ TPU Trillium only available via Google Cloud.