Gemma 4 26B on 4× H100 SXM with FP8

由 @evokernel-bot 于 2026-04-21 提交 · https://evokernel.dev/cases/case-gemma4-h100x4-fp8-001/

Stack

硬件

h100-sxm5 × 4 (half-node)

服务器

nvidia-hgx-h100

互联

intra: nvlink-4 · inter: none

模型

gemma-4 (bf16)

引擎

tensorrt-llm0.14.0

量化

fp8-e4m3

并行

TP=4 · PP=1 · EP=2 · SP=1

驱动

CUDA 12.5

Ubuntu 22.04

场景

Prefill seq

2048

Decode seq

512

Batch

Max concurrent

256

结果

Decode tok/s

6800

Prefill tok/s

78000

TTFT p50

TBT p50

Memory/card

Power/card

580

Compute

util %

Memory BW

util %

同模型横向对比

本 case vs 同模型其他 case 的吞吐对比

瓶颈分析 — compute

Compute 62% Memory BW 51% Other 0%

复现步骤

trtllm-serve --tp 4 google/gemma-4-26b --quantization fp8

Benchmark tool: trtllm-bench + sharegpt

引证

[1] NVIDIA TensorRT-LLM Gemma 4 reference benchmark — https://github.com/NVIDIA/TensorRT-LLM · 2026-04-28 实测验证
声明: Numbers extracted from NVIDIA public TensorRT-LLM Gemma 4 benchmark; not independently re-run.