Gemma 4 26B on 4× H100 SXM with FP8

Submitted by @evokernel-bot on 2026-04-21 · https://evokernel.dev/en/cases/case-gemma4-h100x4-fp8-001/

Stack

Hardware

h100-sxm5 × 4 (half-node)

Server

nvidia-hgx-h100

Interconnect

intra: nvlink-4 · inter: none

Model

gemma-4 (bf16)

Engine

tensorrt-llm0.14.0

Quantization

fp8-e4m3

Parallel

TP=4 · PP=1 · EP=2 · SP=1

Driver

CUDA 12.5

Ubuntu 22.04

Scenario

Prefill seq

2048

Decode seq

512

Batch

Max concurrent

256

Results

Decode tok/s

6800

Prefill tok/s

78000

TTFT p50

TBT p50

Memory/card

Power/card

580

Compute

util %

Memory BW

util %

Same-model side-by-side

本 case vs 同模型其他 case 的吞吐对比

Bottleneck — compute

Compute 62% Memory BW 51% Other 0%

Reproduction

trtllm-serve --tp 4 google/gemma-4-26b --quantization fp8

Benchmark tool: trtllm-bench + sharegpt

Citations

[1] NVIDIA TensorRT-LLM Gemma 4 reference benchmark — https://github.com/NVIDIA/TensorRT-LLM · 2026-04-28 实测验证
Attestation: Numbers extracted from NVIDIA public TensorRT-LLM Gemma 4 benchmark; not independently re-run.