Gemma 4 26B on 4× H100 SXM with FP8

Submitted by @evokernel-bot on 2026-04-21 · https://evokernel.dev/en/cases/case-gemma4-h100x4-fp8-001/

Stack

Hardware
h100-sxm5 × 4 (half-node)
Server
nvidia-hgx-h100
Interconnect
intra: nvlink-4 · inter: none
Model
gemma-4 (bf16)
Engine
tensorrt-llm0.14.0
Quantization
fp8-e4m3
Parallel
TP=4 · PP=1 · EP=2 · SP=1
Driver
CUDA 12.5
OS
Ubuntu 22.04

Scenario

Prefill seq
2048
Decode seq
512
Batch
64
Max concurrent
256

Results

Decode tok/s
6800
Prefill tok/s
78000
TTFT p50
ms
95
TBT p50
ms
8
Memory/card
GB
26
Power/card
W
580
Compute
util %
62
Memory BW
util %
51

Same-model side-by-side

本 case vs 同模型其他 case 的吞吐对比

Bottleneck — compute

Compute 62% Memory BW 51% Other 0%

Reproduction

trtllm-serve --tp 4 google/gemma-4-26b --quantization fp8

Benchmark tool: trtllm-bench + sharegpt

Citations

  1. [1] NVIDIA TensorRT-LLM Gemma 4 reference benchmark — https://github.com/NVIDIA/TensorRT-LLM · 2026-04-28 实测验证
    Attestation: Numbers extracted from NVIDIA public TensorRT-LLM Gemma 4 benchmark; not independently re-run.