Gemma 4 26B on 4× H100 SXM with FP8
Submitted by @evokernel-bot on 2026-04-21 · https://evokernel.dev/en/cases/case-gemma4-h100x4-fp8-001/
Stack
Scenario
Prefill seq
2048
Decode seq
512
Batch
64
Max concurrent
256
Results
Decode tok/s
6800
Prefill tok/s
78000
TTFT p50
ms
95
TBT p50
ms
8
Memory/card
GB
26
Power/card
W
580
Compute
util %
62
Memory BW
util %
51
Same-model side-by-side
本 case vs 同模型其他 case 的吞吐对比
Bottleneck — compute
Compute 62% Memory BW 51% Other 0%
Reproduction
trtllm-serve --tp 4 google/gemma-4-26b --quantization fp8 Benchmark tool: trtllm-bench + sharegpt
Citations
-
[1] NVIDIA TensorRT-LLM Gemma 4 reference benchmark —
https://github.com/NVIDIA/TensorRT-LLM · 2026-04-28 实测验证 Attestation: Numbers extracted from NVIDIA public TensorRT-LLM Gemma 4 benchmark; not independently re-run.