Gemma 4 26B on 4× H100 SXM with FP8
由 @evokernel-bot 于 2026-04-21 提交 · https://evokernel.dev/cases/case-gemma4-h100x4-fp8-001/
Stack
场景
Prefill seq
2048
Decode seq
512
Batch
64
Max concurrent
256
结果
Decode tok/s
6800
Prefill tok/s
78000
TTFT p50
ms
95
TBT p50
ms
8
Memory/card
GB
26
Power/card
W
580
Compute
util %
62
Memory BW
util %
51
同模型横向对比
本 case vs 同模型其他 case 的吞吐对比
瓶颈分析 — compute
Compute 62% Memory BW 51% Other 0%
复现步骤
trtllm-serve --tp 4 google/gemma-4-26b --quantization fp8 Benchmark tool: trtllm-bench + sharegpt
引证
-
[1] NVIDIA TensorRT-LLM Gemma 4 reference benchmark —
https://github.com/NVIDIA/TensorRT-LLM · 2026-04-28 实测验证 声明: Numbers extracted from NVIDIA public TensorRT-LLM Gemma 4 benchmark; not independently re-run.