← 量化方案

FP8 E4M3

fp8 有损

4-bit exponent, 3-bit mantissa; preferred for activations due to dynamic range

权重位数

bits/weight

8

激活位数

bits/activation

8

支持硬件

of total

22/39

实测案例

6

使用此量化的案例 (6)

DeepSeek V4 Flash on 8×H100 SXM with vLLM FP8

h100-sxm5 ×8 · deepseek-v4-flash · 4200 tok/s
Qwen2.5-Coder 32B on 4× L40S with vLLM (FP8)

l40s ×4 · qwen2.5-coder-32b · 580 tok/s
DeepSeek V4 Flash with disaggregated prefill (H100) + decode (H200) via Mooncake

h200-sxm ×16 · deepseek-v4-flash · 9600 tok/s
Qwen3.6 Plus on 8× MI325X with SGLang FP8

mi325x ×8 · qwen3.6-plus · 3100 tok/s
Gemma 4 26B on 4× H100 SXM with FP8

h100-sxm5 ×4 · gemma-4 · 6800 tok/s
GPT-OSS on 8× Intel Gaudi 3 with vLLM

gaudi-3 ×8 · gpt-oss · 2900 tok/s