Memory-bound decode 阶段优先使用 INT8/INT4 量化

类别: quantization

何时适用

当 decode 阶段算术强度低于硬件 ridge point (peak compute / peak memory bandwidth) 时, decode 吞吐受内存带宽限制。此时 weight-only INT8 或 INT4 量化通过减少每 token 的字节读取量, 可以显著提升 decode tok/s, 通常 1.5-2.5x。

适用条件

decode 阶段 batch size 较小 (≤ 16)
模型 active params 较大
硬件支持 INT8/INT4 weight + FP16 activation 的反量化路径

副作用

模型质量略降 (通常 < 0.5 perplexity)
需校准 (AWQ / GPTQ)

支撑案例 (14)

DeepSeek R1 on 16× Ascend 910B with MindIE
DeepSeek V4 Flash on 8×H100 SXM with vLLM FP8
DeepSeek V4 Pro on Huawei CloudMatrix 384 with MindIE
Llama 3.3 70B on 8× A100 SXM4 80GB with vLLM
Llama 4 Scout on 8×H100 SXM with vLLM (public benchmark)
Qwen2.5-Coder 32B on 4× L40S with vLLM (FP8)
DeepSeek V4 Flash with disaggregated prefill (H100) + decode (H200) via Mooncake
GLM-5.1 on 8× H200 SXM with vLLM BF16
Qwen3.6 Plus on 8× MI325X with SGLang FP8
Qwen3.5 397B Reasoning on 8× MI355X with FP4
Qwen3.6 Plus on 8× Cambricon MLU590 with LMDeploy
GLM-5.1 on 8× Biren BR104 (export-control variant)
Gemma 4 on 4× MetaX 曦云 C500 with INT8
DeepSeek R1 on 16× Iluvatar 天垓 100 (Iluvatar IxRT)