← 量化方案

BF16

fp 有损

Brain float 16; 8-bit exponent, 7-bit mantissa; default training precision since 2020

权重位数

bits/weight

16

激活位数

bits/activation

16

支持硬件

of total

31/39

实测案例

10

使用此量化的案例 (10)

DeepSeek R1 on 16× Ascend 910B with MindIE

ascend-910b ×16 · deepseek-r1 · 850 tok/s
DeepSeek V4 Pro on Huawei CloudMatrix 384 with MindIE

ascend-910c ×384 · deepseek-v4-pro · 2400 tok/s
Llama 3.3 70B on 8× A100 SXM4 80GB with vLLM

a100-sxm4 ×8 · llama-3.3-70b · 1480 tok/s
Llama 4 Scout on 8×H100 SXM with vLLM (public benchmark)

h100-sxm5 ×8 · llama-4-scout · 1850 tok/s
GLM-5.1 on 8× H200 SXM with vLLM BF16

h200-sxm ×8 · glm-5.1 · 2400 tok/s
Llama 4 Maverick on TPU Trillium (v6e) 256-chip pod

trillium ×256 · llama-4-maverick · 5800 tok/s
Llama 4 Scout on 8× Hygon DCU K100 with vLLM

dcu-k100 ×8 · llama-4-scout · 850 tok/s
Kimi K2.6 on 16× Cambricon MLU590 (with vLLM port)

mlu590 ×16 · kimi-k2.6 · 480 tok/s
Llama 4 Scout on 8× MI300X with vLLM BF16

mi300x ×8 · llama-4-scout · 2200 tok/s
DeepSeek V3 on AWS Trainium 2 (64-chip Trn2 instance)

trainium-2 ×64 · deepseek-r1 · 3600 tok/s