BF16
fp 有损
Brain float 16; 8-bit exponent, 7-bit mantissa; default training precision since 2020
权重位数
bits/weight
16
激活位数
bits/activation
16
支持硬件
of total
31/39
实测案例
10
支持硬件 (31)
海外
AMD Instinct MI300A AMD Instinct MI300X AMD Instinct MI325X AMD Instinct MI355X Apple M4 Max Neural Engine AWS Inferentia 2 AWS Trainium 2 Cerebras WSE-3 Etched Sohu Google TPU v5p Google TPU Trillium (v6e) Groq LPU (TSP v1) Intel Gaudi 2 Intel Gaudi 3 NVIDIA A100 SXM4 80GB NVIDIA B200 SXM 180GB NVIDIA B300 SXM 288GB NVIDIA GB200 NVL72 NVIDIA GB300 NVL72 NVIDIA H100 SXM5 80GB NVIDIA H200 SXM 141GB NVIDIA L40S NVIDIA R200 SXM (Vera Rubin) SambaNova SN40L Tenstorrent Wormhole n300
使用此量化的案例 (10)
- DeepSeek R1 on 16× Ascend 910B with MindIEascend-910b ×16 · deepseek-r1 · 850 tok/s
- DeepSeek V4 Pro on Huawei CloudMatrix 384 with MindIEascend-910c ×384 · deepseek-v4-pro · 2400 tok/s
- Llama 3.3 70B on 8× A100 SXM4 80GB with vLLMa100-sxm4 ×8 · llama-3.3-70b · 1480 tok/s
- Llama 4 Scout on 8×H100 SXM with vLLM (public benchmark)h100-sxm5 ×8 · llama-4-scout · 1850 tok/s
- GLM-5.1 on 8× H200 SXM with vLLM BF16h200-sxm ×8 · glm-5.1 · 2400 tok/s
- Llama 4 Maverick on TPU Trillium (v6e) 256-chip podtrillium ×256 · llama-4-maverick · 5800 tok/s
- Llama 4 Scout on 8× Hygon DCU K100 with vLLMdcu-k100 ×8 · llama-4-scout · 850 tok/s
- Kimi K2.6 on 16× Cambricon MLU590 (with vLLM port)mlu590 ×16 · kimi-k2.6 · 480 tok/s
- Llama 4 Scout on 8× MI300X with vLLM BF16mi300x ×8 · llama-4-scout · 2200 tok/s
- DeepSeek V3 on AWS Trainium 2 (64-chip Trn2 instance)trainium-2 ×64 · deepseek-r1 · 3600 tok/s