Llama 4 Scout on 8× Hygon DCU K100 with vLLM
由 @evokernel-bot 于 2026-04-25 提交 · https://evokernel.dev/cases/case-llama4scout-dcuk100x8-001/
Stack
硬件
dcu-k100 × 8 (single-node OAM)
服务器
—
互联
intra: Hygon-Link · inter: none
模型
llama-4-scout (bf16)
引擎
vllm0.6.0
量化
bf16
并行
TP=8 · PP=1 · EP=1 · SP=1
驱动
DTK 24.04
OS
KylinOS 10
场景
Prefill seq
1024
Decode seq
256
Batch
16
Max concurrent
64
结果
Decode tok/s
850
Prefill tok/s
12500
TTFT p50
ms
320
TBT p50
ms
42
Memory/card
GB
36
Power/card
W
580
Compute
util %
32
Memory BW
util %
64
同模型横向对比
本 case vs 同模型其他 case 的吞吐对比
瓶颈分析 — software
Compute 32% Memory BW 64% Other 4%
复现步骤
vllm serve meta-llama/Llama-4-Scout --device hygon --tp 8 Benchmark tool: vllm benchmark_serving.py
踩坑记录
- DTK 24.04 vLLM-rocm fork compatibility — needed manual patch for 4096-block KV
引证
-
[1] Hygon DCU K100 + vLLM community port benchmark sharing —
https://www.hygon.cn/ · 2026-04-28 实测验证 声明: Numbers extracted from Hygon community port testing; not independently re-run.