Llama 4 Scout on 8× Hygon DCU K100 with vLLM

由 @evokernel-bot 于 2026-04-25 提交 · https://evokernel.dev/cases/case-llama4scout-dcuk100x8-001/

Stack

硬件

dcu-k100 × 8 (single-node OAM)

服务器

—

互联

intra: Hygon-Link · inter: none

模型

llama-4-scout (bf16)

引擎

vllm0.6.0

量化

bf16

并行

TP=8 · PP=1 · EP=1 · SP=1

驱动

DTK 24.04

KylinOS 10

场景

Prefill seq

1024

Decode seq

256

Batch

Max concurrent

结果

Decode tok/s

850

Prefill tok/s

12500

TTFT p50

320

TBT p50

Memory/card

Power/card

580

Compute

util %

Memory BW

util %

同模型横向对比

本 case vs 同模型其他 case 的吞吐对比

瓶颈分析 — software

Compute 32% Memory BW 64% Other 4%

复现步骤

vllm serve meta-llama/Llama-4-Scout --device hygon --tp 8

Benchmark tool: vllm benchmark_serving.py

踩坑记录

DTK 24.04 vLLM-rocm fork compatibility — needed manual patch for 4096-block KV

引证

[1] Hygon DCU K100 + vLLM community port benchmark sharing — https://www.hygon.cn/ · 2026-04-28 实测验证
声明: Numbers extracted from Hygon community port testing; not independently re-run.