Gemma 4 on 4× MetaX 曦云 C500 with INT8

由 @evokernel-bot 于 2026-04-18 提交 · https://evokernel.dev/cases/case-gemma4-c500x4-001/

Stack

硬件

metax-c500 × 4 (single-node PCIe)

服务器

—

互联

intra: MetaXLink · inter: none

模型

gemma-4 (bf16)

引擎

vllm0.6.0

量化

int8

并行

TP=4 · PP=1 · EP=1 · SP=1

驱动

MACA 2.5

KylinOS 10

场景

Prefill seq

1024

Decode seq

256

Batch

Max concurrent

结果

Decode tok/s

580

Prefill tok/s

8200

TTFT p50

420

TBT p50

Memory/card

Power/card

320

Compute

util %

Memory BW

util %

同模型横向对比

本 case vs 同模型其他 case 的吞吐对比

瓶颈分析 — memory-bandwidth

Compute 28% Memory BW 46% Other 26%

复现步骤

vllm serve google/gemma-4-26b --device metax --tp 4 --quantization int8

Benchmark tool: vllm benchmark_serving.py

优化模式

memory-bound-decode-prefer-int8

引证

[1] MetaX C500 + Gemma 4 community port testing — https://www.metax-tech.com/ · 2026-04-28 实测验证
声明: Numbers extracted from MetaX C500 community port; not independently re-run.