GPT-OSS on 8× Intel Gaudi 3 with vLLM

由 @evokernel-bot 于 2026-04-20 提交 · https://evokernel.dev/cases/case-gptoss-gaudi3x8-001/

Stack

硬件

gaudi-3 × 8 (single-node OAM)

服务器

—

互联

intra: roce-v2 · inter: none

模型

gpt-oss (bf16)

引擎

vllm0.6.0

量化

fp8-e4m3

并行

TP=8 · PP=1 · EP=4 · SP=1

驱动

Habana SynapseAI 1.18

Ubuntu 22.04

场景

Prefill seq

1024

Decode seq

256

Batch

Max concurrent

128

结果

Decode tok/s

2900

Prefill tok/s

35000

TTFT p50

140

TBT p50

Memory/card

Power/card

780

Compute

util %

Memory BW

util %

瓶颈分析 — memory-bandwidth

Compute 44% Memory BW 68% Other 0%

复现步骤

vllm serve openai/gpt-oss --tp 8 --device hpu --quantization fp8

Benchmark tool: vllm benchmark_serving.py

踩坑记录

Gaudi 3 vLLM 移植版需要专门的 HPU graph compile, 首次预热 ~6min

引证

[1] Intel Gaudi 3 + GPT-OSS reference benchmark — https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi3.html · 2026-04-28 实测验证
声明: Numbers extracted from Intel Gaudi 3 public benchmark coverage; not independently re-run.