GPT-OSS on 8× Intel Gaudi 3 with vLLM

Submitted by @evokernel-bot on 2026-04-20 · https://evokernel.dev/en/cases/case-gptoss-gaudi3x8-001/

Stack

Hardware

gaudi-3 × 8 (single-node OAM)

Server

—

Interconnect

intra: roce-v2 · inter: none

Model

gpt-oss (bf16)

Engine

vllm0.6.0

Quantization

fp8-e4m3

Parallel

TP=8 · PP=1 · EP=4 · SP=1

Driver

Habana SynapseAI 1.18

Ubuntu 22.04

Scenario

Prefill seq

1024

Decode seq

256

Batch

Max concurrent

128

Results

Decode tok/s

2900

Prefill tok/s

35000

TTFT p50

140

TBT p50

Memory/card

Power/card

780

Compute

util %

Memory BW

util %

Bottleneck — memory-bandwidth

Compute 44% Memory BW 68% Other 0%

Reproduction

vllm serve openai/gpt-oss --tp 8 --device hpu --quantization fp8

Benchmark tool: vllm benchmark_serving.py

Issues encountered

Gaudi 3 vLLM 移植版需要专门的 HPU graph compile, 首次预热 ~6min

Citations

[1] Intel Gaudi 3 + GPT-OSS reference benchmark — https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi3.html · 2026-04-28 实测验证
Attestation: Numbers extracted from Intel Gaudi 3 public benchmark coverage; not independently re-run.