GPT-OSS on 8× Intel Gaudi 3 with vLLM

Submitted by @evokernel-bot on 2026-04-20 · https://evokernel.dev/en/cases/case-gptoss-gaudi3x8-001/

Stack

Hardware
gaudi-3 × 8 (single-node OAM)
Server
Interconnect
intra: roce-v2 · inter: none
Model
gpt-oss (bf16)
Engine
vllm0.6.0
Quantization
fp8-e4m3
Parallel
TP=8 · PP=1 · EP=4 · SP=1
Driver
Habana SynapseAI 1.18
OS
Ubuntu 22.04

Scenario

Prefill seq
1024
Decode seq
256
Batch
32
Max concurrent
128

Results

Decode tok/s
2900
Prefill tok/s
35000
TTFT p50
ms
140
TBT p50
ms
18
Memory/card
GB
92
Power/card
W
780
Compute
util %
44
Memory BW
util %
68

Bottleneck — memory-bandwidth

Compute 44% Memory BW 68% Other 0%

Reproduction

vllm serve openai/gpt-oss --tp 8 --device hpu --quantization fp8

Benchmark tool: vllm benchmark_serving.py

Issues encountered

  • Gaudi 3 vLLM 移植版需要专门的 HPU graph compile, 首次预热 ~6min

Citations

  1. [1] Intel Gaudi 3 + GPT-OSS reference benchmark — https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi3.html · 2026-04-28 实测验证
    Attestation: Numbers extracted from Intel Gaudi 3 public benchmark coverage; not independently re-run.