GPT-OSS on 8× Intel Gaudi 3 with vLLM

@evokernel-bot 于 2026-04-20 提交 · https://evokernel.dev/cases/case-gptoss-gaudi3x8-001/

Stack

硬件
gaudi-3 × 8 (single-node OAM)
服务器
互联
intra: roce-v2 · inter: none
模型
gpt-oss (bf16)
引擎
vllm0.6.0
量化
fp8-e4m3
并行
TP=8 · PP=1 · EP=4 · SP=1
驱动
Habana SynapseAI 1.18
OS
Ubuntu 22.04

场景

Prefill seq
1024
Decode seq
256
Batch
32
Max concurrent
128

结果

Decode tok/s
2900
Prefill tok/s
35000
TTFT p50
ms
140
TBT p50
ms
18
Memory/card
GB
92
Power/card
W
780
Compute
util %
44
Memory BW
util %
68

瓶颈分析 — memory-bandwidth

Compute 44% Memory BW 68% Other 0%

复现步骤

vllm serve openai/gpt-oss --tp 8 --device hpu --quantization fp8

Benchmark tool: vllm benchmark_serving.py

踩坑记录

  • Gaudi 3 vLLM 移植版需要专门的 HPU graph compile, 首次预热 ~6min

引证

  1. [1] Intel Gaudi 3 + GPT-OSS reference benchmark — https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi3.html · 2026-04-28 实测验证
    声明: Numbers extracted from Intel Gaudi 3 public benchmark coverage; not independently re-run.