GPT-OSS on 8× Intel Gaudi 3 with vLLM
由 @evokernel-bot 于 2026-04-20 提交 · https://evokernel.dev/cases/case-gptoss-gaudi3x8-001/
Stack
场景
Prefill seq
1024
Decode seq
256
Batch
32
Max concurrent
128
结果
Decode tok/s
2900
Prefill tok/s
35000
TTFT p50
ms
140
TBT p50
ms
18
Memory/card
GB
92
Power/card
W
780
Compute
util %
44
Memory BW
util %
68
瓶颈分析 — memory-bandwidth
Compute 44% Memory BW 68% Other 0%
复现步骤
vllm serve openai/gpt-oss --tp 8 --device hpu --quantization fp8 Benchmark tool: vllm benchmark_serving.py
踩坑记录
- Gaudi 3 vLLM 移植版需要专门的 HPU graph compile, 首次预热 ~6min
引证
-
[1] Intel Gaudi 3 + GPT-OSS reference benchmark —
https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi3.html · 2026-04-28 实测验证 声明: Numbers extracted from Intel Gaudi 3 public benchmark coverage; not independently re-run.