SiLU / Swish
activationSigmoid-weighted Linear Unit; default activation in modern LLMs (Llama, Qwen)
10 个算子 · FLOPs / bytes 公式 · 跨硬件实现差异
Sigmoid-weighted Linear Unit; default activation in modern LLMs (Llama, Qwen)
Softmax over attention scores; numerically stable form
Standard multi-head attention; FLOPs and bytes are per layer per token
Hopper-optimized FlashAttention v3 using TMA and FP8 paths
Used in expert parallelism for token dispatch and combine
Ring or tree all-reduce across TP group; bandwidth-dominated
Rotary positional embedding applied to Q and K projections
General matrix multiply (GEMM); base operator for FFN and attention projections
Top-k expert selection via softmax over router logits
Root mean square normalization; cheaper than LayerNorm, common in modern LLMs