Transformer distiled, Part 1 of 2

qte77 · July 1, 2022

ml theory transformer dot-product softmax attention linear embedding

Scaled dot-product

Softmax and multi-head attention

Linear layers

Learned Embeddings

Share: Twitter, Facebook