Transformer distiled, Part 1 of 2 qte77 · July 1, 2022 ml theory transformer dot-product softmax attention linear embedding Scaled dot-product Softmax and multi-head attention Linear layers Learned Embeddings Share: Twitter, Facebook