- [[Attention|Self-attention]] - Weighting significance of parts of the input - Including recursive output - Similar to [[RNN]]s - Process sequential data - Translation & text summarisation - Differences - Process input all at once - Largely replaced [[LSTM]] and gated recurrent units (GRU) which had attention mechanics - No recurrent structure ![[transformer-arch.png]] ## Examples - BERT - Bidirectional Encoder Representations from Transformers - Google - Original GPT [transformers-explained-visually-part-1-overview-of-functionality](https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452) # Architecture ## Input - Byte-pair encoding tokeniser - Mapped via word embedding into vector - Positional information added ## Encoder/Decoder - Similar to seq2seq models - Create internal representation - Encoder layers - Create encodings that contain information about which parts of input are relevant to each other - Subsequent encoder layers receive previous encoding layers output - Decoder layers - Takes encodings and does opposite - Uses incorporated textual information to produce output - Has attention to draw information from output of previous decoders before drawing from encoders - Both use [[attention]] - Both use [[MLP|dense]] layers for additional processing of outputs - Contain residual connections & layer norm steps