Attention Is All You Need
This paper introduces the Transformer, a novel neural network architecture that relies solely on attention mechanisms, dispensing with recurrence and convolutions, achieving state-of-the-art results in machine translation while requiring significantly less training time.
Title The Transformer introduces a self-attention–based architecture that replaces recurrence and convolution, enabling parallel training and achieving strong translation results. | 1:39Explained | |
Introduction The Transformer replaces recurrence with attention to model dependencies and enables significant parallelization, achieving state-of-the-art translation performance. | 1:51Explained | |
Model Architecture The Transformer uses an encoder–decoder framework to map input symbol sequences to output sequences. | 2:04Explained | |
Encoder and Decoder Stacks The encoder stacks six layers with multi-head self-attention and feed-forward networks; the decoder stacks six layers with masked self-attention and encoder–decoder attention. | 1:36Explained | |
Scaled Dot-Product Attention Attention computes a weighted sum of values from queries and keys using a scaled dot-product and softmax. | 1:49Explained | |
Multi-Head Attention Multiple attention heads operate in parallel to attend to information from different representation subspaces. | 1:47Explained | |
Applications of Attention Attention is used in encoder–decoder attention, encoder self-attention, and decoder self-attention with masking. | 1:41Explained | |
Position-wise Feed-Forward Networks Each encoder and decoder layer includes a position-wise feed-forward network with two linear transforms and a ReLU. | 2:10Explained | |
Embeddings and Softmax Token embeddings are learned and shared with the pre-softmax projection, followed by a softmax to produce probabilities. | 2:04Explained | |
Per-layer Complexity Table 1 compares per-layer complexity, sequential operations, and maximum path length across layer types. | 1:33Explained | |
Positional Encoding Positional encodings are added to embeddings to inject sequence order using sinusoidal functions. | 1:08Explained | |
Why Self-Attention Self-attention reduces sequential computation and shortens dependency paths, enabling faster training than recurrence or convolution. | 1:09Explained | |
Restricting Self-Attention Restricting attention to a neighborhood reduces computation but increases the maximum path length. | 1:07Explained | |
Convolution vs Self-Attention Convolutions require multiple layers to connect all positions, whereas self-attention with a feed-forward layer provides efficient, global connectivity. | 1:35Explained | |
Training The Transformer is trained on large translation datasets with Adam and a warmup-based learning rate schedule. | 1:20Explained |