Scaling Laws for Neural Language Models
Cross-entropy loss for language models scales as a power law with three main factors: non-embedding model size N, dataset size D, and training compute C. Larger models are more data-efficient, and under a fixed compute budget the optimal strategy is to train very large models on a modest amount of data and stop before convergence to maximize performance.
Title Neural language model performance scales with model size, dataset size, and training compute, following predictable power-law relationships. | 1:56Explained | |
Introduction and Key Findings Language model loss follows smooth power-law scaling with model size, dataset size, and compute, with optimal compute allocation favoring larger models trained for fewer steps. | 1:49Explained | |
Background, Notation, and Methods Experiments utilize Transformers trained on WebText2, defining model size N, training compute C, and employing Adam/Adafactor optimizers with early stopping to evaluate generalization. | 0:27Explained | |
Empirical Power Laws Test loss exhibits simple power-law scaling with model size, dataset size, and compute when bottlenecks are absent, with Transformers outperforming LSTMs in long contexts. | 1:44Explained | |
Joint Dependence on Model Size and Dataset Size A joint power-law function describes early-stopped test loss as a function of model and dataset size, establishing a relation for data growth to avoid overfitting. | 2:05Explained | |
Training Dynamics, Critical Batch Size, and Learning Curves Training dynamics are characterized by a loss-dependent critical batch size, with larger models exhibiting greater sample efficiency by requiring fewer serial steps to reach a target loss. | 1:31Explained | |
Optimal Allocation of a Fixed Compute Budget Optimal allocation of fixed compute budget favors increasing model size N, with modest increases in batch size and steps, leading to compute-efficient training of very large models. | 1:40Explained | |
Contradictions at Extreme Scale Extrapolation of scaling laws to extreme scales suggests a potential breakdown or qualitative change in data requirements or model behavior as limits are approached. | 1:40Explained | |
Discussion and Practical Takeaways Empirical scaling laws provide guidance for practitioners to favor larger models and earlier stopping for improved performance per unit compute, with architectural choices having minor effects. | 1:35Explained | |
Appendices Appendices summarize key power-law fits, provide fitted exponents and scale constants, and list caveats regarding theoretical explanations and experimental limitations. | 1:34Explained |