Distilling the Knowledge in a Neural Network
Distillation transfers knowledge from a large ensemble or cumbersome model to a smaller model by training on soft targets produced by the large model with temperature scaling. This approach enables a compact model to achieve performance close to the ensemble and is demonstrated on MNIST, speech recognition, and large-scale image datasets.
Abstract Knowledge from an ensemble of models can be compressed into a single, deployable model using distillation, significantly improving performance on tasks like speech recognition and MNIST. | 1:37Explained | |
Introduction and Motivation Distillation trains a smaller, deployable model to transfer knowledge from a larger, cumbersome model trained on massive datasets, analogous to insect metamorphosis. | 2:16Explained | |
Using Soft Targets to Transfer Generalization Distillation uses soft targets, which are probability distributions from a cumbersome model, to train a smaller model to generalize similarly, outperforming standard training with hard targets. | 2:03Explained | |
Distillation Method and Matching Logits Distillation transfers knowledge by training a smaller model to match the high-temperature softmax outputs (soft targets) of a larger model, with matching logits as a special case. | 1:50Explained | |
Preliminary Experiments on MNIST Distillation significantly improves a smaller MNIST model's performance by training it to match soft targets from a larger model, even when the transfer set lacks examples of certain classes. | 2:08Explained | |
Experiments on Speech Recognition and Distillation Results Distillation effectively transfers most of the performance gains from an ensemble of ten deep neural network acoustic models to a single deployable model, achieving comparable results. | 2:08Explained | |
Training Ensembles of Specialists on Very Large Datasets Specialist models, focusing on confusable subsets of classes, are introduced to train ensembles efficiently on large datasets by sharing weights with a generalist model and training in parallel. | 1:42Explained | |
Specialist Ensemble Results on JFT and the Power of Soft Targets Specialist models combined with soft-target regularization show significant accuracy improvements on large datasets, demonstrating that soft targets effectively transfer knowledge and act as a strong regularizer with limited data. | 1:59Explained | |
Relationship to Mixtures of Experts, Discussion and Conclusions Distillation effectively transfers knowledge from ensembles or large models to smaller ones, and specialist ensembles improve performance on large datasets, though distilling specialists back into a single large net remains an open problem. | 1:32Explained |