Example17 min9 chapters9 audios readyExplained0% complete

Distilling the Knowledge in a Neural Network

Distillation transfers knowledge from a large ensemble or cumbersome model to a smaller model by training on soft targets produced by the large model with temperature scaling. This approach enables a compact model to achieve performance close to the ensemble and is demonstrated on MNIST, speech recognition, and large-scale image datasets.

Get transcript

	Abstract Knowledge from an ensemble of models can be compressed into a single, deployable model using distillation, significantly improving performance on tasks like speech recognition and MNIST.	1:37Explained
	Introduction and Motivation Distillation trains a smaller, deployable model to transfer knowledge from a larger, cumbersome model trained on massive datasets, analogous to insect metamorphosis.	2:16Explained
	Using Soft Targets to Transfer Generalization Distillation uses soft targets, which are probability distributions from a cumbersome model, to train a smaller model to generalize similarly, outperforming standard training with hard targets.	2:03Explained
	Distillation Method and Matching Logits Distillation transfers knowledge by training a smaller model to match the high-temperature softmax outputs (soft targets) of a larger model, with matching logits as a special case.	1:50Explained
	Preliminary Experiments on MNIST Distillation significantly improves a smaller MNIST model's performance by training it to match soft targets from a larger model, even when the transfer set lacks examples of certain classes.	2:08Explained
	Experiments on Speech Recognition and Distillation Results Distillation effectively transfers most of the performance gains from an ensemble of ten deep neural network acoustic models to a single deployable model, achieving comparable results.	2:08Explained
	Training Ensembles of Specialists on Very Large Datasets Specialist models, focusing on confusable subsets of classes, are introduced to train ensembles efficiently on large datasets by sharing weights with a generalist model and training in parallel.	1:42Explained
	Specialist Ensemble Results on JFT and the Power of Soft Targets Specialist models combined with soft-target regularization show significant accuracy improvements on large datasets, demonstrating that soft targets effectively transfer knowledge and act as a strong regularizer with limited data.	1:59Explained
	Relationship to Mixtures of Experts, Discussion and Conclusions Distillation effectively transfers knowledge from ensembles or large models to smaller ones, and specialist ensembles improve performance on large datasets, though distilling specialists back into a single large net remains an open problem.	1:32Explained

Share this document