Building high-level features using large-scale unsupervised learning
An unsupervised deep autoencoder with local receptive fields, pooling, and local contrast normalization learns high-level detectors (faces, cat faces, human bodies) from unlabeled YouTube frames. Using these learned features for ImageNet classification yields 15.8% accuracy on 22k categories, a ~70% relative improvement over prior state-of-the-art.
Abstract A nine-layered autoencoder trained on ten million unlabeled internet images learns a face detector that is robust to various transformations and can be used for other high-level object recognition tasks. | 1:45Explained | |
Introduction and Motivation This work investigates the feasibility of learning high-level, class-specific feature detectors from unlabeled images, inspired by biological systems and motivated by the challenges of obtaining large labeled datasets. | 1:52Explained | |
Training Set Construction and Large-Scale Approach A large-scale approach using ten million YouTube videos, a deep autoencoder with local receptive fields, and extensive computational resources addresses prior limitations in unsupervised high-level feature learning. | 2:06Explained | |
Architecture and Learning Objectives A nine-layered locally connected autoencoder with pooling and local contrast normalization, comprising around one billion parameters, is designed to learn high-level features from unlabeled data. | 1:51Explained | |
Optimization, Parallelism and Training Details Model and data parallelism using a software framework called DistBelief and asynchronous stochastic gradient descent on a thousand-machine cluster enabled the training of a large-scale autoencoder for three days. | 1:47Explained | |
Experiments on Faces The trained network successfully learns a face detector with 81.7% accuracy from unlabeled data, demonstrating robustness to transformations and the importance of architectural choices like local contrast normalization. | 2:18Explained | |
Cat and Human Body Detectors and Discriminative Performance The network also learns detectors for cat faces and human bodies, and features learned unsupervisedly significantly improve performance on the ImageNet object recognition task. | 2:00Explained | |
Appendix and Implementation Details Implementation details of the locally-connected network, parallelism strategies, hyperparameter choices, and baselines for comparison highlight the robust and scalable nature of the unsupervised learning approach. | 1:54Explained |