Transcript

Deep Residual Learning for Image Recognition

The paper introduces residual learning with identity shortcut connections to reformulate layers as learning residual functions (F(x) = H(x) - x), making very deep networks easier to train. It demonstrates extremely deep ResNets (up to 152 layers) achieve state-of-the-art results on ImageNet and COCO, proving depth can improve performance when optimization is facilitated by residuals.

Abstract

# Deep Residual Learning for Image Recognition. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Microsoft Research. Abstract. Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the first place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC and COCO 2015 competitions, where we also won the first places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Introduction and Motivation

## 1. Introduction and Motivation. Deep convolutional neural networks have led to a series of breakthroughs for image classification. Deep networks naturally integrate low, mid, and high level features and classifiers in an end-to-end multi-layer fashion, and the levels of features can be enriched by the number of stacked layers, i.e., depth. Recent evidence reveals that network depth is of crucial importance, and the leading results on the challenging ImageNet dataset all exploit very deep models with a depth from sixteen to thirty layers or more. Many other non-trivial visual recognition tasks have also greatly benefited from very deep models. Driven by the significance of depth, a question arises: is learning better networks as easy as stacking more layers. An obstacle to answering this question was the notorious problem of vanishing and exploding gradients which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization and intermediate normalization layers, which enable networks with tens of layers to start converging for stochastic gradient descent with backpropagation. When deeper networks are able to start converging, a degradation problem has been exposed. With the network depth increasing, accuracy gets saturated and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in prior work and verified by our experiments. The degradation of training accuracy indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model where the added layers are identity mapping and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart, but experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution in feasible time.

Residual Learning Framework

## 2. Residual Learning Framework and Related Work. In this paper we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H of x, we let the stacked nonlinear layers fit another mapping of F of x equals H of x minus x, and the original mapping is recast into F of x plus x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. The formulation of F of x plus x can be realized by feedforward neural networks with shortcut connections that skip one or more layers. In our case, the shortcut connections simply perform identity mapping and their outputs are added to the outputs of the stacked layers. Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation and can be easily implemented using common libraries without modifying the solvers. Residual representations have analogies in other fields such as VLAD and Fisher vector encodings where residuals are encoded with respect to a dictionary, and in numerical solvers like Multigrid where reformulation into residual subproblems accelerates convergence. Shortcut connections have been studied in earlier work, including practices that add linear layers directly from input to output and methods that use auxiliary classifiers for intermediate layers. Concurrent methods such as highway networks introduce gating functions on shortcuts, but our identity shortcuts are parameter-free and always pass information through.

Design and Implementation

## 3. Deep Residual Learning Design and Implementation. Let H of x be an underlying mapping to be fit by a few stacked layers with x denoting the input to the first of these layers. Rather than expecting stacked layers to approximate H of x directly, we explicitly let these layers approximate a residual function F of x equals H of x minus x, making the original function F of x plus x. If identity mappings are optimal, the solvers may simply drive the weights of the nonlinear layers toward zero to approach identity mappings. Formally, a building block is defined as y equals F of x with weights plus x, where x and y are the input and output vectors of the considered layers and F represents the residual mapping to be learned. For example, with two layers F equals W2 sigma of W1 x where sigma denotes ReLU and biases are omitted. The operation F plus x is performed by a shortcut and an element-wise addition, and we adopt the second nonlinearity after the addition. The dimensions of x and F must be equal in this formulation, and when they are not, a linear projection Ws can be performed by the shortcut connections to match dimensions so that y equals F of x plus Ws x. In practice, identity mappings are sufficient for addressing the degradation problem and projection shortcuts are used only when matching dimensions. We consider plain network baselines inspired by VGG style designs where convolutional layers mostly have 3 by 3 filters and two design rules are followed, namely keeping the number of filters the same for layers operating on feature maps of the same size and doubling filters when halving spatial resolution to preserve time complexity per layer. Downsampling is performed by convolutional layers with stride 2 and the network ends with global average pooling and a fully-connected softmax layer. Residual networks are obtained by inserting identity shortcut connections into the plain baselines. When dimensions increase, either identity mapping with zero padding or projection with 1 by 1 convolutions is used on the shortcuts, and when the shortcuts cross feature maps of two sizes they are performed with stride 2. We implement the ImageNet experiments following standard practices including scale augmentation, random cropping, color augmentation, batch normalization after each convolution and before activation, weight initialization by current best methods, and SGD with a mini-batch size of 256, initial learning rate 0.1 divided by 10 when error plateaus, training for up to six hundred thousand iterations with weight decay and momentum, and no dropout. For testing we adopt 10-crop evaluation for comparison and fully-convolutional multi-scale testing for best results.

Experiments on ImageNet and CIFAR-10

## 4. Experiments on ImageNet and CIFAR-10, and Analysis. We evaluate our approach on the ImageNet 2012 classification dataset that contains one thousand classes, training on 1.28 million images and evaluating on 50k validation images as well as reporting final numbers on the test server for 100k test images. We compare plain and residual networks of varying depths and find that deeper plain networks can suffer from a degradation problem exhibiting higher training error and validation error when depth increases. For instance, a 34-layer plain net has higher training error than an 18-layer plain net across the whole training procedure despite the 18-layer solution space being a subspace of the 34-layer model. We argue that this optimization difficulty is unlikely due to vanishing gradients because batch normalization ensures non-zero variances of forward propagated signals and healthy norms of backward gradients. By contrast, residual networks manage to overcome the degradation issue and show lower training error and better validation performance as depth increases. In particular, 34-layer residual networks outperform their 18-layer residual counterparts and have considerably lower training error than the plain versions. We investigate identity versus projection shortcuts and find that identity shortcuts with zero-padding for dimension increase already greatly alleviate the degradation, while projection shortcuts give small additional gains at the cost of extra parameters. To build very deep networks economically, we adopt bottleneck residual blocks with a stack of three layers consisting of 1 by 1, 3 by 3, and 1 by 1 convolutions where the 1 by 1 layers reduce and then restore dimensions and the 3 by 3 layer is a computational bottleneck. Using this design we construct 50, 101, and 152 layer ResNets that remain computationally cheaper than the VGG nets while delivering significantly better accuracy. On ImageNet the 152-layer ResNet achieves substantially better single-model top-5 error and an ensemble of models attains 3.57% top-5 error on the test set, which won first place at ILSVRC 2015. We further study CIFAR-10 using simple architectures with stacks of 3 by 3 convolutions and residual shortcuts. Plain CIFAR nets again suffer from degradation with increased depth, while residual versions obtain accuracy gains up to very deep models. A 110-layer residual network converges well and reaches state-of-the-art performance for single-model results on CIFAR-10. We analyze layer responses by measuring standard deviations of outputs after batch normalization and before nonlinearities and observe that residual functions generally have smaller responses than plain counterparts, supporting the motivation that residual mappings are often closer to zero. We also explore aggressively deep models with over a thousand layers, training a 1202-layer residual network that converges to very low training error, though its test error is worse than the 110-layer model possibly due to overfitting on this small dataset without stronger regularization.

Object Detection, Localization, and Generalization

## 5. Object Detection, Localization, and Generalization. Deep residual networks generalize strongly to other recognition tasks including object detection and localization. We adopt Faster R-CNN as the detection framework and replace the VGG-16 backbone with ResNet-101 while computing full-image shared convolutional feature maps using layers whose stride is no greater than sixteen pixels and treating the later conv5 layers as post-RoI processing analogous to fully-connected layers. For fine-tuning-based detection we fix batch normalization statistics computed on the ImageNet training set so that BN layers act as constant linear transforms during detection fine-tuning to reduce memory usage. On Pascal VOC 2007 and 2012 our ResNet-101 baseline improves mAP by more than three percentage points over VGG-16 using the same detection implementation. On the MS COCO dataset ResNet-101 yields an increase of 6.0 points in COCO’s standard mAP metric averaged across IoU thresholds from 0.5 to 0.95, which is a 28 percent relative improvement attributed solely to the learned representations. For competition entries, additional improvements such as box refinement that pools features from regressed boxes and re-scores them, incorporation of global context features by pooling the entire image and concatenating with region features, and multi-scale testing with feature pyramids and maxout merging further increase detection performance. With these improvements and ensembling, our COCO results reached state-of-the-art levels and won the detection challenge. For ImageNet localization, we adopt a per-class regression strategy and design a per-class RPN that produces class-specific proposals and box regressors. Using ResNet-101 based proposal networks and an R-CNN style RoI-centric classifier and regressor, we reduce top-5 localization error dramatically and achieve single-model and ensemble results that dramatically outperform prior work, winning the ImageNet localization task. These experimental results demonstrate that deep residual learning not only eases optimization for very deep networks but also yields powerful and general image representations for downstream tasks.