Transcript

Identity Mappings in Deep Residual Networks

This paper analyzes propagation in deep residual networks, showing that identity skip connections and identity after-addition activation enable direct forward and backward signal flow; it then proposes a pre-activation residual unit that eases optimization and improves accuracy, enabling very deep ResNets with strong performance.

Abstract

Welcome to a closer look at a foundational paper in deep learning: Identity Mappings in Deep Residual Networks by Kaiming He and his team at Microsoft Research. This paper builds on their original, groundbreaking work on ResNets, which proved that you could build incredibly deep neural networks by using shortcuts, or skip connections, to help information flow. In this opening section, the authors are looking under the hood to understand exactly why these skip connections work and how they can be optimized to make networks even deeper. To do this, they break down the math of a single residual unit. Imagine information flowing into a block of neural network layers. The data splits into two parallel paths. One path goes through the complex mathematical transformations of the neural network, which the authors call the residual function. The other path is the shortcut, skipping those complex layers entirely. Finally, the results of both paths are added back together. The authors realized that the secret to successfully training massively deep networks is keeping that shortcut path completely clean and uninterrupted. They call this an identity mapping, meaning the input data is passed forward exactly as it is, without any alterations. By analyzing how signals propagate during training, the team discovered a powerful rule. If both the shortcut path and the final combining step are pure identity mappings, information can travel directly between any two blocks in the entire network, both forward and backward. They tested several different types of complicated shortcuts, but found that the absolute simplest one, doing nothing at all to the data on the shortcut path, performed the best. By embracing this incredibly clean shortcut path, the authors propose a slightly modified residual block. This new design made training easier and allowed them to successfully train mind-bogglingly deep architectures, including a one thousand and one layer network, while pushing accuracy to new heights.

Introduction

In this opening section the authors share a crucial discovery about how to design skip connections in neural networks. They found that altering these shortcuts by adding extra computations like scaling or gating actually leads to higher errors. Instead the network performs best when this shortcut is kept completely clean acting as a direct uninterrupted path for information to flow. To achieve this perfectly clean path they had to rethink conventional network design. Normally activation functions like ReLU and Batch Normalization are applied after the weight layers. The authors propose flipping this order. By treating these functions as a pre activation step that happens before the weights they keep the main highway of information unobstructed. The text breaks this down mathematically into what they call a Residual Unit. You can think of this unit as having two parallel paths that are eventually added together. One path is the residual function which does the heavy lifting of learning through convolutional weight layers. The other path is the pure shortcut connection which simply passes the original input forward untouched. This structural tweak might sound small but it makes the network significantly easier to optimize. Because of this cleaner design the researchers were able to train networks to unprecedented depths without the model overfitting. They successfully built a massive one thousand and one layer network for the CIFAR dataset and a two hundred layer model for ImageNet proving that safely increasing network depth is a major key to the success of modern deep learning.

Analysis of Deep Residual Networks

Let us unpack the mathematical engine of Deep Residual Networks described in this section. The authors start by imagining what happens if the function passing information between layers is completely transparent, which they call an identity mapping. When this happens, a beautiful recursive pattern emerges. In a standard, plain neural network, data is multiplied by weights at every single step. But in a residual network, the signals are simply added together. This means the feature output of any deep layer is exactly equal to the output of an earlier layer, plus the sum of all the small adjustments made by the layers in between. You can think of it like a continuous assembly line where each station simply adds a new piece to a moving object, rather than rebuilding it from scratch. This addition-based design creates a massive advantage when it is time to train the network using backward propagation. During training, an error signal needs to travel backward through the layers to update the network's weights. In a standard network, this backward signal has to multiply through every single layer, often shrinking so much that it disappears entirely. But in a residual network, the calculus reveals that the backward gradient naturally splits into two separate parts. One part travels normally through the weight layers, while the other is a pure error signal that flows directly backward along the skip connections, completely untouched by the weights. Because this pure signal bypasses the weight layers entirely, it guarantees that information will successfully reach the shallower, earlier layers. The authors point out that it is mathematically highly unlikely for these two parts to cancel each other out. This means the training signal will not vanish, even if the weights themselves become arbitrarily small. Ultimately, the network forms an uninterrupted highway where signals can flow directly from any unit to another, both forward and backward. This entire phenomenon rests on one foundational idea: keeping the skip connections as pure identity mappings so the signal is never distorted on its journey.

Discussions

The authors begin by focusing on the idea of a clean information path. For the network to pass information optimally, the shortcut connections need to be free of extra operations, doing nothing more than a simple addition. There are a few necessary exceptions, such as when the network needs to reduce the size of a feature map or change its dimensions. However, because these resizing steps happen so rarely across the entire network, they do not significantly disrupt the overall flow of information. To prove why this clean path is so important, the authors propose a hypothetical scenario. They ask what would happen if we modified the clean shortcut by multiplying the passing data by a simple scaling factor, which they call lambda. Instead of passing the data exactly as it is, this modification scales the information slightly up or slightly down at every single layer. When you apply this mathematically across an extremely deep network, a major problem emerges during backpropagation, which is the process the network uses to learn. Because the network calculates gradients by multiplying the layers together, that scaling factor gets multiplied over and over again. If the scale factor is even slightly greater than one, the learning signal grows exponentially and explodes. If it is less than one, the signal shrinks and completely vanishes. When the signal vanishes, the shortcut is effectively blocked. The network is then forced to push all its information through the complex weight layers instead, which brings back the exact training difficulties these shortcut connections were meant to solve. The authors note that this problem isn't limited to simple scaling numbers. If you try to add more complicated operations to the shortcut, like gating mechanisms or small convolutions, their mathematical derivatives will also multiply together, ultimately blocking the flow of information and crippling the training process. Therefore, keeping the skip connection as a pure, unmodified identity mapping is crucial.

On the Importance of Identity Skip Connections

To understand the true value of skip connections, the authors set up an experiment using a 110-layer ResNet on the CIFAR-10 image classification dataset. This is an extremely deep network, making it a perfect, challenging candidate to study optimization. The unmodified baseline model achieves a test error of 6.61 percent. To ensure their results are reliable and not just a result of random chance, they run each architecture five times and report the median accuracy. In this first set of tests, they want to see what happens if they interfere with the shortcut connections. They try a technique called constant scaling, where they multiply the signal passing through the shortcut by 0.5, effectively halving its strength. Alongside this, they also test variations on the main processing path, referred to as F. They try either leaving this main path unscaled, or scaling it down by half to match the shortcut. The results show exactly why tampering with the shortcut is a bad idea. When the shortcut is scaled down but the main path is left alone, the network struggles to even converge during training. When both paths are scaled by half, the model does eventually converge, but the test error nearly doubles to 12.35 percent. Most importantly, the authors note that the training error itself is noticeably higher than the baseline model. This reveals a critical insight about how these networks learn. Scaling down the shortcut signal directly interferes with the optimization process. ResNets rely on that unmodified, direct path to pass information and error gradients smoothly across dozens of layers. When you reduce that signal, you make the entire network significantly harder to train.

Experiments on Skip Connections

In this section, the authors put the standard identity skip connection to the test by experimenting with a few modifications. The overarching question is whether adding complexity to the shortcut path can improve the network, or if a completely clean, unaltered path is actually the best approach. They start by testing a gating mechanism, inspired by a different architecture known as Highway Networks. In an exclusive gating setup, a mathematical gate dynamically controls the flow of information like a valve. The more signal it routes through the main convolutional path, the less it allows down the shortcut path, and vice versa. However, the authors found that even with carefully tuned settings, this setup lagged far behind a standard ResNet. The core issue is a mathematical catch-22. If the gate leaves the shortcut wide open to help the signal flow easily through the deep network, it simultaneously chokes off the main path, preventing the network from learning the necessary features. To test this further, they tried gating only the shortcut path without restricting the main path, but the network still yielded poor results. Next, the researchers tried replacing the clean identity shortcut with a 1 by 1 convolutional layer. While earlier studies showed this could be useful on a shallower 34-layer network, it failed on a much deeper 110-layer model. By forcing the signal to pass through dozens of additional convolutional layers along the shortcut route, the network effectively bogged down the signal propagation. Finally, they experimented with applying Dropout directly to the shortcut, which randomly drops parts of the signal during training to prevent overfitting. This also caused the network to fail. Because a 50 percent dropout rate statistically cuts the overall signal strength in half, it essentially acts as a roadblock. Ultimately, every single attempt to manipulate, scale, or gate the shortcut path simply impeded the flow of information and increased the training error.

On the Usage of Activation Functions

The authors begin by clarifying a crucial point about shortcut connections: simpler is often better. You might assume that adding trainable features to the shortcut, like gating mechanisms or one-by-one convolutions, would make the network smarter. Theoretically, these complex shortcuts have more representational power. But in practice, they act like traffic jams. They interfere with the direct flow of information through the network and cause optimization problems. Because of this, the cleanest path—a pure identity shortcut—is actually the most effective. Knowing that an untouched shortcut is ideal, the authors identify a minor bottleneck in the original ResNet architecture. In the original design, after the shortcut and the main residual path are added together, the combined result goes through a ReLU activation function. This means the shortcut isn't a completely unhindered path, because its signal is altered at the very end of every block. To fix this, the authors decide to rearrange the activation functions, which include ReLU and Batch Normalization, to see if they can create a truly pure identity mapping. They run a few experiments to test different arrangements. First, they try putting Batch Normalization after the addition step, but this performs poorly because it actively distorts the shortcut signal and impedes information flow. Next, they try moving the ReLU activation to just before the addition. However, ReLU forces all outputs to be non-negative. A true residual function needs the flexibility to make negative adjustments, conceptually taking values ranging from negative to positive infinity. Because moving the ReLU forces the signal to only ever increase, it hurts the model's ability to learn. These experiments lead to a highly effective solution known as the pre-activation design. Instead of applying the activation function at the end of the block where it affects the merged signal, the authors move it to the very beginning of the residual branch. In this asymmetric setup, the activation only impacts the complex transformation path. The shortcut path is entirely bypassed and left completely untouched. This subtle rearrangement successfully creates a pure identity mapping for the shortcut, solving the optimization issues while preserving the network's full learning capacity.

Experiments on Activation

Let's explore where exactly to place the activation functions, specifically Batch Normalization and ReLU, in a residual network. In a simple, straight-line neural network, it doesn't really matter if you think of these activation steps as happening before or after the weight layers. But in a ResNet, where the flow of data splits into branches and merges back together through addition, the exact placement of these functions becomes critical. The researchers tested two new arrangements. First, they tried moving only the ReLU activation to before the weight layers, which they called ReLU-only pre-activation. This didn't change performance much, likely because the ReLU was separated from Batch Normalization, meaning it missed out on the stabilizing benefits of that step. However, when they moved both Batch Normalization and ReLU to sit right before the weights, creating a full pre-activation setup, the results improved significantly. This held true across a variety of network sizes, including a massive one thousand and one layer architecture. So why does full pre-activation work so well? The authors highlight two main reasons. First, applying Batch Normalization before the weights acts as a better regularizer for the model. Second, and perhaps more importantly, it makes optimizing the network much easier because it keeps the shortcut path completely clear. In the original ResNet design, a ReLU activation was applied right after the shortcut and the main path merged. Because a ReLU function turns any negative number into a zero, it could unintentionally block or truncate the signal flowing through the network. While the network can adjust its weights to mostly avoid this in shallower models, this blockage severely slows down learning at the beginning of training in a one thousand layer network. By moving the activations before the weights, the shortcut path becomes a pure, unobstructed identity mapping. Signals can travel cleanly from any one unit to another without being zeroed out, allowing even extremely deep networks to learn rapidly right from the start.

Analysis

The authors now analyze the broader impacts of their new pre-activation design, starting with how it significantly reduces overfitting. To understand why, we have to look at how Batch Normalization interacts with the network's shortcut connections. In the original residual design, a signal was normalized, but then immediately added to the shortcut path. Because of this addition, the final merged signal entering the next layer was no longer properly normalized. By moving Batch Normalization to the very beginning of the block in the new pre-activation design, the inputs to all weight layers stay consistently normalized. This naturally regularizes the network, leading to better performance on new, unseen test data. This improved generalization is obvious in their results on the CIFAR image datasets. The researchers achieved highly competitive results without needing to manually tune network filter sizes or rely on common anti-overfitting tricks like dropout. Instead, their success came from a simple but powerful concept: the new pre-activation structure simply allowed them to build much deeper networks without the training process breaking down. Finally, they put this to the ultimate test on the massive ImageNet dataset. To ensure a fair comparison with other state-of-the-art models, they tested using a larger 320 by 320 image crop size. The most striking result came from the incredibly deep ResNet-200. The original version of ResNet-200 had a fundamental flaw: it actually performed worse than the shallower ResNet-152 because it was severely overfitting its training data. However, the new pre-activation version of ResNet-200 solved this overfitting problem entirely. It dropped the error rate down to 20.7 percent, comfortably beating the shallower baseline models and proving that this minor structural tweak successfully unlocks the true potential of extreme network depth.

Conclusions

This final section wraps up the paper by summarizing its core discovery: for deep residual networks to perform at their absolute best, information needs a clear, unimpeded path. Through mathematical derivation and careful experiments, the authors proved that keeping shortcut connections pure, without any extra activations or modifications, is essential. By moving the activation functions to a pre-activation position, the network allows data to flow smoothly from end to end. This breakthrough is what made it possible to train incredibly deep networks, up to one thousand layers, that are not only easy to train but also highly accurate. After the conclusion, the text transitions into an appendix detailing the exact implementation steps used for the experiments. Think of this as the recipe for reproducing their work. For both the CIFAR and ImageNet datasets, the authors largely rely on standard training routines across multiple GPUs, such as gradually dropping the learning rate over time to fine-tune the model. One interesting note from the CIFAR experiments is the mention of a learning rate warm up. Usually, researchers might start training with a very small learning rate for a few hundred iterations to prevent the model from becoming unstable early on. However, the authors proudly note that because their newly proposed residual unit is so naturally stable, this cautious warm up phase is no longer strictly necessary. The appendix also provides crucial guidance on how to handle the very beginning and the very end of this new network design. Because the new design relies on pre-activation—meaning the activation functions happen before the weights rather than after—the edges of the network need special treatment. At the very beginning, an activation is applied right after the first convolution but before the network splits into its residual paths. At the very end, before the final pooling and classification steps, one extra activation is added. Finally, the authors clarify that if you are using narrower bottleneck units to save on computing power, any shortcuts used to resize the data should also follow this same pre-activation rule, ensuring the whole network speaks the exact same architectural language.