Normalization Layers

Normalization can decrease training time by helping ensure learning from all features by squashing feature values into a more ‘normalized’ data range.  According to “Normalization Techniques in Deep Learning Networks” by Aakash Bindal, normalization has the following benefits:

1.) Feature normalization – placing feature values into a more normalized range helps each feature contribute to learning in a more balanced manner.

2.) Reduces Internal Covariate Shift – reduces the “change in distribution of activations due to the change in network parameters during training.”

3.) Batch Normalization “makes the landscape of the corresponding optimization problem significantly more smooth.[1]

4.) Speeds up Optimization – weights are more constrained and do not explode as often.

5.) Slight Regularization Improvement – this is a very slight benefit that is unintended.

According to “Difference between Local Response Normalization and Batch Normalization” by Aqeel Anwar, normalization is important as it compensates for the “unbounded nature of certain activation functions such as ReLU and ELU,” which can cause the model gradients to “grow as high as the training allows.”

Normalization Layers

MyCaffe supports the following normalization layers that are usually placed within a network model after each convolution or pooling layer.

BatchNormLayer; Batch Normalization is trainable and is used to address “the issues of Internal Covariate Shift (ICS)” which “arises due to the changing distribution of the hidden neurons/activation.”[2]  BatchNorm is one of the more popular normalization techniques.[3]

LRNLayer; Local Response Normalization is non trainable and performs a square-normalization of “the pixel values in a feature map within a local neighborhood.”[2]

MVNLayer; the Mean-Variance Normalization normalizes the input to have a mean = 0 and unit variance = 1.  This type of normalization has been used in speech recognition problems.[4][5]

Normalization1Layer; the Normalization1 performs an L2 normalization over the input data.

GRNLayer; the Global Response Normalization also performs an L2 normalization over the input data across each channel.

[1] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Madry, How Does Batch Normalization Help Optimization?, 2018, arXiv:1805.11604

[2] Aqeel Anwar, Difference between Local Response Normalization and Batch Normalization, 2019, Towards Data Science

[3] Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning, 2021, arXiv:2106.05956

[4] Xu Li, Shansong Liu, Ying Shan, A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion, 2022, arXiv:2206.13762

[5] Xiaofei Wang, Dongmei Wang, Naoyuki Kanda, Sefik Emre Eskimez, Takuya Yoshioka, Leveraging Real Conversational Data for Multi-Channel Continuous Speech Separation, 2022, arXiv:2204.03232

Table of Contents