Implementing batch normalization in Python

We have already discussed the batch data normalization algorithm and even implemented a batch normalization layer for our library in MQL5. Additionally, we have added the capability to utilize multithreading technology with OpenCL. Now let's see the version of the method implementation offered by the familiar Keras library for TensorFlow.

This library provides the tf.keras.layers.BatchNormalization layer.

tf.keras.layers.BatchNormalization(
    axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True,
    beta_initializer='zeros', gamma_initializer='ones',
    moving_mean_initializer='zeros',
    moving_variance_initializer='ones', beta_regularizer=None,
    gamma_regularizer=None, beta_constraint=None, gamma_constraint=None, **kwargs
)

Batch normalization applies a transformation to maintain the mean of the results around zero and the standard deviation around one.

It's important to note that the batch normalization layer operates differently during training and during logical output.

During the training period, the layer normalizes its output data using the mean and standard deviation of the current batch of input data. For each normalized channel, the layer returns:

where:

  • = small constant (configured as part of the constructor arguments) to avoid division by zero error
  • = trainable scale factor (initialized as 1), which can be disabled by setting scale=False in the object constructor
  • = trainable offset factor (initialized as 0), which can be disabled by setting center=False in the object constructor
  • X = tensor of the input data batch
  • = average value of the data batch
  • = variance of the data batch

During operation, the layer normalizes its output data using the moving average and standard deviation of the batches it encountered during training.

Thus, the layer will normalize the input data during inference only after being trained on data with similar statistical characteristics.

You can pass the following arguments to the layer constructor:

  • axis — integer, the axis to be normalized (usually a feature axis)
  • momentum — momentum for the moving average
  • epsilon — small constant to avoid division by zero error
  • center — if True, adds a beta offset to normalized tensor, if False, beta is ignored
  • scale — if True it is multiplied by gamma, if False gamma is not used; when the next layer is a linear layer, it can be turned off because the scaling will be performed by the subsequent layer
  • beta_initializer — beta weight initializer type
  • gamma_initializer — type of gamma weight initializer
  • moving_mean_initializer — type of initializer for moving average
  • moving_variance_initializer — type of initializer for moving average variance
  • beta_regularizer — additional regularizer for beta weight
  • gamma_regularizer — additional regularizer for gamma weight
  • beta_constraint — optional constraint for beta weight
  • gamma_constraint — optional constraint for gamma weight

The following parameters can be used when accessing a layer:

  • inputs — tensor of initial data, it is allowed to use tensor of any rank
  • training — logical flag indicating the mode of operation of the layer: training or operation (the difference between the modes of operation is specified above)
  • input_shape — used to describe the dimensionality of input data in case the layer is specified first in the model

At the output, the layer produces a tensor of results while preserving the dimensionality of the original data.

In addition, the layer design allows the use of the layer.trainable setting that blocks parameters from being changed during training. This is optional and usually means that the layer operates in output mode. This mode is usually enabled by the "training" parameter, which can be passed when calling the layer. However, please note that "Parameter Freeze" and "Output Mode" are two different concepts.

However, in the case of a BatchNormalization layer, setting trainable = False means that the layer will subsequently run in logical output mode. This means that it will use the moving average and moving variance to normalize the current batch instead of the mean and variance of the current dataset.

This behavior was added in TensorFlow 2.0 to ensure that when layer.trainable = False, you get the most commonly expected behavior in the case of fine-tuning.

Note that setting trainable for a model containing other layers will recursively set the trainable value for all inner layers.

If the trainable value of the attribute changes after a model is compiled, the new value does not take effect for that model until the model is recompiled again.