Chapter 9 – Back Propagation#

Data Science and Machine Learning for Geoscientists

The ultimate goal of neural network, don’t forget, is to find the best weight and bias. So when we obtain the predicted \(\hat{y}\), we need to use it to compare with the actual result set \(y\), and adjust weight and bias matrix \(W\) in accordance.

The thought process is identical to the previous two layers neural network as we introduced before. However, we have three layers in this case, and potentially much more layers. We need to adjust and update the weight in each layer, instead of just one layer in the two layer example.

So we need to obtain the gradient of the cost function in order to update weights. Let’s take the example of the first weight in the input layer in figure 8.1 in chapter 8. We need a longer chain rule to obtain the gradient because we have one more layer:

(55)#\[ \frac{\delta C}{\delta w^{(1)}_{11}} = \frac{\delta C}{\delta\hat{y}}\frac{\delta\hat{y}}{\delta h_2} \frac{\delta h_2}{\delta h_1} \frac{\delta h_1}{\delta w^{(1)}_{11}} \label{real chain rule} \]

Note that the gradient of a weight is evaluated by the sum of all data/student (\(y_i\)), as explained in the previous example.

(56)#\[ -\frac{\delta C}{w^{(1)}_{11}} = \frac{1}{m} \frac{\delta}{\delta w^{(1)}_{11}} [\sum_{i=1}^m [y_i*ln(\sigma(h_2)]+(1-y_i)*ln(1-\sigma(h_2))] \]

For simplicity, we just derive the gradient for weight \(w^{(1)}_{11}\) for a single \(y\), and use for loop to sum up them up and divide by \(m\).

(57)#\[ -\frac{\delta C}{w^{(1)}_{11}} = \frac{\delta}{\delta w^{(1)}_{11}} [y*ln(\sigma(h_2)]+(1-y)*ln(1-\sigma(h_2))] \]

Again, the \(y\) is just a constant, so we put it outside of the differential equation.

(58)#\[ -\frac{\delta C}{w^{(1)}_{11}} = y*\frac{\delta}{\delta w^{(1)}_{11}} ln(\sigma(h_2)]+ (1-y)*\frac{\delta}{\delta w^{(1)}_{11}}ln(1-\sigma(h_2)) \]

where \(\sigma(h_2) = \hat{y}\).

We also know that

(59)#\[ h_2 = \sigma(h_1)*w^{(2)}_{11} + \sigma(h_2)*w^{(2)}_{21} + 1*w^{(2)}_{31} \]

and

(60)#\[ h_1 = w^{(1)}_{11}*x_1 + w^{(1)}_{21}*x_2 + w^{(1)}_{31}*x_3 + w^{(1)}_{41}*1 \]

So, for the parts that can be combined into the chain rule:

(61)#\[ \frac{\delta C}{\delta \hat{y}} = \frac{1-y}{1-\hat{y}} - \frac{y}{\hat{y}} \]
(62)#\[ \frac{\delta \hat{y}}{\delta h_2} = \sigma^{'}(h_2)= \sigma(h_2)[1-\sigma(h_2)] \]
(63)#\[ \frac{\delta h_2}{\delta h_1} = w^{(2)}_{11}\sigma^{'}(h_1) \]
(64)#\[ \frac{\delta h_1}{\delta w^{(1)}_{11}} = x_1 \]

As a result, we have the gradient of \(w_{11}^{(1)}\) from the input layer to the hidden layer as follow

(65)#\[ -\frac{\delta C}{w^{(1)}_{11}} = (\frac{1-y}{1-\hat{y}} - \frac{y}{\hat{y}})*\sigma^{'}(h_2)*w^{(2)}_{11}*\sigma^{'}(h_1)*x_1 \label{inputToHidden} \]

On the other hand, we also want to get the gradient of weight (e.g. \(w_{11}^{(2)}\)) from the hidden layer to the output layer:

(66)#\[ \frac{\delta C}{\delta w^{(2)}_{11}} = \frac{\delta C}{\delta\hat{y}}\frac{\delta\hat{y}}{\delta h_2} \frac{\delta h_2}{\delta w^{(2)}_{11}} \]
(67)#\[ \frac{\delta C}{\delta w^{(2)}_{11}} = (\frac{1-y}{1-\hat{y}} - \frac{y}{\hat{y}})*\sigma^{'}(h_2) * \sigma(h_1) \label{hiddenToOutput} \]

We realise that the first two terms are the same in equation (65) and (67), so we define an `Error Term’ \(\delta^{n-1}\) for simplicity (\(n\) is the number of layers in the network, i.e. \(n=3\)):

(68)#\[ \delta^{n-1} = \frac{\delta C}{\delta \hat{y}}\frac{\delta \hat{y}}{\delta h_2} = (\frac{1-y}{1-\hat{y}} - \frac{y}{\hat{y}})*\sigma^{'}(h_2) \]

So the weight \(w_{11}^{(2)}\) from the hidden to output (\(2nd\)) layer in equation (67) can be expressed by the \(\delta^2\) for the \(2nd\) layer as

(69)#\[ \frac{\delta C}{\delta w^{(2)}_{11}} = \delta^{3-1} *\sigma(h_1) \]

And the weight \(w_{11}^{(1)}\) from the input to hidden layer in equation (65) can be written as

(70)#\[ \frac{\delta C}{\delta w^{(1)}_{11}} = \delta^{3-1} * w^{(2)}_{11} * \sigma^{'}(h_1)*x_1 \label{inputToHidden1} \]

To further simplify, we have another `Error Term’ \(\delta^{n-2}\) (i.e. \(\delta^{3-2}\) for the \(1st\) layer)

(71)#\[ \delta^{n-2} = \delta^{n-1} * w^{(2)}_{11} * \sigma^{'}(h_1) \]

where \(\delta^1\) is for the next layer to the left, in this 3 layer neural network example, it is for the weight between input and hidden layer. \(\delta^1\) is updated by multiplying the corresponding weight \(w^{(2)}_{11}\) in the current layer and the derivative of sigma of the next layer to the left \(\sigma^{'}(h_1)\).

So the equation (70) for weights in the \(1st\) layer can be written as

(72)#\[ \frac{\delta C}{\delta w^{(1)}_{11}} = \delta^1 *x_1 \label{inputToHidden2} \]

So in a sense, in the back propagation, we update the weights in layers closer to the output layer first and then update towards the input layer. Every time we go back to one layer, we update the \(\delta^1\) based on the \(\delta^0\) in the previous layer.

For updating the weight from the hidden to the output (\(2nd\)) layer, we have

(73)#\[ w_i^{(2)'} = w_i^{(2)} - \eta \nabla C = w_i^{(2)} - \eta \frac{\delta C}{\delta w_i} = w_i^{(2)} - \eta \sum_{i=1}^m \delta^2\sigma(h_1) \]

Remember that the \(\delta^0\) is a function of \(y\), where we need to sum up all \(y_i\) in the dataset.

For updating the weight from the input to the hidden (\(1st\)) layer, we have

(74)#\[ w_i^{(1)'} = w_i^{(1)} - \eta \nabla C = w_i^{(1)} - \eta \frac{\delta C}{\delta w_i} = w_i^{(1)} - \eta \sum_{i=1}^m \delta^1 x_1 \]