Chapter 9 – Back Propagation
Chapter 9 – Back Propagation#
Data Science and Machine Learning for Geoscientists
The ultimate goal of neural network, don’t forget, is to find the best weight and bias. So when we obtain the predicted \(\hat{y}\), we need to use it to compare with the actual result set \(y\), and adjust weight and bias matrix \(W\) in accordance.
The thought process is identical to the previous two layers neural network as we introduced before. However, we have three layers in this case, and potentially much more layers. We need to adjust and update the weight in each layer, instead of just one layer in the two layer example.
So we need to obtain the gradient of the cost function in order to update weights. Let’s take the example of the first weight in the input layer in figure 8.1 in chapter 8. We need a longer chain rule to obtain the gradient because we have one more layer:
Note that the gradient of a weight is evaluated by the sum of all data/student (\(y_i\)), as explained in the previous example.
For simplicity, we just derive the gradient for weight \(w^{(1)}_{11}\) for a single \(y\), and use for loop to sum up them up and divide by \(m\).
Again, the \(y\) is just a constant, so we put it outside of the differential equation.
where \(\sigma(h_2) = \hat{y}\).
We also know that
and
So, for the parts that can be combined into the chain rule:
As a result, we have the gradient of \(w_{11}^{(1)}\) from the input layer to the hidden layer as follow
On the other hand, we also want to get the gradient of weight (e.g. \(w_{11}^{(2)}\)) from the hidden layer to the output layer:
We realise that the first two terms are the same in equation (65) and (67), so we define an `Error Term’ \(\delta^{n-1}\) for simplicity (\(n\) is the number of layers in the network, i.e. \(n=3\)):
So the weight \(w_{11}^{(2)}\) from the hidden to output (\(2nd\)) layer in equation (67) can be expressed by the \(\delta^2\) for the \(2nd\) layer as
And the weight \(w_{11}^{(1)}\) from the input to hidden layer in equation (65) can be written as
To further simplify, we have another `Error Term’ \(\delta^{n-2}\) (i.e. \(\delta^{3-2}\) for the \(1st\) layer)
where \(\delta^1\) is for the next layer to the left, in this 3 layer neural network example, it is for the weight between input and hidden layer. \(\delta^1\) is updated by multiplying the corresponding weight \(w^{(2)}_{11}\) in the current layer and the derivative of sigma of the next layer to the left \(\sigma^{'}(h_1)\).
So the equation (70) for weights in the \(1st\) layer can be written as
So in a sense, in the back propagation, we update the weights in layers closer to the output layer first and then update towards the input layer. Every time we go back to one layer, we update the \(\delta^1\) based on the \(\delta^0\) in the previous layer.
For updating the weight from the hidden to the output (\(2nd\)) layer, we have
Remember that the \(\delta^0\) is a function of \(y\), where we need to sum up all \(y_i\) in the dataset.
For updating the weight from the input to the hidden (\(1st\)) layer, we have