Chapter 6 – Gradient Descent 2#

Data Science and Machine Learning for Geoscientists

Okay, it sounds good in theory so far. But how do we calculate the \(\nabla C\)? Let’s compute the \(\frac{\delta C(\vec{w},b)}{\delta w_1}\) in this 2 layers (input layer and output layer) neural network example.

Figure 1.7: Two layer neural network.
(29)#\[ -\frac{\delta C(\vec{w},b)}{w_1} = \frac{1}{m} \frac{\delta}{\delta w_1} [\sum_{i=1}^m [y_i*ln(\sigma(\vec{w}*\vec{x}+b))]+(1-y_i)*ln(1-\sigma(\vec{w}+b))] \]

It is very easy to write a for loop to compute the sum: from \(y_1\) to \(y_m\), which is summing up the performance of each single individual prediction. So, let’s compute just \(y_i\) for the sake of simplicity. What’s more, the \(y_i\) is just a constant, which we can put it outside of the differential equation, so we have

(30)#\[ y_i * \frac{\delta}{\delta w_1} ln[\sigma(\vec{w}*\vec{x}+b)] + (1-y_i) *\frac{\delta}{\delta w_1} ln[1-\sigma(\vec{w}*\vec{x}+b)] \label{differential equation} \]

Let’s look at the first term in the equation (30):

(31)#\[ y_i *\frac{\delta}{\delta w_1} ln[\sigma(\vec{w}*\vec{x}+b)] \label{first term} \]

We can break it down so that we can use the chain rule.

(32)#\[ ln(u) \; \; \; u = \sigma(h) \; \; \; h = \vec{w}*\vec{x}+b \]

According to the chain rule, we have

(33)#\[ \frac{\delta ln(u)}{\delta w_1} = \frac{\delta ln(u)}{\delta u} \frac{\delta u}{\delta h} \frac{\delta h}{\delta w_1} \label{chain rule} \]
(34)#\[ \frac{\delta h}{\delta w_1} = \frac{\delta}{\delta w_1}(w_1x_1 + w_2x_2 +b) = x_1 \]
(35)#\[ \frac{\delta u}{\delta h} = \frac{\delta}{\delta h}(\frac{1}{1+e^{-h}}) = \frac{1}{1+e^{-h}}*\frac{e^{-h}}{1+e^{-h}} = \sigma(h)*(1-\sigma(h)) \]
(36)#\[ \frac{\delta ln(u)}{\delta u} = \frac{1}{u} \]

So the equation (31) becomes

(37)#\[ y_i * \frac{\delta}{\delta w_1} ln[\sigma(\vec{w}*\vec{x}+b)] =y_i* \frac{1}{\hat{y}} *\hat{y}(1-\hat{y})*x_1 = y_i*(1-\hat{y})*x_1 \]

where \(y_i\) is the actual result and \(\hat{y}\) is the predicted result by the AI.

Similarly, we can do the same for the second term in equation (30) and have the combined result as follow

(38)#\[ \frac{\delta C}{\delta w_1} = y_i*(1-\hat{y_i})*x_1 + (1-y_i) \hat{y_i}*x_1 = (y_i - \hat{y_i})*x_1 \]

Again, do not forget that we ignored the summation symbol \(\sum\) for simplicity before. So the actual result of the gradient of the loss/cost function is

(39)#\[ \frac{\delta C}{\delta w_1} = \frac{1}{m}\sum_{i=1}^m y_i*(1-\hat{y_i})*x_1 + (1-y_i) \hat{y_i}*x_1 =\frac{1}{m}\sum_{i=1}^m (y_i - \hat{y_i})*x_1 \label{gradient of cost function} \]

This is actually a very neat result. How do we interpret it? Well, the gradient of the cost function of any single neuron, which has weight \(w_i\), is evaluated by the difference between the predicted and actual \(y\) values for all students or data (i.e. the error). The larger the error, the faster the neuron will learn. This is just what we’d intuitively expect. It is also related to the input \(x\) values of that particular neuron.

Anyway, in order to update the weight \(w_i\) as derived earlier in the previous chapter in equation (27), we just need to plug in \(\frac{\delta C}{\delta w_i}\), which we just obtained.

(40)#\[ w_i^{'} = w_i - \eta \nabla C = w_i - \eta \frac{\delta C}{\delta w_i} = w_i - \eta \frac{1}{m}\sum_{i=1}^m (y_i - \hat{y_i})*x_1 \]

It is the same mathematical process for updating the bias term \(b\), we will just write down the result here

(41)#\[ b_i^{'} = b_i -\eta \frac{1}{m}\sum_{i=1}^m(y_i-\hat{y_i}) \]