Chapter 5 – Gradient Descent 1#

Data Science and Machine Learning for Geoscientists

In multi-variable calculus (recommend video in Khan Academy:, the gradient is defined by unit vector nabla:

(22)#\[ \vec{\nabla} = \frac{\delta}{\delta x_1} \vec{x_1} + \frac{\delta}{\delta x_2} \vec{x_2} + ... \]

where \(x\) is the variable and in this example, the variables are \(w\) and \(b\).

In this case, we have

(23)#\[ \nabla C = \frac{\delta C}{\delta w_1} + \frac{\delta C}{\delta w_2} + ... \]

where \(C\) is the cost function.

For every small change in \(C\)

(25)#\[ \Delta C \approx \frac{\delta C}{\delta w_1} \Delta w_1+ \frac{\delta C}{\delta w_2} \Delta w_2 + ... \]

the smaller the \(\Delta w\), the better the approximation.

With these definitions, the expression above for \(\Delta C\) can be rewritten as

(25)#\[ \Delta C \approx \nabla C * \Delta w \]

This equation helps explain why \(\Delta C\) is called the gradient vector: \(\Delta C\) relates changes in \(w\) to changes in \(C\), just as we’d expect something called a gradient to do. But what’s really exciting about the equation is that it lets us see how to choose \(\Delta w\) so as to make \(\Delta C\) negative. In particular, suppose we choose

(26)#\[ \Delta w = -\eta \nabla C \]

where \(\eta\) is a small positive number; it is the step size or learning rate.

Then equation above tells us that \(\Delta C \approx −\nabla C *\eta \nabla C=−\eta ||C||^2\). Because \(||C||^2 >=0\), this guarantees that \(\Delta C<=0\), i.e., \(C\) will always decrease, never increase, if we change \(w\) according to the prescription in the equation above. This is exactly the property we wanted! We’ll use this property to compute a value for \(\delta w\), then move the position \(w\) by that amount:

(27)#\[ w^{'}_n = w_n - \eta \nabla C = w_n - \eta \frac{\delta C}{\delta w_n} \label{update weight} \]

Similarly, we can update the bias term \(b\) by

(28)#\[ b^{'}_n = b_n - \eta \nabla C = b_n - \eta \frac{\delta C}{\delta b_n} \label{update bias} \]