Chapter 5 – Gradient Descent 1
Chapter 5 – Gradient Descent 1#
Data Science and Machine Learning for Geoscientists
In multi-variable calculus (recommend video in Khan Academy: https://www.youtube.com/watch?v=TrcCbdWwCBc&list=PLSQl0a2vh4HC5feHa6Rc5c0wbRTx56nF7), the gradient is defined by unit vector nabla:
where \(x\) is the variable and in this example, the variables are \(w\) and \(b\).
In this case, we have
where \(C\) is the cost function.
For every small change in \(C\)
the smaller the \(\Delta w\), the better the approximation.
With these definitions, the expression above for \(\Delta C\) can be rewritten as
This equation helps explain why \(\Delta C\) is called the gradient vector: \(\Delta C\) relates changes in \(w\) to changes in \(C\), just as we’d expect something called a gradient to do. But what’s really exciting about the equation is that it lets us see how to choose \(\Delta w\) so as to make \(\Delta C\) negative. In particular, suppose we choose
where \(\eta\) is a small positive number; it is the step size or learning rate.
Then equation above tells us that \(\Delta C \approx −\nabla C *\eta \nabla C=−\eta ||C||^2\). Because \(||C||^2 >=0\), this guarantees that \(\Delta C<=0\), i.e., \(C\) will always decrease, never increase, if we change \(w\) according to the prescription in the equation above. This is exactly the property we wanted! We’ll use this property to compute a value for \(\delta w\), then move the position \(w\) by that amount:
Similarly, we can update the bias term \(b\) by