# Chapter 5 – Gradient Descent 1#

Data Science and Machine Learning for Geoscientists

In multi-variable calculus (recommend video in Khan Academy: https://www.youtube.com/watch?v=TrcCbdWwCBc&list=PLSQl0a2vh4HC5feHa6Rc5c0wbRTx56nF7), the gradient is defined by unit vector nabla:

(22)#$\vec{\nabla} = \frac{\delta}{\delta x_1} \vec{x_1} + \frac{\delta}{\delta x_2} \vec{x_2} + ...$

where $$x$$ is the variable and in this example, the variables are $$w$$ and $$b$$.

In this case, we have

(23)#$\nabla C = \frac{\delta C}{\delta w_1} + \frac{\delta C}{\delta w_2} + ...$

where $$C$$ is the cost function.

For every small change in $$C$$

(25)#$\Delta C \approx \frac{\delta C}{\delta w_1} \Delta w_1+ \frac{\delta C}{\delta w_2} \Delta w_2 + ...$

the smaller the $$\Delta w$$, the better the approximation.

With these definitions, the expression above for $$\Delta C$$ can be rewritten as

(25)#$\Delta C \approx \nabla C * \Delta w$

This equation helps explain why $$\Delta C$$ is called the gradient vector: $$\Delta C$$ relates changes in $$w$$ to changes in $$C$$, just as we’d expect something called a gradient to do. But what’s really exciting about the equation is that it lets us see how to choose $$\Delta w$$ so as to make $$\Delta C$$ negative. In particular, suppose we choose

(26)#$\Delta w = -\eta \nabla C$

where $$\eta$$ is a small positive number; it is the step size or learning rate.

Then equation above tells us that $$\Delta C \approx −\nabla C *\eta \nabla C=−\eta ||C||^2$$. Because $$||C||^2 >=0$$, this guarantees that $$\Delta C<=0$$, i.e., $$C$$ will always decrease, never increase, if we change $$w$$ according to the prescription in the equation above. This is exactly the property we wanted! We’ll use this property to compute a value for $$\delta w$$, then move the position $$w$$ by that amount:

(27)#$w^{'}_n = w_n - \eta \nabla C = w_n - \eta \frac{\delta C}{\delta w_n} \label{update weight}$

Similarly, we can update the bias term $$b$$ by

(28)#$b^{'}_n = b_n - \eta \nabla C = b_n - \eta \frac{\delta C}{\delta b_n} \label{update bias}$