Chapter 3 – Cross Entropy#

Data Science and Machine Learning for Geoscientists

The problem of the Maximum Likelihood approach in the last chapter is that if we have a huge dataset, then the total Prob(Event) will be very low (even if the model is pretty good):

(13)#\[ 0.7*0.7*0.8*0.8*0.7*0.7*0.8*0.8 = 9.8\% \]

This is a maximum likelihood approach for a `10 students’ prediction. This prediction is just as good as the previous one, but the total Prob(Event) is significantly lower than the previous one. To solve this problem, we can use natural log:

(14)#\[ ln(A*B) = ln(A) + ln(B) \]

So for the first AI, we have:

(15)#\[ ln(0.7*0.2*0.2*0.6) = ln(0.7) + ln(0.2) + ln(0.2) + ln(0.6) = -4.086 \]

According to tradition, we choose the magnitude of the result: 4.086.

For the second AI prediction, we have:

(16)#\[ -ln( 0.7*0.7*0.8*0.8) = -[ln(0.7) + ln(0.7) + ln(0.8) + ln(0.8)] = 1.16 \]

The lower the score is, the better the prediction. This score is known as the cross-entropy, which can be expressed as follow:

(17)#\[ -\sum_{i=1}^m [y_i*ln(p_i)+(1-y_i)*ln(1-p_i)] \]

where \(y_i\) is the actual result (0 or 1), representing if the \(i^{th}\) student got accepted or rejected; and \(p_i\) is the prob(event) for the \(i^{th}\) student.

The first part of the equation means how good the AI predicts the students who got accepted; the second part the equation evaluates how well AI predicts the students who got rejected.`

Another reason for using the cross entropy is the concern of low learning speed caused by the flattened gradient (http://neuralnetworksanddeeplearning.com/chap3.html#eqtn6). Using the cross entropy format should have higher gradient and solves the problem.

Notice that we do have another form of cost: the quadratic cost.

(18)#\[ C(w,b) = \frac{1}{2n}\sum_x |\hat{y}-y|^2 \]

When should we use the cross-entropy instead of the quadratic cost? In fact, the cross-entropy is nearly always the better choice, provided the output neurons are sigmoid neurons. To see why, consider that when we’re setting up the network we usually initialize the weights and biases using some sort of randomization. It may happen that those initial choices result in the network being decisively wrong for some training input - that is, an output neuron will have saturated near 1, when it should be 0, or vice versa. If we’re using the quadratic cost that will slow down learning. It won’t stop learning completely, since the weights will continue learning from other training inputs, but it’s obviously undesirable.