Chapter 18 – Softmax#

Data Science and Machine Learning for Geoscientists

The cross-entropy cost can be used to address the problem of learning slowdown. However, I want to briefly describe another approach to the problem, based on what are called softmax layers of neurons. Softmax is still worth understanding, in part because it’s intrinsically interesting, and in part because we’ll use softmax layers in our discussion of deep neural networks.

The idea of softmax is to define a new type of output layer for our neural networks. It begins in the same way as with a sigmoid layer, by forming the weighted inputs

(107)#\[ h=w_1x_1 + w_2x_2 + ... + b \]

However, we don’t apply the sigmoid function to get the output. Instead, in a softmax layer we apply the so-called softmax function

(108)#\[ softmax(h)^{(l)}_a = \frac{e^{h^{(n)}_{a}}}{\sum_b e^{h^{(n)}_b}} \]

where \(softmax(h)^{(l)}_a\) is a prediction probability between 0 and 1 in the output layer.

The sum of the \(softmax(h)^{(l)}_a\) is always 1. In other words, the output from the softmax layer can be thought of as a probability distribution. The fact that a softmax layer outputs a probability distribution is rather pleasing. In many problems it’s convenient to be able to interpret the output activation \(a^{(n)}_a\) as the network’s estimate of the probability that the correct output is \(a^{th}\) neuron in the output layer. So, for instance, in the MNIST classification problem, we can interpret \(a^{(n)}_a\) as the network’s estimated probability that the correct digit classification is the \(a^{th}\) neuron.

By contrast, if the output layer was a sigmoid layer, then we certainly couldn’t assume that the activations formed a probability distribution. I won’t explicitly prove it, but it should be plausible that the activations from a sigmoid layer won’t in general form a probability distribution. And so with a sigmoid output layer we don’t have such a simple interpretation of the output activations.