Chapter 19 – Hyper-Parameters#

Data Science and Machine Learning for Geoscientists

How to choose a neural network’s hyper-parameters such as the learning rate,\(\eta\), the regularization parameter, \(\lambda\), and so on? I’ve just been supplying values which work pretty well. In practice, when we’re using neural nets to attack a problem, it can be difficult to find good hyper-parameters. Imagine, for example, that we’ve just been introduced to the MNIST problem, and have begun working on it, knowing nothing at all about what hyper-parameters to use. Let’s suppose that by good fortune in our first experiments we choose many of the hyper-parameters in the same way as was done earlier this chapter: 30 hidden neurons, a mini-batch size of 10, training for 30 epochs using the cross-entropy. But we choose a learning rate \(\eta=10.0\) and regularization parameter \(\lambda=1000.0\). Our classification accuracies are no better than chance! Our network is acting as a random noise generator!

“Well, that’s easy to fix,” we might say, “just decrease the learning rate and regularization hyper-parameters”. Unfortunately, we don’t a priori know those are the hyper-parameters we need to adjust. Maybe the real problem is that our 30 hidden neuron network will never work well, no matter how the other hyper-parameters are chosen? Maybe we really need at least 100 hidden neurons? Or 300 hidden neurons? Or multiple hidden layers? Or a different approach to encoding the output? Maybe our network is learning, but we need to train for more epochs? Maybe the mini-batches are too small? Maybe we’d do better switching back to the quadratic cost function? Maybe we need to try a different approach to weight initialization? And so on, on and on and on. It’s easy to feel lost in hyper-parameter space. This can be particularly frustrating if our network is very large, or uses a lot of training data, since we may train for hours or days or weeks, only to get no result. If the situation persists, it damages our confidence. Maybe neural networks are the wrong approach to our problem? Maybe we should quit our job and take up beekeeping?

There are some heuristics which can be used to set the hyper-parameters in a neural network. Of course, I won’t cover everything about hyper-parameter optimization. That’s a huge subject, and it’s not, in any case, a problem that is ever completely solved, nor is there universal agreement amongst practitioners on the right strategies to use. There’s always one more trick we can try to eke out a bit more performance from our network. But the heuristics in this section should get us started.

Broad Strategy#

When using neural networks to attack a new problem the first challenge is to get any non-trivial learning, i.e., for the network to achieve results better than chance. This can be surprisingly difficult, especially when confronting a new class of problem. Let’s look at some strategies we can use if we’re having this kind of trouble.

Suppose, for example, that we’re attacking MNIST for the first time. We start out enthusiastic, but are a little discouraged when our first network fails completely. The way to go is to strip the problem down. Get rid of all the training and validation images except images which are 0s or 1s. Then try to train a network to distinguish 0s from 1s. Not only is that an inherently easier problem than distinguishing all ten digits, it also reduces the amount of training data by 80 percent, speeding up training by a factor of 5. That enables much more rapid experimentation, and so gives us more rapid insight into how to build a good network.

We can further speed up experimentation by stripping our network down to the simplest network likely to do meaningful learning. If we believe a [784, 10] network can likely do better-than-chance classification of MNIST digits, then begin our experimentation with such a network. It’ll be much faster than training a [784, 30, 10] network, and we can build back up to the latter.

We can get another speed up in experimentation by increasing the frequency of monitoring. In many examples, we monitor performance at the end of each training epoch. With 50,000 images per epoch, that means waiting a little while - about ten seconds per epoch, on my laptop, when training a [784, 30, 10] network - before getting feedback on how well the network is learning. Of course, ten seconds isn’t very long, but if we want to trial dozens of hyper-parameter choices it’s annoying, and if we want to trial hundreds or thousands of choices it starts to get debilitating. We can get feedback more quickly by monitoring the validation accuracy more often, say, after every 1,000 training images.

Furthermore, instead of using the full 10,000 image validation set to monitor performance, we can get a much faster estimate using just 100 validation images. All that matters is that the network sees enough images to do real learning, and to get a pretty good rough estimate of performance. Of course, our program doesn’t currently do this kind of monitoring. But as a kludge to achieve a similar effect for the purposes of illustration, we’ll strip down our training data to just the first 1,000 MNIST training images. Let’s try it and see what happens. (To keep the code below simple I haven’t implemented the idea of using only 0 and 1 images. Of course, that can be done with just a little more work.)

We’re still getting pure noise! But there’s a big win: we’re now getting feedback in a fraction of a second, rather than once every ten seconds or so. That means we can more quickly experiment with other choices of hyper-parameter, or even conduct experiments trialling many different choices of hyper-parameter nearly simultaneously.

This all looks very promising as a broad strategy. However, I want to return to that initial stage of finding hyper-parameters that enable a network to learn anything at all. In fact, even the above discussion conveys too positive an outlook. It can be immensely frustrating to work with a network that’s learning nothing. We can tweak hyper-parameters for days, and still get no meaningful response. And so I’d like to re-emphasize that during the early stages we should make sure we can get quick feedback from experiments. Intuitively, it may seem as though simplifying the problem and the architecture will merely slow us down. In fact, it speeds things up, since we are much more quickly find a network with a meaningful signal. Once we’ve got such a signal, we can often get rapid improvements by tweaking the hyper-parameters. As with many things in life, getting started can be the hardest thing to do.

Specific Recommendations#

Okay, that’s the broad strategy. Let’s now look at some specific recommendations for setting hyper-parameters. As introduced before, the learning rate, \(\eta\), can be dynamic and change with the gradient.

Also, for the L2 regularization parameter, \(\lambda\), we can start with \(\lambda =0\) to determine the value of \(\eta\). Using that choice of \(\eta\), we can then use the validation data to select a good value for \(\lambda\). Start by trialling \(\lambda=1\) and then increase or decrease by factors of 10, as needed to improve performance on the validation data. Once we’ve found a good order of magnitude, we can fine tune our value of \(\lambda\).

What’s more, how should we set the mini-batch size? Choosing the best mini-batch size is a compromise. If the size is too small, we won’t get to take full advantage of the benefits of good matrix libraries optimized for fast hardware. If it is too large then we’re simply not updating the weights often enough. What we need is to choose a compromise value which maximizes the speed of learning. Fortunately, the choice of mini-batch size at which the speed is maximized is relatively independent of the other hyper-parameters (apart from the overall architecture), so we don’t need to have optimized those hyper-parameters in order to find a good mini-batch size. The way to go is therefore to use some acceptable (but not necessarily optimal) values for the other hyper-parameters, and then trial a number of different mini-batch sizes. Plot the validation accuracy versus time (as in, real elapsed time, not epoch), and choose whichever mini-batch size gives us the most rapid improvement in performance. With the mini-batch size chosen we can then proceed to optimize the other hyper-parameters.

However, many of the remarks apply also to other hyper-parameters, including those associated to network architecture, number of epoch (determined by early dropping as mentioned before), other forms of regularization, and some hyper-parameters such as the momentum co-efficient as mentioned earlier.

Now, we can automate some of the hyper-parameters (number of epoch and learning rate) but for others, such as the mini-batch size, we need to hand pick them. There is this paper (http://dl.acm.org/citation.cfm?id=2188395) about the automating processes of choosing hyper-parameters. It is a common technique called grid search, which systematically searches through a grid in hyper-parameter space. The code from this paper (https://github.com/jaberg/hyperopt) has been used with some success by other researchers.