Testing different optimizers in Keras

We have defined and used a network; it is useful to start giving an intuition about how networks are trained. Let's focus on one popular training technique known as gradient descent (GD). Imagine a generic cost function C(w) in one single variable w like in the following graph:

The gradient descent can be seen as a hiker who aims at climbing down a mountain into a valley. The mountain represents the function C, while the valley represents the minimum Cmin. The hiker has a starting point w0. The hiker moves little by little. At each step r, the gradient is the direction of maximum increase. Mathematically, this direction is the value of the partial derivative  evaluated at point wr reached at step r. Therefore by taking the opposite direction, , the hiker can move towards the valley. At each step, the hiker can decide what the leg length is before the next step. This is the learning rate  in gradient descent jargon. Note that if  is too small, then the hiker will move slowly. However, if  is too high, then the hiker will possibly miss the valley.

Now you should remember that a sigmoid is a continuous function, and it is possible to compute the derivative. It can be proven that the sigmoid is shown as follows:

It has the following derivative:

  

ReLU is not differentiable in 0. We can, however, extend the first derivative in 0 to a function over the whole domain by choosing it to be either 0 or 1. The point-wise derivative of ReLU  is as follows:

Once we have the derivative, it is possible to optimize the nets with a gradient descent technique. Keras uses its backend (either TensorFlow or Theano) for computing the derivative on our behalf so we don't need to worry about implementing or computing it. We just choose the activation function, and Keras computes its derivative on our behalf.

A neural network is essentially a composition of multiple functions with thousands, and sometimes millions, of parameters. Each network layer computes a function whose error should be minimized in order to improve the accuracy observed during the learning phase. When we discuss backpropagation, we will discover that the minimization game is a bit more complex than our toy example. However, it is still based on the same intuition of descending a valley.

Keras implements a fast variant of gradient descent known as stochastic gradient descent (SGD) and two more advanced optimization techniques known as RMSprop and Adam. RMSprop and Adam include the concept of momentum (a velocity component) in addition to the acceleration component that SGD has. This allows faster convergence at the cost of more computation. A full list of Keras-supported optimizers is at https://keras.io/optimizers/. SGD was our default choice so far. So now let's try the other two. It is very simple, we just need to change few lines:

from keras.optimizers import RMSprop, Adam
...
OPTIMIZER = RMSprop() # optimizer,

That's it. Let's test it as shown in the following screenshot:

As you can see in the preceding screenshot, RMSprop is faster than SDG since we are able to achieve an accuracy of 97.97% on training, 97.59% on validation, and 97.84% on the test improving SDG with only 20 iterations. For the sake of completeness, let's see how the accuracy and loss change with the number of epochs, as shown in the following graphs:

OK, let's try the other optimizer, Adam(). It is pretty simple, as follows:

OPTIMIZER = Adam() # optimizer

As we have seen, Adam is slightly better. With Adam, we achieve 98.28% accuracy on training, 98.03% on validation, and 97.93% on the test with 20 iterations, as shown in the following graphs:

This is our fifth variant, and remember that our initial baseline was at 92.36%.

So far, we made progressive improvements; however, the gains are now more and more difficult. Note that we are optimizing with a dropout of 30%. For the sake of completeness, it could be useful to report the accuracy on the test only for other dropout values with Adam() chosen as optimizer, as shown in the following graph: