Regression with TensorFlow

We will dive into TensorFlow in a future chapter, but regularized linear regression can be implemented with it, so it's good idea to get a feel for how TensorFlow works.

Details on how TensorFlow is structured will be tackled in Chapter 8 , Artificial Neural Networks and Deep Learning. Some of its scaffolding may seem odd, and there will be lots of magic numbers. Still, we will progressively use more of it for some small examples.

Let's try to use the Boston dataset for this experiment.

import tensorflow as tf

TensorFlow requires you to create symbols for all elements it works on. These can be variables or placeholders. The former are symbols that TensorFlow will change, whereas placeholders are externally imposed by TensorFlow.

For regression, we need two placeholders, one for the input features and one for the output we want to match. We will also require two variables, one for the slope and one for the intercept. Contrary to linear regression, we have to write far more code for the same functionality:

X = tf.placeholder(shape=[None, 1], dtype=tf.float32, name="X")
Y = tf.placeholder(shape=[None, 1], dtype=tf.float32, name="y")
A = tf.Variable(tf.random_normal(shape=[1, 1]), name="A")
b = tf.Variable(tf.random_normal(shape=[1, 1]), name="b")

The two placeholders have a shape of [None, 1]. This means that they have a dynamic size along one axis and a size of 1 on the fastest dimension (in terms of memory layout). The two variables are fully static and have a dimension of [1, 1], meaning a single element. They will both be initialized by TensorFlow following a random variable (a Gaussian with a mean of 0 and a variance of 1).

The type of symbols can be set by using dtype, or for variable it can be inferred from the type of the initial_value. In this example, it will always be a floating point value.

All symbols can have a name and many TensorFlow functions take a name argument. It is good practice to give clear names, as TensorFlow errors will display them. If they are not set, TensorFlow will create new default names that can be difficult to decipher.

All the symbols are now created, and we can now create the loss function. We first create the prediction, and then we will compare it to the ground truth value:

model_output = tf.matmul(X, A) + b
loss = tf.reduce_mean(tf.square(Y - model_output))

The multiplication for the prediction seems to be transposed, and this is due to the way X was defined: it is indeed transposed! This allows model_output to have a dynamic first dimension.

We can now minimize this cost function with a gradient descent. First we create the TensorFlow objects:

grad_step = 5e-7
my_opt = tf.train.GradientDescentOptimizer(grad_step)
train_step = my_opt.minimize(loss)
The gradient step is a crucial aspect of all TensorFlow objects. We will explore this further later; the important aspect is to know that this step depends on the data and the cost function used. There are other optimizers available in TensorFlow; gradient descent is the simplest and one of the most adapted to this case.

We also need some variables:

batch_size = 50
n_epochs = 20000
steps = 100

The batch size indicates how many elements at a time we are going to compute the loss for. This is also the dimension of the input data for the placeholders as well as the dimension of the output we predict during the optimization.

Epochs are the number of times we go through all the training data to optimize our model. Finally, steps are just how often we display the information of the loss function we optimize.

Now we can go to the last step and let TensorFlow loose on the function and data we have:

loss_vec = []
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(n_epochs):
permut = np.random.permutation(len(x))
for j in range(0, len(x), batch_size):
batch = permut[j:j+batch_size]
Xs = x[batch]
Ys = y[batch]

sess.run(train_step, feed_dict={X: Xs, Y: Ys})

temp_loss = sess.run(loss, feed_dict={X: x, Y: y})
loss_vec.append(temp_loss)
if epoch % steps == 0:
(A_, b_) = sess.run([A, b])
print('Epoch #%i A = %s b = %s' % (epoch, np.transpose(A_), b_))
print('Loss = %.8f' % temp_loss)
print("")

prediction = sess.run(model_output, feed_dict={X: trX, Y: trY})
mse = mean_squared_error(y, prediction)
print("Mean squared error (on training data): {:.3}".format(mse))
rmse = np.sqrt(mse)
print('RMSE (on training data): %f' % rmse)
r2 = r2_score(y, prediction)
print("R2 (on training data): %.2f" % r2)

We first create a TensorFlow session. This will enable us to use the symbols with calls to sess.run. The first argument is a function to call or a list of functions to call (and their results will be the return of this call), and we have to pass a dictionary, feed_dict. This dictionary maps placeholders to actual data, so dimensions must match.

The first call in the session initializes all the variables according to what we specified when they were declared. Then we have two loops, one on epochs and one on batch sizes.

For each epoch, we define a permutation of the training data. This randomizes the order of the data. This is important, especially for a neural network, so that they don't have bias and so they learn all the data consistently. If the batch size is equal to the size of the training data, then we don't need to randomize data, and this is usually the case when we have only a handful of data samples. For large datasets, we have to use batches. Each batch will be fed inside the train_step function and the variables will be optimized.

After each epoch, we save the loss over all the training data for display purposes. We also display the state of the variables every few epochs to monitor and check the state of the optimization.

Finally, we display the mean square error of the predicted outputs with our model as well as the r2 score.

Of course, the solution for this loss function is analytically known, so let's modify it:

beta = 0.005
regularizer = tf.nn.l2_loss(A)
loss = loss + beta * regularizer

Then let's run the full optimization to get a Lasso result. We can see that TensorFlow doesn't really shine there. It is very slow and requires an awful number of iterations to get the result that is far from what scikit-learn can retrieve.

Let's see a fraction of the run when using just feature 5 for this dataset:

Epoch #9400 A = [[ 8.60801601]] b = [[-31.74242401]]
Loss = 43.75216293

Epoch #9500 A = [[ 8.57831573]] b = [[-31.81438446]]
Loss = 43.92549133

Epoch #9600 A = [[ 8.67326164]] b = [[-31.88376808]]
Loss = 43.69957733

Epoch #9700 A = [[ 8.75835037]] b = [[-31.94364548]]
Loss = 43.97978973

Epoch #9800 A = [[ 8.70185089]] b = [[-32.03764343]]
Loss = 43.69329453

Epoch #9900 A = [[ 8.66107273]] b = [[-32.10965347]]
Loss = 43.74081802

Mean squared error (on training data): 1.17e+02
RMSE (on training data): 10.8221888258
R2 (on training data): -0.39

Here is how the loss function behaves:

Here is the result when using only the fifth feature: