How to do it…

Let's see how to extract learning curves:

  1. Add the following code to the same Python file as in the previous recipe, Extracting validation curves:
from sklearn.model_selection import validation_curve

classifier = RandomForestClassifier(random_state=7)

parameter_grid = np.array([200, 500, 800, 1100])
train_scores, validation_scores = validation_curve(classifier, X, y, "n_estimators", parameter_grid, cv=5)
print("\n##### LEARNING CURVES #####")
print("\nTraining scores:\n", train_scores)
print("\nValidation scores:\n", validation_scores)

We want to evaluate the performance metrics using training datasets of 200, 500, 800, and 1,100 samples. We use five-fold cross-validation, as specified by the cv parameter in the validation_curve method.

  1. If you run this code, you will get the following output on the Terminal:
  1. Let's plot it:
# Plot the curve
plt.figure()
plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color='black')
plt.title('Learning curve')
plt.xlabel('Number of training samples')
plt.ylabel('Accuracy')
plt.show()
  1. Here is the output:

Although smaller training sets seem to give better accuracy, they are prone to overfitting. If we choose a bigger training dataset, it consumes more resources. Therefore, we need to make a trade-off here to pick the right size for the training dataset.