书名：Python Machine Learning Cookbook（Second Edition）
作者名：Giuseppe Ciaburro Prateek Joshi
本章字数：173字
更新时间：2025-04-04 14:38:10

How to do it…

Let's see how to extract learning curves:

Add the following code to the same Python file as in the previous recipe, Extracting validation curves:

from sklearn.model_selection import validation_curve

classifier = RandomForestClassifier(random_state=7)

parameter_grid = np.array([200, 500, 800, 1100])
train_scores, validation_scores = validation_curve(classifier, X, y, "n_estimators", parameter_grid, cv=5)
print("\n##### LEARNING CURVES #####")
print("\nTraining scores:\n", train_scores)
print("\nValidation scores:\n", validation_scores)

We want to evaluate the performance metrics using training datasets of 200, 500, 800, and 1,100 samples. We use five-fold cross-validation, as specified by the cv parameter in the validation_curve method.

If you run this code, you will get the following output on the Terminal:

Let's plot it:

# Plot the curve
plt.figure()
plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color='black')
plt.title('Learning curve')
plt.xlabel('Number of training samples')
plt.ylabel('Accuracy')
plt.show()

Here is the output:

Although smaller training sets seem to give better accuracy, they are prone to overfitting. If we choose a bigger training dataset, it consumes more resources. Therefore, we need to make a trade-off here to pick the right size for the training dataset.