- Python Machine Learning Cookbook(Second Edition)
- Giuseppe Ciaburro Prateek Joshi
- 335字
- 2021-06-24 15:40:54
How to do it…
Let's see how to predict the quality of wine:
- We will use the wine.quality.py file, already provided to you as a reference. We start, as always, by importing the NumPy library and loading the data (wine.txt):
import numpy as np
input_file = 'wine.txt'
X = []
y = []
with open(input_file, 'r') as f:
for line in f.readlines():
data = [float(x) for x in line.split(',')]
X.append(data[1:])
y.append(data[0])
X = np.array(X)
y = np.array(y)
Two arrays are returned: X (input data), and y (target).
- Now we need to separate our data into two groups: a training dataset and a testing dataset. The training dataset will be used to build the model, and the testing dataset will be used to see how this trained model performs on unknown data:
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=5)
Four arrays are returned: X_train, X_test, y_train, and y_test. This data will be used to train and validate the model.
- Let's train the classifier:
from sklearn.tree import DecisionTreeClassifier
classifier_DecisionTree = DecisionTreeClassifier()
classifier_DecisionTree.fit(X_train, y_train)
To train the model, a decision tree algorithm has been used. A decision tree algorithm is based on a non-parametric supervised learning method used for classification and regression. The aim is to build a model that predicts the value of a target variable using decision rules inferred from the data features.
- Now it's time to the compute accuracy of the classifier:
y_test_pred = classifier_DecisionTree.predict(X_test)
accuracy = 100.0 * (y_test == y_test_pred).sum() / X_test.shape[0]
print("Accuracy of the classifier =", round(accuracy, 2), "%")
The following result is returned:
Accuracy of the classifier = 91.11 %
- Finally, a confusion matrix will be calculated to compute the model performance:
from sklearn.metrics import confusion_matrix
confusion_mat = confusion_matrix(y_test, y_test_pred)
print(confusion_mat)
The following result is returned:
[[17 2 0]
[ 1 12 1]
[ 0 0 12]]
Values not present on the diagonals represent classification errors. So, only four errors were committed by the classifier.