- Python Machine Learning Cookbook(Second Edition)
- Giuseppe Ciaburro Prateek Joshi
- 378字
- 2021-06-24 15:40:59
How to do it...
Let's see how to tackle class imbalance:
- Let's import the libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
import utilities
- Let's load the data (data_multivar_imbalance.txt):
input_file = 'data_multivar_imbalance.txt' X, y = utilities.load_data(input_file)
- Let's visualize the data. The code for visualization is exactly the same as it was in the previous recipe. You can also find it in the file named svm_imbalance.py, already provided to you:
# Separate the data into classes based on 'y'
class_0 = np.array([X[i] for i in range(len(X)) if y[i]==0])
class_1 = np.array([X[i] for i in range(len(X)) if y[i]==1])
# Plot the input data
plt.figure()
plt.scatter(class_0[:,0], class_0[:,1], facecolors='black', edgecolors='black', marker='s')
plt.scatter(class_1[:,0], class_1[:,1], facecolors='None', edgecolors='black', marker='s')
plt.title('Input data')
plt.show()
- If you run it, you will see the following:
- Let's build an SVM with a linear kernel. The code is the same as it was in the previous recipe, Building a nonlinear classifier using SVMs:
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=5)
params = {'kernel': 'linear'}
classifier = SVC(**params, gamma='auto')
classifier.fit(X_train, y_train)
utilities.plot_classifier(classifier, X_train, y_train, 'Training dataset')
plt.show()
- Let's print a classification report:
from sklearn.metrics import classification_report
target_names = ['Class-' + str(int(i)) for i in set(y)]
print("\n" + "#"*30)
print("\nClassifier performance on training dataset\n")
print(classification_report(y_train, classifier.predict(X_train), target_names=target_names))
print("#"*30 + "\n")
print("#"*30)
print("\nClassification report on test dataset\n")
print(classification_report(y_test, y_test_pred, target_names=target_names))
print("#"*30 + "\n")
- If you run it, you will see the following:
- You might wonder why there's no boundary here! Well, this is because the classifier is unable to separate the two classes at all, resulting in 0% accuracy for Class-0. You will also see a classification report printed on your Terminal, as shown in the following screenshot:
- As we expected, Class-0 has 0% precision, so let's go ahead and fix this! In the Python file, search for the following line:
params = {'kernel': 'linear'}
- Replace the preceding line with the following:
params = {'kernel': 'linear', 'class_weight': 'balanced'}
- The class_weight parameter will count the number of datapoints in each class to adjust the weights so that the imbalance doesn't adversely affect the performance.
- You will get the following output once you run this code:
- Let's look at the classification report:
- As we can see, Class-0 is now detected with nonzero percentage accuracy.