书名：Python Machine Learning Cookbook（Second Edition）
作者名：Giuseppe Ciaburro Prateek Joshi
本章字数：378字
更新时间：2021-06-24 15:40:59

How to do it...

Let's see how to tackle class imbalance:

Let's import the libraries:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
import utilities

Let's load the data (data_multivar_imbalance.txt):

input_file = 'data_multivar_imbalance.txt' 
X, y = utilities.load_data(input_file)

Let's visualize the data. The code for visualization is exactly the same as it was in the previous recipe. You can also find it in the file named svm_imbalance.py, already provided to you:

# Separate the data into classes based on 'y'
class_0 = np.array([X[i] for i in range(len(X)) if y[i]==0])
class_1 = np.array([X[i] for i in range(len(X)) if y[i]==1])
# Plot the input data
plt.figure()
plt.scatter(class_0[:,0], class_0[:,1], facecolors='black', edgecolors='black', marker='s')
plt.scatter(class_1[:,0], class_1[:,1], facecolors='None', edgecolors='black', marker='s')
plt.title('Input data')
plt.show()

If you run it, you will see the following:

Let's build an SVM with a linear kernel. The code is the same as it was in the previous recipe, Building a nonlinear classifier using SVMs:

from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=5)
params = {'kernel': 'linear'}
classifier = SVC(**params, gamma='auto')
classifier.fit(X_train, y_train)
utilities.plot_classifier(classifier, X_train, y_train, 'Training dataset')
plt.show()

Let's print a classification report:

from sklearn.metrics import classification_report
target_names = ['Class-' + str(int(i)) for i in set(y)]
print("\n" + "#"*30)
print("\nClassifier performance on training dataset\n")
print(classification_report(y_train, classifier.predict(X_train), target_names=target_names))
print("#"*30 + "\n")
print("#"*30)
print("\nClassification report on test dataset\n")
print(classification_report(y_test, y_test_pred, target_names=target_names))
print("#"*30 + "\n")

If you run it, you will see the following:

You might wonder why there's no boundary here! Well, this is because the classifier is unable to separate the two classes at all, resulting in 0% accuracy for Class-0. You will also see a classification report printed on your Terminal, as shown in the following screenshot:

As we expected, Class-0 has 0% precision, so let's go ahead and fix this! In the Python file, search for the following line:

params = {'kernel': 'linear'}

Replace the preceding line with the following:

params = {'kernel': 'linear', 'class_weight': 'balanced'}

The class_weight parameter will count the number of datapoints in each class to adjust the weights so that the imbalance doesn't adversely affect the performance.
You will get the following output once you run this code:

Let's look at the classification report:

As we can see, Class-0 is now detected with nonzero percentage accuracy.