- Python Machine Learning Cookbook(Second Edition)
- Giuseppe Ciaburro Prateek Joshi
- 483字
- 2021-06-24 15:40:41
How to do it…
Let's see how to estimate bicycle demand distribution:
- We first need to import a couple of new packages, as follows:
import csv
import numpy as np
- We are processing a CSV file, so the CSV package for useful in handling these files. Let's import the data into the Python environment:
filename="bike_day.csv"
file_reader = csv.reader(open(filename, 'r'), delimiter=',')
X, y = [], []
for row in file_reader:
X.append(row[2:13])
y.append(row[-1])
This piece of code just read all the data from the CSV file. The csv.reader() function returns a reader object, which will iterate over lines in the given CSV file. Each row read from the CSV file is returned as a list of strings. So, two lists are returned: X and y. We have separated the data from the output values and returned them. Now we will extract feature names:
feature_names = np.array(X[0])
The feature names are useful when we display them on a graph. So, we have to remove the first row from X and y because they are feature names:
X=np.array(X[1:]).astype(np.float32)
y=np.array(y[1:]).astype(np.float32)
We have also converted the two lists into two arrays.
- Let's shuffle these two arrays to make them independent of the order in which the data is arranged in the file:
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=7)
- As we did earlier, we need to separate the data into training and testing data. This time, let's use 90% of the data for training and the remaining 10% for testing:
num_training = int(0.9 * len(X))
X_train, y_train = X[:num_training], y[:num_training]
X_test, y_test = X[num_training:], y[num_training:]
- Let's go ahead and train the regressor:
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=1000, max_depth=10, min_samples_split=2)
rf_regressor.fit(X_train, y_train)
The RandomForestRegressor() function builds a random forest regressor. Here, n_estimators refers to the number of estimators, which is the number of decision trees that we want to use in our random forest. The max_depth parameter refers to the maximum depth of each tree, and the min_samples_split parameter refers to the number of data samples that are needed to split a node in the tree.
- Let's evaluate the performance of the random forest regressor:
y_pred = rf_regressor.predict(X_test)
from sklearn.metrics import mean_squared_error, explained_variance_score
mse = mean_squared_error(y_test, y_pred)
evs = explained_variance_score(y_test, y_pred)
print( "#### Random Forest regressor performance ####")
print("Mean squared error =", round(mse, 2))
print("Explained variance score =", round(evs, 2))
The following results are returned:
#### Random Forest regressor performance ####
Mean squared error = 357864.36
Explained variance score = 0.89
- Let's extract the relative importance of the features:
RFFImp= rf_regressor.feature_importances_
RFFImp= 100.0 * (RFFImp / max(RFFImp))
index_sorted = np.flipud(np.argsort(RFFImp))
pos = np.arange(index_sorted.shape[0]) + 0.5
To visualize the results, we will plot a bar graph:
import matplotlib.pyplot as plt
plt.figure()
plt.bar(pos, RFFImp[index_sorted], align='center')
plt.xticks(pos, feature_names[index_sorted])
plt.ylabel('Relative Importance')
plt.title("Random Forest regressor")
plt.show()
The following output is plotted:
Looks like the temperature is the most important factor controlling bicycle rentals.