This tutorial is based on Yhat's 2013 tutorial on Random Forests in Python. If you want a good summary of the theory and uses of random forests, I suggest you check out their guide. In the tutorial below, I annotate, correct, and expand on a short code example of random forests they present at the end of the article. Specifically, I 1) update the code so it runs in the latest version of pandas and Python, 2) write detailed comments explaining what is happening in each step, and 3) expand the code in a number of ways.
Let's get started!
A Note About The Data
The data for this tutorial is famous. Called, the iris dataset, it contains four variables measuring various parts of iris flowers of three related species, and then a fourth variable with the species name. The reason it is so famous in machine learning and statistics communities is because the data requires very little preprocessing (i.e. no missing values, all features are floating numbers, etc.).
# Load the library with the iris dataset from sklearn.datasets import load_iris # Load scikit's random forest classifier library from sklearn.ensemble import RandomForestClassifier # Load pandas import pandas as pd # Load numpy import numpy as np
# Create an object called iris with the iris data iris = load_iris() # Create a dataframe with the four feature variables df = pd.DataFrame(iris.data, columns=iris.feature_names) # View the top 5 rows df.head()
|sepal length (cm)||sepal width (cm)||petal length (cm)||petal width (cm)|
# Add a new column with the species names, this is what we are going to try to predict df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names) # View the top 5 rows df.head()
|sepal length (cm)||sepal width (cm)||petal length (cm)||petal width (cm)||species|
Create Training And Test Data
# Create a new column that for each row, generates a random number between 0 and 1, and # if that value is less than or equal to .75, then sets the value of that cell as True # and false otherwise. This is a quick and dirty way of randomly assigning some rows to # be used as the training data and some as the test data. df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # View the top 5 rows df.head()
|sepal length (cm)||sepal width (cm)||petal length (cm)||petal width (cm)||species||is_train|
# Create two new dataframes, one with the training rows, one with the test rows train, test = df[df['is_train']==True], df[df['is_train']==False]
# Show the number of observations for the test and training dataframes print('Number of observations in the training data:', len(train)) print('Number of observations in the test data:',len(test))
Number of observations in the training data: 107 Number of observations in the test data: 43
# Create a list of the feature column's names features = df.columns[:4] features
Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], dtype='object')
# train['species'] contains the actual species names. Before we can use it, # we need to convert each species name into a digit. So, in this case there # are three species, which have been coded as 0, 1, or 2. y = pd.factorize(train['species']) y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
Train The Random Forest Classifier
# Create a random forest classifier. By convention, clf means 'classifier' clf = RandomForestClassifier(n_jobs=2) # Train the classifier to take the training features and learn how they relate # to the training y (the species) clf.fit(train[features], y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2, oob_score=False, random_state=None, verbose=0, warm_start=False)
Huzzah! We have done it! We have officially trained our random forest classifier! Now let's play with it. The classifier model itself is stored in the
Apply classifier To Test Data
If you have been following along, you will know we only trained our classifier on part of the data, leaving the rest out. This is, in my humble opinion, the most important part of machine learning. Why? Because by leaving out a portion of the data, we have a set of data to test the accuracy of our model!
Let's do that now.
# Apply the classifier we trained to the test data (which, remember, it has never seen before) clf.predict(test[features])
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
What are you looking at above? Remember that we coded each of the three species of plant as 0, 1, or 2. What the list of numbers above is showing you is what species our model predicts each plant is based on the the sepal length, sepal width, petal length, and petal width. How confident is the classifier about each plant? We can see that too.
# View the predicted probabilities of the first 10 observations clf.predict_proba(test[features])[0:10]
array([[ 1. , 0. , 0. ], [ 1. , 0. , 0. ], [ 0.9, 0.1, 0. ], [ 1. , 0. , 0. ], [ 1. , 0. , 0. ], [ 1. , 0. , 0. ], [ 1. , 0. , 0. ], [ 1. , 0. , 0. ], [ 1. , 0. , 0. ], [ 1. , 0. , 0. ]])
There are three species of plant, thus
[ 1. , 0. , 0. ] tells us that the classifier is certain that the plant is the first class. Taking another example,
[ 0.9, 0.1, 0. ] tells us that the classifier gives a 90% probability the plant belongs to the first class and a 10% probability the plant belongs to the second class. Because 90 is greater than 10, the classifier predicts the plant is the first class.
Now that we have predicted the species of all plants in the test data, we can compare our predicted species with the that plant's actual species.
# Create actual english names for the plants for each predicted plant class preds = iris.target_names[clf.predict(test[features])]
# View the PREDICTED species for the first five observations preds[0:5]
array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa'], dtype='<U10')
# View the ACTUAL species for the first five observations test['species'].head()
14 setosa 18 setosa 20 setosa 22 setosa 25 setosa Name: species, dtype: category Categories (3, object): [setosa, versicolor, virginica]
That looks pretty good! At least for the first five observations. Now let's use look at all the data.
Create a confusion matrix
A confusion matrix can be, no pun intended, a little confusing to interpret at first, but it is actually very straightforward. The columns are the species we predicted for the test data and the rows are the actual species for the test data. So, if we take the top row, we can wee that we predicted all 20 setosa plants in the test data perfectly. However, in the next row, we predicted 17 of the versicolor plants correctly, but mis-predicted two of the versicolor plants as virginica.
The short explanation of how to interpret a confusion matrix is: anything on the diagonal was classified correctly and anything off the diagonal was classified incorrectly.
# Create confusion matrix pd.crosstab(test['species'], preds, rownames=['Actual Species'], colnames=['Predicted Species'])
View Feature Importance
While we don't get regression coefficients like with OLS, we do get a score telling us how important each feature was in classifying. This is one of the most powerful parts of random forests, because we can clearly see that petal width was more important in classification than sepal width.
# View a list of the features and their importance scores list(zip(train[features], clf.feature_importances_))
[('sepal length (cm)', 0.13356069065846765), ('sepal width (cm)', 0.04486948688226873), ('petal length (cm)', 0.37067096905488794), ('petal width (cm)', 0.45089885340437574)]