There is no denying that Pokemon had a big influence on all the kids of my generation. I remember being 8 or 9 and looking forward to finishing school so I could catch up on the adventures of Ash and Pikachu. I also remember the fun I would have playing Pokemon Stadium on Nintendo 64 with my cousins on the weekend. The phenomenal popularity of Pokemon GO last year further confirmed that the nostalgia factor is still strong for a lot of people, even to this day.
I was browsing Kaggle for datasets to practice classification algorithms when I came across one describing the first 6 generations of Pokemon with a total of 721 Pokemon, of which 46 are legendary. Bingo! I thought. This dataset is not only a fun way to experiment with classifiers to predict whether a Pokemon is legendary or not, but also provides a way to simulate an end-to-end machine learning project. Moreover, evaluating the performance of our models will require careful thinking since only a small fraction (6.4% to be exact) of the Pokemon are legendary.
With this dataset in hand, our goals are to
- Create interesting visualizations.
- Preprocess the data to make it suitable for classification algorithms.
- Evaluate and compare the following classification models
- logistic regression;
- support vector machines (linear vs. RBF kernel);
- K-nearest-neighbors;
- perceptron;
- decision tree;
- random forest.
# standard libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from scipy.stats import norm
# to have multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# plotting
%matplotlib inline
from radar import ComplexRadar
from pylab import rcParams
rcParams['figure.figsize'] = 8, 4
# import dataset
pokemon = pd.read_csv('pokemon_alopez247.csv')
pokemon.drop('Number', inplace=True, axis=1) # not an important numeric quantity
pokemon.head()
1. Data visualization¶
Data visualizations are helpful tools to help us understand the data better. Boxplots are great visualization tools: the boxes show the quartiles of the distribution and the whiskers extend to the rest of it, with the exception of some outliers. It helps us see what Pokemon types have the best statistics; for instance flying Pokemon are typically much faster than other types, but the short top whisker indicates that there are still plenty of much faster non-flying Pokemon.
# boxplots of Pokemon's base stats
plt.rcParams['figure.figsize'] = (15, 8);
fig, axarr = plt.subplots(3, 2, sharex=True);
_ = sns.boxplot(x='Type_1', y='HP', data=pokemon, width=0.5, ax=axarr[0, 0]);
_ = sns.boxplot(x='Type_1', y='Speed', data=pokemon, width=0.5, ax=axarr[0, 1]);
_ = sns.boxplot(x='Type_1', y='Attack', data=pokemon, width=0.5, ax=axarr[1, 0]);
_ = sns.boxplot(x='Type_1', y='Defense', data=pokemon, width=0.5, ax=axarr[1, 1]);
_ = sns.boxplot(x='Type_1', y='Sp_Atk', data=pokemon, width=0.5, ax=axarr[2, 0]);
_ = sns.boxplot(x='Type_1', y='Sp_Def', data=pokemon, width=0.5, ax=axarr[2, 1]);
titles = [['Hit Points', 'Speed'], ['Attack', 'Defense'], ['Special Attack', 'Special Defense']]
for i in range(3):
for j in range(2):
_ = axarr[i, j].set(xlabel='', ylabel='', title=titles[i][j]);
for tick in axarr[i, j].get_xticklabels():
tick.set_rotation(90);
It is also interesting to note what the most popular Pokemon types are. For instance only a few Pokemon have the primary type Flying, whereas it is the most common secondary type. However given the abundance of flying Pokemon it would be unwise to use only the primary type for classification. As such we should consider both primary and secondary types on equal footing when we do our analysis.
# number of Pokemon per primary type
plt.rcParams['figure.figsize'] = (12, 12);
fig, axarr = plt.subplots(2, 1, sharex=False);
type_count1 = sns.countplot(x='Type_1', data=pokemon, order=pokemon.Type_1.value_counts().index, ax=axarr[0])
type_count1.set(xlabel='Primary type', ylabel='Count');
type_count2 = sns.countplot(x='Type_2', data=pokemon, order=pokemon.Type_1.value_counts().index, ax=axarr[1])
type_count2.set(xlabel='Secondary type', ylabel='Count');
Another interesting type of plot is the radar - or spider - chart. Using the code found on this forum, I created a radar chart of the mean and median stats for each of the body styles (determined by the variable I
).
# computing data (mean and median of stats) for our variables
rad_variables = ['HP', 'Attack', 'Defense', 'Sp_Atk', 'Sp_Def', 'Speed']
bodystyle_mean = pokemon.groupby('Body_Style').mean()[rad_variables]
bodystyle_median = pokemon.groupby('Body_Style').median()[rad_variables]
stats = bodystyle_mean.iloc[0, :].index # Pokemon stats
body_index = bodystyle_mean.index # Pokemon body type
I=7 # choose which body type to compute
stats_ranges = [(pokemon.groupby('Body_Style')[col].min().iloc[I],
0.75*pokemon.groupby('Body_Style')[col].max().iloc[I]) for col in rad_variables]
# plotting
fig1 = plt.figure(figsize=(5, 5))
radar = ComplexRadar(fig1, stats, stats_ranges)
radar.plot(bodystyle_mean.iloc[I, :], color='b')
radar.fill(bodystyle_mean.iloc[I, :], color='b', alpha=0.2)
plt.title("Mean Stats: " + body_index[I], fontsize=20);
plt.show();
fig2 = plt.figure(figsize=(5, 5))
radar = ComplexRadar(fig2, stats, stats_ranges)
radar.plot(bodystyle_median.iloc[I, :], color='g')
radar.fill(bodystyle_median.iloc[I, :], color='g', alpha=0.2)
plt.title("Median Stats: " + body_index[I], fontsize=20);
plt.show();
2. Preprocessing the dataset¶
a) Numerical features¶
Most machine learning algorithms perform poorly when the numerical attributes are on different scales, which is the case as shown in the histograms below. As such we will normalize/standardize features to accelerate the learning process. We do so by creating a PandasSelector
functionality from the BaseEstimator
(allows for easy hyperparameter tuning) and TransformerMixin
classes.
# extract numerical features and plot their histograms
numeric_pokemon = pokemon[pokemon.dtypes[(pokemon.dtypes=="float64")|(pokemon.dtypes=="int64")].index.values];
numeric_pokemon.hist(figsize=[11, 11], bins=12);
pokemon.Catch_Rate = pokemon.Catch_Rate.astype(float);
# bulid PandasSelector class with fit and transform methods
from sklearn.base import BaseEstimator, TransformerMixin
class PandasSelector(BaseEstimator, TransformerMixin):
def __init__(self, selected_columns):
self.selected_columns = selected_columns
def fit(self, df, *args):
return self
def transform(self, df):
return df[self.selected_columns].values
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union
num_features = ['Total', 'Attack', 'Sp_Atk', 'Speed','HP', 'Defense', 'Sp_Def', 'Height_m', 'Weight_kg', 'Catch_Rate']
# StandardScaler
standardize_pipeline = make_pipeline(
PandasSelector(num_features[:-1]),
StandardScaler()
)
# MinMaxScaler
minmax_pipeline = make_pipeline(
PandasSelector([num_features[-1]]),
MinMaxScaler()
)
# numerical pipeline
num_pipeline = make_union(standardize_pipeline, minmax_pipeline)
b) Categorical features¶
Categorical features also require some preprocessing, generally in the form of a one-hot encoding to prevent nominal features to be treated as ordinal ones. In what follows we give equal importance to Type_1 and Type_2. We also create a new numerical feature named Pr_Female to replace the categorical hasGender column.
# one-hot encoding of categorical variables
Type_dummies = pd.get_dummies(pokemon['Type_1']) + pd.get_dummies(pokemon['Type_2'])
pokemon[Type_dummies.columns] = Type_dummies
Color_dummies = pd.get_dummies(pokemon['Color'])
pokemon[Color_dummies.columns] = Color_dummies
Body_dummies = pd.get_dummies(pokemon['Body_Style'])
pokemon[Body_dummies.columns] = Body_dummies
# Pr_Female to replace hasGender
pokemon['Pr_Female'] = 1 - pokemon['Pr_Male']
pokemon.Pr_Female.fillna(value=0, inplace=True)
pokemon.Pr_Male.fillna(value=0, inplace=True)
cat_features = list(Type_dummies.columns.values) + list(Color_dummies.columns.values) + list(Body_dummies.columns.values) + ['hasMegaEvolution', 'Pr_Male', 'Pr_Female']
3. Model evaluations and comparisons¶
We are now ready to evaluate and compare different classification models on our dataset. However we need to keep in mind that our data is heavily skewed; only 6.4% of the Pokemon are legendary. Therefore a simple classifier predicting all Pokemon to be non-legendary would have an accuracy of 93.6%. To help put things into perspective, we should use three new metrics:
- Precision: the precision $p$ tells us that when a classifier claims an instance to be positive, it is right with probability $p$.
- Recall: also called true positive rate, the recall $r$ is the ratio of positive instances correctly labeled by the model.
- F1 score: the F1 score is the harmonic mean of precision and recall and favours classifiers that have $p \approx r$.
The most appropriate score for a specific model depends on its goals. Imagine for instance a parental filter on a search engine that rejects inappropriate images. It would be highly preferable for this classifier to have low recall (i.e. to reject images even if they are appropriate) but high precision (only safe content is shown). In our case, we should privilege an algorithm that has both high precision and high recall, which translates into a high F1 score.
Note that a confusion matrix can help us understand our classifier's behaviour better as it clearly illustrates the true positives, false positives, false negatives and true negatives. We will plot one for each classifier to understand where each fails.
In addition to these three tools, one can compute two other scores from precision and recall:
- ROC (receiver operating characteristic) curve: a curve of the TP rate against the FP rate at various threshold settings.
- Precision-Recall curve: a curve of precision vs. recall at various threshold settings.
The area under these curves (AUC) are good scalar metrics to distinguish between different classifiers. However the ROC curve is agnostic to class skew, so a good ROC AUC score might be misleading (like classification accuracy was) when the number of negative instances overwhelms the positive ones. In contrast the PR AUC score makes it clear when there is room for improvement when a class is heavily skewed.
We will train multiple classifiers and tune their hyperparameters using the GridSearchCV
class, which performs stratified cross-validation in order to keep an appropriate ratios of positive examples in each fold. We also allow the algorithms to give a balanced weight to the skewed class during CV for fairer representation (i.e. harsher penalty for getting positive examples wrong), but we let the cross-validation process decide whether to use that feature or not.
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
poke_train, poke_test = train_test_split(pokemon[num_features + cat_features + ['isLegendary']],
test_size=0.2, random_state=42)
X_train, y_train = poke_train.drop('isLegendary', axis=1), poke_train['isLegendary']
X_test, y_test = poke_test.drop('isLegendary', axis=1), poke_test['isLegendary']
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_auc_score, auc
import itertools
def skewed_metrics(model, test=True):
"""
This function computes five scores which we use to
assess the quality of our classifiers on skewed data.
"""
proba_clf = ['KNN', 'tree', 'forest']
if test:
y_true = y_test
y_predict = model.predict(X_test)
if any(clf in list(model.named_steps) for clf in ['KNN', 'tree', 'forest']):
y_score = model.predict_proba(X_test)[:, 1]
else:
y_score = model.decision_function(X_test)
else:
y_true = y_train
y_predict = model.predict(X_train)
if any(clf in list(model.named_steps) for clf in ['KNN', 'tree', 'forest']):
y_score = model.predict_proba(X_train)[:, 1]
else:
y_score = model.decision_function(X_train)
ROC_AUC = roc_auc_score(y_true, y_score, )
precision, recall, _ = precision_recall_curve(y_true, y_score)
PR_AUC = auc(recall, precision)
F1 = f1_score(y_true, y_predict)
P = precision_score(y_true, y_predict)
R = recall_score(y_true, y_predict)
return ROC_AUC, PR_AUC, F1, P, R
def plot_confusion_matrix(CM, classes=['non-legendary', 'legendary'], normalize=False, cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
(This function is adapted from the scikit docs.)
"""
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=False, figsize=(10, 10))
titles = ['Confusion matrix (training set)', 'Confusion matrix (test set)']
for ax, cm, title in zip(fig.axes, CM, titles):
plt.sca(ax)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.grid(which='both')
plt.title(title)
# plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
if normalize:
cm = cm.astype('float')/cm.sum(axis=1)[:, np.newaxis]
# print(cm)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")
plt.tight_layout(pad=5)
plt.ylabel('True label')
plt.xlabel('Predicted label')
a) Logistic regression¶
from sklearn.linear_model import LogisticRegression
# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
('logreg', LogisticRegression())
])
# cross-validate
parameters = {'logreg__C': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
'logreg__class_weight': [None, 'balanced']}
grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))
clf = grid_search.best_estimator_
cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])
logreg_metrics_train = skewed_metrics(clf, test=False)
logreg_metrics_test = skewed_metrics(clf, test=True)
The confusion matrices show that our model misclassifies 2 non-legendary Pokemon as being legendary. However someone who knows Pokemon will immediately notice that the two misclassified examples are somewhat special. Celebi (just like Mew in the training set) is a mythical Pokemon and therefore quite unique, whereas Dragonite is one of the strongest non-legendary Pokemon one may recruit. We can therefore conclude that our logistic regression model generalizes well to the test set since it "knows" more than it was taught.
misclassified = clf.predict(X_test) != y_test
Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
misclassified = clf.predict(X_train) != y_train
Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
b) Support vector machines¶
from sklearn.svm import SVC
# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
('svm', SVC())
])
# cross-validate
parameters = [{'svm__C': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
'svm__gamma': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
'svm__class_weight': [None, 'balanced'], 'svm__kernel': ['rbf']},
{'svm__C': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
'svm__class_weight': [None, 'balanced'], 'svm__kernel': ['linear']}]
grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))
clf = grid_search.best_estimator_
cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])
svm_metrics_train = skewed_metrics(clf, test=False)
svm_metrics_test = skewed_metrics(clf, test=True)
SVM performs identically to logistic regression.
c) K-nearest-neighbors¶
from sklearn.neighbors import KNeighborsClassifier
# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
('KNN', KNeighborsClassifier())
])
# cross-validate
parameters = {'KNN__n_neighbors': [3, 5, 7, 9, 11, 13, 15],
'KNN__weights': ['uniform', 'distance']}
grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))
clf = grid_search.best_estimator_
cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])
KNN_metrics_train = skewed_metrics(clf, test=False)
KNN_metrics_test = skewed_metrics(clf, test=True)
This KNN model has the poorest performance on both sets for all six model.
misclassified = clf.predict(X_test) != y_test
Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
d) Perceptron¶
from sklearn.linear_model import Perceptron
# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
('perc', Perceptron(n_iter=30))
])
# cross-validate
parameters = {'perc__alpha': [1e-5, 3e-5, 1e-3, 3e-3, 0.01, 0.03, 1, 3, 10, 30],
'perc__penalty': ['l1', 'l2', 'elasticnet']}
grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))
clf = grid_search.best_estimator_
cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])
perc_metrics_train = skewed_metrics(clf, test=False)
perc_metrics_test = skewed_metrics(clf, test=True)
The perceptron seems to overfit the training set and fails to generalize as well as logistic regression or SVM on the test set.
misclassified = clf.predict(X_test) != y_test
Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
e) Decision tree¶
from sklearn.tree import DecisionTreeClassifier
# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
('tree', DecisionTreeClassifier())
])
# cross-validate
parameters = {'tree__criterion': ['gini', 'entropy'], 'tree__splitter': ['best', 'random'],
'tree__max_features': [None, 'auto'], 'tree__max_depth': [5, 10, 25, 50],
'tree__class_weight': [None, 'balanced']}
grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))
clf = grid_search.best_estimator_
cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])
tree_metrics_train = skewed_metrics(clf, test=False)
tree_metrics_test = skewed_metrics(clf, test=True)
The model has a good performance and, after running it multiple times with different parameters, doesn't overfit the training set as badly when a max_depth
is specified.
misclassified = clf.predict(X_test) != y_test
Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
misclassified = clf.predict(X_train) != y_train
Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
f) Random forest¶
from sklearn.ensemble import RandomForestClassifier
# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
('forest', RandomForestClassifier())
])
# cross-validate
parameters = {'forest__criterion': ['gini', 'entropy'], 'forest__max_depth': [5, 10, 25, 50],
'forest__n_estimators': [10, 25, 50, 75, 100],
'forest__max_features': [None, 'auto', 'log2'], 'forest__bootstrap': [True, False],
'forest__class_weight': [None, 'balanced']}
grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))
clf = grid_search.best_estimator_
cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])
forest_metrics_train = skewed_metrics(clf, test=False)
forest_metrics_test = skewed_metrics(clf, test=True)
misclassified = clf.predict(X_test) != y_test
Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
misclassified = clf.predict(X_train) != y_train
Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
This random forest perform as well as our decision tree which gives both of them the best performance on unseen data. However their random nature implies that we will get different classifiers were we to rerun the code above, with different precision-recall tradeoffs each time.
Summary of results¶
data = {
'Logistic Regression': list(logreg_metrics_train + logreg_metrics_test),
'Support Vector Machines': list(svm_metrics_train + svm_metrics_test),
'KNN': list(KNN_metrics_train + KNN_metrics_test),
'Perceptron': list(perc_metrics_train + perc_metrics_test),
'Decision Tree': list(tree_metrics_train + tree_metrics_test),
'Random Forest': list(forest_metrics_train + forest_metrics_test)
}
cols = [ ['Training Set']*5 + ['Test Set']*5,
['ROC AUC', 'PR AUC', 'F1 Score', 'Precision', 'Recall']*2 ]
models = pd.DataFrame(data, index=cols).transpose()
models.sort_values(by=[('Test Set', 'PR AUC')], ascending=False)
We now have 6 models, of which 4 have decent performance on unseen data and perfect recall. An interesting feature of our classifiers is their ability to detect when a Pokemon is special: Dragonite, Mew, Meloetta and the mythical Celebi make repeated appearances in the misclassified examples, suggesting that they are better than most other Pokemon.
It would be interesting to see how the models above perform on Pokemon from the 7th generation as an indication of their generalization potential. A different but promising approach to the same problem would be to perform anomaly detection, which works well on imbalanced datasets such as this one. Doing so would require the fit of a multivariate Gaussian to a transformed (i.e. more normal) dataset with exclusively numerical features and the search of a threshold $\epsilon$ that separates legendary Pokemon from regular ones. However we leave this exercice to the reader!