Legendary Pokemon Classifier

There is no denying that Pokemon had a big influence on all the kids of my generation. I remember being 8 or 9 and looking forward to finishing school so I could catch up on the adventures of Ash and Pikachu. I also remember the fun I would have playing Pokemon Stadium on Nintendo 64 with my cousins on the weekend. The phenomenal popularity of Pokemon GO last year further confirmed that the nostalgia factor is still strong for a lot of people, even to this day.

I was browsing Kaggle for datasets to practice classification algorithms when I came across one describing the first 6 generations of Pokemon with a total of 721 Pokemon, of which 46 are legendary. Bingo! I thought. This dataset is not only a fun way to experiment with classifiers to predict whether a Pokemon is legendary or not, but also provides a way to simulate an end-to-end machine learning project. Moreover, evaluating the performance of our models will require careful thinking since only a small fraction (6.4% to be exact) of the Pokemon are legendary.

With this dataset in hand, our goals are to

  1. Create interesting visualizations.
  2. Preprocess the data to make it suitable for classification algorithms.
  3. Evaluate and compare the following classification models
    • logistic regression;
    • support vector machines (linear vs. RBF kernel);
    • K-nearest-neighbors;
    • perceptron;
    • decision tree;
    • random forest.
In [1]:
# standard libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from scipy.stats import norm

# to have multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# plotting
%matplotlib inline
from radar import ComplexRadar
from pylab import rcParams
rcParams['figure.figsize'] = 8, 4
In [2]:
# import dataset
pokemon = pd.read_csv('pokemon_alopez247.csv')
pokemon.drop('Number', inplace=True, axis=1) # not an important numeric quantity
pokemon.head()
Out[2]:
Name Type_1 Type_2 Total HP Attack Defense Sp_Atk Sp_Def Speed ... Color hasGender Pr_Male Egg_Group_1 Egg_Group_2 hasMegaEvolution Height_m Weight_kg Catch_Rate Body_Style
0 Bulbasaur Grass Poison 318 45 49 49 65 65 45 ... Green True 0.875 Monster Grass False 0.71 6.9 45 quadruped
1 Ivysaur Grass Poison 405 60 62 63 80 80 60 ... Green True 0.875 Monster Grass False 0.99 13.0 45 quadruped
2 Venusaur Grass Poison 525 80 82 83 100 100 80 ... Green True 0.875 Monster Grass True 2.01 100.0 45 quadruped
3 Charmander Fire NaN 309 39 52 43 60 50 65 ... Red True 0.875 Monster Dragon False 0.61 8.5 45 bipedal_tailed
4 Charmeleon Fire NaN 405 58 64 58 80 65 80 ... Red True 0.875 Monster Dragon False 1.09 19.0 45 bipedal_tailed

5 rows × 22 columns

1. Data visualization¶

Data visualizations are helpful tools to help us understand the data better. Boxplots are great visualization tools: the boxes show the quartiles of the distribution and the whiskers extend to the rest of it, with the exception of some outliers. It helps us see what Pokemon types have the best statistics; for instance flying Pokemon are typically much faster than other types, but the short top whisker indicates that there are still plenty of much faster non-flying Pokemon.

In [3]:
# boxplots of Pokemon's base stats
plt.rcParams['figure.figsize'] = (15, 8);
fig, axarr = plt.subplots(3, 2, sharex=True);

_ = sns.boxplot(x='Type_1', y='HP', data=pokemon, width=0.5, ax=axarr[0, 0]);
_ = sns.boxplot(x='Type_1', y='Speed', data=pokemon, width=0.5, ax=axarr[0, 1]);
_ = sns.boxplot(x='Type_1', y='Attack', data=pokemon, width=0.5, ax=axarr[1, 0]);
_ = sns.boxplot(x='Type_1', y='Defense', data=pokemon, width=0.5, ax=axarr[1, 1]);
_ = sns.boxplot(x='Type_1', y='Sp_Atk', data=pokemon, width=0.5, ax=axarr[2, 0]);
_ = sns.boxplot(x='Type_1', y='Sp_Def', data=pokemon, width=0.5, ax=axarr[2, 1]);

titles = [['Hit Points', 'Speed'], ['Attack', 'Defense'], ['Special Attack', 'Special Defense']]
for i in range(3):
    for j in range(2):
        _ = axarr[i, j].set(xlabel='', ylabel='', title=titles[i][j]);
        for tick in axarr[i, j].get_xticklabels():
            tick.set_rotation(90);

It is also interesting to note what the most popular Pokemon types are. For instance only a few Pokemon have the primary type Flying, whereas it is the most common secondary type. However given the abundance of flying Pokemon it would be unwise to use only the primary type for classification. As such we should consider both primary and secondary types on equal footing when we do our analysis.

In [4]:
# number of Pokemon per primary type
plt.rcParams['figure.figsize'] = (12, 12);
fig, axarr = plt.subplots(2, 1, sharex=False);

type_count1 = sns.countplot(x='Type_1', data=pokemon, order=pokemon.Type_1.value_counts().index, ax=axarr[0])
type_count1.set(xlabel='Primary type', ylabel='Count');

type_count2 = sns.countplot(x='Type_2', data=pokemon, order=pokemon.Type_1.value_counts().index, ax=axarr[1])
type_count2.set(xlabel='Secondary type', ylabel='Count');

Another interesting type of plot is the radar - or spider - chart. Using the code found on this forum, I created a radar chart of the mean and median stats for each of the body styles (determined by the variable I).

In [5]:
# computing data (mean and median of stats) for our variables
rad_variables = ['HP', 'Attack', 'Defense', 'Sp_Atk', 'Sp_Def', 'Speed']

bodystyle_mean = pokemon.groupby('Body_Style').mean()[rad_variables]
bodystyle_median = pokemon.groupby('Body_Style').median()[rad_variables]

stats = bodystyle_mean.iloc[0, :].index  # Pokemon stats
body_index = bodystyle_mean.index  # Pokemon body type

I=7 # choose which body type to compute
stats_ranges = [(pokemon.groupby('Body_Style')[col].min().iloc[I], 
               0.75*pokemon.groupby('Body_Style')[col].max().iloc[I]) for col in rad_variables]

# plotting
fig1 = plt.figure(figsize=(5, 5))
radar = ComplexRadar(fig1, stats, stats_ranges)
radar.plot(bodystyle_mean.iloc[I, :], color='b')
radar.fill(bodystyle_mean.iloc[I, :], color='b', alpha=0.2)
plt.title("Mean Stats: " + body_index[I], fontsize=20);
plt.show();

fig2 = plt.figure(figsize=(5, 5))
radar = ComplexRadar(fig2, stats, stats_ranges)
radar.plot(bodystyle_median.iloc[I, :], color='g')
radar.fill(bodystyle_median.iloc[I, :], color='g', alpha=0.2)
plt.title("Median Stats: " + body_index[I], fontsize=20);
plt.show();

2. Preprocessing the dataset¶

a) Numerical features¶

Most machine learning algorithms perform poorly when the numerical attributes are on different scales, which is the case as shown in the histograms below. As such we will normalize/standardize features to accelerate the learning process. We do so by creating a PandasSelector functionality from the BaseEstimator (allows for easy hyperparameter tuning) and TransformerMixin classes.

In [6]:
# extract numerical features and plot their histograms
numeric_pokemon = pokemon[pokemon.dtypes[(pokemon.dtypes=="float64")|(pokemon.dtypes=="int64")].index.values];
numeric_pokemon.hist(figsize=[11, 11], bins=12);
pokemon.Catch_Rate = pokemon.Catch_Rate.astype(float);
In [7]:
# bulid PandasSelector class with fit and transform methods
from sklearn.base import BaseEstimator, TransformerMixin

class PandasSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, selected_columns):
        self.selected_columns = selected_columns
    
    def fit(self, df, *args):
        return self

    def transform(self, df):
        return df[self.selected_columns].values
In [8]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union

num_features = ['Total', 'Attack', 'Sp_Atk', 'Speed','HP', 'Defense', 'Sp_Def', 'Height_m', 'Weight_kg', 'Catch_Rate']

# StandardScaler
standardize_pipeline = make_pipeline(
    PandasSelector(num_features[:-1]), 
    StandardScaler()
)

# MinMaxScaler
minmax_pipeline = make_pipeline(
    PandasSelector([num_features[-1]]),
    MinMaxScaler()
)

# numerical pipeline
num_pipeline = make_union(standardize_pipeline, minmax_pipeline)

b) Categorical features¶

Categorical features also require some preprocessing, generally in the form of a one-hot encoding to prevent nominal features to be treated as ordinal ones. In what follows we give equal importance to Type_1 and Type_2. We also create a new numerical feature named Pr_Female to replace the categorical hasGender column.

In [9]:
# one-hot encoding of categorical variables
Type_dummies = pd.get_dummies(pokemon['Type_1']) + pd.get_dummies(pokemon['Type_2'])
pokemon[Type_dummies.columns] = Type_dummies
Color_dummies = pd.get_dummies(pokemon['Color'])
pokemon[Color_dummies.columns] = Color_dummies
Body_dummies = pd.get_dummies(pokemon['Body_Style'])
pokemon[Body_dummies.columns] = Body_dummies

# Pr_Female to replace hasGender
pokemon['Pr_Female'] = 1 - pokemon['Pr_Male']
pokemon.Pr_Female.fillna(value=0, inplace=True)
pokemon.Pr_Male.fillna(value=0, inplace=True)

cat_features = list(Type_dummies.columns.values) + list(Color_dummies.columns.values) + list(Body_dummies.columns.values) + ['hasMegaEvolution', 'Pr_Male', 'Pr_Female']

3. Model evaluations and comparisons¶

We are now ready to evaluate and compare different classification models on our dataset. However we need to keep in mind that our data is heavily skewed; only 6.4% of the Pokemon are legendary. Therefore a simple classifier predicting all Pokemon to be non-legendary would have an accuracy of 93.6%. To help put things into perspective, we should use three new metrics:

  • Precision: the precision $p$ tells us that when a classifier claims an instance to be positive, it is right with probability $p$.
  • Recall: also called true positive rate, the recall $r$ is the ratio of positive instances correctly labeled by the model.
  • F1 score: the F1 score is the harmonic mean of precision and recall and favours classifiers that have $p \approx r$.

The most appropriate score for a specific model depends on its goals. Imagine for instance a parental filter on a search engine that rejects inappropriate images. It would be highly preferable for this classifier to have low recall (i.e. to reject images even if they are appropriate) but high precision (only safe content is shown). In our case, we should privilege an algorithm that has both high precision and high recall, which translates into a high F1 score.

Note that a confusion matrix can help us understand our classifier's behaviour better as it clearly illustrates the true positives, false positives, false negatives and true negatives. We will plot one for each classifier to understand where each fails.

In addition to these three tools, one can compute two other scores from precision and recall:

  • ROC (receiver operating characteristic) curve: a curve of the TP rate against the FP rate at various threshold settings.
  • Precision-Recall curve: a curve of precision vs. recall at various threshold settings.

The area under these curves (AUC) are good scalar metrics to distinguish between different classifiers. However the ROC curve is agnostic to class skew, so a good ROC AUC score might be misleading (like classification accuracy was) when the number of negative instances overwhelms the positive ones. In contrast the PR AUC score makes it clear when there is room for improvement when a class is heavily skewed.

We will train multiple classifiers and tune their hyperparameters using the GridSearchCV class, which performs stratified cross-validation in order to keep an appropriate ratios of positive examples in each fold. We also allow the algorithms to give a balanced weight to the skewed class during CV for fairer representation (i.e. harsher penalty for getting positive examples wrong), but we let the cross-validation process decide whether to use that feature or not.

In [10]:
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

poke_train, poke_test = train_test_split(pokemon[num_features + cat_features + ['isLegendary']], 
                                         test_size=0.2, random_state=42)

X_train, y_train = poke_train.drop('isLegendary', axis=1), poke_train['isLegendary']
X_test, y_test = poke_test.drop('isLegendary', axis=1), poke_test['isLegendary']
In [11]:
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_auc_score, auc
import itertools

def skewed_metrics(model, test=True):
    """
    This function computes five scores which we use to
    assess the quality of our classifiers on skewed data.
    """
    proba_clf = ['KNN', 'tree', 'forest']
    if test:
        y_true = y_test
        y_predict = model.predict(X_test)
        if any(clf in list(model.named_steps) for clf in ['KNN', 'tree', 'forest']):
            y_score = model.predict_proba(X_test)[:, 1]
        else:
            y_score = model.decision_function(X_test)
    else:
        y_true = y_train
        y_predict = model.predict(X_train)
        if any(clf in list(model.named_steps) for clf in ['KNN', 'tree', 'forest']):
            y_score = model.predict_proba(X_train)[:, 1]
        else:
            y_score = model.decision_function(X_train)
    
    ROC_AUC = roc_auc_score(y_true, y_score, )
    precision, recall, _ = precision_recall_curve(y_true, y_score)
    PR_AUC = auc(recall, precision)
    
    F1 = f1_score(y_true, y_predict)
    P = precision_score(y_true, y_predict)
    R = recall_score(y_true, y_predict)
    
    return ROC_AUC, PR_AUC, F1, P, R

def plot_confusion_matrix(CM, classes=['non-legendary', 'legendary'], normalize=False, cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    (This function is adapted from the scikit docs.)
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, sharey=False, figsize=(10, 10))
    titles = ['Confusion matrix (training set)', 'Confusion matrix (test set)']
    
    for ax, cm, title in zip(fig.axes, CM, titles):
        plt.sca(ax)
        
        plt.imshow(cm, interpolation='nearest', cmap=cmap)
        plt.grid(which='both')
        plt.title(title)
        # plt.colorbar()
        tick_marks = np.arange(len(classes))
        plt.xticks(tick_marks, classes, rotation=45)
        plt.yticks(tick_marks, classes)
        
        if normalize:
            cm = cm.astype('float')/cm.sum(axis=1)[:, np.newaxis]
        # print(cm)
        thresh = cm.max() / 2.
        for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")

        plt.tight_layout(pad=5)
        plt.ylabel('True label')
        plt.xlabel('Predicted label')

a) Logistic regression¶

In [12]:
from sklearn.linear_model import LogisticRegression

# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
    ('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
    ('logreg', LogisticRegression())
])

# cross-validate
parameters = {'logreg__C': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
              'logreg__class_weight': [None, 'balanced']}

grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))

clf = grid_search.best_estimator_
Best params: {'logreg__C': 3, 'logreg__class_weight': 'balanced'}
In [13]:
cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])

logreg_metrics_train = skewed_metrics(clf, test=False)
logreg_metrics_test = skewed_metrics(clf, test=True)

The confusion matrices show that our model misclassifies 2 non-legendary Pokemon as being legendary. However someone who knows Pokemon will immediately notice that the two misclassified examples are somewhat special. Celebi (just like Mew in the training set) is a mythical Pokemon and therefore quite unique, whereas Dragonite is one of the strongest non-legendary Pokemon one may recruit. We can therefore conclude that our logistic regression model generalizes well to the test set since it "knows" more than it was taught.

In [14]:
misclassified = clf.predict(X_test) != y_test

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
Out[14]:
Number Name Type_1 Type_2 Total HP Attack Defense Sp_Atk Sp_Def Speed Generation isLegendary
250 251 Celebi Psychic Grass 600 100 100 100 100 100 100 2 False
148 149 Dragonite Dragon Flying 600 91 134 95 100 100 80 1 False
In [15]:
misclassified = clf.predict(X_train) != y_train

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
Out[15]:
Number Name Type_1 Type_2 Total HP Attack Defense Sp_Atk Sp_Def Speed Generation isLegendary
375 376 Metagross Steel Psychic 600 80 135 130 95 90 70 3 False
647 648 Meloetta Normal Psychic 600 100 77 77 128 128 90 5 False
150 151 Mew Psychic NaN 600 100 100 100 100 100 100 1 False
648 649 Genesect Bug Steel 600 71 120 95 120 95 99 5 False
489 490 Manaphy Water NaN 600 100 100 100 100 100 100 4 False
461 462 Magnezone Electric Steel 535 70 70 115 130 90 60 4 False
646 647 Keldeo Water Fighting 580 91 72 90 129 90 108 5 False
566 567 Archeops Rock Flying 567 75 140 65 112 65 110 5 False
58 59 Arcanine Fire NaN 555 90 110 80 100 80 95 1 False
372 373 Salamence Dragon Flying 600 95 135 80 110 80 100 3 False

b) Support vector machines¶

In [16]:
from sklearn.svm import SVC

# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
    ('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
    ('svm', SVC())
])

# cross-validate
parameters = [{'svm__C': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
              'svm__gamma': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
              'svm__class_weight': [None, 'balanced'], 'svm__kernel': ['rbf']},
              {'svm__C': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
              'svm__class_weight': [None, 'balanced'], 'svm__kernel': ['linear']}]

grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))

clf = grid_search.best_estimator_
Best params: {'svm__C': 10, 'svm__class_weight': 'balanced', 'svm__gamma': 0.01, 'svm__kernel': 'rbf'}
In [17]:
cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])

svm_metrics_train = skewed_metrics(clf, test=False)
svm_metrics_test = skewed_metrics(clf, test=True)

SVM performs identically to logistic regression.

c) K-nearest-neighbors¶

In [18]:
from sklearn.neighbors import KNeighborsClassifier

# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
    ('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
    ('KNN', KNeighborsClassifier())
])

# cross-validate
parameters = {'KNN__n_neighbors': [3, 5, 7, 9, 11, 13, 15],
              'KNN__weights': ['uniform', 'distance']}

grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))

clf = grid_search.best_estimator_
Best params: {'KNN__n_neighbors': 3, 'KNN__weights': 'uniform'}
In [19]:
cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])

KNN_metrics_train = skewed_metrics(clf, test=False)
KNN_metrics_test = skewed_metrics(clf, test=True)

This KNN model has the poorest performance on both sets for all six model.

In [20]:
misclassified = clf.predict(X_test) != y_test

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
Out[20]:
Number Name Type_1 Type_2 Total HP Attack Defense Sp_Atk Sp_Def Speed Generation isLegendary
719 720 Hoopa Psychic Ghost 600 80 110 60 150 130 70 6 True
717 718 Zygarde Dragon Ground 600 108 100 121 81 95 95 6 True
244 245 Suicune Water NaN 580 100 75 115 90 115 85 2 True
250 251 Celebi Psychic Grass 600 100 100 100 100 100 100 2 False

d) Perceptron¶

In [21]:
from sklearn.linear_model import Perceptron

# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
    ('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
    ('perc', Perceptron(n_iter=30))
])

# cross-validate
parameters = {'perc__alpha': [1e-5, 3e-5, 1e-3, 3e-3, 0.01, 0.03, 1, 3, 10, 30],
              'perc__penalty': ['l1', 'l2', 'elasticnet']}

grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))

clf = grid_search.best_estimator_
Best params: {'perc__alpha': 1e-05, 'perc__penalty': 'l1'}
In [22]:
cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])

perc_metrics_train = skewed_metrics(clf, test=False)
perc_metrics_test = skewed_metrics(clf, test=True)

The perceptron seems to overfit the training set and fails to generalize as well as logistic regression or SVM on the test set.

In [23]:
misclassified = clf.predict(X_test) != y_test

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
Out[23]:
Number Name Type_1 Type_2 Total HP Attack Defense Sp_Atk Sp_Def Speed Generation isLegendary
719 720 Hoopa Psychic Ghost 600 80 110 60 150 130 70 6 True
244 245 Suicune Water NaN 580 100 75 115 90 115 85 2 True
250 251 Celebi Psychic Grass 600 100 100 100 100 100 100 2 False
148 149 Dragonite Dragon Flying 600 91 134 95 100 100 80 1 False

e) Decision tree¶

In [24]:
from sklearn.tree import DecisionTreeClassifier

# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
    ('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
    ('tree', DecisionTreeClassifier())
])

# cross-validate
parameters = {'tree__criterion': ['gini', 'entropy'], 'tree__splitter': ['best', 'random'],
              'tree__max_features': [None, 'auto'], 'tree__max_depth': [5, 10, 25, 50],
              'tree__class_weight': [None, 'balanced']}

grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))

clf = grid_search.best_estimator_
Best params: {'tree__class_weight': 'balanced', 'tree__criterion': 'entropy', 'tree__max_depth': 5, 'tree__max_features': None, 'tree__splitter': 'best'}
In [25]:
cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])

tree_metrics_train = skewed_metrics(clf, test=False)
tree_metrics_test = skewed_metrics(clf, test=True)

The model has a good performance and, after running it multiple times with different parameters, doesn't overfit the training set as badly when a max_depth is specified.

In [26]:
misclassified = clf.predict(X_test) != y_test

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
Out[26]:
Number Name Type_1 Type_2 Total HP Attack Defense Sp_Atk Sp_Def Speed Generation isLegendary
250 251 Celebi Psychic Grass 600 100 100 100 100 100 100 2 False
In [27]:
misclassified = clf.predict(X_train) != y_train

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
Out[27]:
Number Name Type_1 Type_2 Total HP Attack Defense Sp_Atk Sp_Def Speed Generation isLegendary
647 648 Meloetta Normal Psychic 600 100 77 77 128 128 90 5 False
150 151 Mew Psychic NaN 600 100 100 100 100 100 100 1 False

f) Random forest¶

In [28]:
from sklearn.ensemble import RandomForestClassifier

# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
    ('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
    ('forest', RandomForestClassifier())
])

# cross-validate
parameters = {'forest__criterion': ['gini', 'entropy'], 'forest__max_depth': [5, 10, 25, 50],
              'forest__n_estimators': [10, 25, 50, 75, 100],
              'forest__max_features': [None, 'auto', 'log2'], 'forest__bootstrap': [True, False],
              'forest__class_weight': [None, 'balanced']}

grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))

clf = grid_search.best_estimator_
Best params: {'forest__bootstrap': True, 'forest__class_weight': 'balanced', 'forest__criterion': 'gini', 'forest__max_depth': 5, 'forest__max_features': 'log2', 'forest__n_estimators': 100}
In [29]:
cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])

forest_metrics_train = skewed_metrics(clf, test=False)
forest_metrics_test = skewed_metrics(clf, test=True)
In [30]:
misclassified = clf.predict(X_test) != y_test

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
Out[30]:
Number Name Type_1 Type_2 Total HP Attack Defense Sp_Atk Sp_Def Speed Generation isLegendary
250 251 Celebi Psychic Grass 600 100 100 100 100 100 100 2 False
In [31]:
misclassified = clf.predict(X_train) != y_train

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]
Out[31]:
Number Name Type_1 Type_2 Total HP Attack Defense Sp_Atk Sp_Def Speed Generation isLegendary
375 376 Metagross Steel Psychic 600 80 135 130 95 90 70 3 False
647 648 Meloetta Normal Psychic 600 100 77 77 128 128 90 5 False
487 488 Cresselia Psychic NaN 600 120 70 120 75 130 85 4 False
150 151 Mew Psychic NaN 600 100 100 100 100 100 100 1 False
648 649 Genesect Bug Steel 600 71 120 95 120 95 99 5 False
489 490 Manaphy Water NaN 600 100 100 100 100 100 100 4 False
646 647 Keldeo Water Fighting 580 91 72 90 129 90 108 5 False

This random forest perform as well as our decision tree which gives both of them the best performance on unseen data. However their random nature implies that we will get different classifiers were we to rerun the code above, with different precision-recall tradeoffs each time.

Summary of results¶

In [32]:
data = {
'Logistic Regression': list(logreg_metrics_train + logreg_metrics_test),
'Support Vector Machines': list(svm_metrics_train + svm_metrics_test),
'KNN': list(KNN_metrics_train + KNN_metrics_test),
'Perceptron': list(perc_metrics_train + perc_metrics_test),
'Decision Tree': list(tree_metrics_train + tree_metrics_test),
'Random Forest': list(forest_metrics_train + forest_metrics_test)
}
cols = [ ['Training Set']*5 + ['Test Set']*5,
         ['ROC AUC', 'PR AUC', 'F1 Score', 'Precision', 'Recall']*2 ]

models = pd.DataFrame(data, index=cols).transpose()

models.sort_values(by=[('Test Set', 'PR AUC')], ascending=False)
Out[32]:
Training Set Test Set
ROC AUC PR AUC F1 Score Precision Recall ROC AUC PR AUC F1 Score Precision Recall
Decision Tree 0.999462 0.992763 0.974359 0.950000 1.000000 0.999088 0.986111 0.941176 0.888889 1.000
Random Forest 0.999853 0.997893 0.915663 0.844444 1.000000 0.999088 0.985243 0.941176 0.888889 1.000
Support Vector Machines 0.998826 0.974847 0.883721 0.791667 1.000000 0.995438 0.898115 0.888889 0.800000 1.000
Logistic Regression 0.999951 0.999316 0.883721 0.791667 1.000000 0.993613 0.873710 0.888889 0.800000 1.000
Perceptron 1.000000 1.000000 1.000000 1.000000 1.000000 0.992701 0.859573 0.750000 0.750000 0.750
KNN 0.993543 0.919928 0.831169 0.820513 0.842105 0.930657 0.779490 0.714286 0.833333 0.625

We now have 6 models, of which 4 have decent performance on unseen data and perfect recall. An interesting feature of our classifiers is their ability to detect when a Pokemon is special: Dragonite, Mew, Meloetta and the mythical Celebi make repeated appearances in the misclassified examples, suggesting that they are better than most other Pokemon.

It would be interesting to see how the models above perform on Pokemon from the 7th generation as an indication of their generalization potential. A different but promising approach to the same problem would be to perform anomaly detection, which works well on imbalanced datasets such as this one. Doing so would require the fit of a multivariate Gaussian to a transformed (i.e. more normal) dataset with exclusively numerical features and the search of a threshold $\epsilon$ that separates legendary Pokemon from regular ones. However we leave this exercice to the reader!

blogroll

social