There is no denying that Pokemon had a big influence on all the kids of my generation. I remember being 8 or 9 and looking forward to finishing school so I could catch up on the adventures of Ash and Pikachu. I also remember the fun I would have playing Pokemon Stadium on Nintendo 64 with my cousins on the weekend. The phenomenal popularity of Pokemon GO last year further confirmed that the nostalgia factor is still strong for a lot of people, even to this day.

I was browsing Kaggle for datasets to practice classification algorithms when I came across one describing the first 6 generations of Pokemon with a total of 721 Pokemon, of which 46 are legendary. Bingo! I thought. This dataset is not only a fun way to experiment with classifiers to predict whether a Pokemon is legendary or not, but also provides a way to simulate an end-to-end machine learning project. Moreover, evaluating the performance of our models will require careful thinking since only a small fraction (6.4% to be exact) of the Pokemon are legendary.

With this dataset in hand, our goals are to

Create interesting visualizations.
Preprocess the data to make it suitable for classification algorithms.
Evaluate and compare the following classification models
- logistic regression;
- support vector machines (linear vs. RBF kernel);
- K-nearest-neighbors;
- perceptron;
- decision tree;
- random forest.

In [1]:

# standard libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from scipy.stats import norm

# to have multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# plotting
%matplotlib inline
from radar import ComplexRadar
from pylab import rcParams
rcParams['figure.figsize'] = 8, 4

In [2]:

# import dataset
pokemon = pd.read_csv('pokemon_alopez247.csv')
pokemon.drop('Number', inplace=True, axis=1) # not an important numeric quantity
pokemon.head()

Out[2]:

	Name	Type_1	Type_2	Total	HP	Attack	Defense	Sp_Atk	Sp_Def	Speed	...	Color	hasGender	Pr_Male	Egg_Group_1	Egg_Group_2	hasMegaEvolution	Height_m	Weight_kg	Catch_Rate	Body_Style
0	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	...	Green	True	0.875	Monster	Grass	False	0.71	6.9	45	quadruped
1	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	...	Green	True	0.875	Monster	Grass	False	0.99	13.0	45	quadruped
2	Venusaur	Grass	Poison	525	80	82	83	100	100	80	...	Green	True	0.875	Monster	Grass	True	2.01	100.0	45	quadruped
3	Charmander	Fire	NaN	309	39	52	43	60	50	65	...	Red	True	0.875	Monster	Dragon	False	0.61	8.5	45	bipedal_tailed
4	Charmeleon	Fire	NaN	405	58	64	58	80	65	80	...	Red	True	0.875	Monster	Dragon	False	1.09	19.0	45	bipedal_tailed

5 rows × 22 columns

1. Data visualizationÂ¶

Data visualizations are helpful tools to help us understand the data better. Boxplots are great visualization tools: the boxes show the quartiles of the distribution and the whiskers extend to the rest of it, with the exception of some outliers. It helps us see what Pokemon types have the best statistics; for instance flying Pokemon are typically much faster than other types, but the short top whisker indicates that there are still plenty of much faster non-flying Pokemon.

In [3]:

# boxplots of Pokemon's base stats
plt.rcParams['figure.figsize'] = (15, 8);
fig, axarr = plt.subplots(3, 2, sharex=True);

_ = sns.boxplot(x='Type_1', y='HP', data=pokemon, width=0.5, ax=axarr[0, 0]);
_ = sns.boxplot(x='Type_1', y='Speed', data=pokemon, width=0.5, ax=axarr[0, 1]);
_ = sns.boxplot(x='Type_1', y='Attack', data=pokemon, width=0.5, ax=axarr[1, 0]);
_ = sns.boxplot(x='Type_1', y='Defense', data=pokemon, width=0.5, ax=axarr[1, 1]);
_ = sns.boxplot(x='Type_1', y='Sp_Atk', data=pokemon, width=0.5, ax=axarr[2, 0]);
_ = sns.boxplot(x='Type_1', y='Sp_Def', data=pokemon, width=0.5, ax=axarr[2, 1]);

titles = [['Hit Points', 'Speed'], ['Attack', 'Defense'], ['Special Attack', 'Special Defense']]
for i in range(3):
    for j in range(2):
        _ = axarr[i, j].set(xlabel='', ylabel='', title=titles[i][j]);
        for tick in axarr[i, j].get_xticklabels():
            tick.set_rotation(90);

It is also interesting to note what the most popular Pokemon types are. For instance only a few Pokemon have the primary type Flying, whereas it is the most common secondary type. However given the abundance of flying Pokemon it would be unwise to use only the primary type for classification. As such we should consider both primary and secondary types on equal footing when we do our analysis.

In [4]:

# number of Pokemon per primary type
plt.rcParams['figure.figsize'] = (12, 12);
fig, axarr = plt.subplots(2, 1, sharex=False);

type_count1 = sns.countplot(x='Type_1', data=pokemon, order=pokemon.Type_1.value_counts().index, ax=axarr[0])
type_count1.set(xlabel='Primary type', ylabel='Count');

type_count2 = sns.countplot(x='Type_2', data=pokemon, order=pokemon.Type_1.value_counts().index, ax=axarr[1])
type_count2.set(xlabel='Secondary type', ylabel='Count');

Another interesting type of plot is the radar - or spider - chart. Using the code found on this forum, I created a radar chart of the mean and median stats for each of the body styles (determined by the variable I).

In [5]:

# computing data (mean and median of stats) for our variables
rad_variables = ['HP', 'Attack', 'Defense', 'Sp_Atk', 'Sp_Def', 'Speed']

bodystyle_mean = pokemon.groupby('Body_Style').mean()[rad_variables]
bodystyle_median = pokemon.groupby('Body_Style').median()[rad_variables]

stats = bodystyle_mean.iloc[0, :].index  # Pokemon stats
body_index = bodystyle_mean.index  # Pokemon body type

I=7 # choose which body type to compute
stats_ranges = [(pokemon.groupby('Body_Style')[col].min().iloc[I], 
               0.75*pokemon.groupby('Body_Style')[col].max().iloc[I]) for col in rad_variables]

# plotting
fig1 = plt.figure(figsize=(5, 5))
radar = ComplexRadar(fig1, stats, stats_ranges)
radar.plot(bodystyle_mean.iloc[I, :], color='b')
radar.fill(bodystyle_mean.iloc[I, :], color='b', alpha=0.2)
plt.title("Mean Stats: " + body_index[I], fontsize=20);
plt.show();

fig2 = plt.figure(figsize=(5, 5))
radar = ComplexRadar(fig2, stats, stats_ranges)
radar.plot(bodystyle_median.iloc[I, :], color='g')
radar.fill(bodystyle_median.iloc[I, :], color='g', alpha=0.2)
plt.title("Median Stats: " + body_index[I], fontsize=20);
plt.show();

2. Preprocessing the datasetÂ¶

a) Numerical featuresÂ¶

Most machine learning algorithms perform poorly when the numerical attributes are on different scales, which is the case as shown in the histograms below. As such we will normalize/standardize features to accelerate the learning process. We do so by creating a PandasSelector functionality from the BaseEstimator (allows for easy hyperparameter tuning) and TransformerMixin classes.

In [6]:

# extract numerical features and plot their histograms
numeric_pokemon = pokemon[pokemon.dtypes[(pokemon.dtypes=="float64")|(pokemon.dtypes=="int64")].index.values];
numeric_pokemon.hist(figsize=[11, 11], bins=12);
pokemon.Catch_Rate = pokemon.Catch_Rate.astype(float);

In [7]:

# bulid PandasSelector class with fit and transform methods
from sklearn.base import BaseEstimator, TransformerMixin

class PandasSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, selected_columns):
        self.selected_columns = selected_columns
    
    def fit(self, df, *args):
        return self

    def transform(self, df):
        return df[self.selected_columns].values

In [8]:

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union

num_features = ['Total', 'Attack', 'Sp_Atk', 'Speed','HP', 'Defense', 'Sp_Def', 'Height_m', 'Weight_kg', 'Catch_Rate']

# StandardScaler
standardize_pipeline = make_pipeline(
    PandasSelector(num_features[:-1]), 
    StandardScaler()
)

# MinMaxScaler
minmax_pipeline = make_pipeline(
    PandasSelector([num_features[-1]]),
    MinMaxScaler()
)

# numerical pipeline
num_pipeline = make_union(standardize_pipeline, minmax_pipeline)

b) Categorical featuresÂ¶

Categorical features also require some preprocessing, generally in the form of a one-hot encoding to prevent nominal features to be treated as ordinal ones. In what follows we give equal importance to Type_1 and Type_2. We also create a new numerical feature named Pr_Female to replace the categorical hasGender column.

In [9]:

# one-hot encoding of categorical variables
Type_dummies = pd.get_dummies(pokemon['Type_1']) + pd.get_dummies(pokemon['Type_2'])
pokemon[Type_dummies.columns] = Type_dummies
Color_dummies = pd.get_dummies(pokemon['Color'])
pokemon[Color_dummies.columns] = Color_dummies
Body_dummies = pd.get_dummies(pokemon['Body_Style'])
pokemon[Body_dummies.columns] = Body_dummies

# Pr_Female to replace hasGender
pokemon['Pr_Female'] = 1 - pokemon['Pr_Male']
pokemon.Pr_Female.fillna(value=0, inplace=True)
pokemon.Pr_Male.fillna(value=0, inplace=True)

cat_features = list(Type_dummies.columns.values) + list(Color_dummies.columns.values) + list(Body_dummies.columns.values) + ['hasMegaEvolution', 'Pr_Male', 'Pr_Female']

3. Model evaluations and comparisonsÂ¶

We are now ready to evaluate and compare different classification models on our dataset. However we need to keep in mind that our data is heavily skewed; only 6.4% of the Pokemon are legendary. Therefore a simple classifier predicting all Pokemon to be non-legendary would have an accuracy of 93.6%. To help put things into perspective, we should use three new metrics:

Precision: the precision $p$ tells us that when a classifier claims an instance to be positive, it is right with probability $p$.
Recall: also called true positive rate, the recall $r$ is the ratio of positive instances correctly labeled by the model.
F1 score: the F1 score is the harmonic mean of precision and recall and favours classifiers that have $p \approx r$.

The most appropriate score for a specific model depends on its goals. Imagine for instance a parental filter on a search engine that rejects inappropriate images. It would be highly preferable for this classifier to have low recall (i.e. to reject images even if they are appropriate) but high precision (only safe content is shown). In our case, we should privilege an algorithm that has both high precision and high recall, which translates into a high F1 score.

Note that a confusion matrix can help us understand our classifier's behaviour better as it clearly illustrates the true positives, false positives, false negatives and true negatives. We will plot one for each classifier to understand where each fails.

In addition to these three tools, one can compute two other scores from precision and recall:

ROC (receiver operating characteristic) curve: a curve of the TP rate against the FP rate at various threshold settings.
Precision-Recall curve: a curve of precision vs. recall at various threshold settings.

The area under these curves (AUC) are good scalar metrics to distinguish between different classifiers. However the ROC curve is agnostic to class skew, so a good ROC AUC score might be misleading (like classification accuracy was) when the number of negative instances overwhelms the positive ones. In contrast the PR AUC score makes it clear when there is room for improvement when a class is heavily skewed.

We will train multiple classifiers and tune their hyperparameters using the GridSearchCV class, which performs stratified cross-validation in order to keep an appropriate ratios of positive examples in each fold. We also allow the algorithms to give a balanced weight to the skewed class during CV for fairer representation (i.e. harsher penalty for getting positive examples wrong), but we let the cross-validation process decide whether to use that feature or not.

In [10]:

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

poke_train, poke_test = train_test_split(pokemon[num_features + cat_features + ['isLegendary']], 
                                         test_size=0.2, random_state=42)

X_train, y_train = poke_train.drop('isLegendary', axis=1), poke_train['isLegendary']
X_test, y_test = poke_test.drop('isLegendary', axis=1), poke_test['isLegendary']

In [11]:

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_auc_score, auc
import itertools

def skewed_metrics(model, test=True):
    """
    This function computes five scores which we use to
    assess the quality of our classifiers on skewed data.
    """
    proba_clf = ['KNN', 'tree', 'forest']
    if test:
        y_true = y_test
        y_predict = model.predict(X_test)
        if any(clf in list(model.named_steps) for clf in ['KNN', 'tree', 'forest']):
            y_score = model.predict_proba(X_test)[:, 1]
        else:
            y_score = model.decision_function(X_test)
    else:
        y_true = y_train
        y_predict = model.predict(X_train)
        if any(clf in list(model.named_steps) for clf in ['KNN', 'tree', 'forest']):
            y_score = model.predict_proba(X_train)[:, 1]
        else:
            y_score = model.decision_function(X_train)
    
    ROC_AUC = roc_auc_score(y_true, y_score, )
    precision, recall, _ = precision_recall_curve(y_true, y_score)
    PR_AUC = auc(recall, precision)
    
    F1 = f1_score(y_true, y_predict)
    P = precision_score(y_true, y_predict)
    R = recall_score(y_true, y_predict)
    
    return ROC_AUC, PR_AUC, F1, P, R

def plot_confusion_matrix(CM, classes=['non-legendary', 'legendary'], normalize=False, cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    (This function is adapted from the scikit docs.)
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, sharey=False, figsize=(10, 10))
    titles = ['Confusion matrix (training set)', 'Confusion matrix (test set)']
    
    for ax, cm, title in zip(fig.axes, CM, titles):
        plt.sca(ax)
        
        plt.imshow(cm, interpolation='nearest', cmap=cmap)
        plt.grid(which='both')
        plt.title(title)
        # plt.colorbar()
        tick_marks = np.arange(len(classes))
        plt.xticks(tick_marks, classes, rotation=45)
        plt.yticks(tick_marks, classes)
        
        if normalize:
            cm = cm.astype('float')/cm.sum(axis=1)[:, np.newaxis]
        # print(cm)
        thresh = cm.max() / 2.
        for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")

        plt.tight_layout(pad=5)
        plt.ylabel('True label')
        plt.xlabel('Predicted label')

a) Logistic regressionÂ¶

In [12]:

from sklearn.linear_model import LogisticRegression

# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
    ('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
    ('logreg', LogisticRegression())
])

# cross-validate
parameters = {'logreg__C': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
              'logreg__class_weight': [None, 'balanced']}

grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))

clf = grid_search.best_estimator_

Best params: {'logreg__C': 3, 'logreg__class_weight': 'balanced'}

In [13]:

cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])

logreg_metrics_train = skewed_metrics(clf, test=False)
logreg_metrics_test = skewed_metrics(clf, test=True)

The confusion matrices show that our model misclassifies 2 non-legendary Pokemon as being legendary. However someone who knows Pokemon will immediately notice that the two misclassified examples are somewhat special. Celebi (just like Mew in the training set) is a mythical Pokemon and therefore quite unique, whereas Dragonite is one of the strongest non-legendary Pokemon one may recruit. We can therefore conclude that our logistic regression model generalizes well to the test set since it "knows" more than it was taught.

In [14]:

misclassified = clf.predict(X_test) != y_test

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]

Out[14]:

	Number	Name	Type_1	Type_2	Total	HP	Attack	Defense	Sp_Atk	Sp_Def	Speed	Generation	isLegendary
250	251	Celebi	Psychic	Grass	600	100	100	100	100	100	100	2	False
148	149	Dragonite	Dragon	Flying	600	91	134	95	100	100	80	1	False

In [15]:

misclassified = clf.predict(X_train) != y_train

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]

Out[15]:

	Number	Name	Type_1	Type_2	Total	HP	Attack	Defense	Sp_Atk	Sp_Def	Speed	Generation	isLegendary
375	376	Metagross	Steel	Psychic	600	80	135	130	95	90	70	3	False
647	648	Meloetta	Normal	Psychic	600	100	77	77	128	128	90	5	False
150	151	Mew	Psychic	NaN	600	100	100	100	100	100	100	1	False
648	649	Genesect	Bug	Steel	600	71	120	95	120	95	99	5	False
489	490	Manaphy	Water	NaN	600	100	100	100	100	100	100	4	False
461	462	Magnezone	Electric	Steel	535	70	70	115	130	90	60	4	False
646	647	Keldeo	Water	Fighting	580	91	72	90	129	90	108	5	False
566	567	Archeops	Rock	Flying	567	75	140	65	112	65	110	5	False
58	59	Arcanine	Fire	NaN	555	90	110	80	100	80	95	1	False
372	373	Salamence	Dragon	Flying	600	95	135	80	110	80	100	3	False

b) Support vector machinesÂ¶

In [16]:

from sklearn.svm import SVC

# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
    ('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
    ('svm', SVC())
])

# cross-validate
parameters = [{'svm__C': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
              'svm__gamma': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
              'svm__class_weight': [None, 'balanced'], 'svm__kernel': ['rbf']},
              {'svm__C': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
              'svm__class_weight': [None, 'balanced'], 'svm__kernel': ['linear']}]

grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))

clf = grid_search.best_estimator_

Best params: {'svm__C': 10, 'svm__class_weight': 'balanced', 'svm__gamma': 0.01, 'svm__kernel': 'rbf'}

In [17]:

cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])

svm_metrics_train = skewed_metrics(clf, test=False)
svm_metrics_test = skewed_metrics(clf, test=True)

SVM performs identically to logistic regression.

c) K-nearest-neighborsÂ¶

In [18]:

from sklearn.neighbors import KNeighborsClassifier

# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
    ('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
    ('KNN', KNeighborsClassifier())
])

# cross-validate
parameters = {'KNN__n_neighbors': [3, 5, 7, 9, 11, 13, 15],
              'KNN__weights': ['uniform', 'distance']}

grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))

clf = grid_search.best_estimator_

Best params: {'KNN__n_neighbors': 3, 'KNN__weights': 'uniform'}

In [19]:

cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])

KNN_metrics_train = skewed_metrics(clf, test=False)
KNN_metrics_test = skewed_metrics(clf, test=True)

This KNN model has the poorest performance on both sets for all six model.

In [20]:

misclassified = clf.predict(X_test) != y_test

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]

Out[20]:

	Number	Name	Type_1	Type_2	Total	HP	Attack	Defense	Sp_Atk	Sp_Def	Speed	Generation	isLegendary
719	720	Hoopa	Psychic	Ghost	600	80	110	60	150	130	70	6	True
717	718	Zygarde	Dragon	Ground	600	108	100	121	81	95	95	6	True
244	245	Suicune	Water	NaN	580	100	75	115	90	115	85	2	True
250	251	Celebi	Psychic	Grass	600	100	100	100	100	100	100	2	False

d) PerceptronÂ¶

In [21]:

from sklearn.linear_model import Perceptron

# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
    ('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
    ('perc', Perceptron(n_iter=30))
])

# cross-validate
parameters = {'perc__alpha': [1e-5, 3e-5, 1e-3, 3e-3, 0.01, 0.03, 1, 3, 10, 30],
              'perc__penalty': ['l1', 'l2', 'elasticnet']}

grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))

clf = grid_search.best_estimator_

Best params: {'perc__alpha': 1e-05, 'perc__penalty': 'l1'}

In [22]:

cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])

perc_metrics_train = skewed_metrics(clf, test=False)
perc_metrics_test = skewed_metrics(clf, test=True)

The perceptron seems to overfit the training set and fails to generalize as well as logistic regression or SVM on the test set.

In [23]:

misclassified = clf.predict(X_test) != y_test

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]

Out[23]:

	Number	Name	Type_1	Type_2	Total	HP	Attack	Defense	Sp_Atk	Sp_Def	Speed	Generation	isLegendary
719	720	Hoopa	Psychic	Ghost	600	80	110	60	150	130	70	6	True
244	245	Suicune	Water	NaN	580	100	75	115	90	115	85	2	True
250	251	Celebi	Psychic	Grass	600	100	100	100	100	100	100	2	False
148	149	Dragonite	Dragon	Flying	600	91	134	95	100	100	80	1	False

e) Decision treeÂ¶

In [24]:

from sklearn.tree import DecisionTreeClassifier

# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
    ('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
    ('tree', DecisionTreeClassifier())
])

# cross-validate
parameters = {'tree__criterion': ['gini', 'entropy'], 'tree__splitter': ['best', 'random'],
              'tree__max_features': [None, 'auto'], 'tree__max_depth': [5, 10, 25, 50],
              'tree__class_weight': [None, 'balanced']}

grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))

clf = grid_search.best_estimator_

Best params: {'tree__class_weight': 'balanced', 'tree__criterion': 'entropy', 'tree__max_depth': 5, 'tree__max_features': None, 'tree__splitter': 'best'}

In [25]:

cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])

tree_metrics_train = skewed_metrics(clf, test=False)
tree_metrics_test = skewed_metrics(clf, test=True)

The model has a good performance and, after running it multiple times with different parameters, doesn't overfit the training set as badly when a max_depth is specified.

In [26]:

misclassified = clf.predict(X_test) != y_test

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]

Out[26]:

	Number	Name	Type_1	Type_2	Total	HP	Attack	Defense	Sp_Atk	Sp_Def	Speed	Generation	isLegendary
250	251	Celebi	Psychic	Grass	600	100	100	100	100	100	100	2	False

In [27]:

misclassified = clf.predict(X_train) != y_train

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]

Out[27]:

	Number	Name	Type_1	Type_2	Total	HP	Attack	Defense	Sp_Atk	Sp_Def	Speed	Generation	isLegendary
647	648	Meloetta	Normal	Psychic	600	100	77	77	128	128	90	5	False
150	151	Mew	Psychic	NaN	600	100	100	100	100	100	100	1	False

f) Random forestÂ¶

In [28]:

from sklearn.ensemble import RandomForestClassifier

# pipeline to preprocess, reassemble, and fit the model
fit_pipeline = Pipeline([
    ('preprocessing', make_union(num_pipeline, PandasSelector(cat_features))),
    ('forest', RandomForestClassifier())
])

# cross-validate
parameters = {'forest__criterion': ['gini', 'entropy'], 'forest__max_depth': [5, 10, 25, 50],
              'forest__n_estimators': [10, 25, 50, 75, 100],
              'forest__max_features': [None, 'auto', 'log2'], 'forest__bootstrap': [True, False],
              'forest__class_weight': [None, 'balanced']}

grid_search = GridSearchCV(fit_pipeline, param_grid=parameters, cv=5).fit(X_train, y_train)
print('Best params: {}'.format(grid_search.best_params_))

clf = grid_search.best_estimator_

Best params: {'forest__bootstrap': True, 'forest__class_weight': 'balanced', 'forest__criterion': 'gini', 'forest__max_depth': 5, 'forest__max_features': 'log2', 'forest__n_estimators': 100}

In [29]:

cm_train = confusion_matrix(y_train, clf.predict(X_train))
cm_test = confusion_matrix(y_test, clf.predict(X_test))
plot_confusion_matrix([cm_train, cm_test])

forest_metrics_train = skewed_metrics(clf, test=False)
forest_metrics_test = skewed_metrics(clf, test=True)

In [30]:

misclassified = clf.predict(X_test) != y_test

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]

Out[30]:

	Number	Name	Type_1	Type_2	Total	HP	Attack	Defense	Sp_Atk	Sp_Def	Speed	Generation	isLegendary
250	251	Celebi	Psychic	Grass	600	100	100	100	100	100	100	2	False

In [31]:

misclassified = clf.predict(X_train) != y_train

Pokemon = pd.read_csv('pokemon_alopez247.csv') # because we got rid of names
Pokemon.iloc[misclassified[misclassified==True].index, :13]

Out[31]:

	Number	Name	Type_1	Type_2	Total	HP	Attack	Defense	Sp_Atk	Sp_Def	Speed	Generation	isLegendary
375	376	Metagross	Steel	Psychic	600	80	135	130	95	90	70	3	False
647	648	Meloetta	Normal	Psychic	600	100	77	77	128	128	90	5	False
487	488	Cresselia	Psychic	NaN	600	120	70	120	75	130	85	4	False
150	151	Mew	Psychic	NaN	600	100	100	100	100	100	100	1	False
648	649	Genesect	Bug	Steel	600	71	120	95	120	95	99	5	False
489	490	Manaphy	Water	NaN	600	100	100	100	100	100	100	4	False
646	647	Keldeo	Water	Fighting	580	91	72	90	129	90	108	5	False

This random forest perform as well as our decision tree which gives both of them the best performance on unseen data. However their random nature implies that we will get different classifiers were we to rerun the code above, with different precision-recall tradeoffs each time.

Summary of resultsÂ¶

In [32]:

data = {
'Logistic Regression': list(logreg_metrics_train + logreg_metrics_test),
'Support Vector Machines': list(svm_metrics_train + svm_metrics_test),
'KNN': list(KNN_metrics_train + KNN_metrics_test),
'Perceptron': list(perc_metrics_train + perc_metrics_test),
'Decision Tree': list(tree_metrics_train + tree_metrics_test),
'Random Forest': list(forest_metrics_train + forest_metrics_test)
}
cols = [ ['Training Set']*5 + ['Test Set']*5,
         ['ROC AUC', 'PR AUC', 'F1 Score', 'Precision', 'Recall']*2 ]

models = pd.DataFrame(data, index=cols).transpose()

models.sort_values(by=[('Test Set', 'PR AUC')], ascending=False)

Out[32]:

	Training Set					Test Set
	ROC AUC	PR AUC	F1 Score	Precision	Recall	ROC AUC	PR AUC	F1 Score	Precision	Recall
Decision Tree	0.999462	0.992763	0.974359	0.950000	1.000000	0.999088	0.986111	0.941176	0.888889	1.000
Random Forest	0.999853	0.997893	0.915663	0.844444	1.000000	0.999088	0.985243	0.941176	0.888889	1.000
Support Vector Machines	0.998826	0.974847	0.883721	0.791667	1.000000	0.995438	0.898115	0.888889	0.800000	1.000
Logistic Regression	0.999951	0.999316	0.883721	0.791667	1.000000	0.993613	0.873710	0.888889	0.800000	1.000
Perceptron	1.000000	1.000000	1.000000	1.000000	1.000000	0.992701	0.859573	0.750000	0.750000	0.750
KNN	0.993543	0.919928	0.831169	0.820513	0.842105	0.930657	0.779490	0.714286	0.833333	0.625

We now have 6 models, of which 4 have decent performance on unseen data and perfect recall. An interesting feature of our classifiers is their ability to detect when a Pokemon is special: Dragonite, Mew, Meloetta and the mythical Celebi make repeated appearances in the misclassified examples, suggesting that they are better than most other Pokemon.

It would be interesting to see how the models above perform on Pokemon from the 7th generation as an indication of their generalization potential. A different but promising approach to the same problem would be to perform anomaly detection, which works well on imbalanced datasets such as this one. Doing so would require the fit of a multivariate Gaussian to a transformed (i.e. more normal) dataset with exclusively numerical features and the search of a threshold $\epsilon$ that separates legendary Pokemon from regular ones. However we leave this exercice to the reader!

alexandrevincart-emard

Legendary Pokemon Classifier

1. Data visualizationÂ¶

2. Preprocessing the datasetÂ¶

a) Numerical featuresÂ¶

b) Categorical featuresÂ¶

3. Model evaluations and comparisonsÂ¶

a) Logistic regressionÂ¶

b) Support vector machinesÂ¶

c) K-nearest-neighborsÂ¶

d) PerceptronÂ¶

e) Decision treeÂ¶

f) Random forestÂ¶

Summary of resultsÂ¶

1. Data visualizationÂ¶

2. Preprocessing the datasetÂ¶

a) Numerical featuresÂ¶

b) Categorical featuresÂ¶

3. Model evaluations and comparisonsÂ¶

a) Logistic regressionÂ¶

b) Support vector machinesÂ¶

c) K-nearest-neighborsÂ¶

d) PerceptronÂ¶

e) Decision treeÂ¶

f) Random forestÂ¶

Summary of resultsÂ¶

blogroll

social