In my previous post Beer Crawl, I deployed a web crawler to extract information describing over 300 craft beers in British Columbia. However after further investigation I realized that the dataset I had collected unfortunately had a few drawbacks. Perhaps the main one is that it describes the best beers in BC, and as a result their descriptions, ratings and reviews are more or less homogeneous, as this short visualization exercice will soon show.
Initialization and data pre-processing¶
Let's first load the standard libraries that we will use throughout and then proceed to clean up the raw data:
# import standard libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# load dataset
Beers = pd.read_csv('beer_info.csv')
# pre-processing numerical data
Beers['Appearance'].fillna(Beers['Appearance'].mean(), inplace=True);
Beers['Aroma'].fillna(Beers['Aroma'].mean(), inplace=True);
Beers['Taste'].fillna(Beers['Taste'].mean(), inplace=True);
Beers['Palate'].fillna(Beers['Palate'].mean(), inplace=True);
Beers['Total'] = Beers[['Appearance', 'Aroma', 'Taste', 'Palate']].mean(axis=1)
# missing breweries (done manually)
Beers.loc[47, 'Brewery'] = 'Wheelhouse and Yellow Dog'
Beers.loc[63, 'Brewery'] = 'Bomber & Doans'
Beers.loc[65, 'Brewery'] = 'Four Winds & Le Trou du Diable'
Beers.loc[72, 'Brewery'] = 'R&B;'
Beers.loc[236, 'Brewery'] = 'Parallel 49'
Beers.loc[263, 'Brewery'] = 'Parallel 49'
# remove % sign in Alcohol and convert to float
Beers.loc[52, 'Alcohol'] = '4.5%'
for idx, alcohol in enumerate(Beers['Alcohol']):
Beers.loc[idx, 'Alcohol'] = np.float64(alcohol.strip('%'))
# change column type to operate on it using Series/Dataframe methods
Beers['Alcohol'] = Beers['Alcohol'].astype(np.float64)
Visualization of text data¶
One way to see that the beer descriptions are somewhat homogeneous is to create a wordcloud of the Pros and Cons columns and see which descriptive words come up the most. To do so we use a Python-based word cloud library, which automates the process.
We create the word clouds by amalgamating all the Pros and Cons in two very long strings from which we then remove some common but undescriptive words, such as Ale or Big. The WordCloud object then deals with tokenization automatically to generate the most frequently occuring key words and their relative size. Additionally, one of the options allows us to use a "thumbs up" black and white image to create a mask over which the words will be displayed, adding a nice touch to the presentation.
from wordcloud import WordCloud
from os import path, getcwd
from PIL import Image
d = getcwd()
# define thumb mask
thumb_mask = np.array(Image.open(path.join(d, "thumbs-up.jpg")))
thumb_up = thumb_mask[:, :, 0]
thumb_down = np.rot90(thumb_mask[:, :, 0], 2)
# wordcloud for Pros
textPros = Beers['Pros'].str.cat(sep=' ')
removewords_pros = ['flavour', 'Nice', 'Hop', 'hop', 'Ale', 'ale', 'ed', 'well', 'BC',
'Great', 'great', 'big', 'Big', 'ful', 'ag', 'bodi', 'ler', 'beer']
for w in removewords_pros:
textPros = textPros.replace(w, '')
textPros = textPros.replace('balanc ', 'balance ')
textPros = textPros.replace('Balanc ', 'balance ')
textPros = textPros.replace('balance', 'balanced')
wordcloud_pros = WordCloud(max_words=80, width=2000, height=1200,
mask=thumb_up, background_color='white').generate(textPros)
# wordcloud for Cons
textCons = Beers['Cons'].str.cat(sep=' ')
removewords_cons = ['flavour', 'may', 'Quite', 'quite', 'Rather', 'hop', 'Ale', 'ale',
'find', 'Fairly', 'BC', 'beer', 'Big', 'big', 'ger', 'much', 'py']
for w in removewords_cons:
textCons = textCons.replace(w, '')
wordcloud_cons = WordCloud(max_words=80, width=2000, height=1200,
mask=thumb_down, background_color='white').generate(textCons)
# plot the two wordclouds
_, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 12))
ax1.imshow(wordcloud_pros, interpolation="bilinear");
ax1.axis("off");
ax2.imshow(wordcloud_cons, interpolation="bilinear");
ax2.axis("off");
We see that many words, such as IPA, light, strong, tone and sour, appear as major contributors to both wordclouds. This observation, together with the low variance of ratings across all beers, suggests that any analysis relying on these two columns to predict total scores would struggle to discern a bad beer from a better one.
Analysis of the beer ratings¶
The beers are rated according to four criteria - Appearance, Aroma, Palate, and Taste - all of which are closely distributed around 4.2/5. However the vast majority of them end up scoring a combined total higher than 4/5, which somewhat invalidates the predictive power of the Pros and Cons columns even more. As a matter of fact, a preliminary analysis (not shown here) using random forest regressors for $n$-grams with $n \leq 3$ for the Pros and Cons categories found a huge variance in the predictions' residuals.
import matplotlib.gridspec as gridspec
f = plt.figure(figsize=(15, 10))
gridspec.GridSpec(4, 8)
# plot individual plots
plt.subplot2grid((4, 8), (0, 0), colspan=2)
sns.distplot(Beers['Appearance'], kde=False);
plt.subplot2grid((4, 8), (0, 2), colspan=2)
sns.distplot(Beers['Aroma'], kde=False);
plt.subplot2grid((4, 8), (1, 0), colspan=2)
sns.distplot(Beers['Palate'], kde=False);
plt.subplot2grid((4, 8), (1, 2), colspan=2)
sns.distplot(Beers['Taste'], kde=False);
plt.subplot2grid((4, 8), (0, 4), colspan=4, rowspan=2)
sns.distplot(Beers['Total'], kde=False);
plt.tight_layout()
Most reviewed breweries and beer types¶
With the Brewery and Type columns in hand, we can get an idea of the most popular BC craft breweries and the most-produced types of beers. We do need to keep in mind that this dataset is not a thorough collection of all craft beers produced in BC but only a subset of the best ones according to a small group of people.
To extract the top 20 breweries from this dataset, we need a uniformization procedure since a lot of entries for particular breweries go by two or more different names, e.g. 'Howe Sound Brewery' and 'Howe Sound Brewing Co.'. In fact, essentially all of the discrepancies involve using either "Brewery" or "Brewing Co." or "Brewing Company", and an effective way to uniformize the Brewery column is to only extract the first word of each entry.
# breweries by first word in name
for idx, brewery in enumerate(Beers['Brewery']):
if ' s' in brewery:
brewery = brewery.replace(' s', 's')
Beers.loc[idx, 'Brewery (one word)'] = brewery.split()[0]
# extract top 20 breweries
top20Breweries = Beers['Brewery (one word)'].value_counts().index[:20]
top20Count = Beers['Brewery (one word)'].value_counts()[:20]
top20 = sns.barplot(x=top20Breweries, y=top20Count);
_ = top20.set(ylabel='Count', title='Top 20 Most Productive Breweries in BC');
for xlabel in top20.get_xticklabels():
xlabel.set_rotation(90)
As for the most reviewed beers, no such data post-processing is necessary. Unsurprisingly (given the content of the word clouds), the top spot is taken by India Pale Ales, but it remains unclear whether this information conveys that BC procudes a lot of IPAs or simply that the reviewers have a personal preference for them.
# most reviewed beers
most_reviewed = Beers.groupby('Type')['Type'].count().nlargest(20).plot(
kind='bar', title='Most Reviewed Beer Types', colormap='summer')
most_reviewed.set_xlabel('');
most_reviewed.set_ylabel('Number of Beers Reviewed');