Beer Crawl

Many websites like Kaggle and the UC Irvine Machine Learning Repository offer a vast selection of clean data sets on which experienced and aspiring data scientists alike can experiment. However the web is full of interesting data that has not been organized in neat tables ready for extraction. For instance one can think of plane ticket prices, real estate listings, stock prices, product reviews, social media feeds, etc. The list goes on!

In such cases web scraping becomes a very handy skill to have as a data scientist. I've always thought that the idea of creating a script to crawl webpages in order to extract information sounded fun, but I didn't know how to approach this problem until I came across this book, which pedagogically outlines how to use the Python library BeautifulSoup to pull data out of HTML and XML files. I won't get into the specifics of my implementation in here, but there are many free tutorials available online that can help you get started.

In order to start scraping, I first had to find an interesting website with which to get my hands dirty. I eventually stumbled upon Beer Me BC, a website dedicated to reviewing craft beers in British Columbia. Bingo! Not only were most review pages structured the same way, the quantity of information I could extract could potentially lead to a nice analysis of what makes a great craft beer in BC. Challenge accepted!

The Golden Rules of Web Scraping

There are a few things to be aware of before scraping a website:

  1. Be polite. A computer can send thousands of web requests per second, which has the potential to harm a website by overloading its servers. It is easy to implement a time delay before each request so that a website's performance doesn't degrade for other users.

  2. Respect a website's terms and conditions about scraping, which are usually found in the website's robots.txt file (typically located at www.website.com/robots.txt). For instance some websites may ask that scraping be done at night only in order to keep daytime traffic at reasonable levels.

  3. Scrapers break. There is no guarantee that the layout your web scraper is based upon will stay the same, so keep things as general as possible and be prepared to rewrite your code if changes occur.

  4. There's no avoiding inconsistencies. Most webpages are well-structured, but do not expect every page to be exactly the same. Manual clean-up will very likely be required after acquiring the data.

Beer Crawl

With these rules in mind I wrote two scraping functions. The first one, getBeerLinks, is a crawler that extracts individual beer review URLs from a list.

import requests
from bs4 import BeautifulSoup

def getBeerLinks(pageUrl):
    """ Links to beer descriptions are contained in a list of pages """

    # create session with custom HTTP header
    session = requests.Session()
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/603.2.5 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.5", 
               "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}
    beerLinks = []

    # access url; raise exception is something goes wrong
    try:
        req = session.get(pageUrl, headers=headers)
        req.raise_for_status()
    except Exception:
        print("Error encountered when accessing url; return empty 'beerLinks' list.")
        return beerLinks

    # beer urls are all under the h3 header of the blog-list style-1 division 
    try:
        bsObj = BeautifulSoup(req.text, 'lxml')
        beers = bsObj.find({"div"}, {"class": "blog-list style-1"}).findAll("h3")
    except AttributeError:
        print("Attribute error encountered in bsObj; return empty 'beerLinks' list.")
        return beerLinks

    # extract individual beer urls
    for beer in beers:
        beerLinks.append(beer.a["href"])

    return beerLinks

The requests library is useful to create a web session that mimics a human user so that the website administrators don't flag our script as being a bot. This is achieved by specifying the user-agent options in our session's headers to be the same as the ones from a typical web browser. We then use BeautifulSoup to parse the HTML text extracted in req in order to extract the information we need, which in our case are the URLs contained in the "href" tag of all "h3" headers on the page.

The second function, called getBeerInfo, takes these URLs as an input and returns a dictionary that contains information such as the beer's name, brewery, ratings, alcohol percentage, review text, and so on. Note that this function is very specific to the particular layout of the review pages.

import requests
from bs4 import BeautifulSoup
import numpy as np
import re

def getBeerInfo(pageUrl):
    """ Retrieve beer information and store in a Python dictionary """

    # create session with custom HTTP header
    session = requests.Session()
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/603.2.5 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.5", 
               "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}

    beerInfo = {}

    # access url; raise exception is something goes wrong
    try:
        req = session.get(pageUrl, headers=headers)
        req.raise_for_status()
    except Exception:
        print("Error encountered when accessing url; return.")
        return

    bsObj = BeautifulSoup(req.text, 'lxml')

    # Name and Brewery
    tempName = bsObj.h1.get_text()
    try:
        I = tempName.index('–')
        beerInfo['Name'] = tempName[I+2:]
        beerInfo['Brewery'] = tempName[:I-1]
    except ValueError:
        # if - not in title, may have to parse information manually later
        beerInfo['Name'] = tempName

    # Ratings (does not include total)
    for rating in bsObj.findAll("div", {"class": "rate-item"}):
        tempRating = rating.findAll("strong")    # rating and categories within <strong> tag
        beerInfo[tempRating[1].get_text().strip().title()] = np.float64(tempRating[0].get_text())
    # Now treat inconsistencies in ratings systems
    if 'Flavour' in beerInfo:
        beerInfo['Taste'] = beerInfo['Flavour']
    if 'Mouthfeel' in beerInfo:
        beerInfo['Palate'] = beerInfo['Mouthfeel']

    # Total score is left empty because some beers are rated with <4 criteria
    # Pandas can handle NaN easily in postprocessing

    # Categories found in second panel...
    panel = bsObj.findAll("div", {"class": "panel"})[1]

    # Type (located in possibly many <a> tags, so we need to concatenate them)
    beerInfo['Type'] = ", ".join([BeautifulSoup.get_text(type) for type in panel.findAll("li")[0].findAll("a")])

    # Pros, Cons, and Conclusion
    panel_list = panel.findAll("li", {"class": "graytext"})
    try:
        beerInfo['Pros'] = panel_list[0].p.get_text()
        beerInfo['Cons'] = panel_list[1].p.get_text()
        beerInfo['Conclusion'] = panel_list[2].p.get_text()
    except IndexError:
        # these categories do not exist
        pass

    # Alcohol, Size, and IBU (if available); will need post-processing
    # If value not available, will show up as NaN in .csv file since not in dictionary
    paragraph_row = bsObj.find("div", {"class": "paragraph-row"})

    pASI = paragraph_row.parent.findAll("p") # so we can loop through <p> tags

    for p in pASI:
        paragraph = p.get_text().title().split()
        if 'Alcohol' in paragraph:
            I = paragraph.index('Alcohol')
            beerInfo['Alcohol'] = paragraph[I+2]
        else:
            continue
        if 'Size' in paragraph:
            I = paragraph.index('Size')
            beerInfo['Size'] = re.sub('\D', '', paragraph[I+2])
        else:
            continue
        if 'Ibu' in paragraph:
            I = paragraph.index('Ibu')
            beerInfo['IBU'] = paragraph[I+2]
        else:
            continue

    # Date reviewed
    beerInfo['Date Reviewed'] = bsObj.find("span", {"class", "dtreviewed"}).span["title"]

    # Reviewer (take last element of the title attribute "Posts by *reviewer*")
    beerInfo['Reviewer'] = bsObj.find("span", {"class", "reviewer"}).a["title"].split().pop()

    # Categories and Tags
    articleFoot = bsObj.find("div", {"class": "article-foot"})

    categoryLinks = articleFoot.find("div", {"class": "left"}).findAll("a")
    categoryList = []
    for category in categoryLinks:
        categoryList.append(category.get_text())
    beerInfo['Categories'] = ", ".join(categoryList)

    tagLinks = articleFoot.find("div", {"class": "right"}).findAll("a")
    tagList = []
    for tag in tagLinks:
        tagList.append(tag.get_text())
    beerInfo['Tags'] = ", ".join(tagList)

    # Review text and URL
    overview_header = bsObj.find("div", {"class": "paragraph-row"}).findNext("h2").findNext("h2")
    beerInfo['Review Text'] = overview_header.findNext("p").get_text()
    beerInfo['URL'] = pageUrl    # for identification of duplicates

    return beerInfo

Notice the use of the many if and try ... except statements in both functions above. These statements are actually critical to the proper functioning of our web scraper. Indeed it turns out that even if most beer reviews follow a similar HTML template, exceptions are bound to happen: the 'Taste' rating may be called 'Flavour' in a handful of reviews; the 'Pros' and 'Cons' categories might be absent; the URL may not be accessible for some reason; etc. By giving the scraper an alternative (e.g. if 'Pros' doesn't exist, then keep on going), we allow it to retrieve as much information as it can instead of stopping it in its tracks whenever an exception occurs.

However no matter how many precautions one takes additional measures will need to be implemented. For instance in this case many strings contained in the beerInfo dictionary resulted in a UnicodeDecodeError when saved to a .csv file. The culprit was often a rogue slanted apostrophe in the text that the ascii codec could not decode. I thus had to post-process all strings to remove unicode characters using the regular expression library re

for key in beerInfo.keys():
    if isinstance(beerInfo[key], str):
                beerInfo[key] = re.sub(r'[^\x00-\x7f]', r' ', beerInfo[key])

Getting a scraper to work properly requires patience and quite a bit of trial and error, but the rewards are usually worth it. It is also a fantastic addition to a data scientist's skill set. Note that there are additional tools in Python that one may use to improve the scope of their web scraping practices. One of these is Scrapy, a powerful crawling framework that handles downloading, cleaning, and saving data, as opposed to simply parsing HTML files like BeautifulSoup. Another useful library is Selenium, originally developed for web testing, which finds its usefulness in the fact that it can handle JavaScript execution and redirects so that one can extract information that appears dynamically in a browser, e.g. when hovering over an element.

For the interested the raw beer dataset obtained from the above web crawler is available here, and I will soon post an analysis of the data contained within.

blogroll

social