Many websites like Kaggle and the UC Irvine Machine Learning Repository offer a vast selection of clean data sets on which experienced and aspiring data scientists alike can experiment. However the web is full of interesting data that has not been organized in neat tables ready for extraction. For instance one can think of plane ticket prices, real estate listings, stock prices, product reviews, social media feeds, etc. The list goes on!
In such cases web scraping becomes a very handy skill to have as a data scientist. I've always thought that the idea of creating a script to crawl webpages in order to extract information sounded fun, but I didn't know how to approach this problem until I came across this book, which pedagogically outlines how to use the Python library BeautifulSoup
to pull data out of HTML and XML files. I won't get into the specifics of my implementation in here, but there are many free tutorials available online that can help you get started.
In order to start scraping, I first had to find an interesting website with which to get my hands dirty. I eventually stumbled upon Beer Me BC, a website dedicated to reviewing craft beers in British Columbia. Bingo! Not only were most review pages structured the same way, the quantity of information I could extract could potentially lead to a nice analysis of what makes a great craft beer in BC. Challenge accepted!
The Golden Rules of Web Scraping
There are a few things to be aware of before scraping a website:
-
Be polite. A computer can send thousands of web requests per second, which has the potential to harm a website by overloading its servers. It is easy to implement a time delay before each request so that a website's performance doesn't degrade for other users.
-
Respect a website's terms and conditions about scraping, which are usually found in the website's
robots.txt
file (typically located at www.website.com/robots.txt). For instance some websites may ask that scraping be done at night only in order to keep daytime traffic at reasonable levels. -
Scrapers break. There is no guarantee that the layout your web scraper is based upon will stay the same, so keep things as general as possible and be prepared to rewrite your code if changes occur.
-
There's no avoiding inconsistencies. Most webpages are well-structured, but do not expect every page to be exactly the same. Manual clean-up will very likely be required after acquiring the data.
Beer Crawl
With these rules in mind I wrote two scraping functions. The first one, getBeerLinks
, is a crawler that extracts individual beer review URLs from a list.
import requests from bs4 import BeautifulSoup def getBeerLinks(pageUrl): """ Links to beer descriptions are contained in a list of pages """ # create session with custom HTTP header session = requests.Session() headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/603.2.5 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.5", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"} beerLinks = [] # access url; raise exception is something goes wrong try: req = session.get(pageUrl, headers=headers) req.raise_for_status() except Exception: print("Error encountered when accessing url; return empty 'beerLinks' list.") return beerLinks # beer urls are all under the h3 header of the blog-list style-1 division try: bsObj = BeautifulSoup(req.text, 'lxml') beers = bsObj.find({"div"}, {"class": "blog-list style-1"}).findAll("h3") except AttributeError: print("Attribute error encountered in bsObj; return empty 'beerLinks' list.") return beerLinks # extract individual beer urls for beer in beers: beerLinks.append(beer.a["href"]) return beerLinks
The requests
library is useful to create a web session that mimics a human user so that the website administrators don't flag our script as being a bot. This is achieved by specifying the user-agent options in our session's headers to be the same as the ones from a typical web browser. We then use BeautifulSoup
to parse the HTML text extracted in req
in order to extract the information we need, which in our case are the URLs contained in the "href" tag of all "h3" headers on the page.
The second function, called getBeerInfo
, takes these URLs as an input and returns a dictionary that contains information such as the beer's name, brewery, ratings, alcohol percentage, review text, and so on. Note that this function is very specific to the particular layout of the review pages.
import requests from bs4 import BeautifulSoup import numpy as np import re def getBeerInfo(pageUrl): """ Retrieve beer information and store in a Python dictionary """ # create session with custom HTTP header session = requests.Session() headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/603.2.5 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.5", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"} beerInfo = {} # access url; raise exception is something goes wrong try: req = session.get(pageUrl, headers=headers) req.raise_for_status() except Exception: print("Error encountered when accessing url; return.") return bsObj = BeautifulSoup(req.text, 'lxml') # Name and Brewery tempName = bsObj.h1.get_text() try: I = tempName.index('–') beerInfo['Name'] = tempName[I+2:] beerInfo['Brewery'] = tempName[:I-1] except ValueError: # if - not in title, may have to parse information manually later beerInfo['Name'] = tempName # Ratings (does not include total) for rating in bsObj.findAll("div", {"class": "rate-item"}): tempRating = rating.findAll("strong") # rating and categories within <strong> tag beerInfo[tempRating[1].get_text().strip().title()] = np.float64(tempRating[0].get_text()) # Now treat inconsistencies in ratings systems if 'Flavour' in beerInfo: beerInfo['Taste'] = beerInfo['Flavour'] if 'Mouthfeel' in beerInfo: beerInfo['Palate'] = beerInfo['Mouthfeel'] # Total score is left empty because some beers are rated with <4 criteria # Pandas can handle NaN easily in postprocessing # Categories found in second panel... panel = bsObj.findAll("div", {"class": "panel"})[1] # Type (located in possibly many <a> tags, so we need to concatenate them) beerInfo['Type'] = ", ".join([BeautifulSoup.get_text(type) for type in panel.findAll("li")[0].findAll("a")]) # Pros, Cons, and Conclusion panel_list = panel.findAll("li", {"class": "graytext"}) try: beerInfo['Pros'] = panel_list[0].p.get_text() beerInfo['Cons'] = panel_list[1].p.get_text() beerInfo['Conclusion'] = panel_list[2].p.get_text() except IndexError: # these categories do not exist pass # Alcohol, Size, and IBU (if available); will need post-processing # If value not available, will show up as NaN in .csv file since not in dictionary paragraph_row = bsObj.find("div", {"class": "paragraph-row"}) pASI = paragraph_row.parent.findAll("p") # so we can loop through <p> tags for p in pASI: paragraph = p.get_text().title().split() if 'Alcohol' in paragraph: I = paragraph.index('Alcohol') beerInfo['Alcohol'] = paragraph[I+2] else: continue if 'Size' in paragraph: I = paragraph.index('Size') beerInfo['Size'] = re.sub('\D', '', paragraph[I+2]) else: continue if 'Ibu' in paragraph: I = paragraph.index('Ibu') beerInfo['IBU'] = paragraph[I+2] else: continue # Date reviewed beerInfo['Date Reviewed'] = bsObj.find("span", {"class", "dtreviewed"}).span["title"] # Reviewer (take last element of the title attribute "Posts by *reviewer*") beerInfo['Reviewer'] = bsObj.find("span", {"class", "reviewer"}).a["title"].split().pop() # Categories and Tags articleFoot = bsObj.find("div", {"class": "article-foot"}) categoryLinks = articleFoot.find("div", {"class": "left"}).findAll("a") categoryList = [] for category in categoryLinks: categoryList.append(category.get_text()) beerInfo['Categories'] = ", ".join(categoryList) tagLinks = articleFoot.find("div", {"class": "right"}).findAll("a") tagList = [] for tag in tagLinks: tagList.append(tag.get_text()) beerInfo['Tags'] = ", ".join(tagList) # Review text and URL overview_header = bsObj.find("div", {"class": "paragraph-row"}).findNext("h2").findNext("h2") beerInfo['Review Text'] = overview_header.findNext("p").get_text() beerInfo['URL'] = pageUrl # for identification of duplicates return beerInfo
Notice the use of the many if
and try ... except
statements in both functions above. These statements are actually critical to the proper functioning of our web scraper. Indeed it turns out that even if most beer reviews follow a similar HTML template, exceptions are bound to happen: the 'Taste' rating may be called 'Flavour' in a handful of reviews; the 'Pros' and 'Cons' categories might be absent; the URL may not be accessible for some reason; etc. By giving the scraper an alternative (e.g. if 'Pros' doesn't exist, then keep on going), we allow it to retrieve as much information as it can instead of stopping it in its tracks whenever an exception occurs.
However no matter how many precautions one takes additional measures will need to be implemented. For instance in this case many strings contained in the beerInfo
dictionary resulted in a UnicodeDecodeError
when saved to a .csv file. The culprit was often a rogue slanted apostrophe in the text that the ascii codec could not decode. I thus had to post-process all strings to remove unicode characters using the regular expression library re
for key in beerInfo.keys(): if isinstance(beerInfo[key], str): beerInfo[key] = re.sub(r'[^\x00-\x7f]', r' ', beerInfo[key])
Getting a scraper to work properly requires patience and quite a bit of trial and error, but the rewards are usually worth it. It is also a fantastic addition to a data scientist's skill set. Note that there are additional tools in Python that one may use to improve the scope of their web scraping practices. One of these is Scrapy, a powerful crawling framework that handles downloading, cleaning, and saving data, as opposed to simply parsing HTML files like BeautifulSoup. Another useful library is Selenium, originally developed for web testing, which finds its usefulness in the fact that it can handle JavaScript execution and redirects so that one can extract information that appears dynamically in a browser, e.g. when hovering over an element.
For the interested the raw beer dataset obtained from the above web crawler is available here, and I will soon post an analysis of the data contained within.