In my previous post Beer Crawl, I deployed a web crawler to extract information describing over 300 craft beers in British Columbia. However after further investigation I realized that the dataset I had collected unfortunately had a few drawbacks. Perhaps the main one is that it describes the best beers in BC, and as a result their descriptions, ratings and reviews are more or less homogeneous, as this short visualization exercice will soon show.
Initialization and data pre-processing¶
Let's first load the standard libraries that we will use throughout and then proceed to clean up the raw data:
Beer Crawl
Many websites like Kaggle and the UC Irvine Machine Learning Repository offer a vast selection of clean data sets on which experienced and aspiring data scientists alike can experiment. However the web is full of interesting data that has not been organized in neat tables ready for extraction. For instance one can think of plane ticket prices, real estate listings, stock prices, product reviews, social media feeds, etc. The list goes on!
In such cases web scraping becomes a very handy skill to have as a data scientist. I've always thought that the idea of creating a script to crawl webpages in order to extract information sounded fun, but I didn't know how to approach this problem until I came across this book, which pedagogically outlines how to use the Python library BeautifulSoup
to pull data out of HTML and XML files. I won't get into the specifics of my …
Page 1 / 1