In my previous post Beer Crawl, I deployed a web crawler to extract information describing over 300 craft beers in British Columbia. However after further investigation I realized that the dataset I had collected unfortunately had a few drawbacks. Perhaps the main one is that it describes the best beers in BC, and as a result their descriptions, ratings and reviews are more or less homogeneous, as this short visualization exercice will soon show.
Initialization and data pre-processing¶
Let's first load the standard libraries that we will use throughout and then proceed to clean up the raw data:
Beer Crawl
Many websites like Kaggle and the UC Irvine Machine Learning Repository offer a vast selection of clean data sets on which experienced and aspiring data scientists alike can experiment. However the web is full of interesting data that has not been organized in neat tables ready for extraction. For instance one can think of plane ticket prices, real estate listings, stock prices, product reviews, social media feeds, etc. The list goes on!
In such cases web scraping becomes a very handy skill to have as a data scientist. I've always thought that the idea of creating a script to crawl webpages in order to extract information sounded fun, but I didn't know how to approach this problem until I came across this book, which pedagogically outlines how to use the Python library BeautifulSoup
to pull data out of HTML and XML files. I won't get into the specifics of my …
Stochastic Optimization
In a world where data can be collected continuously and storage costs are cheap, issues related to the growing size of interesting datasets can pose a problem unless we have the right tools for the task. Indeed, in the event where we have streaming data it might be impossible to wait until the "end" before fitting our model since it may never come. Alternatively it might be problematic to even store all of the data, scattered across many different servers, in memory before using it. Instead it would be preferable to do an update each time some new data (or a small batch of it) arrives. Similarly we might find ourselves in an offline situation where the number of training examples is very large and traditional approaches, such as gradient descent, start to become too slow for our needs.
Stochastic gradient descent (SGD) offers an easy solution to all of these problems.
Read more →Hello World
After a week of fiddling and learning HTML and CSS on the fly, I finally managed to create a blog using Pelican, a static site generator powered by Python that supports Markdown and reST syntax.
Now that I have successfully defended my PhD thesis (which I verbosely titled "Numerical Investigation of Spatial Inhomogeneities in Gravity and Quantum Field Theory"), I have decided to take some time to blog about my progress as I transition towards a career in data science. This blog will be my platform to discuss all things related to
- Machine Learning;
- the ethics of Big Data;
- coding in Python;
- physics;
- interesting books I've read;
and more. Cheers!
Read more →Page 1 / 1