1. Legendary Pokemon Classifier

    There is no denying that Pokemon had a big influence on all the kids of my generation. I remember being 8 or 9 and looking forward to finishing school so I could catch up on the adventures of Ash and Pikachu. I also remember the fun I would have playing Pokemon Stadium on Nintendo 64 with my cousins on the weekend. The phenomenal popularity of Pokemon GO last year further confirmed that the nostalgia factor is still strong for a lot of people, even to this day.

    I was browsing Kaggle for datasets to practice classification algorithms when I came across one describing the first 6 generations of Pokemon with a total of 721 Pokemon, of which 46 are legendary. Bingo! I thought. This dataset is not only a fun way to experiment with classifiers to predict whether a Pokemon is legendary or not, but also provides a way to simulate an end-to-end machine learning project. Moreover, evaluating the performance of our models will require careful thinking since only a small fraction (6.4% to be exact) of the Pokemon are legendary.

    Read more →
  2. Dogs vs. Cats - Classification with VGG16

    Convolutional neural networks (CNNs) are the state of the art when it comes to computer vision. As such we will build a CNN model to distinguish images of cats from those of dogs by using the Dogs vs. Cats Redux: Kernels Edition dataset.

    Pre-trained deep CNNs typically generalize easily to different but similar datasets with the help of transfer learning. The reason is simple: the filters present in the earlier convolutional layers of a CNN usually capture low-level features such as straight lines, whereas higher-level filters recognizing complex objects such as faces are activated deeper in the network. As such it is possible to directly use the training weights associated with shape recognition and retrain only the deepest layers of the network - a procedure called finetuning or transfer learning - to perform classification tasks on different types of images.

    For this competition we follow the process described in the deep learning course fast.ai Read more →

  3. Beer Crawl - Preliminary Analysis

    In my previous post Beer Crawl, I deployed a web crawler to extract information describing over 300 craft beers in British Columbia. However after further investigation I realized that the dataset I had collected unfortunately had a few drawbacks. Perhaps the main one is that it describes the best beers in BC, and as a result their descriptions, ratings and reviews are more or less homogeneous, as this short visualization exercice will soon show.

    Initialization and data pre-processing¶

    Let's first load the standard libraries that we will use throughout and then proceed to clean up the raw data:

  4. AI Fearmongering

    The future of AI is a topic of contention at the moment, with many prestigious names bringing their conflicting opinions to the table. On one hand there are the pessimists, supported by Elon Musk and Stephen Hawking, who warn about the potential threat AI poses to mankind's existence. On the other we find optimists like Mark Zuckerberg, who thinks that such fear mongering is not only negative but also irresponsible.

    The idea that artificial intelligence could be dangerous is not new. A particularly widespread hypothesis of an AI doomsday scenario is that of the singularity, the idea that an upgradable superintelligent agent could enter an ever-accelerating cycle of self-improvement until its problem-solving and inventive skills far surpass those of humanity. This agent could then proceed to build an even more intelligent machine and the cycle would repeat itself indefinitely, leaving no room for mankind in the process.

    We are obviously …

    Read more →
  5. Beer Crawl

    Many websites like Kaggle and the UC Irvine Machine Learning Repository offer a vast selection of clean data sets on which experienced and aspiring data scientists alike can experiment. However the web is full of interesting data that has not been organized in neat tables ready for extraction. For instance one can think of plane ticket prices, real estate listings, stock prices, product reviews, social media feeds, etc. The list goes on!

    In such cases web scraping becomes a very handy skill to have as a data scientist. I've always thought that the idea of creating a script to crawl webpages in order to extract information sounded fun, but I didn't know how to approach this problem until I came across this book, which pedagogically outlines how to use the Python library BeautifulSoup to pull data out of HTML and XML files. I won't get into the specifics of my …

    Read more →
  6. Stochastic Optimization

    In a world where data can be collected continuously and storage costs are cheap, issues related to the growing size of interesting datasets can pose a problem unless we have the right tools for the task. Indeed, in the event where we have streaming data it might be impossible to wait until the "end" before fitting our model since it may never come. Alternatively it might be problematic to even store all of the data, scattered across many different servers, in memory before using it. Instead it would be preferable to do an update each time some new data (or a small batch of it) arrives. Similarly we might find ourselves in an offline situation where the number of training examples is very large and traditional approaches, such as gradient descent, start to become too slow for our needs.

    Stochastic gradient descent (SGD) offers an easy solution to all of these problems.

    Read more →
  7. Hello World

    After a week of fiddling and learning HTML and CSS on the fly, I finally managed to create a blog using Pelican, a static site generator powered by Python that supports Markdown and reST syntax.

    Now that I have successfully defended my PhD thesis (which I verbosely titled "Numerical Investigation of Spatial Inhomogeneities in Gravity and Quantum Field Theory"), I have decided to take some time to blog about my progress as I transition towards a career in data science. This blog will be my platform to discuss all things related to

    • Machine Learning;
    • the ethics of Big Data;
    • coding in Python;
    • physics;
    • interesting books I've read;

    and more. Cheers!

    Read more →
  8. Page 1 / 1

blogroll

social