Machine learning (ML) is one of the hottest fields in data science. As soon as ML entered the mainstream through Amazon, Netflix, and Facebook people have been giddy about what they can learn from their data. However, modern machine learning (i.e. not the theoretical statistical learning that emerged in the 70s) is very much an evolving field and despite its many successes we are still learning what exactly can ML do for data practitioners. I gave a talk on this topic earlier this fall at Northwestern University and I wanted to share these cautionary tales with a wider audience.
Machine learning is a field of computer science where algorithms improve their performance at a certain task as more data are observed. To do so, algorithms select a hypothesis that best explains the data at hand with the hope that the hypothesis would generalize to future (unseen) data. Take the left panel in the figure in the header, the crosses denote the observed data projected in a two-dimensional space — in this case house prices and their corresponding size in square meters. The blue line is the algorithm’s best hypothesis to explain the observed data. It states “there is a linear relationship between the price and size of a house. As the house’s size increases, so does its price in linear increments.” Now using this hypothesis, I can predict the price of an unseen datapoint based on its size.
As the dimensions of the data increase, the hypotheses that explain the data become more complex. However, given that we are using a finite sample of observations to learn our hypothesis, finding an adequate hypothesis that generalizes to unseen data is nontrivial. There are three major pitfalls one can fall into that will prevent you from having a generalizable model and hence the conclusions of your hypothesis will be in doubt.
Occam’s razor is a principle attributed to William of Occam a 14th century philosopher. Occam’s razor advocates for choosing the simplest hypothesis that explains your data, yet no simpler. While this notion is simple and elegant, it is often misunderstood to mean that we must select the simplest hypothesis possible regardless of performance.