I thought of sharing seven simple Machine Learning concepts

Machine Learning   |   
Published June 13, 2015   |   

It’s hard not to be fascinated by machine learning. It is overwhelmingly powerful. As former Microsoft CEO Steve Ballmer says, it is going to be the next era of computer science. There is a lot of excitement and buzz around machine learning. So, I thought of sharing seven simple machine learning concepts. So let’s get started.

1. Ensemble Methods

Ensemble methods are really cool. A single decision tree does not perform well; But, it is super-fast.
Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a vote (weighted) of their predictions. In this post, I will talk about some of the popular Ensemble methods such as bagging, boosting and random forests.
If I were to define Ensembles in one line, then I will define them as follows:
Aggregation of predictions of multiple classifiers with the goal of improving accuracy.
But, if you want to train multiple classifiers you need more data. The more degrees of freedom your model has, the more data your model tries to fit and more data points you need to have, this is where boot strapping comes to help you.

2. Boot Strapping

You use the data you have and try to make it more, this is what boot strapping does. Take N data points and then draw these data points N times with replacement, this means you end up with multiple data sets of same size. As you sample with replacement, all of these samples are slightly different. You can use each of these data sets as one data set on its own.

Remember, you cannot do cross validation on these data sets because if you have bootstrap samples x1 and x2, there are going to be points that are in both x1 and x2.
To further understand boot strapping, please read this thread on stack exchange.

3. Bagging

First ensemble method that I am going to talk about is Bagging. One disadvantage in decision trees is that they will look different if there are slight variations in the data, ensemble converts this disadvantage into an advantage.
Bagging follows these steps:
1) Sample with replacement from your data set(boot strapping).
2) Learn a classifier for each bootstrap sample.
3) Average the results of each classifier.
When each boot strap sample looks slightly different, then you will get different decision trees. You can average them and you can get a nice boundary. Bagging reduces overfitting significantly.
The basic idea of Bagging is to average noisy and unbiased models to create a model with low variance in terms of classification.  Watch this video on bagging to understand more about it.

4. Random Forest

Random forests uses lot of decision trees to create a classification.  The parameters are number of features and number of trees. Random forest is called off the shelf classifier, most of the time you need not even bother about these parameters.

If you have understood the concept of decision trees then understanding random forest is really easy. Go through Edwin chen’s Layman introduction to random forest, and also watch Thales video on random forest to understand random forest better.

Random forest error rate depends on: correlation between the trees and strength of single trees. But if you increase the number of features for each split, then it increases the correlation among the trees and increases the strength of single trees. If your trees are not that different then your averaging doesn’t give you as much as improvement as you want. Single trees should be really strong, you want each tree in there to actually learn something. So there is a tradeoff here.

a. Out Of Bag Error:

Out of bag error is the inbuilt version for computing the test error. It is very nice because you don’t have to put the data set aside in the beginning. You take the boot strap sample, and the boot strap sample covers 60% of the original data points and you have 30 % that you didn’t use. The idea is you can use these 30% as points to estimate the test error. If I make boot strap samples and I end up with 500 of those, that is for each single tree(not for the whole forest), I have my boot strap sample and I have the rest left over training points that didn’t make into my boot strap sample. Those I can use to estimate the test performance for that single tree, again not for the whole forest, just for the single tree. The Idea is you can do that for each single tree and that can give you the overall performance of random forest, because you know how strong the single trees are. Still, I will not count the out of bag error as test error. If you plot them one against another, most of the time you will get a positive, because it is not the true test error.
You can use out of bag error for validation, you can tune the number of trees. You can read Leo Breiman out of bag estimation article to know more about out of bag error.

b. Variable Importance:

Random forests allow you to compute a heuristic for determining how “important” a feature is in predicting a target. This heuristic measures the change in prediction accuracy, if you take a given feature and permute (scramble) it across the data points in the training set. The more the accuracy drops when the feature is permuted, the more IMPORTANT we can conclude the feature is.

5. UnBalanced Classes

Unbalanced classes means you have 8000 class A points and 2000 class B points, your job is to make the classifier understand how important the class B points are. One version to handle unbalanced classes is over sample( keep on sampling till you have 2 data sets that are of same size). Other option is to down sample the majority class. Random forest gives a nice solution for this, that is, you can sub sample for each tree. So that each of the trees are balanced now, in the end the algorithm has seen everything from the majority class. unbalanced classes

6. Precision and Recall

Precision or recall is an evaluation metric for classification problem with skewed classes. For many applications you have to control the tradeoff between precision and recall. Consider cancer prediction example(watch coursera machine learning course week 6 video to understand cancer prediction); if you modify your hypothesis (h(x)) to 0.7 to predict cancer confidently, then you end up with a classifier that will have high precision and low recall.  In this case, if a patient has cancer, but you fail to tell them they do not have one. If you set (h(x)) to 0.3, then you have high recall and low precision. Here, your classifier may predict most of them as having cancer. For most of the classifiers there is going to be a tradeoff between precision and recall.
Is there a way to choose threshold automatically? How do you decide which of these thresholds are the best? This is where f score comes in.  F score is the harmonic mean of precision and recall. To understand more about precision and recall, please watch professor Andrew Ngs week 6 system design videos

7. Boosting

It is kind of an ensemble method but it does something more than averaging.  You train one classifier, you look at the results and you make correction, and you train two, you look at the results and you make corrections, and you average them with weights. One that performed really well, I want to put lot of weight on that classifier, but if the other one did not perform really well so it actually gets less weight.
Boosting is better than bagging for many applications. Read more about boosting here.

a. Ada Boost

In Ada boost, you draw one decision boundary and look at the classification performance. A few of these points are going to be wrongly classified. I am going to put lot of weight on the points that were wrongly classified and I am going to say to these points that for the next classifier, that I am going to train, please pay attention and get these points right. What’s going to happen is I am going to get a different decision boundary the next time(line 2). Instead of introducing this randomness concept I am looking at the performance of the previous classifier and try to correct the mistakes it made.

ada boost
Most of the time Ada boost works with decision stumps . Decision stumps are only a tree that does only one split this means one x’s align split on one feature thats all your whole classifier. In really time, the figure Ada boost the line wont be diagonal it will be a xs aligned split.
Let me conclude this post with a quote that depicts the concept of ensemble: “The collective knowledge of a diverse and independent body of people typically exceeds the knowledge of any single individual, and can be harnessed by voting.” – James Surowiecki
Ensembles was also the method of choice to win the Netflix prize, you can watch the video here.