Top 10 machine learning algorithms you should know in 2018

Published February 1, 2018 |

arvindl

The word “Big data” prevailed in 2017, and it’s going to keep prevailing in the following years. In our previous post, I’ve introduced some concepts about big data, machine learning, and data mining (see post: Understanding Big data, Data mining, and Machine Learning in 5 Minutes). Now let’s dig deeper into Machine Learning with a brief walk-through of some most commonly used ML algorithms, no codes, no abstract theories, just pictures and some examples of how they are used.
The list of algorithms covered in this article include:

Decision tree
Random forest
Logistic regression
Support vector machine
Naive Bayes
k-NearestNeighbor
k-means
Adaboost
Neural network
Markov

1. Decision Tree

Classify a set of data into different groups using certain attributes, execute a test at each node, through brach judgement, further split the data into two distinct groups, so on and so forth. Tests are done based on existing data, and when new data are being added it can be classified to the corresponding group
Classify data according to some features, whenever the process goes to the next step, there is a judging branch, and the judgement divides the data into two, and the process goes on. When tests are done with existing data, new data can be These questions are learned by the existing data, when there is new data coming in, computer can categorize data into the right leaves.
decision-tree-1

2. Random Forest

Select randomly from the original data, and form into different subsets.
random-forest-1
Matrix S is the original data, and it contains 1-N data rows, while A, B, C are the features, and the last C stands for categories.
random-forest-2
Create random subsets from S, let’s say we got M sets of subsets.
random-forest-3
And we get M sets of decision trees from these subsets:
Throw new data into these trees, we can get M sets of results, and we count to see which results are the most in all M sets, we can consider that as the final result.
random-forest-4

3. Logistic Regression

When the probability of the predicting target is larger than 0, and less than or equal to 1, it cannot be fulfilled by simple linear model. Because when domain of definition is not within certain level, the range would exceed the specified interval.
logistic-regression1
We better go with model with this kind.
logistic-regression2
So how can we get this model?
This model needs to fulfill two conditions, “Larger than or equal to 0”, “Less than or equal to 1”
logistic-regression3
And we transform the formula, we can get the logistic regressions model:
logistic-regression4
By calculating the original data, we can get corresponding coefficients.
And we get the logistic model plot.
logistic-regression5

4. Support Vector Machine

To separate the two classes from hyperplane, the best choice will be the hyperplane that leaves the maximum margin from both classes. Because Z2>Z1, so the green one is better.
svm1
Use a linear equation to express the hyperplane, class above the line is larger than or equal to 1, the other class is less than or equal to -1.
svm2
Calculate the distance between the point to the surface by using the equation in the graph:
svm3
So we get the expression of total margin as below, the aim is to maximize the margin, which we need to do is to minimize the denominator.
svm4
For example, we use 3 points to find the optimal hyperplane, define weight vector=(2, 3) – (1, 1)
svm5
And get weight vector (a, 2a), substitute these two points into the equation
svm6
When a is confirmed, the result using (a, 2a) is support vector,
Equation substituting in a and w0 is support vector machine.

5. Naive Bayes

Here’s an example of NLP:
Giving out a pieces of text, examine the text’s attitude is positive or negative.
naive-bayes1
To solve the problem, we can only look at some of the words:
naive-bayes2
And these words, will represent by only some of words and their counts.
naive-bayes3
And the original question is: Give you a sentence, which category does it belong?
By using Bayes Rules, it is going to be an easy question.
naive-bayes4
The question becomes, in this class, what’s the probability of occurrence of this sentence? And remember not to forget the other two probabilities in the equation.
Example: the probability of occurrence of the word “love” is 0.1 in the positive class, and 0.001 in the negative class.
naive-bayes5

6. k-NearestNeighbor

When comes a new datum, which category has the most points nearest to it, it belongs to which category.
For example: To distinguish “dog” and “cat”, we judge from two features, “claws” and “sound”. Circles and triangles are the known categories, what about “star”:
knn1
When K=3, these three lines connect the nearest 3 points, and circles are more, so “star” belongs to “cat”.
knn2

7. k-means

Separate the data into 3 classes, the pink part is the biggest, while the yellow is the smallest.
Pick 3, 2, 1 as default, and calculate the distance between the rest data and the defaults, and classify it into the class that has the shortest distance.
kmeans1
After classification, calculate the means of each class, and set it as the new center.
kmeans2
After some rounds, we can stop when the class no longer changes.
kmeans3

8. Adaboost

Adaboost is one measure of boosting.
Boosting is to gather up the classifiers that didn’t have satisfied results, and generate a classifier that may have better effect.
As the below shows, tree 1 and tree 2 don’t have good effects individually, but if we input the same data, and sum up the results, the final result will be more convincing.
adaboost1
An example for adaboost, in handwriting recognition, the panel can extract many features, such as the beginning direction, distance between beginning point and ending point, and etc.
adaboost2
When training the machine, it will get the weight of each feature, like 2 and 3, the beginnings of writing them are very similar, so this feature does little to classification, so its weight is little.
adaboost3
But this alpha angle has a great recognizability, so the weight of this feature will be great. The final outcome will be a result of considering all of these features.
adaboost4

9. Neural Network

In NN, an input may end up into at least two classes.
Neural network is formed of neures, and connections of neures.
The first layer is the input layer, and the last layer is the output layer.
In hidden layers and output layer, they both have their own classifiers.
nn1
When an input comes in the network, and being activated, the calculated score will be passed down to the next layer. Scores shown in the output layer are the scores for each class. Example below gets the result of class 1;
nn2
same input being passed to different knots generates different scores, which is because that in each knot, it has different weights and bias, and this is propagation.

10. Markov

Markov Chain consists of states and transitions.
For example, get a Markov Chain based on “the quick brown fox jumps over the lazy dog”.
First, we need to set every word under a state, and we need to calculate the probability of state transitions.
markov1
These are the probabilities calculated by one single sentence. When you use massive data of texts to train the computer, you will get a bigger state transition matrix, such as words that can follow “the”, and their corresponding probabilities.
markov2