Needless to say, R is one of the most efficient and effective tools for analysing and manipulating data for statistical purpose. To add to that, R being both inexpensive and beautiful, embellishes both the art of programming and proliferating the skill set of the programmer. Now, the question of how R adds value is what we are going to deal with in this article.
Though R can be used as a general programming language apart from statistical applications, this article will deal with the most widely used R packages used in the field of machine learning. These packages are the ones which makes R simple and thereby dandy for developing machine learning (ML) algorithms for cracking the business problems.
Just as I mentioned, R being inexpensive (open source software), algorithms required for machine learning purpose is not included as a part of the base installation. In course of time, such algorithms were added to the base R by experts and now free packages (groups of functions made freely available to users) do exist for these ML algorithms. It is this exquisite and simplistic beauty of R which makes it so damn attractive and coveted!
There is a comprehensive list of R packages which can be viewed at CRAN(Comprehensive R Archive Network) website.
But, to make things simpler, I have chosen 10 packages which make machine learning attractive using R.
If the data is stored in SQL databases (Oracle, MySQL) or ODBC(Open Database Connectivity) and needs to be converted into R data frame, then nothing can be as effective as RODBC package to import this data frame.
The most direct way to install a package is using install.packages ( ) function.
So to install RODBC package, one needs to input:
In order to load the RODBC function, we use:
Importing data is the primary requisite for any statistical modelling approach. Data from anywhere can be loaded into a compatible R format and, if your database is protected, you will have to provide the password and that’s it! Simple, isn’t it.
During statistical analysis, we may often want to compare relationship between two nominal variables. To explain this, let’s consider 2 nominal variables, one being ‘Income groups’ (Levels=High, Medium, Low), and the other being ‘Highest level of Education’ (Levels= Undegraduation, Graduation, Post-Graduation).We might be interested to find out whether the Income has a significant relationship with the affordability of the level of education. Such analysis can be done using CrossTable( ) function available in gmodels package, where the results are represented in a tabular format with rows indicating the levels of one variable and the columns indicating the levels of the other variable
That’s it! You are all set to experience the usability of CrossTable().
The two packages discussed earlier were pertaining to simple data applications. This package ‘class’ contains the knn( ) function which provides the food for constructing the k-nearest neighbours algorithm- an easy machine learning algorithm. The knn( ) function uses the Euclidean distance method to identify the k-nearest neighbours; k is a user-specified number.
Examples of knn function( ): To predict whether a person enjoys the videos suggested by YouTube.
And there you go!
These days lots of statistical analysis requires thorough processing of text data, be it SMS’s or mails, which involves a lot of tedious efforts. This kind of analysis might even require removing punctuation marks, numbers and certain unwanted words like ‘but’,’or’ etc. depending upon the business requirement. The tm package contains flexible functions like corpus( ) which can read from pdf’s and word documents, and convert the text data into R vector and tm_map() which helps in cleaning the text data( removing blanks, conversion from upper to lower and viceversa etc.), thereby making the data ready for analysis.
The tm text mining package can be installed using install.packages(“tm”)and loaded with library(tm).
A single picture speaks a thousand words! We all must have heard this and R in real life implements our belief. The package ‘wordcloud’ helps to create a diagrammatic representation of words and a user can actually customize the ‘wordcloud’ such as place the high-frequency words closer together in the centre, arrange the words in a random fashion, specify the frequency of a particular word etc. thereby etching a long lasting impression in anyone’s mind.
The wordcloud package can be installed using install.packages(“wordcloud”)and loaded with library(wordcloud).
Look how it looks. Now you too can create this using R. Isn’t this superb?
Looks like some garbage value? Hang on. This package provides us with the function naiveBayes( ) based on the simple application of conditional probability.
Let us try to analyse a situation where retailers need to find out what is the probability of a customer to buy bread when he has already bought butter. Such type of analysis requires conditional probability which can be made available using e1071 package which in turn helps in finding effective business solutions. In our example, if the probability of buying bread is high the retailer may formulate new strategies such as keeping bread and butter together or give some discount if bread and butter are bought together etc., to augment the store’s revenue.
The e1071 package can be installed using install.packages(“e1071”)and loaded with library(e1071).
This package contains the function C50 which finds application in building decision tree algorithms. Decision tree models have a structure similar to the flowcharts with decision node indicating the decision to be made on a particular attribute. This algorithm has widespread application for processes which needs to maintain transparency at all levels. For instance, Airtel may try to predict a set of customers who are likely to churn out from their network. Such analysis may help Airtel understand the reasons for the churn and so, Airtel would be able to attract the customers with some lucrative offers or act upon the reasons of dissatisfaction among such customers. C50 does such predictions fair justice. [d31]
Install C50 package using install.packages(“C50”) and load it using library(C50).
This package typically finds application in building regression trees. Regression is a concept which involves establish relationship between a single dependant variable and independent variable(s).Suppose, a product company needs to determine how it’s sales have been due to promotions on TV, Out of Home (OOH), Newspapers, Magazines etc. The rpart package containing the rpart() function helps explain the variance in the dependant variable( eg. sales) caused by the independent variables(TV ads, newspaper ads, magazines).
Install rpart package using install.packages(“rpart”) and load it using library(rpart).
Artificial Neural Network Algorithms (ANN) often referred to as ‘deep learning’ can be practised through the ‘neuralnet’ package. ANN builds a model based on the understanding of how the human brain works by establishing a relationship between the input and the output signals.
In the aviation industry, safety is of utmost concern. A lot of importance is given to the materials used for building aeroplanes. The ingredients aluminium alloy, carbon fibre, fiber glass, graphite-epoxy are used in complex combinations and, as a result, it becomes difficult to predict the accurate strength of the final product. In such situations, ANN finds the right application[d32] using the function neuralnet() to predict the strength of each component, which in turn would enable safer aeroplanes.
Install neuralnet package using install.packages(“neuralnet”) and load it using library(neuralnet)
OCR reads various characters using key dimensions. The typical machine has to be able to distinguish the letters accurately. Image processing is perhaps one of the most difficult tasks involved considering the amount of noise present, the positioning and orientation and how the image gets captured. Support Vector Machine(SVM) models finds extensive applications in pattern recognition fields as it is highly dexterous in learning the complex patterns efficiently.
Kernlab may be installed using install.packages(“kernlab”) and loaded using library(kernlab). The function ksvm( ) may then be used along with the user-specific kernel.
Pattern recognition is one of the most perilous tasks which SVM eases out to a considerable extent. It also happens to be an interesting field of growing interest and so, ksvm( ) provides a range of kernel(vanillladot,rbfdot,polydot,tanhdot, to name a few) to perform pattern recognition based on user interests.
There are lots of packages available in CRAN and you might find them useful as well[d31] . These ten packages find usage depending on the needs of various industries and I have personally used them to build models whose performance has been pretty good. One thing which needs to be kept in mind is, the functions associated with these packages prove effective only if the data set is voluminous and sufficient amount of massaging [d32] is done to the data.