A road map to become a Data Scientist

Published September 4, 2015   |   

I receive a lot of questions regarding which books one should read to become a Data Miner/Data Scientist. Here is a suggested reading list, and also a proposed RoadMap(apart from the requirement of having an appropriate University degree) to become a Data Scientist.

Road map to become a Data Scientist

Before going further, it appears that a Data Scientist should possess a lot of skills: Statistics, Programming, Databases, Presentation Skills, Knowledge of Data Cleaning and Transformations.
The skills that ideally you should acquire are as follows :

  • Sound Statistical Understanding and Data Pre-Processing.
  • Know the Pitfalls : You must be aware of the Biases that could affect you as an analyst and  also the common mistakes made during Statistical Analysis.
  • Understand how several Machine Learning / Statistical Techniques work.
  • Time Series Forecasting.
  • Computer Programming (R, Java, Python, Scala).
  • Databases (SQL and NoSQL Databases).
  • Web Scraping (Apache Nutch, Scrapy, JSoup).
  • TextData.

Statistical Understanding

A good Introductory Book is Fundamental Statistics for the Behavioral Sciences by Howell. Also IBM SPSS for Introductory Statistics – Use and Interpretation and IBM SPSS For Intermediate Statistics by Morgan et al. Although all of the books (especially the two latter) are heavy on  IBM SPSS Software they are able to provide a good introduction to key statistical concepts while the  books by Morgan et al give a methodology to use with a practical example of analyzing the High-Scool and Beyond Dataset.

Data Pre-Processing

I must re-iterate the importance of thoroughly checking and identifying problems within your Data. Data Pre-processing guards against the possibility of feeding erroneous data to a Machine Learning / Statistical Algorithm but also transforms data in such a way so that an algorithm can extract/identify patterns more easily. Suggested Books :

  • Data Preparation for Data Mining by Dorian Pyle.
  • Mining Imperfect Data: Dealing with Contamination and Incomplete Records by Pearson.
  • Exploratory Data Mining and Data Cleaning by Johnson and Dasu.

Know the Pitfalls

There are many cases of Statistical Misuse and biases that may affect your work even if -at times- you do not know it consciously. This has happened to me in various occasions. Actually, this blog contains a couple of examples of Statistical Misuse even though i tried (and keep trying) to highlight limitations due to the nature of Data as much as i can. Big Data is another technology where caution is warranted. For example, see : Statistical Truisms in the Age of Big Data and The Hidden biases of Big Data.
Some more examples :
-Quora Question : What are common fallacies or mistakes made by beginners in Statistics / Machine Learning / Data Analysis.
Identifying and Overcoming Common Data Mining Mistakes by SAS Institute.
The following Book is suggested:

  • Common Errors in Statistics (and how to avoid them) by P. Good and J. Harding.

In case you are into Financial Forecasting i strongly suggest reading Evidence-Based Technical Analysis by David Aronson which is heavy on how Data Mining Bias (and several other cognitive biases) may affect your Analysis.
Understand how several Machine Learning / Statistical Algorithms work : You must be able to understand the pros and cons of each algorithm. Does the algorithm that you are about to try handle noise well? How Does it scale? What kind of optimizations can be performed? Which are the necessary Data transformations? Here is an example for fine-tuning Regression SVMs: Practical Selection of SVM Parameters and Noise Estimation for SVM Regression 
Another book which deserves attention is Applied Predictive Modelling by Khun, Johnson which also gives numerous examples on using the caret R Package which -among other things- has extended Parameter Optimization capabilities.
When it comes to getting to know Machine Learning/ Statistical Algorithms I’d suggest the following books  :
Data Mining : Practical Machine Learning Tools and Techniques by Witten and Frank.
The Elements of Statistical Learning by Friedman, Hasting, Tibishirani.
Time Series Forecasting : In many situations you might have to identify and predict trends from Time Series Data. A very good Introductory Book is Forecasting : Principles and Practice by Hyndman and Athanasopoulos which contains sections on Time Series Forecasting. Time Series Analysis and its Applications with R Examples by Shumway and Stoffer is another book with Practical Examples and R Code as the title suggests.
In case you are interested more about Time Series Forecasting i would also suggest ForeCA (Forecastable Component Analysis) R package written by Georg Goerg -working at Google at the moment of writing- which tells you how forecastable a Time Series is (Ω = 0:white noise, therefore not forecastable, Ω=100: Sinusoid, perfectly forecastable).
Computer Programming Knowledge: This is another essential skill. It allows you to use several Data Science Tools/APIs that require -mainly- Java and Python skills. Scala appears to be also becoming an important Programming Language for Data Science. R Knowledge is considered a “must”. Having prior knowledge of Programming gives you the edge if you wish to learn n new Programming Language. You should also constantly be looking for Trends on programming language requirements (see Finding the right Skillset for Big Data Jobs). It appears that -currently- Java is the most sought Computer Language, followed by Python and SQL. It is also useful looking at Google Trends but interestingly “Python” is not available as a Programming Language Topic at the moment of writing.
Database Knowledge : In my experience this is a very important skill to have. More often than not, Database Administrators (or other IT Engineers) that are supposed to extract Data for you are just too busy to do that. That means that you must have the knowledge to connect to a Database, Optimize a Query and perform several Queries/Transformations to get the Data that you want on a format that you want.
Web Scraping: It is a useful skill to have. There are tons of useful Data which you can access if you know how to write code to access and extract information from the Web. You should get to know  HTML Elements and XPath.  Some examples of Software that can be used for this purpose :
Scrapy
Apache Nutch
JSoup
Text Data: Text Data contain valuable information : Consumer Opinions, Sentiment, Intentions to name just a few. Information Extraction and Text Analytics are important Technologies that a Data Scientist should ideally know.
Information Extraction:
GATE
UIMA
Text Analytics:
The “tm” R Package
LingPipe
NLTK
The following Books are suggested :

  • Introduction to Information Retrieval by Manning, Raghavan and Schütze.
  • Handbook of Natural Language Processing by Indurkhya, Damerau (Editors).
  • The Text Mining HandBook – Advanced Approaches in Analyzing Unstructured Data by Feldman and Sanger.

Finally here are some Books that should not be missed by any Data Scientist :

  • Data Mining and Statistics for Decision Making by Stéphane Tufféry (A personal favorite).
  • Introduction to Data Mining by Tan, Steinbach, Kumar.
  • Applied Predictive Modelling by Khun, Johnson.
  • Data Mining with R – Learning with Case Studies by Torgo.
  • Principles of Data Mining by Bramer.

This article originally appeared here. Republished with permission. Submit your copyright complaints here.