Data mining is often a difficult and time consuming task. Hence, not having a clear idea how to mine the data will severely affect the project’s focus. So having a clear idea of what, how and when the data will be mined is important. Do not just accumulate a big pile of data and toss it into the big data mining engine.
2. Do plan for the data to be messy and disorganized
Nowadays, data is being available in various formats and its ever increasing. So it is ever important to organize data and place it into a data mining model before getting any conclusions out of the data. These different formats means there will be a lot of time spent on ETL(Extract Transform Load) process on the data. Never underestimate the importance of good data preparation.
3. Do ask the right questions
Often, it’s the right questions which contribute to big data success rather than the right algorithms. It is very easy to make mistakes on this aspect. Making sure all the anomalies and mistakes are taken care of during the ETL process will ensure that it’s not too late. Do crosscheck the data which has been extracted with the original sources of the data and the project stakeholders.
4. Do not ascribe mystical powers to the algorithms
In big data it is easy to find patterns over random entries of data. Use tests like randomization tests to iron out discrepancies. Do not over-focus on the software, it is not a sufficient replacement for your insight, abilities and capabilities. Do use more than one algorithm. Do crosscheck your algorithm with the data obtained from the ETL processes.
5. Simplify your solution
It is easy to engineer a complex solution with a lot of variables, but it is better to give a simple solution though it is not that accurate. A model a client cannot grasp is one that will not be trusted as much as one that “makes sense.”
6. Do not use the default model accuracy metric
The default model accuracy metric is useful for models where the error is not to big compared to the average. Most often, some errors are worse than others and we select models that do more than have good accuracy on average. So using default model accuracy metrics should be avoided.
7. Do not forget to document all the modelling steps and the underlying data
It is important to document all the data for future reference. It is also useful when you move in to the next step of modelling steps in each step when checking the data.
8. Do not use the same algorithm for two similar looking data sets
Chances are the two datasets have different factors into it and a really inaccurate conclusion would be attained.