Here is another interesting use case that came up when I was working with one of our clients in the insurance industry. The client had enormous amount of claim data residing in multiple databases in SQL Server which were to be consolidated into one. Some of the queries on this data took days because of which we were looking for an alternate solution that could process data in a distributed fashion and save us some time. We started looking into a Hadoop based solution since the company was already using Hadoop.
We had few options on the table such as Hive, Pig, Hbase etc and after some brainstorming decided to go with HBase for the following reasons:
- It is an open source distributed database which would yield higher performance while being cost effective at the same time.
- We do not have to worry about distributing the data for faster processing since Hadoop takes care of it.
- Batch processing with no real indexes.
- Data integrity as HBase confirms a write after its write-ahead log reaches all the three in-memory HDFS replicas.
- Easily scalable, fault tolerant and highly available.
Now the next step was to move data from the SQL database to HDFS for which we used Sqoop.