Store and ETL Big Data in the Cloud with Apache Hive

Published December 26, 2013 |

arvindl

Big Data and cloud storage paired with the processing capabilities of Apache Hadoop and Hive as a service can be an excellent complement to expensive data warehouses. The ever-increasing storage demands of big data applications put pressure on in-house storage solutions. This data is generally not stored in database systems because of its size. In fact, it is commonly the precursor to a data mining or aggregation process, which writes the distilled information into a database. However, for this process the constantly growing data collected from logs, transactions, sensors, or others has to be stored somewhere, inexpensively, safely, and accessibly.

Cloud Storage

Most companies can achieve two of the three attributes, e.g. safe and inexpensive (tapes) or safe and accessible (multi-location, data servers), or inexpensive and accessible (non-redundant server or network attached drive). The combination of the three requires economies of scale beyond most company’s ability and feasibility for their data. Cloud providers like Amazon Web Services have taken on the challenge and offer excellent solutions like Amazon S3. It provides a virtually limitless data sink with a tremendous safety (99.999999999% durability), instant access, and reasonable pricing.

Utilizing Cloud Data

Scenarios for big data in the cloud consequently should consider S3 as a data sink to collect data in a single point. Once collected the data has to be processed. One option is to write custom software to load the data, parse it, and aggregate and export the information contained in it to another storage format or data store. This common ETL (Extract, Transform, Load) processing is encountered in data warehouses situations. So reinventing the wheel by developing a custom solution is not desirable. At the same time building or scaling a data warehouse for Big Data is an expensive proposition.