Big data becomes a relevant topic in many companies this year. Although there is no standard definition of the term „big data“, Hadoop is the de facto standard for processing big data. Almost all big software vendors such as IBM, Oracle, SAP, or even Microsoft use it. However, when you have decided to use Hadoop, the first question is how to start and which product to choose for your big data processes. Several alternatives exist for installing a version of Hadoop and realizing big data processes. This article discusses different alternatives and recommends when to use which one.
Alternatives for Hadoop Platforms
The following picture shows different alternatives for Hadoop platforms. You can either install just the Apache release, choose one of several distributions of different vendors, or you can decide to use a big data suite. It is important to understand that every distribution contains Apache Hadoop, and that almost every big data suite contains or uses a distribution.
Let’s now take a closer look at the different alternatives, beginning with Apache Hadoop in the next section.
The current Apache Hadoop project (version 2.0) includes these modules:
- Hadoop Common: The common utilities that support the other Hadoop modules.
- Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets