In a recent conversation with project team members from a client, one shared an internal slide deck used to promote the benefits of big data (in general) and Hadoop (in particular) among both key management decision makers and the development and implementation groups in IT. One interesting aspect of the presentation was the comparison of Hadoop to earlier computing ecosystems and the casting of the open source distributed processing framework in the role of “operating system” for a big data environment.
At the time the slide deck was assembled, that characterization was perhaps somewhat of a stretch. The core components of the initial Hadoop release were the Hadoop Distributed File System (HDFS) for storing and managing data and an implementation of the MapReduce programming model. The latter included application programming interfaces, runtime support for processing MapReduce jobs and an execution environment that provided the infrastructure for allocating resources in Hadoop clusters and then scheduling and monitoring jobs.
While those components acted as proxies for aspects of an operating system, the framework’s processing capabilities were limited by its architecture, with the JobTracker resource manager and the application logic and data processing layers all combined in MapReduce.
So what did that mean for running business intelligence and analytics applications? It had a big hampering effect: Although the task scheduling capabilities allowed for parallel execution of MapReduce applications, typically only one batch job could execute at a single time. That basically prevented the interleaving of different types of analysis in Hadoop systems. Batch analytics applications would have to run on a separate set of cluster nodes than a front-end query engine accessing data in HDFS.