Hadoop Glossary: 20 most important terms

Hadoop | Tech and Tools   |   
Published August 12, 2014   |   
Baiju NT

This is a list of most important Hadoop terms you need to know and understand, before going into the Hadoop eco-system. [To read about top 10 most popular myths about Hadoop, click here.]

Most important Hadoop terms

Apache or Apache Software Foundation (ASF): A non-profit software foundation set up to support open source software projects. Apache projects are protected by an ASF license that provides legal protection to volunteers who work on Apache products.

Apache Hadoop: An open source platform that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. The platform particularly suited to large volumes of unstructured data such as Facebook comments and Twitter tweets, email and instant messages, and security and application logs.

Apache Spark: An open-source data analytics cluster computing framework, originally developed in the AMPLab at UC Berkeley. It is built on top of the Hadoop Distributed File System and has much faster performance compared to MapReduce. It provides high-level APIs in Scala, Python and Java.

Flume: A service for collecting, aggregating, and moving large amounts of log and event data into Hadoop.

Hadoop Common: Usually only referred to by programmers, Hadoop Common is a common utilities library that contains code to support some of the other modules within the Hadoop ecosystem. When Hive and HBase want to access HDFS, for example, they do so using JARs (Java archives), which are libraries of Java code stored in Hadoop Common.

HBase: An open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data.

HDFS: An acronym for “Hadoop Distributed File System”, which breaks large application workloads into smaller data blocks that are replicated and distributed across a cluster of commodity hardware for faster processing.

Hive: A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It allows you to query data using a SQL-like language called HiveQL (HQL).

HiveQL (HQL): A SQL like query language for Hadoop used to execute MapReduce jobs on HDFS.

HUE: A browser-based desktop interface for interacting with Hadoop.

Impala: An SQL query engine with massive parallel processing (MPP) power, running natively on the Apache Hadoop framework. It shares the same flexible file system (HDFS), metadata, resource management and security frameworks as used by other Hadoop ecosystem components.

JobTracker: the service within Hadoop which distributes MapReduce tasks to specific nodes in the cluster.

MapReduce: A software framework for easily writing applications that process vast amounts of data (multi-terabyte data-sets) in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. Hadoop acts as a platform for executing MapReduce.

NameNode: the core of the HDFS file system. The NameNode maintains a record of all files stored on the Hadoop cluster.

Oozie: A workflow engine for Hadoop.

Pig: A high level programming language for creating MapReduce programs used within Hadoop.

Sqoop: A tool designed to transfer data between Hadoop and relational databases.

Whirr: A set of libraries for running cloud services. It’s ideal for running temporary Hadoop clusters to carry out a proof of concept, or to run a few one-time jobs.

YARN: a resource manager for Hadoop 2. YARN is short for “Yet another resource negotiator”.

ZooKeeper: Allows Hadoop administrators to track and coordinate distributed applications.