Dark

Light

Dark

Light

How-to: Create a Simple Hadoop Cluster with VirtualBox

I wanted to get familiar with the big data world, and decided to test Hadoop. Initially, I used Cloudera’s pre-built virtual machine with its full Apache Hadoop suite pre-configured (called Cloudera QuickStart VM), and gave it a try. It was a really interesting and informative experience. The QuickStart VM is fully functional and you can test many Hadoop services, even though it is running as a single-node cluster.

I wondered what it would take to install a small four-node cluster…

I did some research and I found this excellent video on YouTube presenting a step by step explanation on how to setup a cluster with VMware and Cloudera. I adapted this tutorial to use VirtualBox instead, and this article describes the steps used.

The overall approach is simple. We create a virtual machine, we configure it with the required parameters and settings to act as a cluster node (specially the network settings). This referenced virtual machine is then cloned as many times as there will be nodes in the Hadoop cluster. Only a limited set of changes are then needed to finalize the node to be operational (only the hostname and IP address need to be defined).

In this article, I created a 4 nodes cluster. The first node, which will run most of the cluster services, requires more memory (8GB) than the other 3 nodes (2GB). Overall we will allocate 14GB of memory, so ensure that the host machine has sufficient memory, otherwise this will impact your experience negatively.

Read More

Author avatar
Arvind Lakshminarayanan

Arvind is the editor-in-chief of Big Data Made Simple. He is also a content specialist at Crayon Data.