Our data platform with Docker

Industry   |   
Published February 10, 2016   |   

This article has been co-authored by Rafi Syed and Sree Pratheep.

As a continuation of our earlier article on Docker, here is a brief about how we started our journey in Docker.

Our goal was to have a completely self-sufficient development cum integration environment – to make the development experience smoother and to reduce the ramp up time for any newcomer to the team. Also, to break the dependency on the AWS infrastructure at development time and reduce cost substantially.

To achieve this, we had to run multiple nodes with different configuration together, like the data ingestion node, development, production, etc.

We could have gone for a multiple VM based solution, with VM of different flavor/configuration. But the problems with that approach are:

  • We need to install/configure different VMs manually by installing appropriate services
  • Setting up inter-communication across VMs would be hard to do, especially based on the dependencies involved across nodes
  • The performance would be really poor when we run multiple VMs in a single machine. Based on these we went for a docker based approach

Approach

We started by creating different Docker images, each with the set of services pre-installed for the nodes mentioned above. Then we had to tackle the problem of running them together and establishing the required dependencies among them. For example, the processed data need to move from one of the nodes to the other one. This should be achieved with the logical references between the nodes, i.e. without referring to any node by its IP address, etc.

To achieve this (i.e. to run multiple instances together with coordination, we had two options:

  • To go for a pure Docker only based solution with Docker compose.
  • To go for Docker and Vagrant based solution.

We decided to go for Vagrant plus Docker over Docker only approach.

Vagrant and Docker

Vagrant helps us create and configure lightweight, reproducible, and portable development environments. This was exactly what we were looking for i.e. a reproducible development environment.

The reasons why we couldn’t go for Docker compose were because:

  • It lacked some functionalities in terms of provisioning instances when compared to vagrant.
  • Vagrant had provisions to express dependencies among the instances.
  • Like the order in which they need to be brought up and other form of dependencies as well.
  • We can also mention whether the instances need to be brought up sequentially or in parallel.
  • Vagrant also has a provision to halt/resume the entire environment with single command “vagrant up/halt” (similarly “vagrant suspend/resume”).

We started writing vagrant files to stitch together all our images in the way we required.

  1. We had to make our images start in the appropriate sequence.
  2. We added the provisioning scripts for all the instances. With commands to startup appropriate services for each node/instance
  3. We customized the create cluster and startup scripts of Sequence-iq image, according to our requirements.
  4. We had added the port mapping and the linking configuration for the relevant instances using Vagrant options in the Vagrant file.

Our achievement

With one Vagrant up command on, developers get a consistent development environment that they can work on (in any OS like Linux, Mac, or Windows). No extra steps other than installing Vagrant is required.

We tweaked sequenceiq base image for bringing up the Hadoop cluster with needed service pre-installed and created a custom image for Crayon. We defined our own blueprints and used Vagrant to trigger them instead of using Ambari shell.

We could completely avoid using AWS infrastructure at development time, except on rare occasions. We used it only for production and deployment purpose.

We had set up our own private docker registry for storing our custom images, also by using the docker registry docker image.

Best Practices

Some of the best practices that can be followed while using/adopting docker are:

  • It is better to have a hierarchy of images while creating our own docker images with required services pre-installed.
  • We can define a base image that has components that rarely change during the life cycle of the product.
  • On top of it we can build another image using that as the base image with components that you foresee to change often. That is by referring it on the FROM section of your Dockerfile.
  • Use the respective image to link with product build system and creation of the Docker image with your latest artifacts. Referring Apache spark source code can be a nice start to integrate docker build with your product’s build system.