How to Escape the Dark Valley of Your Hadoop Journey

Hadoop | Tech and Tools   |   
Published September 30, 2013   |   
Gary Nakamura

It happens to the best of us. You know your business is bursting with useful data, and you’ve only begun to scratch the surface. So you strike out to build an analytical platform, using all the great open-source tools you’ve been hearing so much about. First, you have to capture all the data that’s coming in, before you even know what you’re going to do with it. So you build a butterfly net to catch it all, using Hadoop. But as soon as the net is cast, everything goes dark. You know the data’s there, but you can’t get at it, or if you can, it comes out in unusable formats. Your current systems won’t talk to it, and you don’t have a staff of PhDs in programming, the budget to buy a United Nations’ worth of translators or hire an army of consultants. A chill runs down the back of your neck. What have you done? You’ve entered the Dark Valley of Hadoop. That’s the bad news. The good news is that you’re not alone, and there’s a way out.

Warning Signs That You’re in the Dark Valley

Many data-rich companies fall into the Dark Valley for a time. You have the data, but you’re not getting the value that you expect from it. You have problems testing and deploying the applications that are supposed to extract that value. You have difficulty turning business requirements into code that will turn the great Leviathan that is the Hadoop Distributed File System into something halfway manageable. The project for which all of this effort was intended is delayed for months, with cost overruns making stakeholders nervous. Once you finally get the chance to test, you don’t get the results you were expecting. More delays ensue.

One of the cruelest tricks of the Dark Valley is the illusion that you got it right on the first try. Agile design philosophy tells us to work on small projects and test them as quickly as possible, then iterate forward. But Hadoop tends to reveal its weaknesses around manageability the deeper you get into the adoption cycle. If you’re using tools made for programmers, such as Pig and Hive, you’re making a bet that the programmer who built that first iteration is going to be around for the second. In today’s competitive marketplace, there is no guarantee of that. Then there is the fact that MapReduce, Hadoop’s native language, is already on its second version, with a third entirely new computation engine, built from the ground up, rapidly on its way. In the Hadoop ecosystem, many low-level moving parts have the nasty habit of changing every 90 to 120 days. All these moving parts mean that you’re having to keep up with numerous release cycles, which takes your focus off the business at hand.

Read More