For anyone who gets into the Big Data world, the terms Big Data and Hadoop become synonyms. As they learn the ecosystem along with the tools and their workings, people become more aware about what big data actually means, and what role Hadoop has in the big data ecosystem.
According to Wikipedia, “Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate”.
To put it in simple terms, as the size of data increases the usual processing methods takes too longer or proves to be too costly.
Hadoop was created in ,2005, by Doug Cutting, who was inspired by Google’s white papers on GFS and MapReduce. Hadoop is an open source software framework for distributed storage and distributed processing of very large data sets. In other words, it is designed to reduce cost and time of processing large data sets.
Hadoop, with its distributed file system (HDFS) and distributed processing model (MapReduce) became the de-facto standard in big data computing. The term ‘Hadoop’ refers to not only the base modules, but also the ecosystem of other software packages that can be used along with Hadoop.
As time went on, data generation exploded and the need for processing large amounts of data also exploded. This eventually generated a variety of needs in big data computing, not all of which could be satisfied by Hadoop.
Most of the analysis done on data are iterative in nature. While iterative processing could be done in MapReduce, data should be read for each iteration of the process. Under normal circumstances, this would be fine, but reading 100′s of GB’s of data or a few TB’s of data is going to take time and people are not patient.
Many people consider data analytics to be an art rather than a science. In any art, the creator creates a small piece of the puzzle and attaches it to the bigger one to witness its growth. Loosely translated, data analysts want to see the results of each process before proceeding to the next one. In other words, a lot data analytics is interactive in nature. Traditionally, interactive analytics is effected through SQL. Analysts write queries which operate on data in databases. Although, Hadoop had equivalents (Hive & Pig), this proved to be time consuming as each query takes a lot of time processing the data.
Both these hurdles led to the birth of Spark, a new processing model that facilitates iterative programming and interactive analytics. Spark provided an in-memory primitive models that loads the data into memory and query it repeatedly. This makes Spark well suited for a lot data analytics and machine learning algorithms.
Note that, Spark only defines the distributed processing model. Storing the data part is not addressed by Spark and it still relies on hadoop (HDFS) to efficiently store the data in a distributed way.
Spark is setting the big data ecosystem on hyperdrive. It promises to be 10-100 times faster than MapReduce. Many think this could be the end of MapReduce.
Ease of Use
Spark is easy to use compared to MapReduce. Very easy. Even a simple logic or algorithm could take 100′s of lines of code in MapReduce; with Spark the same logic can be written using few lines of code. This leads to a crucial factor called versatility. Many advanced algorithms of machine learning or graph problems, which were impossible in MapReduce, can now be done in Spark. This is driving Spark adoption very highly.
MapReduce doesn’t have an interactive model. Although, Hive & Pig includes command line interfaces, the performance of these systems is still dependent on MapReduce. MapReduce is great for batch processing.
Spark processes data in-memory while MapReduce pushes the data back to disk after processing it. So, Spark will outperform MapReduce.
In 2014, Spark entered the Daytona GraySort contest and won it. For the un-initiated, Daytona GraySort is a third party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records)
Spark used 206 AWS EC2 machines and sorted 100 TB of data on disk in just 23 minutes. The previous record was held by MapReduce, it used 2100 machines and took 72 minutes. Spark did the same thing as MapReduce, but only 3 times faster on 10 times fewer hardware. For more details, please refere this article.
Spark needs a lot memory. If we run Spark alongside other memory-demanding services, its performance could degrade. However, we can safely say that Spark has the upper hand in iterative processing(needs to pass over the same data several times).
The hardware requirements are very similar in compute power, disk and network aspects. Spark needs more memory to perform better. Both use commodity servers.
Programming in MapReduce is an arduous task, not many experts are available in the market. Even there are very few Spark experts, but it is only because Spark is still young. It is easier to learn and code in Spark than in MapReduce.
Spark without Hadoop
Spark doesn’t actually require Hadoop to run. If we are not reading the data from HDFS, Spark can run on its own. There are many other storages such as S3, Cassandra, etc, from which Spark can read and write data. Under this architecture, Spark runs in stand-alone mode not requiring Hadoop components in any way.
Recent studies have shown that there is a surge in adoption of Spark in production. Many are running Spark with Cassandra, Spark with Hadoop and Spark on Apache Mesos. Although Spark adoption has increased, it hasn’t caused any panic in big data community. MapReduce usage may reduce, but the rate at which it will reduce is yet to be seen.
Many predict that Spark would facilitate the growth of another stack, which could be much more powerful. But this new stack would be very similar to that of Hadoop and its ecosystem of software packages.
Simplicity is the biggest advantage of Spark. But it is not going to eradicate MapReduce as there are still use cases. Even if Spark is a big winner, unless there is new distributed file system, we will be using Hadoop alongside Spark for a full big data package.