Get ignited with Apache Spark – Part 1

NoSQL |

Published November 9, 2015 |

Spark began life in 2009 as a project within the AMPLab at the University of California, Berkeley. More specifically, it was born out of the necessity to prove out the concept of Mesos, which was also created in the AMPLab. Spark was first discussed in the Mesos white paper titled Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, written most notably by Benjamin Hindman and Matei Zaharia.

It emerged as a fast and convenient solution to perform complex analysis of large-scale data. Spark evolved as a new processing framework for big data that addresses many of the shortcomings in the MapReduce model. It supports for large scale Data Analysis, and the data could be from different sources like real time, batch processing in various formats like images, texts, graphs and many more. In addition to its Apache Spark core, it also provides some useful set of libraries for big data analytics.

Overview of Spark Components

The driver is the code that includes the main function and defines the resilient distributed datasets (RDDs) and their transformations. RDDs are the main data structures which will be used in our Spark programs.
Parallel operations on the RDDs are sent to the DAG scheduler, which will optimize the code and arrive at an efficient DAG that represents the data processing steps in the application.
Resulting DAG is sent to the cluster manager and the cluster manager has information about the workers, assigned threads, and the location of data blocks and is responsible for assigning specific processing tasks to workers. It also handles the paly back in the case if worker failure. The cluster manger can be YARN, Mesos, Spark’s cluster Manager.
The worker receives units of work and data to manage and the worker executes its specific task without knowledge of the entire DAG and its results are sent back to the driver applications.
Spark, like other big data tools, is powerful, capable, and well-suited to tackling a range of data challenges. Spark, like other big data technologies, is not necessarily the best choice for every data processing task.
In Part 2 – we will be discussing on Basics of Spark Concepts like Resilient Distributed Datasets, Shared Variables, SparkContext, Transformations, Action, and Advantages of using Spark along with examples and when to use Spark.
Reference:
Learn Spark in a Day By Acodemy & Hadoop Applications Architectures.