Framework of an Apache Spark Job Run!

NoSQL   |   
Published November 30, 2015   |   

Now our the big data analytics community has started to use Apache Spark in full-swing for big data processing. The processing could for ad-hoc queries, prebuilt queries, graph processing, machine learning, and even for the data streaming.

Hence the understanding of Spark Job Submission is very vital for community. Extend to that happy to share with you the learnings of the steps involved in the Apache Spark Job Submission.

Basically it has two steps,
Apache Hadoop

Job Submission

Spark job is submitted automatically when an actions like count () is performed on an RDD.
Internally runJob() to be called on the SparkContext and then call on to the scheduler that runs as a part of the deriver.
The scheduler is made up of 2 parts – DAG Scheduler and Task Scheduler.
Scheduler

DAG Construction

There are two types of DAG constructions,
DAG construction

  • Simple Spark job is one that does not need a shuffle and therefor has just one single stage composed of result tasks, like map-only job in MapReduce
  • Complex Spark job involves grouping operations and require one or more shuffle stages.
  • Spark’s DAG scheduler turns job into two stages.
  • DAG scheduler is responsible for splitting a stage into tasks for submission to the task scheduler.
  • Each task is given a placement preference by the DAG scheduler to allow the task scheduler to take advantage of data locality.
  • Child stages are only submitted once their parents have completed successfully.

Task Scheduling

  • Task scheduler will sent a set of tasks; it uses its list of executors that are running for the application and constructs a mapping of tasks to executors that takes placement preferences into account.
  • Task scheduler assigns to executors that have free cores, each task is allocated one core by default. It can be changed by spark.task.cpus parameter.
  • Spark uses Akka, which is actor based platform for building highly-scalable event-driven distributed applications.
  • Spark doesn’t use the Hadoop RPC for remote calls.

Task Execution

An executor runs a task as follows,

  • It makes sure that the JAR and file dependences for the task are up-to-date.
  • De-serializes the task code.
  • Task code is executed.
  • Task returns results to the driver, which assembles into a final result to return to the user.

Reference

  •  The Hadoop Definitive Guide
  • Analytics & Big Data Open Source Community

This article originally appeared here. Republished with permission. Submit your copyright complaints here.