Now our the big data analytics community has started to use Apache Spark in full-swing for big data processing. The processing could for ad-hoc queries, prebuilt queries, graph processing, machine learning, and even for the data streaming.
Hence the understanding of Spark Job Submission is very vital for community. Extend to that happy to share with you the learnings of the steps involved in the Apache Spark Job Submission.
Basically it has two steps,
Spark job is submitted automatically when an actions like count () is performed on an RDD.
Internally runJob() to be called on the SparkContext and then call on to the scheduler that runs as a part of the deriver.
The scheduler is made up of 2 parts – DAG Scheduler and Task Scheduler.
There are two types of DAG constructions,
- Simple Spark job is one that does not need a shuffle and therefor has just one single stage composed of result tasks, like map-only job in MapReduce
- Complex Spark job involves grouping operations and require one or more shuffle stages.
- Spark’s DAG scheduler turns job into two stages.
- DAG scheduler is responsible for splitting a stage into tasks for submission to the task scheduler.
- Each task is given a placement preference by the DAG scheduler to allow the task scheduler to take advantage of data locality.
- Child stages are only submitted once their parents have completed successfully.
- Task scheduler will sent a set of tasks; it uses its list of executors that are running for the application and constructs a mapping of tasks to executors that takes placement preferences into account.
- Task scheduler assigns to executors that have free cores, each task is allocated one core by default. It can be changed by spark.task.cpus parameter.
- Spark uses Akka, which is actor based platform for building highly-scalable event-driven distributed applications.
- Spark doesn’t use the Hadoop RPC for remote calls.
An executor runs a task as follows,
- It makes sure that the JAR and file dependences for the task are up-to-date.
- De-serializes the task code.
- Task code is executed.
- Task returns results to the driver, which assembles into a final result to return to the user.
- The Hadoop Definitive Guide
- Analytics & Big Data Open Source Community