This blog introduces Spark’s core abstraction for working with data, the RDD (Resilient Distributed Dataset). An RDD is simply a distributed collection of elements or objects (Java, Scala, Python, and user defined functions) across the Spark cluster. In Spark, all work is expressed in three ways as follows,
- Creating new RDDs
- Transforming existing RDDs
- Calling operations on RDDs to compute a result
RDD in Spark is simple an immutable distributed collection of objects, each split into multiple partitions. We create RDDs in two ways as like,
- By loading an external dataset
- By distributing a collection of objects in their driver program
Once the RDDs are created it offers two types of operations such as,
Transformations construct a new RDD from a previous one [filter, map, groupBy] and Actions on other hands compute result based on an RDD either it return to driver program or save it to an external storage system (HDFS, S3, Cassandra, HBase, etc.,) [first, count, collect, save].
Transformations and actions are different because of the way Spark computes RDDs, as we can able to define the new RDDs any time. Spark computes them only in a lazy fashion that is nothing but when used first time in action.
Finally, the RDDs are by default recomputed each time when we run an action on them, if we want to use multiple times then we can ask Spark to persist by using RDD.persist().
Listed are the number of ways and options to use for persisting RDD in Spark and if we wanted to replicate the data on two machines then we need to add _2 at the end of storage level. In production practices, we will often use persist () to load the subset of the data into memory which could be query frequently. And, the cache() is the same calling persists() with default storage level.
Just to summarize every Spark program will work as follows,
- Create some input RDDs from external data
- Transform RDDs to define new RDDs using transformations like filter()
- Use persist () to persist an intermediate RDDs which will be reused
- Launch actions such as count(), first() to kick-start the parallel computation
And to conclude RDDs are Immutable, portioned collections of objects spread across a cluster, stored in RAM or on disk, built through lazy parallel transformations, and automatically rebuilt on failure. In Part 2 – we will be sharing internal details of Transformations & Actions of RDDs and benefits of Lazy Evaluation.
Reference – Big Data Analytics Community, Learning Spark: Karau, Konwinski, Wendell, Zaharia.