Get ignited by Apache Spark – Part 2

NoSQL |

Published November 16, 2015 |

Thanks for your time; I definitely try to value yours. In part 1 – we discussed about Apache Spark libraries, Spark Components like Driver, DAG Scheduler, Task Scheduler, and Worker. Now in Part 2 -we will be discussing on Basics of Spark Concepts like Resilient Distributed Datasets, Shared Variables, SparkContext, Transformations, Action, and Advantages of using Spark along with examples and when to use Spark.

RDD – Resilient distributed datasets

They are collections of serializable elements and such a collection may be partitioned in which case it is stored in multiple nodes.

It may reside in memory or on disk.

Spark uses RDD to reduce I/O and maintain the processed data in memory

RDD helps with tolerating node failures and need not be restart the whole process or computing

Typically it’s created from the Hadoop input format or from transformation applied on existing RDDs.

RDDs store its data lineage; if data is lost Spark replay the lineage to rebuild the lost RDDs.

RDDs are immutable.

Shared variables

Spark has two types of variables that allow sharing information between the execution nodes.

Two variables are broadcast & accumulator variables.

Broadcast variables are all sent to the remote execution nodes, similar to MapReduce Configuration objects.

Accumulators are all also sent to remote execution nodes, with the limitation that we can add only to the accumulator variables, similar to MapReduce counters.

Spark context

It is an object that represents the connection to a Spark cluster.

It is used to create RDDs, broadcast data and initialize accumulators.

Transformations

It is functions that take one RDD and return another.

Transformations will never modify their input, only returns the modified RDD.

It’s always lazy, so they don’t compute their results. Instead of calling a transformation function only creates a new RDD.

The whole set of above said transformations are executed when an action is called.

There are many transformation in Spark – map(), filter(), KeyBy(), Join(), groupByKey(), sort().

Action

Actions are methods that take an RDD and perform computation and return the result to the driver application.

Action trigger the computation of transformations, and the results can be a collection, values to the screen, values saved to file.

Action will never return an RDD.

Benefits

Simplicity
Versatility
Reduced disk I/O
Storage
Multilanguage
Resource manager independence
Interactive shell (REPL)

Spark, like other big data tools, it is powerful, capable, and well-suited to tackling a range of analytics & big data challenges.

This article originally appeared here. Republished with permission. Submit your copyright complaints here.

Subscribe to the Crayon Blog

Get the latest posts in your inbox!

Get ignited by Apache Spark – Part 2

RDD – Resilient distributed datasets

Shared variables

Spark context

Transformations

Action

Benefits

Recent Blogs

Interview with Oren Eini of RavenDB on database management, analytics & security

5 common mistakes to avoid when de-duping your data

Top 9 database management systems for Joomla’s templates

Data analysis guide: It’s time to excel by using Excel!

DNA vs modern backup methods: The future of data storage

Bridging RDBMS and NoSQL: Introduction to 2DX UI cluster

Constraining Data flexibility in a NoSQL database

Framework of an Apache Spark Job Run!

Get ignited with Apache Spark – Part 1

Relational vs Non-Relational data bases – Part 3

Advantages of NoSQL Databases – Everything you need to know

Top Facebook groups for Analytics, Big Data, Data Mining, Hadoop, NoSQL, Data Science

Deep dive into NoSQL: A complete list of NoSQL databases

Exploring the world of data: A complete list of Big Data blogs

Top five advantages and challenges of NoSQL

5 fun facts you may not know about HBase!

Top 18 free and widely used, open source NoSQL databases

Relational Vs Non-Relational databases – Part 2

Relational vs. non-relational databases – Part 1

HBase: 5 tips for running on low memory EC2

Big data showdown: Cassandra vs. HBase

What’s better for your big data application, SQL or NoSQL?

Open source data grows up: Choosing MySQL, NoSQL, or both

The three most common NoSQL mistakes you don’t want to be making

The dirty truth about big data and NoSQL

8 Features of a True Enterprise-Grade Platform for Hadoop and NoSQL

Subscribe to the Crayon Blog

AI-led revenue acceleration platform for enterprises