Top ten pointers in the new Apache Spark release (version 1.6)

Hadoop |

Published January 11, 2016 |

In 2016, we should be excited that Apache Spark community launched Apache Spark 1.6.

Committers – There are around 1000 contributors to Apache Spark, which has doubled.

Patches – Apache Spark 1.6 version includes & covers 1000 patches.

Run SQL query on files – This feature helps user and application to run SQL queries on files directly without creating a table. And it’s similar to the feature available in Apache Drill. For an example select id from JSON.`path/to/json/files` as j.

Star (*) expansion for StructTypes – This features makes it easier to nest and unnest arbitrary numbers of columns. It is pretty common for customers to do regular extractions of update data from an external data source (e.g. MySQL or Postgres). While this is possible today in the new release with some small improvements to the analyzer. And the goal is to allow users to execute the following two queries as well as their data frame equivalents to find the most recent record for each key to unnest the struct from above group by query.

Parquet Performance – It has been the most commonly used data formats within the Apache Spark, and Parquet scan performance has pretty significant impact on many large applications. Before this version it depends on parquet-mr to read and decode Parquet files, with that often many times are spent in record assembly, which is a process that reconstructs records from Parquet columns. But in Spark 1.6. They have introduced a new Parquet reader that bypasses the old parquert-mr’s record assembly and uses a more optimized code path for flat schemas. It seems and benchmarks results in 50% improvement.

Automatic Memory Management – In older version of Apache Spark (lesser than 1.6), it just splits the available memory into two regions which are called execution memory and cache memory. Execution memory is the region that is used for sorting, hashing, and shuffling, while cache memory is used to cache recent data. And now in the Spark 1.6 version, it introduces a new memory manager that will automatically tune the size of different memory regions. The runtime automatically grows and shrinks regions according to the needs of the executing application. Hence, many applications will be get benefited for operators like joins and aggregations, without any user optimization and tuning.

Streaming State Management – State management is a very vital function in streaming applications in Spark, often used to maintain aggregations or session information. Apache Spark 1.6 introduces a new mapWithStateAPI that scales linearly with the number of updates rather than the total number of records. The mapWithState has an efficient implementations of deltas, rather than always requiring full scans over data. It helped the user with a greatness of performance improvements.

Spark Datasets – The lesser version of Apache Spark(less than 1.6) is the lack of support for compile-time type safety. To solve this problem in Spark 1.6 team introduced a typed extension of the DataFrame API called Datasets. The Dataset API extends the DataFrame API to supports static typing and user functions that run directly on existing Scala or Java types. When compared with the traditional RDD API, Datasets provide better memory management as well as in the long term better performance.

Machine Learning Pipeline Persistence – In the lesser version of Apache Spark(less than 1.6) lot of machine learning applications leverage Spark’s ML pipeline feature to construct learning pipelines. In the past, we have to implement custom persistence code to store the pipeline externally which could be used for big data applications. But in Spark 1.6, the pipeline API offers functionality to save and reload pipelines from a previous state and apply models built previously to new data later.

Addition of New Algorithms – In Apache Spark 1.6 release they have increased algorithm coverage in machine learning like univariate and bivariate statistics, survival analysis, standard equation for least squares, bisecting K-means clustering, online hypothesis testing, latent Dirichlet allocation(LDA), R-like statistics, feature interactions in R formula, instance weights, univariate and bivariate statistics in DataFrames, LIBSVM data source, non-standard JSON data.

Reference – databricks.com, issues.apache.org, Big Data Analytics Community.

This article originally appeared here. Republished with permission. Submit your copyright complaints here.