10 new exciting features of Apache Hive 2.0.0

Machine Learning   |   
Published April 25, 2016   |   

We should be excited that Apache Hive community have released the largest release and announced the availability of Apache Hive 2.0.0. It brings great and exciting improvements in the category of new functionality, Performance, Optimizations, Security, and Usability. Let us explore the features in detail below;

HBase to store Hive Metadata

The current metastore implementation is slow when tables have thousands or more partitions. With Tez and Spark engines we are pushing Hive to a point where queries only take a few seconds to run. But planning the query can take as long as running it. Much of this time is spent in metadata operations. Due to scale limitations we have never allowed tasks to communicate directly with the metastore. However, with the development of LLAP this requirement will have to be relaxed. If we can relax this there are other use cases that could benefit from this. Eating our own dog food. Rather than using external systems to store our metadata there are benefits to using other components in the Hadoop system.

Long-lived daemons for query fragment execution, I/O and caching

LLAP is the new hybrid execution model that enables efficiencies across queries, such as caching of columnar data, JIT-friendly operator pipelines, and reduced overhead for multiple queries (including concurrent queries), as well as new performance features like asynchronous I/O, pre-fetching and multi-threaded processing. The hybrid model consists of a long-lived service interacting with on-demand elastic containers serving as a tightly integrated DAG-based framework for query execution. The first version of LLAP is being shipped in Hive 2.0 release. The component has been extensively exercised on test and live clusters, and tested, but is expected to have rough edges in this initial release. The current limitations are: supported with Tez only; does not support ACID tables; the I/O elevator and cache only support ORC format and vectorized execution.

HPL/SQL – Implementing Procedural SQL in Hive

In this new release we have PL/HQL tool (www.plhql.org) that implements procedural SQL for Hive (actually any SQL-on-Hadoop implementation and any JDBC source).Alan Gates offered to contribute it to Hive under HPL/SQL name (org.apache.hive.hplsql package). It’s going to be very much helpful for the SQL communities.

Hive on Spark Container

When Hive job is launched by Oozie, a Hive session is created and job script is executed. Session is closed when Hive job is completed. Thus, Hive session is not shared among Hive jobs either in an Oozie workflow or across workflows. Since the parallelism of a Hive job executed on Spark is impacted by the available executors, such Hive jobs will suffer the executor ramp-up overhead. The idea here is to wait a bit so that enough executors can come up before a job can be executed.

Hive-on-Spark parallel ORDER BY

This is the one of the greatest feature, as if we need to sort the records then we have to manually set / force the reducer count to 1 to have it in single file. But with the above feature in new release we don’t have to force reducer# to 1 as spark supports parallel sorting.

Dynamic Partition Pruning

Tez implemented dynamic partition pruning and this is a nice optimization and we should implement the same in HOS.

Hive-on-Spark Self Union/Join

A Hive query may try to scan the same table multi times, like self-join, self-union, or even share the same subquery. As you may know that, Spark support cache RDD data, which mean Spark would put the calculated RDD data in memory and get the data from memory directly for next time, this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost of more memory usage. Through analyze the query context; we should be able to understand which part of query could be shared, so that we can reuse the cached RDD in the generated Spark job. And also the subquery, which translated into separate, but equivalent Works in SparkWork, combining these equivalent Works into a single one would help to benefit from following dynamic RDD caching optimization.

CBO enhancements

This is the one of nice enhancement TPC-DS queries 51 fails with Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also this release checks for circular dependencies.

Apache Parquet predicate pushdown

When filtering Parquet tables using a partition column, the query fails saying the column does not exist: It is correct that the partition column is not part of the Parquet schema. So, the fix should be to remove such expression from the Parquet PPD. This feature helps lot in the performance & optimizations.

Codahale-based metrics

There is a current Hive metrics system that hooks up to a JMX reporting, but all its measurements, models are custom. This is to make another metrics system that will be based on Codahale (ie yammer, dropwizard), which has the following advantage: Well-defined metric model for frequently-needed metrics (ie JVM metrics), Well-defined measurements for all metrics (ie max, mean, stddev, mean_rate, etc), and built-in reporting frameworks like JMX, Console, Log, JSON webserver. It is used for many projects, including several Apache projects like Oozie. Overall, monitoring tools should find it easier to understand these common metric, measurement, reporting models.The existing metric subsystem will be kept and can be enabled if backward compatibility is desired.

This article originally appeared here. Republished with permission. Submit your copyright complaints here.