Eight breakthrough changes in Apache Flink 1.0.0

Hadoop |

Published May 2, 2016 |

Apache Flink is an open source platform for distributed stream and batch data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink also builds batch processing on top of the streaming engine, overlaying native iteration support, managed memory, and program optimization.

The Apache Flink community is pleased to announce the availability of the 1.0.0 release. The community put significant effort into improving and extending Apache Flink since the last release, focusing on improving the experience of writing and executing data stream processing pipelines in production.

And the 1.0.0 release introduced API breaking change and Apache Flink had many cool & exciting features,

DataStream API

Removed partition ByHash. Use keyBy instead.
Scala API: fold parameters switched.
Hash partitioner now scrambles hashes with murmur hash. This might break programs relying on the output of the hashing.

DataSet API

Combinable annotation removed. Implement a combinable GroupReduceFunction<IN, OUT> by implementing a CombineFunction<IN, IN> or GroupCombineFunction<IN, IN> interface in the GroupReduceFunction.

Gelly

The LabelPropagation library method now supports any Comparable type of label. It used to expect a Long value, so now users have to specify one extra type parameter when calling the method.
Gelly vertex-centric model has been renamed to scatter-gather. Graph’s runVertexCentricIteration() methods have been renamed to runScatterGatherIteration() and VertexCentricConfiguration has been renamed to ScatterGatherConfiguration.

Start/Stop scripts

The ./bin/start-webclient.sh and ./bin/stop-webclient.sh scripts have been removed. The webclient is now included in Flink’s web dashboard and activated by default. It can be disabled by configuring jobmanager.web.submit.enable: false in ./conf/flink-conf.yaml.

Backwards compatibility:

Flink 1.0 removes the hurdle of changing the application code when Flink releases new versions. This is huge for production users who want to maintain their business logic and applications while seamlessly benefiting from new patches in Flink.

Operational features:

Flink by now boasts very advanced monitoring capabilities (this release adds backpressure monitoring, checkpoint statistics, and the ability to submit jobs via the web interface). This release also adds savepoints, an essential feature (and unique in the open source world) that allows users to pause and resume applications without compromising result correctness and continuity.

Battle-tested:

Flink is by now in production use at both large tech and Fortune Global 500 companies. A team at Twitter recently clocked Flink at 15 million events per second in a moderate cluster.

Integrated:

Flink has always been integrated with the most popular open source tools, such as Hadoop (HDFS, YARN), Kafka (this release adds full support for Kafka 0.9), HBase, and others. Flink also features compatibility packages and runners, so that it can be used as an execution engine for programs written in MapReduce, Apache Storm, Cascading, and Apache Beam (incubating).

This article originally appeared here. Republished with permission. Submit your copyright complaints here.