This blog introduces the convergence of complementary technologies – Spark, Mesos, Akka, Cassandra and Kafka (SMACK) stack. And we will see how Apache Kafka can help us to get data under control and what is it role in our data pipeline, how Spark & Akka help us to process the data, and how Cassandra to store data. Also we will look what is Mesos a cluster manager.
Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009.
Speed, Ease of use, and a unified engine are three core benefits of Apache Spark. Keep Reading…
Apache Mesos: It is a distributed systems kernel and Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elastic Search) with APIs for resource management and scheduling across entire datacenter and cloud environments.
It has rich features like,
- Scalability to 10,000s of nodes
- Fault-tolerant replicated master and slaves using ZooKeeper
- Support for Docker containers
- Native isolation between tasks with Linux Containers
- Multi-resource scheduling (memory, CPU, disk, and ports)
- Java, Python and C++ APIs for developing new parallel application
- Web UI for viewing cluster state. Keep Reading…
Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM. Akka was designed to enable developers to easily build reactive applications using a high level of abstraction. It does so in a very natural and simple way, without having to deal with low-level concepts like thread pools, mutexes, and deadlocks.
It does so by leveraging the Actor Model of concurrency and fault-tolerance. This is a powerful model that allows the behavior and state of the application to be encapsulated and modeled as an actor. The key principle behind an actor is that the application only interacts with it through messages and never talks with it directly. This isolation allows Akka to manage the currency of the actor.
It has rich set of features like,
Simple Concurrency & Distribution (Asynchronous and Distributed by Design. High-level abstractions like Actors, Streams and Futures).
Resilient by Design (Write systems that self-heal. Remote and local supervisor hierarchies).
High Performance (50 million msg/sec on a single machine. Small memory footprint; ~2.5 million actors per GB of heap).
Elastic & Decentralized (Adaptive cluster management, load balancing, routing, partitioning and sharding).
Extensible(Use Akka Extensions to adapt Akka to fit your needs).
It is a top-level Apache project born at Facebook and built on Amazons Dynamo and Googles BigTable, is a distributed database for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure. Cassandra offers capabilities that relational databases and other NoSQL databases simply cannot match such as continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centers and cloud availability zones. Cassandra’s architecture is responsible for its ability to scale, perform, and offer continuous uptime. Rather than using a legacy master-slave or a manual and difficult-to-maintain sharded architecture, Cassandra has a masterless “ring” design that is elegant, easy to set up, and easy to maintain.
In Cassandra, all nodes play an identical role; there is no concept of a master node, with all nodes communicating with each other equally. Cassandra’s built-for-scale architecture means that it is capable of handling large amounts of data and thousands of concurrent users or operations per second— even across multiple data centers— as easily as it can manage much smaller amounts of data and user traffic. Cassandra’s architecture also means that, unlike other master-slave or sharded systems, it has no single point of failure and therefore is capable of offering true continuous availability and uptime — simply add new nodes to an existing cluster without having to take it down. Keep Reading…
Kafka is one of those systems that is very simple to describe at a high level, but has an incredible depth of technical detail when you dig deeper. Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable.
Like many publish-subscribe messaging systems, Kafka maintains feeds of messages in topics. Producers write data to topics and consumers read from topics. Since Kafka is a distributed system, topics are partitioned and replicated across multiple nodes.
Messages are simply byte arrays and the developers can use them to store any object in any format – with String, JSON, and Avro the most common. It is possible to attach a key to each message, in which case the producer guarantees that all messages with the same key will arrive to the same partition. When consuming from a topic, it is possible to configure a consumer group with multiple consumers. Each consumer in a consumer group will read messages from a unique subset of partitions in each topic they subscribe to, so each message is delivered to one consumer in the group, and all messages with the same key arrive at the same consumer.
What makes Kafka unique is that Kafka treats each topic partition as a log (an ordered set of messages). Each message in a partition is assigned a unique offset. Kafka does not attempt to track which messages were read by each consumer and only retain unread messages; rather, Kafka retains all messages for a set amount of time, and consumers are responsible to track their location in each log. Consequently, Kafka can support a large number of consumers and retain large amounts of data with very little overhead.
Reference – Big Data Analytics Communities, and DataStax.com.