In last few months, I have come across the topic of “Stream Processing” so much that it had become hard to ignore and definitely seemed like a new wave of Big Data technologies. We are living in the world of innovation and rapid changes thus I was cautious about investing my time in every new hype. After going through few articles, attending meetups and listening to some nice presentations from conferences, I am convinced that Stream processing is here to stay and being rapidly adopted by the industry. In this blog, I am presenting a summary of my understanding of the de-facto tools used in stream processing.
Big Data Infrastructure since the advent of Hadoop stack hadn’t undergone a major change until very recently with new technology trends like Microservices, Internet of Things & Data localization in Apps has led to infrastructure requirements that should support processing and aggregation of real-time events and various consumer processes with the capability to handle those events at a high throughput . In last few years, there were mainly 2 patterns of programming
Request response is the traditional method wherein the server gets a single request from the customer and the customer gets a response which might or might not be trivial depending on the backend of the application. Then, came the need for processing of log files using map/reduce and coming up with intelligent analytics which gave rise to Hadoop stack based batch processing wherein the Hadoop jobs ran for hours or sometimes overnight to provide insights into log data and user activity.
In last, couple of years, there arose a need to develop frameworks which can perform real-time analytics on events coming from various devices or microservices or click stream from a user group (based on geography etc.) . This led to the development of in-house tools and frameworks by companies such as LinkedIn, Twitter, and Google for stream analytics. But, these tools were less mature as compared to their Batch/Hadoop counterparts. Also, there wasn’t much industry consensus around tools and technologies for stream processing. Thus, the situation was something like this
Fig 1 . From the talk “Apache Kafka and the Next 700 Stream Processing Systems” by Jay Kreps 
Hence, for the developer community, it was hard to get their hands on mature tools which could be used for stream processing. Then few years ago, Linkedin open-sourced Apache Kafka, which is a high throughput messaging system which led to rapid innovation in this field. Kafka is now becoming the backbone of streaming pipeline and being adopted rapidly . Let’s look at few of the key concept of Apache Kafka.
Kafka is a messaging system built for scalability and high throughput, sounds like JMS ? Kafka comes with a few big advantages over traditional JMS. I would choose Kafka over traditional JMS queueing solution if I have to persist my messages for longer periods of times (upto days). Another reason is that, if I have to support multiple consumers that would be reading the messaging at their own pace as different Datastores have different write speeds depending upon their core functionality. Consumers, can also “replay” messages if they want and if they are down then they can start from the point where they left off once they are up again. Thus, these features make Kafka highly usable in an environment where heterogeneous frameworks, libraries are involved as each can consume messages based on a polling model at a throughput which they can handle which isn’t possible with traditional messaging systems. Let’s have a closer look at why Kafka scales so well –
Fig 2 – A Topic along with it’s subset partitions
In Kafka, a Topic is an important abstraction which represents logical stream of data, this stream of data achieves parallelism when it’s partitioned in parallel streams which are called partitions. The messages/events in these partitions are identified by a unique Id or message offset which represents increasing timestamp within that partition. This image from the official docs  further elaborates on this –
Fig 3 – From the official Kafka docs, Anatomy of a Topic
Thus, we can see that messages are written in the sequential order and the consumer also consumes the messages sequentially.
Let’s look at the overall architecture of Kafka.
Fig 4 – Internal Architecture of Kafka
Brokers are part of the Kafka cluster, a topic is divided into multiple partitions in order to balance load and also support parallelism. Each consumer takes part in a consumer group, when we start a new consumer we specify a group label and based on that label grouping takes place. Each message of a topic gets delivered atleast once to a consumer instance of each subscribing group. You can see from the above image that partition 0 is sending the message of consumer 2 only, as both consumer 1 and 2 are part of the same consumer group hence the only one consumer instance of a subscribing group gets the message. If each consumer is in a different group then each message is broadcasted to all groups. Similarly, Consumer group 2 only has one consumer which gets messages from all 3 partitions. The total number of consumers in a group should never be more than total partitions of a subscribed topic. We can have a new 3rd Consumer in 1st group as there are 3 partitions but if we have a 4rth Consumer then it would be idle as there would be more Consumers in a group then number of partitions.
Brokers in Kafka are stateless which basically means that the broker doesn’t keep a record of how much messages a consumer has consumed, the message offset is for the consumer to keep track of not the broker. But this also makes it hard to delete redundant messages thus Kafka solves this problem by using a time-based SLA for the retention policy. Since the messages can stay for a while in Kafka thus the consumer can also rewind to an earlier offset or if a consumer instance crashes then that partition from which it was reading from is assigned to another instance of the group. This distributed coordination is handled by Zookeeper, it’s responsibilities include keeping track of newly added brokers and consumers, rebalancing partitions- consumer group mapping when new consumers are added or removed and keeping track of the offset consumed in each partition. Zookeeper maintains two registries ownership registry and an offset registry. The owner registry maintains the partition – consumer group mapping such that each consumer group has it’s own corresponding ownership registry and the offset registry contains the last consumed message offset of each partition. Thus, with Kafka, it becomes easy to integrate with various frameworks consuming at the throughput which they can handle and is required by the application. Here’s how things would look once you have Kafka integrated with other tools.
Fig 5 – From the talk “Demystifying Stream Processing with Apache Kafka”- by Neha Narkhede 
So far we have discussed Kafka which is the backbone of your streaming infrastructure, now let’s look at Apache Flink which is a stream processing framework and includes multiple processing libraries to enrich incoming events and messages.
I heard about Flink very recently and have been really impressed by the traction it’s gaining in the developer community. It’s also one of the most active Apache Big Data projects and Flink meetups are coming up in all major tech cities of the world. Since I have been reading about Stream processing I realized that there are few key features which are important for any good stream processor to support. I was impressed to see that Flink is pretty new yet it supports all these key features, namely –
- Processing engine with support for Streaming as well as Batch
- Supporting various windowing paradigms
- Support for Stateful Streaming
- Faul Tolerant and high throughput
- Complex Event Processing (CEP)
- Backpressure handling
- Easy Integration with existing Hadoop stack
- Libraries for doing Machine Learning and graph processing.
The existing Hadoop stack which is good at batch processing already has so many moving parts that trying to configure it for stream processing is a difficult task, since various components like Oozi (job scheduler), HDFS (and flume for data loading), ML and graph libraries, & Batch processing jobs all have to work in perfect harmony. On top of that Hadoop has poor Stream support and no easy way to handle backpressure spikes. This makes Hadoop stack in streaming data processing even harder to use. Let’s take a look high-level view of Flink’s architecture
Fig 6 – Flink’s architecture from the official docs 
For every submitted program a client is created which does the required pre-processing and turns the program into a parallel dataflow form which is then executed by the TaskManagers and the JobManager . JobManager is the primary coordinator for the whole execution cycle and is responsible for allotting tasks to TaskManager and also for resource management.
Interesting, thing about flink is that it contains so much functionality within it’s own framework that the number of moving parts in the streaming architecture goes down. Here are the internal Flink components –
Fig 7- From the talk “Apache Flink: What, How, Why, Who, Where? ” by Slim Baltagi 
Flink engine’s which is a Distributed Streaming dataflow engine support both Streaming and Batch processing, along with the ability to support and use existing storage and deployment infrastructure, it supports multiple of domain specific libraries like FLinkML for machine learning, Gelly for graph analysis, Table for SQL, and FlinkCEP for complex event processing. Another interesting aspect of Flink is that existing big data jobs (Hadoop M/R, Cascading, Storm) can be executed on the Flink’s engine by means of an adapter thus this kind of flexibility is something which makes Flink center of the Streaming infrastructure processing.
As discussed above in the key feature list, two important aspects of Streaming supported by Flink are Windowing and Stateful streaming. Windowing is basically the technique of executing aggregates over streams. Windows can be broadly classified into
- Tumbling windows (no overlap)
- Sliding windows (with overlap)
The above two concepts can be explained by the following 2 images
Fig 8 – From the talk “Unified Stream and Batch Processing with Apache Flink” by Ufuk Celebi 
Fig 8 – From the talk “Unified Stream and Batch Processing with Apache Flink” by Ufuk Celebi 
In references, I have provided link to Flink APIs that support stream aggregations i.e. windowing.
Stream processing which supports basic filtering or simple transformation don’t need state but when it comes to more advanced concepts like aggregation on streams (windowing), complex transformation, complex event processing then it becomes necessary to support stateful streaming.
In the recent release of Flink, they have introduced a concept called Savepoints. The Flink task managers regularly create checkpoints of the job’s state being processed and under the hood Savepoints are basically pointers to any of the checkpoints, these Savepoints can be manually triggered and they never expire until discarded by the user. Let’s look at an image for a more clearer understanding
Fig 9 – From the official docs -Savepoints 
Here the checkpoints C1 and C3 have been discarded as the checkpoint C4 is the latest checkpoint and all the earlier checkpoints except C2 have been discarded. The reason C2 is still there is because a Savepoint was created when C2 was the latest checkpoint and now that Savepoint has a pointer to the C2. Initially, the job’s state is stored in-memory and then checkpointed into a filesytem (like HDFS etc) and savepoint is basically a url to the HDFS location of the checkpointed state. In order to store a much larger state, Flink team is working towards providing a state backend based on RocksDB.
Here is an overview of a Streaming architecture using Kafka and Flink
Fig 10 – From the talk “Advanced Streaming Analytics with Apache Flink and Apache Kafka” by Stephan Ewen 
So far, we have discussed both Flink and Kafka before concluding let’s just go through the Yahoo Benchmark for stream processors 
Fig 11 – From the talk “Unified Stream and Batch Processing with Apache Flink” by Ufuk Celebi 
The Architecture consisted of Kafka clusters feeding the stream processors and the results of stream transformation were published in Redis and via Redis available to applications outside the architecture. As you can see that even at high throughput Storm and Flink maintained low latency. This benchmark was further extended by Data Artisans , the company behind Flink, they took Yahoo’s benchmark as a starting point and upgraded the Flink’s cluster’s node interconnect to 10GigE from 1 GigE which was used by Yahoo. The results were very interesting as Flink not only outperformed storm but also saturated the Kafka link at around 3 million events/sec.
Stream processing is at an initial yet very interesting phase, and I hope after reading this blog you would give Kafka and Flink a try on your machine. Feel free to share your feedback/comments
 – https://www.youtube.com/watch?v=9RMOc0SwRro
 – http://www.confluent.io/confluent-unveils-next-generation-of-apache-kafka-as-enterprise-adoption-soars
 – http://kafka.apache.org/documentation.html
 – http://www.infoq.com/presentations/stream-processing-kafka
 – https://ci.apache.org/projects/flink/flink-docs-release-0.7/internal_general_arch.html
 – http://www.slideshare.net/sbaltagi/apacheflinkwhathowwhywhowherebyslimbaltagi-57825047
 – https://www.youtube.com/watch?v=8Uh3ycG3Wew
 – https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html
 – https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
 – http://data-artisans.com/extending-the-yahoo-streaming-benchmark/