This is the introduction to a new series of blog posts about data analysis with the SMACK stack. Follow along in the coming weeks to learn more!
Over the last 20 years, the architectures of (big) data platforms have changed and companies processing large amounts of data have moved from big Enterprise Data warehouse solutions to on-premise / cloud-based solutions, with Apache Hadoop (Hadoop Distributed File System, MapReduce, and YARN) as a fundamental building block. Architectures based on Hadoop tend to be focussed on (long running) batched or offline jobs, where data is captured to storage and then processed periodically. For usage in an online environment, this batch based processing was becoming too slow and business expectations were changing. Over the last 5-10 years, the demand for performing (near) real-time analysis has been pushing the industry into finding new solutions and architectural patterns to achieve these new goals. This has led to several ‘new’ architectural patterns like the Lambda architecture and the Kappa architecture. Both architectural patterns have a focus on processing data at speed (stream processing), where the Kappa architecture is purely focussed on a streaming (speed) layer and completely removes batch-oriented processing.
When designing a data platform there are many aspects that need to be taken into consideration:
- the type of analysis – batch, (near) real-time, or both
- the processing methodology – predictive, analytical, ad-hoc queries or reporting
- data frequency and size — how much data is expected and at what frequency does it arrive at the platform
- the type of data – transactional, historical, etc
- the format of incoming data — structured, unstructured or semi-structured
- the data consumers – who will be using the results
This list is by no means exhaustive, but it’s a starting point.
Organisations processing high volumes of data used to always pick a (single) vendor backed product stack, but these days there are so many great, open source, reliable and proven solutions out there that you can easily take a best of breed approach and build your own stack. There is a wide variety of components to select, so always do it based on your specific requirements. One of the more popular, general purpose and best-of-breed big data stacks I’ve seen lately is the SMACK stack.
The SMACK stack
The SMACK stack consists of the following technologies:
- Spark – Apache Spark™ is a fast and general engine for large-scale data processing. Spark allows you to combine SQL, streaming, and complex analytics. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. Spark has support for both real-time (Spark Streaming with µ batches) as well as batch (MapReduce) processing.
- Mesos – Apache Mesos™ abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. Mesos runs applications within its cluster and makes sure they are highly available and in the case of a machine failure will relocate applications to different nodes in the cluster.
- Akka – Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM. Akka uses the Actor Model to raise the abstraction level and provide a better platform to build scalable, resilient and responsive applications. Everything in Akka is designed to work in a distributed environment: all interactions of actors use pure message passing and everything is asynchronous.
- Cassandra – Apache Cassandra™ is a proven, high performant, durable and fault tolerant NoSQL database. Cassandra can easily manage large amounts of data and offers robust support for clusters spanning multiple datacenters and geographical locations.
- Kafka – Apache Kafka™ is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design to allow a single cluster to serve as the central data backbone for a large organization.
The tools are very easy to integrate with each other and serve their own purpose within a modern platform for Big Data applications. The ease of integration between has probably helped a lot in making it a popular solution, but it’s not the only reason. I think the most important reasons are because:
- it’s a concise toolbox that can deal with a wide variety of data processing scenarios
- it’s composed of proven, battle tested and widely used software components. The individual components are open source and backed by a large open-source community
- the stack is easily scalable and replication of data happens while still preserving low latencies
- the stack can run on a single cluster managed platform that can handle heterogeneous loads and any kind of applications
Over the next couple of weeks, we’ll be doing a deep-dive into each individual technology, so we can elaborate why these technologies combined are extremely powerful and give you a wide variety of options when designing your (big) data architecture.
Feel free to continue reading in the second part of this series, which covers Apache Mesos, the foundation of the SMACK stack.