Tag Archive: mesos



A beginners guide to the SMACK stack – Part 2: Mesos

In the first part of this series, we’ve briefly looked at the technologies that make up the SMACK stack. In this post, we’ll take a closer look at one of the fundamental layers of the stack: Apache Mesos.

What is Apache Mesos?

Apache Mesos is an open source cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. Mesos manages a cluster of machines (virtual or physical) and can also be seen as a distributed systems kernel, because the Mesos kernel runs on every machine in the cluster and provides applications (e.g., Hadoop, Spark, Kafka, etc) with API’s for resource management and scheduling across an entire data center or cloud environment. One of the ideas behind Mesos is to deploy multiple distributed systems within the same shared pool of cluster nodes in order to increase resource utilization. Let’s get to know Mesos a little better by looking at some of the core Mesos concepts.

Architecture

By examining the Mesos architecture we will get to know the most important concepts to understand when dealing with Mesos.

 

mesos-architecture-2

 

As you can see in the above figure, a Mesos architecture consists of a master daemon that manages agent daemons running on nodes in the cluster. Apache Mesos also uses Apache ZooKeeper to operate. ZooKeeper acts as the master election service in the Mesos architecture and stores state for the Mesos nodes.

Frameworks

Next to the masters and agents, Mesos has the concept of frameworks. Frameworks within Mesos are responsible for running tasks on the Mesos agent nodes.

A Mesos framework consists of two major components:

  1. a scheduler that registers with the Mesos master to be offered resources
  2. an executor process that is launched on slave nodes to run the framework’s tasks

Mesos agents will notify the Mesos master about their available resources. Based on that information, the Mesos master determines how many resources to offer to each framework. After the Mesos master has decided which resources to offer, the scheduler then selects which of the offered resources to use. When a framework accepts offered resources, it passes Mesos a description of the tasks it wants to launch.

Now that we went over the concept of frameworks, schedulers, and executors, it’s interesting to point out that most of the components that make up the SMACK stack are available as a Mesos framework. Spark, Kafka, and Cassandra are all available as a framework for Mesos. Mesos has support for a lot of different frameworks, you can find a more extensive list of available frameworks on the Mesos frameworks documentation page.

Jobs

Almost all data processing platforms will have the need for two different kinds of jobs:

  1. scheduled/periodic jobs – you can think of periodic batch aggregations or reporting jobs
  2. long-running jobs – stream processing or long running application jobs

For the first kind of jobs, you can use the Chronos framework. Chronos is a distributed and fault-tolerant scheduler that runs on top of Apache Mesos and can be used for job orchestration. If you’re familiar with Linux you can compare it to a distributed version of cron. However, compared to regular cron, Chronos has a number of advantages. For instance, it allows you to schedule your jobs using ISO8601 repeating interval notation, which enables more flexibility in job scheduling. Next to that Chronos also supports arbitrarily long dependency chains. Jobs can be triggered by the completion of other jobs. This can very useful at times.

For the long-running jobs, Mesos has the Marathon framework. Marathon is a fault tolerant and distributed ‘init’ system and can be used to deploy and run applications across a Mesos cluster. Marathon has many features that simplify running applications in a clustered environment, such as high-availability, node constraints, application health checks, an API for scriptability and service discovery. When it comes to application deployment Mesos has built in support for Blue-Green deployments and it also adds scaling and self-healing to the already big feature sets. In case a machine in the cluster dies, Marathon will make sure that the application is automatically spawned elsewhere in the cluster to make sure the application is always on and meets the preconfigured amount of instances.

Marathon can be used to run any kind of executable process and is also often used as a PaaS solutions for running containers. Other Mesos frameworks can also be launched from Marathon. In combination with the self-healing ability, Mesos and Marathon make a very robust platform for running any kind of application.

Getting some hands-on experience

You can build Mesos from source, but probably the easiest way to getting some hands-on experience with Mesos in the context of the SMACK stack is by installing DC/OS. DC/OS stands for the open source DataCenter Operating System developed by Mesosphere, the company which is also actively working on Apache Mesos, and DC/OS is built on top of Apache Mesos. When experimenting with new technologies I always like to use Vagrant or Docker. Luckily Mesosphere has released a DC/OS vagrant project which we can easily use, but before we get started, make sure you have Git, Vagrant, VirtualBox installed.

The first step is installing the vagrant-host manager plugin, which will alter our /etc/hosts file with some new hostnames, so we can easily connect to the instances.

$ vagrant plugin install vagrant-hostmanager

Now let’s clone the dcos vagrant project:

$ git clone https://github.com/dcos/dcos-vagrant

Let’s configure a 3 node installation by copying the correct Vagrant configuration:

$ cd dcos-vagrant
$ cp VagrantConfig-1m-1a-1p.yaml VagrantConfig.yaml

Now that we’ve configured everything, we can spin up our new DC/OS VMs.

$ vagrant up

During the installation, it might ask you for a password. You will need to enter your local machines password because vagrant is trying to alter your /etc/hosts file. It might take a little while before everything is up and running, but once it’s done you can navigate to http://m1.dcos/.

If everything is setup alright you should be able to log in and see the DC/OS dashboard.

As you can see from the screenshot above we currently have no tasks or services running, but now we’re all set to go and continue with our deep dive into the SMACK stack. Let’s install Marathon in DC/OS, so we can get some sense of what it takes to install a Mesos Framework.

Go to Universe -> Packages and install Marathon as a Mesos framework. Once installed Marathon you should be able to see you now have running tasks on the dashboard.

And by going to Services you should be able to see the marathon deployment happening within the Mesos cluster.

Now if you want you can also open the service and go to the Marathon UI to start creating groups or applications, but we’ll leave that for next time.

Marathon allows you to deploy processes and application containers and through the UI you can also easily scale these applications.

Mesos frameworks like Marathon, Kafka, Cassandra and Spark can also be easily scaled from within DC/OS. It’s just a matter of a few clicks.

In the above example, we installed Mesos via DC/OS in Vagrant, but you can also run DC/OS in the cloud or on premise. See the DC/OS get started page for more setup options.

Summary

Mesos is the fundamental layer of the SMACK stack that allows the applications to be run efficiently. It has all the features available to do proper auto recovery and can meet the scaling requirements required when you hit high load traffic on your architecture. Installing other parts of the stack is almost trivial with DC/OS and the Mesos frameworks for Kafka, Spark and Cassandra.

In the next part of this series, we will dive deeper into Cassandra and start adding Cassandra to our local SMACK stack for storage of our data at hand.

SaveSave


A beginners guide to the SMACK stack – Part 1: Introduction

This is the introduction to a new series of blog posts about data analysis with the SMACK stack. Follow along in the coming weeks to learn more!

Over the last 20 years, the architectures of (big) data platforms have changed and companies processing large amounts of data have moved from big Enterprise Data warehouse solutions to on-premise / cloud-based solutions, with Apache Hadoop (Hadoop Distributed File System, MapReduce, and YARN) as a fundamental building block. Architectures based on Hadoop tend to be focussed on (long running) batched or offline jobs, where data is captured to storage and then processed periodically. For usage in an online environment, this batch based processing was becoming too slow and business expectations were changing. Over the last 5-10 years, the demand for performing (near) real-time analysis has been pushing the industry into finding new solutions and architectural patterns to achieve these new goals. This has led to several ‘new’ architectural patterns like the Lambda architecture and the Kappa architecture. Both architectural patterns have a focus on processing data at speed (stream processing), where the Kappa architecture is purely focussed on a streaming (speed) layer and completely removes batch-oriented processing.

When designing a data platform there are many aspects that need to be taken into consideration:

  • the type of analysis – batch, (near) real-time, or both
  • the processing methodology – predictive, analytical, ad-hoc queries or reporting
  • data frequency and size — how much data is expected and at what frequency does it arrive at the platform
  • the type of data – transactional, historical, etc
  • the format of incoming data — structured, unstructured or semi-structured
  • the data consumers – who will be using the results

This list is by no means exhaustive, but it’s a starting point.

Organisations processing high volumes of data used to always pick a (single) vendor backed product stack, but these days there are so many  great, open source, reliable and proven solutions out there that you can easily take a best of breed approach and build your own stack. There is  a wide variety of components to select, so always do it based on your specific requirements.  One of the more popular, general purpose and best-of-breed big data stacks I’ve seen lately is the SMACK stack.

The SMACK stack

The SMACK Stack: Spark, Mesos, Akka, Cassandra and Kafka.

The SMACK stack consists of the following technologies:

  • Spark – Apache Spark™ is a fast and general engine for large-scale data processing. Spark allows you to combine SQL, streaming, and complex analytics. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. Spark has support for both real-time (Spark Streaming with µ batches) as well as batch (MapReduce) processing.
  • Mesos – Apache Mesos™ abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. Mesos runs applications within its cluster and makes sure they are highly available and in the case of a machine failure will relocate applications to different nodes in the cluster.
  • Akka – Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM. Akka uses the Actor Model to raise the abstraction level and provide a better platform to build scalable, resilient and responsive applications. Everything in Akka is designed to work in a distributed environment: all interactions of actors use pure message passing and everything is asynchronous.
  • Cassandra – Apache Cassandra™ is a proven, high performant, durable and fault tolerant NoSQL database. Cassandra can easily manage large amounts of data and offers robust support for clusters spanning multiple datacenters and geographical locations.
  • Kafka – Apache Kafka™ is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design to allow a single cluster to serve as the central data backbone for a large organization.

The tools are very easy to integrate with each other and serve their own purpose within a modern platform for Big Data applications. The ease of integration between has probably helped a lot in making it a popular solution, but it’s not the only reason. I think the most important reasons are because:

  1. it’s a concise toolbox that can deal with a wide variety of data processing scenarios
  2. it’s composed of proven, battle tested and widely used software components. The individual components are open source and backed by a large open-source community
  3. the stack is easily scalable and replication of data happens while still preserving low latencies
  4. the stack can run on a single cluster managed platform that can handle heterogeneous loads and any kind of applications

Over the next couple of weeks, we’ll be doing a deep-dive into each individual technology, so we can elaborate why these technologies combined are extremely powerful and give you a wide variety of options when designing your (big) data architecture.

Feel free to continue reading in the second part of this series, which covers Apache Mesos, the foundation of the SMACK stack.