Categories for Continuous Delivery



Monitoring Spring Boot applications with Prometheus and Grafana

At my current project, we’ve been building three different applications. All three applications are based on Spring Boot, but have very different workloads. They’ve all reached their way to the production environment and have been running steadily for quite some time now. We do regular (weekly basis) deployments of our applications to production with bug fixes, new features, and technical improvements. The organisation has a traditional infrastructure workflow in the sense that deployments to the VM instances on acceptance and production happen via the (remote) hosting provider.

The hosting provider is responsible for the uptime of the applications and therefore they keep an eye on system metrics through the usage of their own monitoring system. As a team, we are able to look in the system, but it doesn’t say much about the internals of our application. In the past, we’ve asked to add some additional metrics to their system, but the system isn’t that easy to configure with additional metrics. To us as a team runtime statistics about our applications and the impact our changes have on the overall health are crucial to understanding the impact of our work. The rest of this post will give a short description of our journey and the reasons why we chose the resulting setup.

Spring Boot Actuator and Micrometer

If you’ve used Spring Boot before you’ve probably heard of Spring Boot Actuator. Actuator is a set of features that help you monitor and manage your application when it moves away from your local development environment and onto a test, staging or production environment. It helps expose operational information about the running application ‚Äď health, metrics, audit entries, scheduled task, env settings, etc. You can query the information via either several HTTP endpoints or JMX beans. Being able to view the information is useful, but it’s hard to spot trends or see the behaviour over a period of time.

When we recently upgraded our projects to Spring Boot 2 my team was pretty excited that we were able to start using micrometer a (new) instrumentation library powering the delivery of application metrics. Micrometer is now the default metrics library in Spring Boot 2 and it doesn’t just give you metrics from your Spring application, but can also deliver JVM metrics (garbage collection and memory pools, etc) and also metrics from the application container. Micrometer has several different libraries that can be included to ship metrics to different backends and has support for Prometheus, Netflix Atlas, CloudWatch, Datadog, Graphite, Ganglia, JMX, Influx/Telegraf, New Relic, StatsD, SignalFx, and Wavefront.

Because we didn’t have a lot of control over the way our applications were deployed we looked at the several different backends supported by micrometer. Most of the above backends work by pushing data out to a remote (cloud) service. Since the organisation we work for doesn’t allow us to push this ‘sensitive’ data to a remote party we looked at self-hosted solutions. We did a quick scan and started with looking into Prometheus (and Grafana) and soon learned that it was really easy to get a monitoring system up and we had a running system within an hour.

To be able to use Spring Boot Actuator and Prometheus together you need to add two dependencies to your project:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

Actuator has an endpoint available for prometheus to scrape but it’s not exposed by default, so you will need to enable the endpoint by means of configuration. In this case, I’ll do so via the application.properties.

management.endpoint.prometheus.enabled=true
management.endpoints.web.exposure.include=prometheus,info,health

Now if you browse to http(s)://host(:8080)/actuator/prometheus you will see the output that prometheus will scrape to get the information from your application. A small snippet of the information provided by the endpoint is shown below, but there is a lot more information that the prometheus endpoint will expose.

# HELP tomcat_global_sent_bytes_total  
# TYPE tomcat_global_sent_bytes_total counter
tomcat_global_sent_bytes_total{name="http-nio-8080",} 75776.0
tomcat_global_sent_bytes_total{name="http-nio-8443",} 1.0182049E8
# HELP tomcat_servlet_request_max_seconds  
# TYPE tomcat_servlet_request_max_seconds gauge
tomcat_servlet_request_max_seconds{name="default",} 0.0
tomcat_servlet_request_max_seconds{name="jsp",} 0.0
# HELP process_files_open The open file descriptor count
# TYPE process_files_open gauge
process_files_open 91.0
# HELP system_cpu_usage The "recent cpu usage" for the whole system
# TYPE system_cpu_usage gauge
system_cpu_usage 0.00427715996578272
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
# TYPE jvm_memory_max_bytes gauge
jvm_memory_max_bytes{area="nonheap",id="Code Cache",} 2.5165824E8
jvm_memory_max_bytes{area="nonheap",id="Metaspace",} -1.0
jvm_memory_max_bytes{area="nonheap",id="Compressed Class Space",} 1.073741824E9
jvm_memory_max_bytes{area="heap",id="PS Eden Space",} 1.77733632E8
jvm_memory_max_bytes{area="heap",id="PS Survivor Space",} 524288.0
jvm_memory_max_bytes{area="heap",id="PS Old Gen",} 3.58088704E8

Now that everything is configured from the application perspective, let’s move on to Prometheus itself.

Prometheus

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud and now part of the Cloud Native Computing Foundation. To get a better understanding of what prometheus really is let us take a look at an architectural diagram.

(Source: https://prometheus.io/docs/introduction/overview/)

The prometheus server contains of a set of 3 features:

  • A time series database
  • A retrieval component which scrapes its targets for information
  • An HTTP server which you can use to query information stored inside the time series database

To make it even more powerful there are some additional components which you can use if you want:

  • An alert manager, which you can use to send alerts via Pagerduty, Slack, etc.
  • A push gateway in case you need to push information to prometheus instead of using the default pull mechanism
  • Grafana for visualizing data and creating dashboards

When looking at Prometheus the most appealing features for us were:

  • no reliance on distributed storage; single server nodes are autonomous
  • time series collection happens via a pull model over HTTP
  • targets are discovered via service discovery or static configuration
  • multiple modes of graphing and dashboarding support

To get up and running quickly you can configure prometheus to scrape some (existing) Spring Boot applications. For scraping targets, you will need to specify them within the prometheus configuration. Prometheus uses a file called prometheus.yml as its main configuration file. Within the configuration file, you can specify where it can find the targets it needs to monitor, specify recording rules and alerting rules.

The following example shows a configuration with a set of static targets for both prometheus itself and our spring boot application.

global:
  scrape_interval:   15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'bootifull-monitoring'

scrape_configs:
- job_name:       'monitoring-demo'

  # Override the global default and scrape targets from this job every 10 seconds.
  scrape_interval: 10s
  metrics_path: '/actuator/prometheus'

  static_configs:
  - targets: ['monitoring-demo:8080']
    labels:
      application: 'monitoring-demo'

- job_name: 'prometheus'

  scrape_interval: 5s

  static_configs:
  - targets: ['localhost:9090']

As you can see the configuration is pretty simple. You can add specific labels to the targets which can, later on, be used for querying, filtering and creating a dashboard based upon the information stored within prometheus.

If you want to get started quickly with Prometheus and have docker on your environment you can use the official docker prometheus image by running the following command and provide a custom configuration from your host machine by running:

docker run -p 9090:9090 -v /tmp/prometheus.yml:/etc/prometheus/prometheus.yml \
       prom/prometheus:v2.4.3

In the above example we bind-mount the main prometheus configuration file from the host system, so you can, for instance, use the above configuration. Prometheus itself has some basic graphing capabilities (as you can see in the following image), but they are more meant to be used when doing some ad-hoc queries.

For creating an application monitoring dashboard Grafana is much more suited.

Grafana

So what is Grafana and what role does it play in our monitoring stack?

Grafana allows you to query, visualize, alert on and understand your metrics no matter where they are stored. Create, explore, and share dashboards with your team and foster a data driven culture.

The cool thing about Grafana is (next to the beautiful UI) that it’s not tied to Prometheus as its single data source like for instance Kibana is tied to Elasticsearch. Grafana can have many different data sources like AWS Cloudwatch, Elasticsearch, InfluxDB, Prometheus, etc. This makes it a very good option for creating a monitoring dashboard. Grafana talks to prometheus by using the PromQL query language.

For Grafana there is also an official Docker image available for you to use. You can get Grafana up and running with a simple command.

docker run -p 3000:3000 grafana/grafana:5.2.4

Now if we connect Grafana with Prometheus as the datasource and install this excellent JVM Micrometer dashboard into Grafana we can instantly start monitoring our Spring Boot application. You will end up with a pretty mature dashboard that lets you switch between different instances of your application.

If you want to start everything all at once you can easily use docker-compose.

version: "3"
services:
  app:
    image: monitoring-demo:latest
    container_name: 'monitoring-demo'
    build:
      context: ./
      dockerfile: Dockerfile
    ports:
    - '8080:8080'
  prometheus:
    image: prom/prometheus:v2.4.3
    container_name: 'prometheus'
    volumes:
    - ./monitoring/prometheus/:/etc/prometheus/
    ports:
    - '9090:9090'
  grafana:
    image: grafana/grafana:5.2.4
    container_name: 'grafana'
    ports:
    - '3000:3000'
    volumes:
    - ./monitoring/grafana/provisioning/:/etc/grafana/provisioning/
    env_file:
    - ./monitoring/grafana/config.monitoring
    depends_on:
    - prometheus

I’ve put together a small demo project, containing a simple Spring Boot application and the above prometheus configuration, in a github repository for demo and experimentation purposes. Now if you want to generate some statistics run a small load test with JMeter or Apache Bench. Feel free to use/fork it!


Elasticsearch instances for integration testing

In my latest project I have implemented all communication with my Elasticsearch cluster using the high level REST client. My next step was to setup and teardown an Elasticsearch instance automatically in order to facilitate proper integration testing. This article describes three different ways of doing so and discusses some of the pros and cons. Please refer to this repository for implementations of all three methods.

docker-maven-plugin

This generic Docker plugin allows you to bind the starting and stopping of Docker containers to Maven lifecycles. You specify two blocks within the plugin; configuration and executions. In the configuration block, you choose the image that you want to run (Elasticsearch 6.3.0 in this case), the ports that you want to expose, a health check and any environment variables. See the snippet below for a complete example:

<plugin>
    <groupId>io.fabric8</groupId>
    <artifactId>docker-maven-plugin</artifactId>
    <version>${version.io.fabric8.docker-maven-plugin}</version>
    <configuration>
        <imagePullPolicy>always</imagePullPolicy>
        <images>
            <image>
                <alias>docker-elasticsearch-integration-test</alias>
                <name>docker.elastic.co/elasticsearch/elasticsearch:6.3.0</name>
                <run>
                    <namingStrategy>alias</namingStrategy>
                    <ports>
                        <port>9299:9200</port>
                        <port>9399:9300</port>
                    </ports>
                    <env>
                        <cluster.name>integration-test-cluster</cluster.name>
                    </env>
                    <wait>
                        <http>
                            <url>http://localhost:9299</url>
                            <method>GET</method>
                            <status>200</status>
                        </http>
                        <time>60000</time>
                    </wait>
                </run>
            </image>
        </images>
    </configuration>
    <executions>
        <execution>
            <id>docker:start</id>
            <phase>pre-integration-test</phase>
            <goals>
                <goal>start</goal>
            </goals>
        </execution>
        <execution>
            <id>docker:stop</id>
            <phase>post-integration-test</phase>
            <goals>
                <goal>stop</goal>
            </goals>
        </execution>
    </executions>
</plugin>

You can see that I’ve bound the plugin to the pre- and post-integration-test lifecycle phases. By doing so, the Elasticsearch container will be started just before any integration tests are ran and will be stopped after the integration tests have finished. I’ve used the maven-failsafe-plugin in order to trigger the execution of tests ending with *IT.java in the integration-test lifecycle phase.

Since this is a generic Docker plugin, there is no special functionality to easily install Elasticsearch plugins that may be needed during your integration tests. You could however create your own image with the required plugins and pull that image during your integration tests.

The integration with IntelliJ is also not optimal. When running an *IT.java class, IntelliJ will not trigger the correct lifecycle phases and will attempt to run your integration test without creating the required Docker container. Before running an integration test from IntelliJ, you need to manually start the container from the “Maven projects” view by running the docker:start commando:

Maven Projects view in IntelliJ

After running, you will also need to run the docker:stop commando to kill the container that is still running. If you forget to kill the running container and want to run a mvn clean install later on it will fail, since the build will attempt to create a container on the same port – as far as I know, the plugin does not allow for random ports to be chosen.

Pros:

  • Little setup, only requires configuration of one Maven plugin

Cons:

  • No out of the box functionality to start the Elasticsearch instance on a random port
  • No out of the box functionality to install extra Elasticsearch plugins
  • Extra dependency in your build pipeline (Docker)
  • IntelliJ does not trigger the correct lifecycle phases

elasticsearch-maven-plugin

This second plugin does not require Docker and only needs some Maven configuration to get started. See the snippet below for a complete example:

<plugin>
    <groupId>com.github.alexcojocaru</groupId>
    <artifactId>elasticsearch-maven-plugin</artifactId>
    <version>${version.com.github.alexcojocaru.elasticsearch-maven-plugin}</version>
    <configuration>
        <version>${version.org.elastic}</version>
        <clusterName>integration-test-cluster</clusterName>
        <transportPort>9399</transportPort>
        <httpPort>9299</httpPort>
    </configuration>
    <executions>
        <execution>
            <id>start-elasticsearch</id>
            <phase>pre-integration-test</phase>
            <goals>
                <goal>runforked</goal>
            </goals>
        </execution>
        <execution>
            <id>stop-elasticsearch</id>
            <phase>post-integration-test</phase>
            <goals>
                <goal>stop</goal>
            </goals>
        </execution>
    </executions>
</plugin>

Again, I’ve bound the plugin to the pre- and post-integration-test lifecycle phases in combination with the maven-failsafe-plugin.

This plugin provides a way of starting the Elasticsearch instance from IntelliJ in much the same way as the docker-maven-plugin. You can run the elasticsearch:runforked commando from the “Maven projects” view. However in my case, this started the container and then immediately exited. There is also no out of the box possibility of setting a random port for your instance. Of course, there are solutions to this at the expense of having a somewhat more complex Maven configuration.

Overall, this is a plugin that seems to provide almost everything we need with a lot of configuration options. You can automatically install Elasticsearch plugins or even bootstrap your instance with data.

In practice I did have some problems using the plugin in my build pipeline. Upon downloading the Elasticsearch zip the build would sometimes fail, or in other cases when attempting to download a plugin. Your mileage may vary, but this was reason for me to keep looking for another solution. Which brings me to plugin number three.

Pros:

  • Little setup, only requires configuration of one Maven plugin
  • No extra external dependencies
  • High amount of configuration possible

Cons:

  • No out of the box functionality to start the Elasticsearch instance on a random port
  • Poor integration with IntelliJ
  • Seems unstable

testcontainers-elasticsearch

This third plugin is different from the other two. It uses a Java testcontainer that you can configure through Java code. This gives you a lot of flexibility and requires no Maven configuration. Since there is no Maven configuration, it does require some work to make sure the Elasticsearch container is started and stopped at the correct moments.

In order to realize this, I have extended the standard SpringJUnit4ClassRunner class with my own ElasticsearchSpringRunner. In this runner, I have added a new JUnit RunListener named JUnitExecutionListener. This listener defines two methods testRunStarted and testRunFinished that enable me to start and stop the Elasticsearch container at the same points in time that the pre- and post-integration-test Maven lifecycle phases would. See the snippet below for the implementation of the listener:

public class JUnitExecutionListener extends RunListener {

    private static final Logger LOGGER = LoggerFactory.getLogger(JUnitExecutionListener.class);
    private static final String ELASTICSEARCH_IMAGE = "docker.elastic.co/elasticsearch/elasticsearch";
    private static final String ELASTICSEARCH_VERSION = "6.3.0";
    private static final String ELASTICSEARCH_HOST_PROPERTY = "nl.luminis.articles.maven.elasticsearch.host";
    private static final int ELASTICSEARCH_PORT = 9200;

    private ElasticsearchContainer container;

    @Override
    public void testRunStarted(Description description) {
        // Create a Docker Elasticsearch container when there is no existing host defined in default-test.properties.
        // Spring will use this property to configure the application when it starts.
        if (System.getProperty(ELASTICSEARCH_HOST_PROPERTY) == null) {
            LOGGER.debug("Create Elasticsearch container");
            int mappedPort = createContainer();
            System.setProperty(ELASTICSEARCH_HOST_PROPERTY, "localhost:" + mappedPort);
            String host = System.getProperty(ELASTICSEARCH_HOST_PROPERTY);
            RestAssured.basePath = "";
            RestAssured.baseURI = "http://" + host.split(":")[0];
            RestAssured.port = Integer.parseInt(host.split(":")[1]);
            LOGGER.debug("Created Elasticsearch container at {}", host);
        }
    }

    @Override
    public void testRunFinished(Result result) {
        if (container != null) {
            String host = System.getProperty(ELASTICSEARCH_HOST_PROPERTY);
            LOGGER.debug("Removing Elasticsearch container at {}", host);
            container.stop();
        }
    }

    private int createContainer() {
        container = new ElasticsearchContainer();
        container.withBaseUrl(ELASTICSEARCH_IMAGE);
        container.withVersion(ELASTICSEARCH_VERSION);
        container.withEnv("cluster.name", "integration-test-cluster");
        container.start();
        return container.getMappedPort(ELASTICSEARCH_PORT);
    }
}

It will create an Elasticsearch Docker container on a random port for use by the integration tests. The best thing about having this runner is that it works perfectly fine in IntelliJ. Simply right-click and run your *IT.java classes annotated with @RunWith(ElasticsearchSpringRunner.class) and IntelliJ will use the listener to setup the Elasticsearch container. This allows you to automate your build pipeline while still keeping developers happy.

Pros:

  • Neat integration with both Java and therefore your IDE
  • Sufficient configuration options out of the box

Cons:

  • More complex initial setup
  • Extra dependency in your build pipeline (Docker)

In summary, all three of the above plugins are able to realize the goal of starting an Elasticsearch instance for your integration testing. For me personally, I will be using the testcontainers-elasticsearch plugin going forward. The extra Docker dependency is not a problem since I use Docker in most of my build pipelines anyway. Furthermore, the integration with Java allows me to configure things in such a way that it works perfectly fine from both the command line and the IDE.

Feel free to checkout the code behind this article, play around with the integration tests that I’ve setup there and decide for yourself which plugin suits your needs best. Please note that the project has a special Maven profile that separates unittests from integration tests. Build the project using mvn clean install -P integration-test to run both.


Looking back on AWS Summit Benelux 2018

Last week I visited AWS Summit Benelux together with Sander. AWS Summit is all about cloud computing and the topics that surround cloud computing. This being my first AWS conference I can say it was a really nice experience. Sure there was room for improvement (no coffee or tea after the opening keynote being one), but other than that it was a very good experience. Getting inside was a breeze with all the different check-in points and after you entered you were directly on the exhibitor floor where a lot of Amazon partners showed their products.

Opening keynote

The day started with an introduction by Kamini Aisola, Head of Amazon Benelux. With this being my first AWS summit it was great to see Kamini showing some numbers about the conference: 2000 attendees and 28 technical sessions. She also showed us the growth pattern of AWS with an increasing growth of 49% compared to last year. That’s really impressive!

Who are builders?

Shortly after, Amazon.com CTO Werner Vogels started with his opening keynote. Werner showed how AWS evolved from being ‘just’ an IaaS company to now offering more than 125 different services. More than 90% of the developed services were based on customer feedback from the last couple of years. That’s probably one of the reasons why AWS is growing so rapidly and customers are adopting the AWS platform.

What I noticed throughout the entire keynote is that AWS is constantly thinking about what builders want to build (in the cloud) and what kind of tools those builders need to have to be successful. These tools come in different forms and sizes, but I noticed there is a certain pattern in how services evolve or are grown at AWS. The overall trend I noticed during the talks is that engineers or builders should have to spend less time focussing on lower level infrastructure and can start to really focus on delivering business value by leveraging the services that AWS has to offer.

During the keynote Werner ran through a couple of different focus areas for which he showed what AWS is currently offering. In this post I won’t go through all of them, because I expect you can probably watch a recording of the keynote on youtube soon, but I’ll highlight a few.

Let’s first start with the state of Machine Learning and analytics. Werner looked back at how machine learning evolved at Amazon.com and how services were developed to make machine learning more accessible for teams within the organisation. Out of this came a really nice mission statement:

AWS want’s to put machine learning in the hands of every developer and data scientist.

To achieve this mission AWS is currently offering a layered ML stack to engineers looking into to using ML on the AWS platform.

The layers go from low-level libraries to pre-build functionalities based on these lower level layers. I really liked that fact that these services are built in such a way that engineers can decide at which level of complexity they want to start using the ML services offered by AWS. Most of the time data engineers and data scientist will start from either SageMaker or even lower, but most application developers might just want to use a pre-built functionality like image recognition, text processing or speech recognition. See for instance this really awesome post on using Facial recognition by my colleague Roberto.

Another example of this layered approach was with regards to container support on AWS. A few years back Amazon added container support to their offering with Amazon Elastic Container Service (Amazon ECS). This allowed Amazon ECS helped customers run containers on AWS without having to manage all servers and manager their own container orchestration software. ECS delivered all of this. Now fast forwarding a few years Amazon is now offering Amazon EKS (managed Kubernetes on Amazon) after they noticed that about 63% of managed Kubernetes clusters ran on AWS. Kubernetes has become the current industry standard when it comes to container orchestration, so this makes a lot of sense. In addition, Amazon now also offers Amazon Fargate. With Fargate they take the next step which means that Fargate allows you as the developer to focus on running containers ‘without having to think about managing servers or clusters’.

During his keynote, Werner also mentioned the Well-Architected framework. The Well-Architect framework has been developed to help cloud architects run their applications in the cloud based on AWS best practices. When implemented correctly it allows you to fully focus on your functional requirements to deliver business value to your customers. The framework is based on the following five pillars:

  1. Operational Excellence
  2. Security
  3. Reliability
  4. Performance Efficiency
  5. Cost Optimization

I had not heard about the framework before, so during the weekend I read through some of its documentation. Some of the items are pretty straightforward, but others might give you some insights in what it means to run applications in the cloud. One aspect of the Well-Architected framework, Security, had been recurring throughout the entire keynote.

Werner emphasised a very important point during his presentation:

Security is EVERYONE’s job

With all the data breaches happening lately I think this is a really good point to make. Security should be everybody’s number one priority these days.

During the keynote, there were a couple of customers that showed how AWS had helped them achieve a certain goal. Bastiaan Terhorst, CPO at WeTransfer explained that being a cloud-scale company comes with certain problems. He explained how they moved from a brittle situation towards a more scalable solution. They could not modify the schema of their DB anymore without breaking the application, which is horrible if you reach a certain scale and customer base. They had to rearchitect the way they worked with incoming data and using historic data for reporting. I really liked the fact that he shared some hard-learned lessons about database scalability issues that can occur when you reach a certain scale.

Tim Bogaert, CTO at de Persgroep also showed how they moved from being a silo-ed organization with own datacenters and waterfall long-running projects towards all-in AWS with an agile approach and teams following the “You Build It, You Run It” mantra. It was an interesting story because I see a lot of larger enterprises still struggling with these transitions.

After the morning keynote, the breakout sessions started. There were 7 parallel tracks and all with different topics, so plenty to choose from. During the day I attended only a few, so here goes.

Improve Productivity with Continuous Integration & Delivery

This really nice talk by Clara Ligouri (software engineer for AWS Developer Tools) and Jamie van Brunschot (Cloud engineer at Coolblue) gave a good insight into all the different tools provided by AWS to support the full development and deployment lifecycle of an application.

Clara modified some code in Cloud9 (the online IDE), debugged some code, ran CI jobs, tests and deployments all from within her browser and pushed a new change to production within only a matter of minutes. It shows how far the current state of being a cloud-native developer has really come. I looked at Cloud9 years ago. Way before they were acquired by Amazon. I’ve always been a bit skeptical when it comes to using an online IDE. I remember having some good discussions with the CTO at my former company about if this would really be the next step for IDEs and software development in general. I’m just so comfortable with IntelliJ for Java development and it always works (even if I do not have any internet ;-)). I do wonder if anybody reading this is already using Cloud9 (or any other Web IDE) and is doing his / her development fully in the cloud. If you do, please leave a comment, I would love to learn from your experiences. The other tools like CodePipeline and CodeDeploy definitely looked interesting, so I need to find some time to play around with them.

GDPR

Next up was a talk on GDPR. The room was quite packed. I didn’t expect that though, because everybody should be GDPR compliant by now right? ūüôā Well not really. Companies are still implementing changes to be compliant with GDPR. The talk by Christian Hesse looked at different aspects of GDPR like:

  • The right to data portability
  • The right to be forgotten
  • Privacy by design
  • Data breach notification

He also talked about the shared responsibility model when it comes to being GDPR compliant. AWS as the processor of personal data and the company using AWS being the controller are both responsible for making sure data stays safe. GDPR is a hot topic and I guess it will stay so for the rest of the year at least. It’s something that we as engineers will always need to keep in the back of our minds while developing new applications or features.

Serverless

In the afternoon I also attended a talk on Serverless by Prakash Palanisamy (Solutions Architect, Amazon Web Services) and Joachim den Hertog (Solutions Architect, ReSnap / Albelli). This presentation gave a nice overview of Serverless and Step functions, but also showed new improvements like the Serverless Application Repository, save Serverless deployments and incremental deployments. Joachim gave some insights into how Albelli was using Serverless and Machine Learning on the AWS platform for their online photo book creator application called ReSnap.

Unfortunately I had to leave early, so I missed the end of the Serverless talk and the last breakout session, but all in all AWS Summit Benelux was a very nice experience with some interesting customer cases and architectures. For a ‘free’ event it was amazingly organized, I learned some new things and had a chance to speak with some people about how they used AWS. It has triggered me to spend some more time with AWS and its services. Let’s see what interesting things I can do on the next Luminis TechDay.

Build On!


Tracing API’s: Combining Spring’s Sleuth, Zipkin & ELK

Tracing bugs, errors and the cause of lagging performance can be cumbersome, especially when functionality is distributed over multiple microservices.
In order to keep track of these issues, the usage of an ELK stack (or any similar sytem) is already a big step forward in creating a clear overview of all the processes of a service and finding these issues.
Often bugs can be traced by using ELK far more easily than just using a plain log file – if even available.
Optimization in this approach can be preferred, as for example you may want to see the trace logging for one specific event only. (more…)


Providing developers and testers with a representative database using Docker

When developing or testing, having a database that comes pre-filled with relevant data can be a big help with implementation or scenario walkthroughs. However, often there is only a structure dump of the production database available, or less. This article outlines the process of creating a Docker image that starts a database and automatically restores a dump containing a representative set of data. We use the PostgreSQL database in our examples, but the process outlined here can easily be applied to other databases such as MySQL or Oracle SQL.

Dumping the database

I will assume that you have access to a database that contains a copy of production or an equivalent representative set of data. You can dump your database using Postgres’ pg_dump utility:

pg_dump --dbname=postgresql://user@localhost:5432/mydatabase \
        --format=custom --file=database.dmp

We will be using the custom format option to create the dump file with. This gives us a file that can easily be restored with the pg_restore utility later on and ensures the file is compressed. In case of larger databases, you may also wish to exclude certain database elements from your dump. In order to do so you have the following options:

  • The¬†--exclude-table¬†option, which takes a string pattern and ensures any tables matching the pattern will not be included in the dump file.
  • The¬†--schema¬†option, which restricts our dump to particular schemas in the database. It may be a good idea to exclude¬†pg_catalog¬†– this schema contains among other things the table¬†pg_largeobject, which contains all of your database’s binaries.

See the Postgres documentation for more available options.

Distributing the dump among users

For the distribution of the dump, we will be using Docker. Postgres, MySQL and even Oracle provide you with prebuilt Docker images of their databases.

Example 1: A first attempt
In order to start an instance of a Postgres database, you can use the following docker run command to start a container based on Postgres:

docker run -p 5432:5432 --name database-dump-container \
           -e POSTGRES_USER=user -e POSTGRES_PASSWORD=password \
           -e POSTGRES_DB=mydatabase -d postgres:9.5.10-alpine

This starts a container named database-dump-container that can be reached at port 5432 with user:password as the login. Note the usage of the 9.5.10-alpine tag. This ensures that the Linux distribution that we use inside the Docker container is Alpine Linux, a distribution with a small footprint. The whole Docker image will take up about 14 MB, while the regular 9.5.10 tag would require 104 MB. We are pulling the image from Docker Hub, a public Docker registry where various open source projects host their Docker images.

Having started our Docker container, we can now copy our dump file into it. We first use docker exec to execute a command against the container we just made. In this case, we create a directory inside the Docker container:

docker exec -i database-dump-container mkdir -p /var/lib/postgresql/dumps/

Following that, we use docker cp to copy the dump file from our host into the container:

docker cp database.dmp database-dump-container:/var/lib/postgresql/dumps/

After this, we can restore our dump:

docker exec -i database-dump-container pg_restore --username=user --verbose \
                                                  --exit-on-error --format=custom \
                                                  --dbname=mydatabase /var/lib/postgresql/dumps/database.dmp

We now have a Docker container with a running Postgres instance containing our data dump. In order to actually distribute this, you will need to get it into a Docker repository. If you register with Docker Hub you can create public repositories for free. After creating your account you can login to the registry that hosts your repositories with the following command:

docker login docker.io

Enter the username and password for your Docker Hub account when prompted.

Having done this, we are able to publish our data dump container as an image, using the following commands:

docker commit database-dump-container my-repository/database-dump-image
docker push my-repository/database-dump-image

Note that you are able to push different versions of an image by using Docker image tags.

The image is now available to other developers. It can be pulled and ran on another machine using the following commands:

docker pull my-repository/database-dump-image
docker run -p 5432:5432 --name database-dump-container \
           -e POSTGRES_USER=user -e POSTGRES_PASSWORD=password \
           -e POSTGRES_DB=mydatabase -d my-repository/database-dump-image

All done! Or are we? After we run the container based on the image, we still have an empty database. How did this happen?

Example 2: Creating your own Dockerfile
It turns out that the Postgres Docker image uses Docker volumes. This separates the actual image from data and ensures that the size of the image remains reasonable. We can view what volumes Docker has made for us by using docker volume ls. These volumes can be associated with more than one Docker container and will remain, even after you have removed the container that initially spawned the volume. If you would like to remove a Docker container, including its volumes, make sure to use the -v option:

docker rm -v database-dump-container

Go ahead and execute the command, we will be recreating the container in the following steps.

So how can we use this knowledge to distribute our database including our dump? Luckily, the Postgres image provides for exactly this situation. Any scripts that are present in the Docker container under the directory /docker-entrypoint-initdb.d/ will be executed automatically upon starting a new container. This allows us to add data to the Docker volume upon starting the container. In order to make use of this functionality, we are going to have to create our own image using a Dockerfile that extends the postgres:9.5.10-alpine image we used earlier:

FROM postgres:9.5.10-alpine

RUN mkdir -p /var/lib/postgresql/dumps/
ADD database.dmp /var/lib/postgresql/dumps/
ADD intialize.sh /docker-entrypoint-initdb.d/

The contents of initialize.sh are as follows:

pg_restore --username=user --verbose --exit-on-error --format=custom \
           --dbname=mydatabase /var/lib/postgresql/dumps/database.dmp

We can build and run this Dockerfile by navigating to its directory and then executing:

docker build --rm=true -t database-dump-image .
docker run -p 5432:5432 --name database-dump-container \
           -e POSTGRES_USER=user -e POSTGRES_PASSWORD=password \
           -e POSTGRES_DB=mydatabase -d database-dump-image

After starting the container, inspect its progress using docker logs -f database-dump-container. You can see that upon starting the container, our database dump is being restored into the Postgres instance.

We can now again publish the image using the earlier steps, and the image is available for usage.

Conclusions and further reading

While working through this article, you have used a lot of important concepts within Docker. The first example demonstrated the usage of images and containers, combined with commands¬†exec¬†and¬†cp¬†that are able to interact with running containers. We then demonstrated how you can publish a Docker image using Docker Hub, after which we’ve shown you how to build and run your own custom made image. We have also touched upon some more complex topics such as Docker volumes.

After this you may wish to consult the Docker documentation to further familiarize yourself with the other commands that Docker offers.

This setup still leaves room for improvement – the current process involves quite a lot of handwork, and we’ve coupled our container with one particular database dump. Please refer to the¬†Github project¬†for automated examples of this process.