Tag Archive: database

Setting up data analytics pipeline: the best practices

The picture is courtesy of https://bit.ly/2K44Nk5 1 Datapipeline Architect Example

In a data science analogy with the automotive industry, the data plays the role of the raw-oil which is not yet ready for combustion. The data modeling phase is comparable with combustion in the engines and data preparation is the refinery process turning raw-oil to the fuel i.e., ready for combustion. In this analogy data analytics pipeline includes all the steps from extracting the oil up to combustion, driving and reaching to the destination (analogous to reach the business goals). As you can imagine, the data (or oil in this analogy) goes through a various transformation and goes from one stage of the process to another. But the question is what is the best practice in terms of data format and tooling? Although there are many tools that make the best practice sometimes very use-case specific but generally JSON is the best practice for the data-format of communication or the lingua franca and Python is the best practice for orchestration, data preparation, analytics and live production.

What is the common inefficiency and why it happens?

The current inefficiency is overusing of tabular (csv-like) data-formats for communication or lingua franca. I believe data scientists still overuse the structured data types for communication within data analytics pipeline because of standard data-frame-like data formats offered by major analytic tools such as Python and R. Data scientists start getting used to data-frame mentality forgetting the fact that tabular storage of the data is a low scale solution, not optimized for communication and when it comes to bigger sets of data or flexibility to add new fields to the data, data-frames and their tabular form are non-efficient.

DataOps Pipeline and Data Analytics

A very important aspect for analytics being ignored in some circumstances is going live and getting integrated with other systems. DataOps is about setting up a set of tools from capturing data, storing them up to analytics and integration, falling into an interdisciplinary realm of the DevOps, Data Engineering, Analytics and Software Engineering (Hereinafter I use data analytics pipeline and DataOps pipeline interchangeably.) The modeling part and probably some parts in data prep phases need a data-frame like data format but the rest of the pipeline is more efficient and robust if is JSON native. It allows adding/removing features easier and is a compact form for communication between modules.

The picture is courtesy of https://zalando-jobsite.cdn.prismic.io/zalando-jobsite/2ed778169b702ca83c2505ceb65424d748351109_image_5-0d8e25c02668e476dd491d457f605d89.jpg 2

The role of Python

Python is a great programming language used not only by the scientific community but also the application developers. It is ready to be used as back-end and by combining it with Django you can build up full-stack web applications. Python has almost everything you need to set up a DataOps pipeline and is ready for integration and live production.

Python Example: transforming CSV to JSON and storing it in MongoDB

To show some capabilities of Python in combination with JSON, I have brought a simple example. In this example, a dataframe is converted to JSON (Python dictionaries) and is stored in MongoDB. MongoDB is an important database in today’s data storage as it is JSON native storing data in a document format bringing high flexibility .

<br />### Loading packages

from pymongo import MongoClient import pandas as pd

# Connecting to the database

client = MongoClient('localhost', 27017)

# Creating database and schema

db = client.pymongo_test posts = db.posts

# Defining a dummy dataframe

df = pd.DataFrame({'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])

# Transforming dataframe to a dictionary (JSON)


# Writing to the database

result = posts.insert_one(dic) print('One post: {0}'.format(result.inserted_id))

The above example shows the ability of python in data transformation from dataframe to JSON and its ability to connect to various tooling (MongoDB in this example) in DataOps pipeline.


This article is an extension to my previous article on future of data science (https://bit.ly/2sz8EdM). In my earlier article, I have sketched the future of data science and have recommended data scientists to go towards full-stack. Once you have a full stack and various layers for DataOps / data analytics JSON is the lingua franca between modules bringing robustness and flexibility for this communication and Python is the orchestrator of various tools and techniques in this pipeline.

1: The picture is courtesy of https://cdn-images-1.medium.com/max/1600/1*8-NNHZhRVb5EPHK5iin92Q.png 2: The picture is courtesy of https://zalando-jobsite.cdn.prismic.io/zalando-jobsite/2ed778169b702ca83c2505ceb65424d748351109_image_5-0d8e25c02668e476dd491d457f605d89.jpg

Providing developers and testers with a representative database using Docker

When developing or testing, having a database that comes pre-filled with relevant data can be a big help with implementation or a scenario walk-through. However, often there is only a structure dump of the production database available, or less. This article outlines the process of creating a Docker image that starts a database and automatically restores a dump containing a representative set of data. We use the PostgreSQL database in our examples, but the process outlined here can easily be applied to other databases such as MySQL or Oracle SQL.

Dumping the database

I will assume that you have access to a database that contains a copy of production or an equivalent representative set of data. You can dump your database using Postgres’ pg_dump utility:

 pg_dump --dbname=postgresql://user@localhost:5432/mydatabase \ --format=custom --file=database.dmp 

We will be using the custom format option to create the dump file with. This gives us a file that can easily be restored with the pg_restore utility later on and ensures the file is compressed. In case of larger databases, you may also wish to exclude certain database elements from your dump. In order to do so you have the following options:

  • The --exclude-table option, which takes a string pattern and ensures any tables matching the pattern will not be included in the dump file.
  • The --schema option, which restricts our dump to particular schemas in the database. It may be a good idea to exclude pg_catalog – this schema contains among other things the table pg_largeobject, which contains all of your database’s binaries.

See the Postgres documentation for more available options.

Distributing the dump among users

For the distribution of the dump, we will be using Docker. Postgres, MySQL and even Oracle provide you with prebuilt Docker images of their databases.

Example 1: A first attempt
In order to start an instance of a Postgres database, you can use the following docker run command to start a container based on Postgres:

 docker run -p 5432:5432 --name database-dump-container \ -e POSTGRES_USER=user -e POSTGRES_PASSWORD=password \ -e POSTGRES_DB=mydatabase -d postgres:9.5.10-alpine 

This starts a container named database-dump-container that can be reached at port 5432 with user:password as the login. Note the usage of the 9.5.10-alpine tag. This ensures that the Linux distribution that we use inside the Docker container is Alpine Linux, a distribution with a small footprint. The whole Docker image will take up about 14 MB, while the regular 9.5.10 tag would require 104 MB. We are pulling the image from Docker Hub, a public Docker registry where various open source projects host their Docker images.

Having started our Docker container, we can now copy our dump file into it. We first use docker exec to execute a command against the container we just made. In this case, we create a directory inside the Docker container:

docker exec -i database-dump-container mkdir -p /var/lib/postgresql/dumps/

Following that, we use docker cp to copy the dump file from our host into the container:

docker cp database.dmp database-dump-container:/var/lib/postgresql/dumps/

After this, we can restore our dump:

 docker exec -i database-dump-container pg_restore --username=user --verbose \ --exit-on-error --format=custom \ --dbname=mydatabase /var/lib/postgresql/dumps/database.dmp 

We now have a Docker container with a running Postgres instance containing our data dump. In order to actually distribute this, you will need to get it into a Docker repository. If you register with Docker Hub you can create public repositories for free. After creating your account you can login to the registry that hosts your repositories with the following command:

docker login docker.io

Enter the username and password for your Docker Hub account when prompted.

Having done this, we are able to publish our data dump container as an image, using the following commands:

 docker commit database-dump-container my-repository/database-dump-image<br />
docker push my-repository/database-dump-image 

Note that you are able to push different versions of an image by using Docker image tags.

The image is now available to other developers. It can be pulled and ran on another machine using the following commands:

 docker pull my-repository/database-dump-image docker run -p 5432:5432 --name database-dump-container \ -e POSTGRES_USER=user -e POSTGRES_PASSWORD=password \ -e POSTGRES_DB=mydatabase -d my-repository/database-dump-image 

All done! Or are we? After we run the container based on the image, we still have an empty database. How did this happen?

Example 2: Creating your own Dockerfile
It turns out that the Postgres Docker image uses Docker volumes. This separates the actual image from data and ensures that the size of the image remains reasonable. We can view what volumes Docker has made for us by using docker volume ls. These volumes can be associated with more than one Docker container and will remain, even after you have removed the container that initially spawned the volume. If you would like to remove a Docker container, including its volumes, make sure to use the -v option:

docker rm -v database-dump-container

Go ahead and execute the command, we will be recreating the container in the following steps.

So how can we use this knowledge to distribute our database including our dump? Luckily, the Postgres image provides for exactly this situation. Any scripts that are present in the Docker container under the directory /docker-entrypoint-initdb.d/ will be executed automatically upon starting a new container. This allows us to add data to the Docker volume upon starting the container. In order to make use of this functionality, we are going to have to create our own image using a Dockerfile that extends the postgres:9.5.10-alpine image we used earlier:

FROM postgres:9.5.10-alpine

RUN mkdir -p /var/lib/postgresql/dumps/
ADD database.dmp /var/lib/postgresql/dumps/
ADD intialize.sh /docker-entrypoint-initdb.d/

The contents of initialize.sh are as follows:

 pg_restore --username=user --verbose --exit-on-error --format=custom \ --dbname=mydatabase /var/lib/postgresql/dumps/database.dmp 

We can build and run this Dockerfile by navigating to its directory and then executing:

 docker build --rm=true -t database-dump-image .<br />
docker run -p 5432:5432 --name database-dump-container \ -e POSTGRES_USER=user -e POSTGRES_PASSWORD=password -e POSTGRES_DB=mydatabase -d database-dump-image 

After starting the container, inspect its progress using docker logs -f database-dump-container. You can see that upon starting the container, our database dump is being restored into the Postgres instance.

We can now again publish the image using the earlier steps, and the image is available for usage.

Conclusions and further reading

While working through this article, you have used a lot of important concepts within Docker. The first example demonstrated the usage of images and containers, combined with commands exec and cp that are able to interact with running containers. We then demonstrated how you can publish a Docker image using Docker Hub, after which we’ve shown you how to build and run your own custom made image. We have also touched upon some more complex topics such as Docker volumes.

After this you may wish to consult the Docker documentation to further familiarize yourself with the other commands that Docker offers.

This setup still leaves room for improvement – the current process involves quite a lot of handwork, and we’ve coupled our container with one particular database dump. Please refer to the Github project for automated examples of this process.