Categories for Uncategorized



How to Install PySpark and Apache Spark on MacOS

Here is an easy Step by Step guide to installing PySpark and Apache Spark on MacOS.

Step 1: Get Homebrew

Homebrew makes installing applications and languages on a Mac OS a lot easier. You can get Homebrew by following the instructions on its website.

In short you can install Homebrew in the terminal using this command:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Step 2: Installing xcode-select

Xcode is a large suite of software development tools and libraries from Apple. In order to install Java, and Spark through the command line we will probably need to install xcode-select.

Use the blow command in your terminal to install Xcode-select: xcode-select –install

You usually get a prompt that looks something like this to go further with installation:

You need to click “install” to go further with the installation.

Step 3: DO NOT use Homebrew to install Java!

The latest version of Java (at time of writing this article), is Java 10. And Apache spark has not officially supported Java 10! Homebrew will install the latest version of Java and that imposes many issues!

To install Java 8, please go to the official website: https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Then From “Java SE Development Kit 8u191” Choose:

Mac OS X x64 245.92 MB jdk-8u191-macosx-x64.dmg

To download Java. Once Java is downloaded please go ahead and install it locally.

Step 3: Use Homebrew to install Apache Spark

To do so, please go to your terminal and type: brew install apache-spark Homebrew will now download and install Apache Spark, it may take some time depending on your internet connection. You can check the version of spark using the below command in your terminal: pyspark –version

You should then see some stuff like below:

Step 4: Install PySpark and FindSpark in Python

To be able to use PyPark locally on your machine you need to install findspark and pyspark

If you use anaconda use the below commands:

#Find Spark Option 1: 
     conda install -c conda-forge findspark 
#Find Spark Option 2: 
     conda install -c conda-forge/label/gcc7 findspark 
#PySpark: 
     conda install -c conda-forge pyspark

If you use regular python use pip install as: 
     pip install findspark 
     pip install pyspark

Step 5: Your first code in Python

After the installation is completed you can write your first helloworld script:

    import findspark from pyspark 
    import SparkContext from pyspark.sql 
    import SparkSession 
    findspark.init() 
    sc = SparkContext(appName="MyFirstApp") 
    spark = SparkSession(sc) 
    print("Hello World!") 
    sc.close() #closing the spark session

The Forgotten Step in CRISP-DM and ASUM-DM Methodologies

CRISP-DM stands for the cross-industry standard process for data mining which an open standard for data mining existing since 1999 and proposed by IBM. CRISP-DM suggests a set of steps to perform data mining projects to maximize the success of the project and minimize the common faults happening in any data-oriented projects. Later in 2015, an extended version of CRISP-DM is proposed by IBM so-called ASUM-DM (the Analytics Solutions Unified Method). ASUM-DM is an extension of CRISP-DM having the same steps in data mining (development) plus an operational / deployment part. I personally pretty much a fan of CRISP-DM and ASUM-DM. In my daily consultancy life, I stick to the steps provided because it minimizes the risk of project failures. I believe following CRISP-DM and ASUM-DM methodologies properly distinguishes a senior data scientist from junior ones. Many data-scientists/data-miners have the tendency to quickly model the data to reach the insights ignoring proper understanding of the problem and the right data preparation. That is the reason CRISP-DM comes with clear steps that taking them minimizes the common failure in any data science/data mining projects. Being a data miner and later a data scientist for over 12 years, I believe CRISP-DM misses one crucial step. By writing this article I intend to add a new step in CRISP-DM/ASUM-DM which comes from some years of experience in data science.

CRISP-DM Methodology

CRISP-DM suggests these steps for data-mining/data-science: (1) Business understanding: which means the data scientist should properly understand the business of his/her client. Why is analytics important to them? How analytics can be of a great value for the business and so on. (2) Data understanding: which means the data scientist should go through all the fields within the data to understand the data like a domain expert. With a poor understanding of the data, a data scientist can barely provide high-quality data science solutions. (3) Data preparation: which is the most time-consuming step in any data science project being data preparation in the way that a model can ingest and understand it. (4) Modeling: the magical phase turning the raw-data to (actionable) insights. With recent advances in data science and the toolings such as AutoML and deep learning, modeling is less complicated as before. (4) Evaluation: checking the accuracy of the model which metrics such as a confusion matrix, RMSE, MAPE, and MdAPE. (5) Deployment: which means making the use of the model with the new data. As you can see in the picture, CRISP-Dm is an iterative approach, matches quite well with agile methodology. The steps can be taken in parallel and they are flexible enough to be redone quickly once there is a modification in any previous steps.

ASUM-DM methodology:

ASUM-DM adds a new deployment/operation wing to CRISP-DM. The development phase stays the same as CRISP-DM however in deployment new facets are added such as collaboration, version control, security, and compliance.

The forgotten step in CRISP-DM and ASUM-DM:

CRISP-DM repeats itself in ASUM-DM as the development part however it misses an important step being data validation. My CRISP-DM version looks like this.

Why data validation?

Data validation happens immediately after data preparation/wrangling and before modeling. it is because during data preparation there is a high possibility of things going wrong especially in complex scenarios. Data validation ensures that modeling happens on the right data. faulty data as input to the model would generate faulty insight!

How is data validation done?

Data validation should be done by involving minimum one external person who has a proper understanding of the data and business. In my situation is usually my clients who technically good enough to check my data. Once I go through data preparation and just before data modeling, I usually make data visualization and give my newly prepared data to the client. The clients with the help of SQL queries or any other tools try to validate if my output contains no error. Combing CRISP-DM/ASUM-DM with the agile methodology, steps can be taken in parallel meaning you do not have to wait for the green light for data validation to do the modeling. But once you get feedback from the domain expert that there are faults in the data, you need to correct the data by re-doing the data-preparation and re-model the data.

What are the common causes leading to a faulty output from data preparation?

Common causes are:
1. Lack of proper understanding of the data, therefore, the logic of the data preparation is not correct.
2. Common bugs in programming/data preparation pipeline that lead to a faulty output.
3. Data formats that make some troubles within the data-preparation step and generating faulty outputs with no error trace to be caught by the data scientist/engineer during the data-preparation.

Conclusion:

In this article, I would like to extend the CRISP-DM/ASUM-DM by adding a new step. The whole idea of these methodologies is to formalize the steps helping the data-scientists/data-miners to improve the success of the projects and reduce failures. In my CRISP-DM version, “data validation” step is added which ensures even more success of the project and prevents, even more, the failures and faults of any data-science/data-mining projects.


What I’ve Learned About Kubernetes

So, I recently finished a course about Kubernetes (and Docker and Google Cloud): Scalable Microservices with Kubernetes. Unfortunately, it’s only been a couple of days since then, and I’m already not exactly sure about what I learned.

To start with, here are some basics:

  1. Docker manages containers.
  2. Kubernetes manages containers like Docker and is declarative (more about that later).
  3. Google Cloud has a similar set of services to AWS, but of course with different names.

Lastly, I’m more a fan of the command-line than of GUIs so there will be more of a focus on commands.

 


Docker

Docker manages containers. I originally thought that a container was simply a virtual instance of a computer (in other words a virtual machine or VM). However, that’s incorrect with regards to how a container is actually used.

If you use containers as if they were VMs, then that makes it hard to elegantly and efficiently build systems with or from containers. A VM has all of the services and resources that a normal computer has: does a container have all of those resources? A VM is “the world an application lives in”. Is a container a “full world”?

There were 2 things that I read that were changed my view about containers:

  1. A (well-known) post on Hacker News about how Docker contains an OS.
  2. A paper by Burns and Oppenheimer (Google) about “Design patterns for container-based distributed systems“.

I would recommend reading both, but for those of who you just want the raw info, here it is:

  1. Containers do not need to contain a full operating system (there goes my VM analogy!)
  2. Development with containers has reached the point that containers are being used as units — like lego-blocks — in design patterns.

So, what are containers?

Well, sure, what does Wikipedia say? Wiki says that containers are, in essence, “an operating system feature in which the kernel allows the existence of multiple isolated user-space instances“. (If you have a background in operating systems or linux, then reading about cgroups is also interesting).

Of course, it’s my blog post, so I’ll write about it if I want to. What would I say that containers are? Hmm…

Containers are containers. Hahahaha.. No, really. Containers are a place for (a piece of) software. And, to paraphrase Larry Wall (of Perl fame), that piece of software in your container is probably communicating with something else, so containers always have network interfaces or access to a shared/external filesystem.

In short, containers are powerful because you can use a container to both isolate (or “make modular”) a piece of software and run a piece of software. In other words, this is the next level of modularization (and thus reuse) and execution. Other similar mechanisms to a container in the past are:

  • compilers (which created a binary that could be run)
  • Java (isolation from the operating system)
  • Service Oriented Architecture (isolation/modularization per computer, group of services)
  • Microservices (isolation/modularization per function/domain)

Inherent in the modularization + execution model is increased scalability.

One of the key takeaways here is that you should not be putting multiple applications/pieces of software in 1 container. In fact, it seems like this might be an anti-pattern and “tight-coupling” in essence.

Actual Docker use

In terms of actual facts, commands and code that I learned:

  • Docker images are described in Dockerfiles.
    • Actually, a Dockerfile basically1 just contains
      1. the names of what you’re using,
      2. the commands to set it up (RUN, ENV, EXPOSE, etc..) and
      3. the commands to run the actual “contained” application (CMD).
  • docker run runs the container, docker ps shows the docker processes, docker stop stops the container, and docker rm removes the container from the system.
  • There are a couple of sites that host Dockerfiles: you only store 1 Dockerfile per “repository”, although the versions may differ (but not the name).

Here’s an example Dockerfile:

FROM java:8
RUN javac MyJavaClass.java
CMD ["java", "MyJavaClass"]

[ 1 ] A Dockerfile contains more than just that, but for simplicity’s sake..

 


Kubernetes

Kubernetes makes me a little bit sad. That’s how awesome it is. What makes especially me sad about Kubernetes is that it’s declarative.

If you look back at the big picture of what we as coders (sorry, “software engineers”) have been working on for the last 30 years, we’ve been 1. building systems for other people and 2. making it easier for us to do (1) faster.

So (2), “making it easier for ourselves”, is great for other people and it’s great for building stuff quickly. However, in my opinion, most of the really hard (and interesting) problems are part of (2). And Kubernetes solves one more problem that future generations will never really have to solve again, which is a little sad (but a lot awesome).

Kubernetes solves scaling with containers2.

Before we go further, if you’ve never looked at D3.js, then go do that first. You’ve already read this far and deserve a break. No, really, go away. Come back later if you’re still interested. Yes, I mean it. BYE!

[ 2 ] I’m officially calling this the first occurance of Rietveld’s theory: the less words it takes to explain something in non-technical language, the more complex and impressive that thing is.

Declarative programming

You’re back! Where was I? Uhm.. Kubernetes! and D3.js! Wait, What?!?

So, if you’re used to most (imperative) programming languages, whether it’s python, bash scripts or java, then you’re used to spelling out everything:

int s = 0;
for( i = 0; i < 10; ++i ) { 
  s += i;
}

Declarative thinking means that you don’t say what you want to do, you just say what you want to do! (Hahahaha.. ). Okay, what I actually mean is that you specify the requirements of the task as opposed to describing the steps of the task. It’s a paradigm shift.

D3.js was the first time I explicitly ran into this way of coding: you specify what you want from D3 instead of how D3 should do it. Then I also learned that SQL was declarative as well. Oh.. Duh.

Back to Kubernetes: you tell Kubernetes what you want, not how K8s should do it. (Yeah, people use K8s as an abbreviation for Kubernetes.) So, for example, that K8s should create a load balancer in front of 3 (identical) instances of a specific container. You put that in a Kubernetes config file (in YAML format), and it does that.

What’s impressive to me is the amount of network “magic” that Kubernetes is doing under-the-hood. The problems Kubernetes solves are both hard and relatively new, which says something about how much research Google has been doing in the last 2 decades.

Pods and Services…

Kubernetes pods are groupings of containers. In Burns’s and Oppenheimer’s paper (about container design patterns), they write that multiple-container patterns that take place within one “node” are equivalent to a Kubernetes pod. An example would be the Sidecar pattern. The Sidecar pattern is 1 pod containing 2 containers: 1 container with a web server and 1 container that streams the web server’s logs to somewhere else.

Kubernetes services are basically a way to communicate between groups of pods. (Sort of like how an ESB makes sure that different components/webservices can easily communicate with eachother. )

Maybe a quick example of service configuration file will help:

kind: Service
apiVersion: v1
metadata:
  name: my-service
spec:
  selector:
    myLabel: MyKey
  ports:
  - protocol: TCP
    port: 80
    targetPort: 9376

Oops, I forgot to mention that K8s also has this concept of labels, which are key-value pairs you can attach to pods (among other things). So the my-service service will route communication (messages, packets, etc.. ) to all of the pods that have the mylabel label with a value of MyKey. Of course, there are other types of services (and thus other ways to define them), but you can go read up on that yourself!

…and Deployments

 

Lastly, Kubernetes also has the concept of deployments. Remember that I said that Kubernetes was declarative? A deployment is a little bit like a firewall rule. Deployments are mostly used to manage the clustering/load-balancing of pods.

The documentation states: “You describe a desired state in a Deployment object, and the Deployment controller changes the actual state to the desired state at a controlled rate“. It also helpfully lists some use cases for when you would use a deployment.

The reason that I compared a K8s deployment to a firewall rule is that the deployment is not only applied, Kubernetes remembers that you’ve specified this. So if you then start to do things with Pods or Services that don’t match up to what you specified in your Deployment, you’ll run into problems.

One last tip: Kubernetes calls clusters of pods “replica sets“: you have multiple “replicas” or (identical) copies of a pod in a cluster, which makes up a.. replica set. You can thus describe what type of replica set you want in your K8s deployment.

Actual Kubernetes Use

So, how would you actually use all of this?

  • kubectl is our main command.
  • kubectl apply -f my-deployment.yaml applies a deployment configuration (described in the my-deployment.yaml file)
  • If you have a docker image and just want to run it via K8s, then you can do kubectl run MyK8sDeplName --image=myDockerImageNameAndVersion. This command also specifies a deployment.
    • kubectl run takes all sorts of other options; for example, you can use --replicas=<num-replicas> to specify how many replicas you want in your deployment.
  • If you want more granular control over your K8s resources, you can also use kubectl create to create “resources” (pods, services, etc.).
    • You would usually do something like this: kubectl create -f my-service-config.yaml.
    • Here’s a slightly more complex tutorial that covers that.
  • Of course, it’s also helpful to be able to get information about the actual status of your K8s clusters. You can use the following commands to do this:
    • You can use kubectl get to get specific information. For example, kubectl get pods or kubectl get services to list all pods or services.
    • For more in-depth information, use the kubectl describe command. For example, kubectl describe pods/nginx.
  • Lastly, there is of course kubectl delete (which deletes stuff..).

There are a bunch more commands you can read about here.

Lastly, there’s a funny but very interesting talk by Kelsey Hightower here that uses Tetris to explain Kubernetes.


 

The Google Cloud Platform

Google Cloud Services

To start with, I haven’t actually worked with any cloud providers yet, so this was my first hands-on experience with one.

It seems to me like most cloud providers are now grouping their services into 4 types:

  1. “Functions” or “Lambdas” which can be thought of as pieces of code.
    • AWS uses “Lambda” to describe their service.
    • Google Cloud calls theirs “Functions“.
  2. “Platforms” which, for us Java developers, mean “application containers”, roughly. For example, a tomcat or jetty instance.
  3. “Containers as a Service”, which is a cloud service to manage Docker and Kubernetes resources.
    • AWS has ECS (Elastic Container Service) and EKS (Elastic Kubernetes Service).
    • The Google Cloud service is called the “Kubernetes Engine“.
  4. And, of course, “Infrastructure as a Service” which is simply Virtual Machines in the cloud.
    • AWS has EC2 (Elastic Compute Cloud).
    • The Google Cloud service is called the “Compute Engine“.

Using Google Cloud

Right.

While there are a bunch of different menus available in the google cloud console, one of the primary ways to interact with your Google Cloud resources is via the “Google Cloud Shell”. In short, this is an in-browser shell that seems to take place on a (virtual) linux machine.

In the Google Cloud shell, you have your own home directory as well as a fairly standard linux path. Of course, there are some other commands available, such as the gcloud, kubectl and docker commands. To tell the truth, it all seems to work magically. By that, what I really mean is that it’s not clear to me why those commands “just work” in the google cloud shell, but they do.

There were only really 4 (or 5) commands I used with Google Cloud in the shell:

  • gcloud compute instances create <instance-name> [OPTIONS] creates a Google Compute Engine instance, which is a VM.
    • Among other options, you can specify what type of OS you want with various options.
  • gcloud compute ssh <instance-name> allows you to ssh to the Compute Engine instance.
  • You need gcloud compute zones list and gcloud config set compute/zone <zone> to set the timezone that an instance is in.
  • Lastly, for Kubernetes, I used gcloud container clusters create <kubernetes-cluster-name> to create a Google Kubernetes Engine instance (a.k.a. a Kubernetes cluster on the Google Cloud).

 

And folks, that’s all I wrote!


Image sources


Techday – Smart mirror

A few months ago we had a techday at Luminis Amsterdam. These days are great to explore, learn, innovate or just doing something cool. For a while now I have been reading about creating a smart mirror with a raspberry pi. So my idea on this techday was to find out what we need to do to create a smart mirror for our new office. Ideas went all over the place.. we need facial recognition, machine learning, some sort of custom greetings message.. Well enough enthusiasm!
Since creating a smart mirror also includes some carpentry, we decided to focus on the software part first 😉

(more…)


Fixing the long startup time of my Java application running on macOS Sierra

At my current project, we’re developing an application based on Spring Boot. During my normal development cycle, I always start the application from within IntelliJ by means of a run configuration that deploys the application to a local Tomcat container.  Spring boot applications can run perfectly fine with an embedded container, but since we deploy the application within a Tomcat container in our acceptance and production environments, I always stick to the same deployment manner on my local machine.

After joining the project in March one thing always kept bugging me. When I started the application with IntelliJ, it always took more than 60 seconds to start the deployed application, which I thought was pretty long given the size of the application. My teammates always said they found it strange as well, but nobody bothered to spend the time to investigate the cause.

Most of us run the entire application and it’s dependencies (MongoDB and Elasticsearch) on their laptop and the application requires no remote connections, so I always wondering what the application was doing during those 60+ seconds. Just leveraging the logging framework with the Spring boot application gives you a pretty good insight into what’s going on during the launch of the application. In the log file, there were a couple of strange jumps in time that I wanted to investigate further. Let’s take a look at a snippet of the log:

2017-05-09 23:53:10,293 INFO - Bean 'integrationGlobalProperties' of type [class java.util.Properties] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2017-05-09 23:53:15,829 INFO - Cluster created with settings {hosts=[localhost:27017], mode=MULTIPLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
2017-05-09 23:53:15,830 INFO - Adding discovered server localhost:27017 to client view of cluster
2017-05-09 23:53:16,432 INFO - No server chosen by WritableServerSelector from cluster description ClusterDescription{type=UNKNOWN, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=localhost:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
2017-05-09 23:53:20,992 INFO - Opened connection [connectionId{localValue:1, serverValue:45}] to localhost:27017
2017-05-09 23:53:20,994 INFO - Monitor thread successfully connected to server with description ServerDescription{address=localhost:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 4, 2]}, minWireVersion=0, maxWireVersion=5, maxDocumentSize=16777216, roundTripTimeNanos=457426}
2017-05-09 23:53:20,995 INFO - Discovered cluster type of STANDALONE
2017-05-09 23:53:21,020 INFO - Opened connection [connectionId{localValue:2, serverValue:46}] to localhost:27017
2017-05-09 23:53:21,293 INFO - Checking unique service notification from repository: 

Now what’s interesting about the above log is that it makes a couple of multi-second jumps. The first jump is after handling the bean ‘integrationGlobalProperties’. After about 5 seconds the application logs an entry when it tries to setup a connection to a locally running MongoDB instance. I double checked my settings, but you can see it’s really trying to connect to a locally running instance by the log messages stating it tries to connect to ‘localhost’ on ‘27017’.
A couple of lines down it makes another jump of about 4 seconds. In that line, it is still trying to set up the proper MongoDB connection. So in it takes about 10 seconds in total to connect to a locally running (almost empty) MongoDB instance. That can’t be right?!

Figuring out what’s was going on wasn’t that hard. I just took a couple of Thread dumps and a small Google query which led me to this post on the IntelliJ forum and this post on StackOverflow. Both posts point out a problem similar to mine: a ‘DNS problem’ with how ‘localhost’ was resolved. T he time seems to be spent in java.net.InetAddress.getLocalHost(). The writers of both posts have a delay up to 5 minutes or so, which definitely is not workable and would have pushed me to look into this problem instantly. I guess I was ‘lucky’ it just took a minute on my machine.

Solving the problem is actually quite simple as stated in both posts. All you have to do is make sure that your /etc/hosts file also contains the .local domain entry for ‘localhost’ entries.

While inspecting my hosts file I noticed it did contain both entries for resolving localhost on both IPv4 and IPv6.

127.0.0.1 localhost
::1       localhost

However, it was missing the .local addresses, so I added those. If you’re unsure what your hostname is, you can get it quite easily from a terminal. Just use the hostname command:

$ hostname

and it should return something like:

Jeroens-MacBook-Pro.local

In the end, the entries in your host file should look something like:

127.0.0.1   localhost Jeroens-MacBook-Pro.local
::1         localhost Jeroens-MacBook-Pro.local

Now with this small change applied to my hosts file, the application starts within 19 seconds. That 1/3 of the time it needed before! Not bad for a 30-minute investigation. I wonder if this is related to an upgraded macOS or if it exists on a clean install of macOS Sierra as well. The good thing is that this will apply to other applications as well, not just Java applications.


Virtualizing your development environment with Vagrant and Ansible

Tools for virtualization and provisioning like Vagrant and Ansible have been around for a while now, but I haven’t met many people that use them for their everyday development work. So I am going to try to make some PR for Vagrant and Ansible.

I first started working in virtual machines because I wanted to be independent of the hardware my employer provided me with. At that time, we moved from a desktop computer to a frequently crashing Windows laptop, to a perfectly good System76 laptop with Ubuntu pre-installed. By the end of that transition, I did all my development work in a vm. I had an Ubuntu guest running inside an Ubuntu host and I was happy; because I knew that when I would have to switch hardware again, I could just export my vm and import it somewhere else.

Back then I only used VirtualBox. Nowadays I use Vagrant and Ansible to completely automate the process of setting up a new development environment. Whenever I start working on a new project, I let Vagrant and Ansible do their magic. Vagrant takes care of configuring VirtualBox and it will also pull a base image with some OS pre-installed. Ansible does provisioning, which, in a nutshell, means installing packages and copying configuration files.

This approach has several advantages.

  • Your host OS will not become a monolith with all kinds of packages installed that you no longer need.
  • Each one of your projects has its own personal sandbox. It’s like Python’s virtualenv or Ruby’s RVM on an OS level.
  • If you manage to break apt-get or grub, it will only affect 1 project. And you can easily recreate that 1 vm.
  • If your hardware breaks or you need to switch hardware for some other reason, you won’t have to cry. If you backed up your VirtualBox files, you can continue working on another machine without having to reinstall anything. And if you didn’t make backups, you can just rerun Vagrant and Ansible.
  • When you close your vm, you can choose to save its state. Memory will be freed so you will be able to use all your system’s resources for something else. And when you start your vm again, it’s like you never left and you are right back in the middle of that Jupyter notebook file.
  • You can easily share your Ansible roles with your team members. If there is a new colleague coming to work, you won’t have to sit next to him/her for an entire day to help him set up his environment. Just sit next to him while his machine is provisioning and explain to him in general terms what Ansible is installing and why.
  • In case you need to transfer your work to a co-worker, you can give him/her your vm image in addition to any other files/information.
  • When you are done working on a project, make sure all your code is in git, like a proper developer. Then you can just delete your vm without leaving any debris.

And of course there are drawbacks as well

  • Obviously, you would need to know a little bit about Vagrant and Ansible.
  • Both your hardware and host OS need to support hardware virtualization.
  • There is some overhead involved in running an OS on top of another OS. But I don’t think I ever noticed enough to complain about it. In any case, storing your VirtualBox files on an SSD will certainly help.
  • You need a little bit of extra disk space for all those guest OS’s that will start gathering dust. You should just delete old vm’s that you don’t use anymore, or at least remove them from your SSD.
  • If you use proprietary software, you might run into licensing issues. In the case of IntelliJ, it seems you cannot run 2 instances of the IDE simultaneously.
  • Occasionally you need to have 2 vm’s open at the same time. Make sure you have enough memory available to support both vm’s. If not, the 1st vm you started will be terminated while starting up or un-suspending the 2nd. Depending on the memory usage of your vm’s and your host, running 2 vm’s simultaneously on an 8 GB host can get tricky. I have learned to appreciate extra RAM as well as lightweight Linux distributions.

I hope this piece of propaganda will motivate people to try something new. On my Github account, you will find some scripts that I use to set up a development environment for data science projects that I sometimes do. Take note that it’s a work in progress and it probably won’t work out of the box, because I use my own private base box. If you are completely new to Vagrant or Ansible, it’s best you start with some tutorial.

 


What a search PhD is doing at a data solutions company

In this post, after a brief commercial for my somewhat recently defended dissertation[1], I’ll draw a couple of parallels between the work I did in academia and how we approach data science at Luminis.

My dissertation is about search, or, information retrieval: how to find relevant bits in the ever growing multitude of information around us. Machine learning pops up a lot in search these days, and also in my work, whether it is classifying queries, clustering search results, or learning to rank documents in response to a query.

cover
All chapters handle specialized search engines, e.g., people, scientific literature, or blog search engines. Using background knowledge, we improve general algorithms. As an example, to improve Twitter search, we used hashtags to automatically train a machine learning algorithm (Chapter 7, based on [2]).

A first obvious parallel between what I did before and what we do at Luminis Amsterdam is search. Elasticsearch is one of our biggest knowledge areas, as the proportion of blog posts on this site about it indicate.

A second is that the machine learning algorithms used in my dissertation can be applied to a wide range of problems, also outside search. For example, at Luminis we use classification techniques for problems like churn prediction and prospect scoring. The algorithms I personally have most experience with, e.g., decision trees, support vector machines, Bayesian classifiers, share the property that they are interpretable. At Luminis, we feel it is important for our customers to be able to understand, maintain, and manipulate the predictive algorithms we build for them.

A third is data. Right in the beginning of my PhD, in one of my favorite projects, we performed a query log analysis of a people search engine [3]. What made this exciting for me was the fact that we were working with real data, from real people. At Luminis, we work with real data as well, e.g., data from schools, hospitals, and businesses.

A fourth is tooling. In my PhD, as my experiments grew more complex and my datasets larger and larger, I appreciated more and more the software engineering challenges associated with data science. Working at Luminis means upgrading my software toolbox in almost every aspect. Python libraries like pandas, scikit-learn, Javascript frameworks like Angular, The ELK stack (Elasticsearch, Logstash, Kibana), Spark, Java frameworks like Spring and OSGI are some of the software that I’ve started using a lot more.

A fifth is dissemination. Of course, science is all about dissemination of knowledge (after one has been the first to obtain a publishable bit of it). But at Luminis, too, we believe in sharing our knowledge, and even our code, with large open source projects like Amdatu [4]. For me personally, it means among other things that I was given the chance to prepare a talk about how one might approach a data science project for a retail business starting from just an Excel sheet and without any business input; the video [5] and code [6] are online.

A sixth is experience. At ILPS, led by Maarten de Rijke, where I did my PhD, there was a vast amount of experience with challenges and opportunities of the full range of major web search engines to smaller specialised search engines like, for example, Netherlands Institute of Sound and Vision. At Luminis, we can draw on a vast amount of experience with the challenges and opportunities of our customers—businesses, semi-public and public organisations.

Putting these ingredients together, this is one way I like to think about how we approach data science: we enable organisations to build a data driven culture that can be sustained by its people, and based on which decision makers can make responsible and informed strategies and decisions. Interpretable algorithms, insightful reports and experiments, interactive dashboards with visualisations that directly relate to what is going on under the hood in predictive algorithms, useful applications; all backed by solid software engineering, resulting in lean and maintainable code bases.

[1] Berendsen, R. W. (2015). Finding people, papers, and posts: Vertical search algorithms and evaluation. PhD Thesis, UvA. URL: http://dare.uva.nl/record/1/489897

[2] Berendsen, R., Tsagkias, M., Weerkamp, W., & De Rijke, M. (2013, July). Pseudo test collections for training and tuning microblog rankers. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval (pp. 53-62). ACM. Pre-print: http://wouter.weerkamp.com/downloads/sigir2013-pseudotestcollections.pdf

[3] Weerkamp, W., Berendsen, R., Kovachev, B., Meij, E., Balog, K., & De Rijke, M. (2011, July). People searching for people: Analysis of a people search engine log. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (pp. 45-54). ACM. Pre-print: http://www.wouter.weerkamp.com/downloads/sigir2011-peoplesearchlog.pdf

[4] https://www.luminis.eu/what-we-share/open-source/

[5] https://youtu.be/8wNl89zXlKw

[6] https://github.com/luminis-ams/devcon-2016-rb


Internal hackdays – the untold story

One of the reasons that got me interested in Luminis Amsterdam was that they have internal hackdays (days where developers, designers, and other interested parties come together to build stuff). Nowadays, having an internal hackday is not that special anymore: many companies do them (big and small). I liked to hear during my interview that since Luminis Amsterdam started, 1 year back, they already had two hackdays. And the trend continued after I joined them. Unfortunately, four months had passed this year and there was no sign of our next hackday. Having organized a couple of hackdays before, I took charge.

In this article, I’ll describe what I found important and what lessons I learned after organizing a hackday in Luminis Amsterdam. There’s a lot of good advice in there, but it can be a bit tricky to pick the most important one. I’ll be assuming that you’ve already convinced business to allow the hackday to take place (and take care of the eventual expenses).

Hackday guidelines

Here is what others [1, 2, 34] recommend keeping in mind when organizing the hackday:

  • define a clear objective of the hackday
  • include and invite as many people as possible (even if it’s only for half a day, even if they don’t know much about hacking)
  • organize the event in a place that is different from the place where you work every day
  • decide of the hackday logistics (choose the day/days, choose the schedule of the day, technical setup, food, prizes, needed hardware/software etc). On some of these aspects, ask for input from participants.
  • encourage everyone to prepare for the day (coming up with ideas, creating teams etc.)
  • be clear about what will happen with the hacks
  • involve business at least in the presentations at the end
  • expect things to go wrong
  • at the beginning of the day, remind participants of the structure of the day and of the rules and objectives
  • define a theme for the day (most usually: solve a problem that your team or the company is facing)
  • explore the IT landscape (learn a new way of working, a new technology or work with new people)
  • keep the hacks isolated from the other systems
  • build the smallest thing possible as fast as possible (it’s called hackday for a reason)
  • make a few rules and respect them

Planning the hackday

Giving a bit of context about Luminis Amsterdam will give more perspective about the decisions:

  • We’re still a small company, so it seemed important to have everyone showing up for the whole day. Otherwise, we’d be having a hackday with only 3 or 4 persons.
  • Being a small company, people have a natural aversion to rules and schedules, so the day should be as flexible as possible
  • We wanted to allow people to experiment, learn and have fun together, so there were no restrictions in terms of what topics/ideas participants could work on.

The hackday plan was created by applying the advice to our context:

  • everyone had to agree on a specific day to have the hackday
  • business had to agree on a clear objective of the day: the project should bring value to the company
  • I made a schedule for the day
  • I created the rules for the day:
    • at least 2 people working per idea (to promote working together)
    • demo at the end of the day (in which it is clearly mentioned how what was worked on brings value to the company)
    • every person can pitch only 3 ideas
  • the logistics needed by ideas were delegated to those that came up with the idea
  • I asked people to put their ideas online, in a public space and give a couple of details about the ideas
  • the teams were to be formed on the hackday, after the pitching of the ideas

After the hackday

The day itself went quite well and everyone agreed that we reached the goal set at the beginning of the day: “experimenting, learning and having fun together while creating something valuable for the company”

What went well

  • working in pairs (pair programming)
  • preparing by having the ideas and their short descriptions in a public place
  • all the ideas worked on were demo-ed
  • there were plenty of ideas presented, so there were plenty of options to choose from
  • we had fun while working with new technologies or techniques

What we want to improve next time

  • The day passed quite fast. Ideally, on the hackday, the teams are already formed, all the preconditions for developing the idea are met and all the big risks (that could stall an idea) have been eliminated. The rest of the plan of the day is entirely up to the team.
  • We encountered difficulties on the hardware project with not having the proper connectors and cables. This one will be mitigated by making sure that the team prepares before the hackday (at least 1 meeting of the whole team). You don’t want to do debugging or workarounds for half of the hackday.
  • Have a narrower domain for the ideas. We didn’t want to restrict participants from working on what they want to, but we discovered that the ideas were a bit all over the place and that makes it harder to demonstrate the value generated for the company. A narrower domain, if chosen well, will not restrict participants that much. Involve business when choosing the domain.

What advice we find crucial for the first hackdays

Based on my current experience with hackdays, I would say that the most important aspects to keep in mind when you’re in charge with organizing hackdays are:

  • having a clear objective and a theme
  • make sure that the ideas are submitted and teams are formed around these ideas. Make sure that the teams meet in advance to prepare the hackday (decide together what they will build, handle the logistics, minimize the big risks)
  • have a demo at the end of the day and make sure that business attends it
  • at the beginning of the day, remind participants of the rules and objectives
  • do not allow participants to work alone on an idea.

Codemotion Amsterdam – Day 2

Day 1 – https://amsterdam.luminis.eu/2016/05/18/codemotion-amsterdam-day-1/

The second day of the conference started with the keynote,  “Costs of the Cult of Expertise” by Jessica Rose. This session was more of an inspirational session about valuing our skills and choosing a employer wherein we can properly utilize our talents. The talk discussed findings of a study that “Man are confident about their ability if they meet 60 % of job posting’s requirement but women don’t feel confident till they meet all the requirements”. The speaker also shared some insights about how industry defines “expertise”, importance of self-promotion and, emphasis on pair programming during the interview process. The talk ended with a discussion on  “Zone of Proximal development”  , which emphasises that learning new things shouldn’t be too hard or too easy  i.e. helping people come out of comfort zone and at the same time not being frustratingly difficult. This next talk was about Git.

“Knowledge is Power: Getting out of trouble by understanding Git” by Steve Smith, this was a very interesting talk as initially I was thinking it would be a basic Git overview but the speaker explained some esoteric Git concepts and Git internals. The talk started with showing generic Git flows used by various dev teams along with live demoing. Then the relationship between Git commit, Git tree and Git blob was explained using examples as well as a nice diagram. Overview of branches along with git rebase, revert and git reset to undo undesired changes.  Git “Reflog” was discussed which is basically the “undo history for your repo.”  Merge strategies were discussed along with Git “Bisect” which is very much useful in identifying the commit which caused the build to break. Git Bisect does a binary search on a list of commits between “start” and “End” commit in order to spot the commit causing the regression.

How to Build a Micro-services Infrastructure In 7 Day” by Gil Tayar, this talk focused on building a modern microservice infrastructure in 7 days. The speaker shared the tools, technologies, and scenarios they tackled while creating a modern Mesos based infrastructure. Mesos is cluster manager for data centers that supports docker out of the box, which makes it suitable for deployment of microservices. The presenter shared, that their team used Marathon as the container orchestration tool from Mesosphere( same company as Mesos). Marathon is used as a scheduler within Mesos which in turn is managing the data center resources. Service Discovery is an integral element of any microservice architecture, Gil shared that their team used nginx reverse proxy gateway as the tool for service discovery for both external requests i.e. external gateway and another instance for routing the requests internally. For managing the whole deployment life-cycle command line tools were developed using NodeJs and thus services were deployed via CLI. The next talk I attended was about push notifications.

“The web is getting pushy” by Phil Nash, the talk started with presenting the fact that mobile platform developers have relied on push notifications for higher user engagement and how web browsers have been lagging behind native apps in providing similar push notifications. After a brief overview of push notifications, the presenter quickly started demoing this feature. A twitter hashtag was subscribed to and whenever there was a tweet with that hashtag it appeared in a small pop-up window even if the website tab was closed thus inviting the user back to website. Push notifications have been made possible for the web by Service Workers, it’s a  new browser feature that provides event-driven scripts that run independently of web pages. This session was followed by lunch.

After lunch, I attended the session “Modern Javascript with React and Redux” by Pavithra Kodmad. The session started with quick walkthrough of evolution of front-end technologies. Feature of React js were presented along with Redux which is the state container library for React Js applications. ReactJs is a library from Facebook with a component based architecture, unlinke angular ReactJs has one way data binding from parent to child components. React uses JSX which adds XML to Javascript, it also uses the concept of Virtual Dom which is an level of abstraction on the real DOM to improve performance as making changes to DOM is expensive. The speaker then discussed Redux which contains the state of the application and acts as the single source of truth along with demoing the “time travel” debugging technique which lets you go back in time by cancelling actions.

The next talk was – “Making your elastic cluster perform” by Jettro Coenradie. Jettro is also my colleague at Luminis Amsterdam. The talk gave an overview of all the major things you need to keep in mind to have a highly perfomant elastic cluster. Starting with required number of nodes, and their categorization into masterdata and. client nodes for production use-cases. The speaker then shared strategies for number of shards, replicas and significance of “time based” indices. Memory considerations were shared along with the fact that it’s always better to use SSD drives specially since aggregation queries in elastic can be quite taxing. Tips for faster indexing and lowering latency while quering were presented as well. When to use match query instead of filter and vice versa. Using elastic’s multi-field approach for  special analyzers. In the end there was demo showcasing the Profile API which provides users with insights into the working of the query at a low level thus helps in debugging slow queries.

The closing keynote “Virtual Reality; Past, Present and Future” by Avinash Changa (WeMakeVR)”  by Avinash Changa was an inspiring talk on Virtual Reality. The whole evolution of VR devices in last 30 years was presented. Present day challenges were discussed and how Avinash’s company is creating breakthrough VR products.  The talk also included an interesting video called “The Void” wherein people can play video games in actual locations instead on TVs by wearing virtual reality googles thus giving them a more real life experience.


Book review: The Phoenix Project

AN EXCITING FICTION-FOR-TECHIES BOOK

On the first night, when I opened the book, it was 22:30. I stopped reading at 1:00. On the second and third night, the story repeated. Luckily, the weekend came and I managed to get back lost sleep.

When I say exciting I’m not joking: The main character gets a sudden promotion from IT manager to VP of IT Operations in a big company ($4 billion per year). What he doesn’t know is that the whole IT organization is completely crippled and he got the job is to get it healthy again.

Having to manage severity one incidents, doing weekend long deployments, getting impossible constraints and requirements from business and security are just a few of the obstacles that he encounters in his new role.

Various IT problems keep on appearing in the first part of the book. After that, things start stabilizing and you are rewarded with a happy ending😉

Although a work of fiction the book introduces various currently-used-in-IT concepts: kanban, the theory of constraints, DevOps, wait time is %busy divided by %idle, simian army, The 3 ways, the 4 types of work. People working in/with IT will appreciate the quick intro into DevOps and IT management. I doubt that non-techies will find the book interesting.

The idea that impressed me most was the comparison of the IT organization with a factory. Thinking back at my development experience, I observed that there are a lot more similarities between a factory worker and a software engineer than I’m comfortable to admit.

The main ideas are summarized at the end of the book. Along with them, you will also find the books (1, 2) that the authors used when building up the plot. I warmly recommend that you read at least this part of the book.