Categories for Machine Learning

How to bring your python machine learning model in production using RESTful APIs

In recent years there is much attention to bringing machine learning models in production whereas up to a few years ago the results of machine learning came into slides or some dashboards.
Bringing machine learning in production is important to integrate the outputs of machine learning with other systems.

What does “bring in production” mean?

To bring a machine learning in production means to run the machine learning model more often and integrate and use the model’s output in other systems.

There are some ways that you can bring your machine learning models in production, such as:

  1. To build a web-service around it and use it in real time (API calls, microservice architecture)
  2. Schedule your code to be run regularly (such as Oozie and Airflow)
  3. Stream analytics (such as Spark Streams) for lambda/kappa architect

The focus of this article is to discuss the first way of going on production. This method is usually used in web-based environments and microservices architect.

Python and modeling

In this article, we build up a sample machine learning model for Online Shoppers’ Purchasing Intention Dataset Data Set, available and discussed in
In the below you can find the code for the data preparation and modeling:


import pandas as pd # to use dataframes and etc. 
import numpy as np #for arithmatic operationms
from time import strptime #to convert month abbrivations to numeric values
from sklearn.model_selection import train_test_split #to split up the samples
from sklearn.tree import DecisionTreeRegressor #ression tree model
from sklearn.metrics import confusion_matrix # to check the confusion matrix and evaluate the accuracy

#reading the data

#preparing for split
y=dataset["Revenue"].map(lambda x: int(x))
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2)

#making a copy (to be able to change the values)

#data prep phase
def dataPrep(localData):
    #The problem is "June"is the full month name and the rest abbriviation --> we turn all in abbr.
    localData["Month"]= localData["Month"].map(lambda x: str.replace(x,"June","Jun"))
    # Our model doesn't ingest text, we transform it to int
    localData["Month"]= localData["Month"].map(lambda x: strptime(x,"%b").tm_mon)
    # The weeknd should also be turned into int
    localData["Weekend"]= localData["Weekend"].map(lambda x: int(x))
    # turning string to int
    localData["VisitorType"] = localData["VisitorType"].astype('category')
    return localData

#Sending the data through data prep phase
#define the regression tree
regr_tree = DecisionTreeRegressor(max_depth=200)
#fiting the tree with the training data, y_train)
# running the predictions
# looking at the confusion matrix

For data scientists, the above code should be very familiar. We read the data, do very few data wranglings and model it with decision trees.

Save the model

The next step which does not appear in data scientists workflow regularly is to save the model on hard-drive. This step is necessary if you bring your python code in production.
In below you can see how “joblib” can be of assistant to do this:


from sklearn.externals import joblib
joblib.dump(regr_tree, '.../model3.pkl')

Build your Flask back-end

If you are not familiar with building back-end programs and RESTful APIs, I highly recommend reading and other related materials. But in short Web Services, and RESTful APIs are servers provide functions (on the server). The application can remotely call those function and get the outputs back. In our example, we call our machine learning model from anywhere through internet and TCPIP protocol. Once the model is called with the data, the result of classification is back to the client or the computer which has already called the machine learning model.
Discussing details about web-services and web APIs are beyond the scope of this article but you can find many interesting articles on this by some internet search.
In below we use Flask to build the webservice around the machine learning model.


from flask import Flask, request
from flask_restful import Resource, Api
from sklearn.externals import joblib
import pandas as pd

app = Flask(__name__)
api = Api(app)

class Classify(Resource):
    def get(self): # get is the right http verb because it caches the outputs and is faster in general
        data = request.get_json() # greading the data
        data1 = pd.DataFrame.from_dict(data,orient='index') # converting data into  DataFrame (as our technique does not ingest json)
        data1=data1.transpose() # once Json is converted to DataFrame is not columnar, we need to convert it to columnar
        model = joblib.load('../model3.pkl') # loading the model from disk
        result = list(model.predict(data1)) # conversion to list because numpy.ndarray cannot be jsonified
        return result # returning the result of classification 

api.add_resource(Classify, '/classify')

Test your API

You can use various techniques to test if the back-end works. I use Postman software to test if the API is working.
You need to consider that we made a GET request in our Flask application. The motivation behind choosing GET request is the ability of the webservers to cash the results helping with the speed of the webservice.
Another consideration is we send the data in JSON format (in the format after data preparation phase) for the call and the results are back also in JSON.


    "Administrative_Duration": 0.0,
    "Informational": 0,
    "Informational_Duration": 0.0,
    "ProductRelated": 1,
    "ProductRelated_Duration": 0.0,
    "BounceRates": 0.2,
    "ExitRates": 0.2,
    "PageValues": 0.0,
    "SpecialDay": 0.0,
    "Month": 5,
    "OperatingSystems": 2,
    "Browser": 10,
    "Region": 5,
    "TrafficType": 1,
    "VisitorType": 87093223,

Microservices architect:

I personally like to bring machine learning in production using RESTful APIs and the motivation behind it is because of microservices architect. The microservices architect lets developers to build up loosely coupled services and enables continuous delivery and deployment.

Scaling up

To scale up your webservice there are many choices of which I would recommend load balancing using Kubernetes

Setting up data analytics pipeline: the best practices

The picture is courtesy of 1 Datapipeline Architect Example

In a data science analogy with the automotive industry, the data plays the role of the raw-oil which is not yet ready for combustion. The data modeling phase is comparable with combustion in the engines and data preparation is the refinery process turning raw-oil to the fuel i.e., ready for combustion. In this analogy data analytics pipeline includes all the steps from extracting the oil up to combustion, driving and reaching to the destination (analogous to reach the business goals). As you can imagine, the data (or oil in this analogy) goes through a various transformation and goes from one stage of the process to another. But the question is what is the best practice in terms of data format and tooling? Although there are many tools that make the best practice sometimes very use-case specific but generally JSON is the best practice for the data-format of communication or the lingua franca and Python is the best practice for orchestration, data preparation, analytics and live production.

What is the common inefficiency and why it happens?

The current inefficiency is overusing of tabular (csv-like) data-formats for communication or lingua franca. I believe data scientists still overuse the structured data types for communication within data analytics pipeline because of standard data-frame-like data formats offered by major analytic tools such as Python and R. Data scientists start getting used to data-frame mentality forgetting the fact that tabular storage of the data is a low scale solution, not optimized for communication and when it comes to bigger sets of data or flexibility to add new fields to the data, data-frames and their tabular form are non-efficient.

DataOps Pipeline and Data Analytics

A very important aspect for analytics being ignored in some circumstances is going live and getting integrated with other systems. DataOps is about setting up a set of tools from capturing data, storing them up to analytics and integration, falling into an interdisciplinary realm of the DevOps, Data Engineering, Analytics and Software Engineering (Hereinafter I use data analytics pipeline and DataOps pipeline interchangeably.) The modeling part and probably some parts in data prep phases need a data-frame like data format but the rest of the pipeline is more efficient and robust if is JSON native. It allows adding/removing features easier and is a compact form for communication between modules.

The picture is courtesy of 2

The role of Python

Python is a great programming language used not only by the scientific community but also the application developers. It is ready to be used as back-end and by combining it with Django you can build up full-stack web applications. Python has almost everything you need to set up a DataOps pipeline and is ready for integration and live production.

Python Example: transforming CSV to JSON and storing it in MongoDB

To show some capabilities of Python in combination with JSON, I have brought a simple example. In this example, a dataframe is converted to JSON (Python dictionaries) and is stored in MongoDB. MongoDB is an important database in today’s data storage as it is JSON native storing data in a document format bringing high flexibility .

<br />### Loading packages

from pymongo import MongoClient import pandas as pd

# Connecting to the database

client = MongoClient('localhost', 27017)

# Creating database and schema

db = client.pymongo_test posts = db.posts

# Defining a dummy dataframe

df = pd.DataFrame({'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])

# Transforming dataframe to a dictionary (JSON)


# Writing to the database

result = posts.insert_one(dic) print('One post: {0}'.format(result.inserted_id))

The above example shows the ability of python in data transformation from dataframe to JSON and its ability to connect to various tooling (MongoDB in this example) in DataOps pipeline.


This article is an extension to my previous article on future of data science ( In my earlier article, I have sketched the future of data science and have recommended data scientists to go towards full-stack. Once you have a full stack and various layers for DataOps / data analytics JSON is the lingua franca between modules bringing robustness and flexibility for this communication and Python is the orchestrator of various tools and techniques in this pipeline.

1: The picture is courtesy of*8-NNHZhRVb5EPHK5iin92Q.png 2: The picture is courtesy of

Looking back on AWS Summit Benelux 2018

Last week I visited AWS Summit Benelux together with Sander. AWS Summit is all about cloud computing and the topics that surround cloud computing. This being my first AWS conference I can say it was a really nice experience. Sure there was room for improvement (no coffee or tea after the opening keynote being one), but other than that it was a very good experience. Getting inside was a breeze with all the different check-in points and after you entered you were directly on the exhibitor floor where a lot of Amazon partners showed their products.

Opening keynote

The day started with an introduction by Kamini Aisola, Head of Amazon Benelux. With this being my first AWS summit it was great to see Kamini showing some numbers about the conference: 2000 attendees and 28 technical sessions. She also showed us the growth pattern of AWS with an increasing growth of 49% compared to last year. That’s really impressive!

Who are builders?

Shortly after, CTO Werner Vogels started with his opening keynote. Werner showed how AWS evolved from being ‘just’ an IaaS company to now offering more than 125 different services. More than 90% of the developed services were based on customer feedback from the last couple of years. That’s probably one of the reasons why AWS is growing so rapidly and customers are adopting the AWS platform.

What I noticed throughout the entire keynote is that AWS is constantly thinking about what builders want to build (in the cloud) and what kind of tools those builders need to have to be successful. These tools come in different forms and sizes, but I noticed there is a certain pattern in how services evolve or are grown at AWS. The overall trend I noticed during the talks is that engineers or builders should have to spend less time focussing on lower level infrastructure and can start to really focus on delivering business value by leveraging the services that AWS has to offer.

During the keynote Werner ran through a couple of different focus areas for which he showed what AWS is currently offering. In this post I won’t go through all of them, because I expect you can probably watch a recording of the keynote on youtube soon, but I’ll highlight a few.

Let’s first start with the state of Machine Learning and analytics. Werner looked back at how machine learning evolved at and how services were developed to make machine learning more accessible for teams within the organisation. Out of this came a really nice mission statement:

AWS want’s to put machine learning in the hands of every developer and data scientist.

To achieve this mission AWS is currently offering a layered ML stack to engineers looking into to using ML on the AWS platform.

The layers go from low-level libraries to pre-build functionalities based on these lower level layers. I really liked that fact that these services are built in such a way that engineers can decide at which level of complexity they want to start using the ML services offered by AWS. Most of the time data engineers and data scientist will start from either SageMaker or even lower, but most application developers might just want to use a pre-built functionality like image recognition, text processing or speech recognition. See for instance this really awesome post on using Facial recognition by my colleague Roberto.

Another example of this layered approach was with regards to container support on AWS. A few years back Amazon added container support to their offering with Amazon Elastic Container Service (Amazon ECS). This allowed Amazon ECS helped customers run containers on AWS without having to manage all servers and manager their own container orchestration software. ECS delivered all of this. Now fast forwarding a few years Amazon is now offering Amazon EKS (managed Kubernetes on Amazon) after they noticed that about 63% of managed Kubernetes clusters ran on AWS. Kubernetes has become the current industry standard when it comes to container orchestration, so this makes a lot of sense. In addition, Amazon now also offers Amazon Fargate. With Fargate they take the next step which means that Fargate allows you as the developer to focus on running containers ‘without having to think about managing servers or clusters’.

During his keynote, Werner also mentioned the Well-Architected framework. The Well-Architect framework has been developed to help cloud architects run their applications in the cloud based on AWS best practices. When implemented correctly it allows you to fully focus on your functional requirements to deliver business value to your customers. The framework is based on the following five pillars:

  1. Operational Excellence
  2. Security
  3. Reliability
  4. Performance Efficiency
  5. Cost Optimization

I had not heard about the framework before, so during the weekend I read through some of its documentation. Some of the items are pretty straightforward, but others might give you some insights in what it means to run applications in the cloud. One aspect of the Well-Architected framework, Security, had been recurring throughout the entire keynote.

Werner emphasised a very important point during his presentation:

Security is EVERYONE’s job

With all the data breaches happening lately I think this is a really good point to make. Security should be everybody’s number one priority these days.

During the keynote, there were a couple of customers that showed how AWS had helped them achieve a certain goal. Bastiaan Terhorst, CPO at WeTransfer explained that being a cloud-scale company comes with certain problems. He explained how they moved from a brittle situation towards a more scalable solution. They could not modify the schema of their DB anymore without breaking the application, which is horrible if you reach a certain scale and customer base. They had to rearchitect the way they worked with incoming data and using historic data for reporting. I really liked the fact that he shared some hard-learned lessons about database scalability issues that can occur when you reach a certain scale.

Tim Bogaert, CTO at de Persgroep also showed how they moved from being a silo-ed organization with own datacenters and waterfall long-running projects towards all-in AWS with an agile approach and teams following the “You Build It, You Run It” mantra. It was an interesting story because I see a lot of larger enterprises still struggling with these transitions.

After the morning keynote, the breakout sessions started. There were 7 parallel tracks and all with different topics, so plenty to choose from. During the day I attended only a few, so here goes.

Improve Productivity with Continuous Integration & Delivery

This really nice talk by Clara Ligouri (software engineer for AWS Developer Tools) and Jamie van Brunschot (Cloud engineer at Coolblue) gave a good insight into all the different tools provided by AWS to support the full development and deployment lifecycle of an application.

Clara modified some code in Cloud9 (the online IDE), debugged some code, ran CI jobs, tests and deployments all from within her browser and pushed a new change to production within only a matter of minutes. It shows how far the current state of being a cloud-native developer has really come. I looked at Cloud9 years ago. Way before they were acquired by Amazon. I’ve always been a bit skeptical when it comes to using an online IDE. I remember having some good discussions with the CTO at my former company about if this would really be the next step for IDEs and software development in general. I’m just so comfortable with IntelliJ for Java development and it always works (even if I do not have any internet ;-)). I do wonder if anybody reading this is already using Cloud9 (or any other Web IDE) and is doing his / her development fully in the cloud. If you do, please leave a comment, I would love to learn from your experiences. The other tools like CodePipeline and CodeDeploy definitely looked interesting, so I need to find some time to play around with them.


Next up was a talk on GDPR. The room was quite packed. I didn’t expect that though, because everybody should be GDPR compliant by now right? 🙂 Well not really. Companies are still implementing changes to be compliant with GDPR. The talk by Christian Hesse looked at different aspects of GDPR like:

  • The right to data portability
  • The right to be forgotten
  • Privacy by design
  • Data breach notification

He also talked about the shared responsibility model when it comes to being GDPR compliant. AWS as the processor of personal data and the company using AWS being the controller are both responsible for making sure data stays safe. GDPR is a hot topic and I guess it will stay so for the rest of the year at least. It’s something that we as engineers will always need to keep in the back of our minds while developing new applications or features.


In the afternoon I also attended a talk on Serverless by Prakash Palanisamy (Solutions Architect, Amazon Web Services) and Joachim den Hertog (Solutions Architect, ReSnap / Albelli). This presentation gave a nice overview of Serverless and Step functions, but also showed new improvements like the Serverless Application Repository, save Serverless deployments and incremental deployments. Joachim gave some insights into how Albelli was using Serverless and Machine Learning on the AWS platform for their online photo book creator application called ReSnap.

Unfortunately I had to leave early, so I missed the end of the Serverless talk and the last breakout session, but all in all AWS Summit Benelux was a very nice experience with some interesting customer cases and architectures. For a ‘free’ event it was amazingly organized, I learned some new things and had a chance to speak with some people about how they used AWS. It has triggered me to spend some more time with AWS and its services. Let’s see what interesting things I can do on the next Luminis TechDay.

Build On!

A look at the source code of gensim doc2vec

Previously, we’ve built a simple PV-DBOW-‘like’ model ( We’ve made a couple of choices, e.g., about how to generate training batches, how to compute the loss function, etc. In this blog post, we’ll take a look at the choices made in the popular gensim library. First, we’ll convince ourselves that we implemented indeed more or less the same thing :-). Then, by looking at the differences, we’ll get ideas to improve and extend our own implementation (of course, this could work both ways ;-)). The first extension we are interested in is to infer a document vector for a new document. We’ll discuss how the gensim implementation achieves this.

Disclaimer: You will notice that we’ll write this blog post in a somehwat dry, bullet-point style. You may use it for reference if you ever want to work on doc2vec. We plan to, anyway. If you see mistakes in our eyeball-interpretation of what gensim does, feel free to (gently) correct us; please refer to the same git commit version of the code against which we wrote this blog post, and use line numbers to point to code.

Code walk through of gensim’s PV-DBOW

We’ll start with the non-optimized Python module doc2vec ( Note that the link is to the specific version against which this blog post was written. To narrow it down, and to stay as close to our own PV-DBOW implementation, we’ll first postulate some assumptions:

  • we’ll initialize the Doc2Vec class as follows d2v = Doc2Vec(dm=0, **kwargs). That is, we’ll use the PV-DBOW flavour of doc2vec.
  • We’ll use just one, unique, ‘document tag’ for each document.
  • We’ll use negative sampling.

The first thing to note about the Doc2Vec class is that is subclasses the Word2Vec class, overriding some of its methods. By prefixing methods with the class, we’ll denote which exact method is called. The super class object is then initialized as follows, in lines 640-643, by deduction:

Word2Vec(sg=1, null_word=0, **kwargs)

sg stands for Skip-Gram. Remember from elsewhere on the net that the Skip-Gram Word2Vec model is trained to predict surrounding words (for any word in a corpus).

Upon initialisation, Word2Vec.train() is called: a model is trained. Here, some parallelisation is taken care of that I will not go into at this point. At some point however, Doc2Vec._do_train_job() is called: in a single job a number of documents is trained on. Since we have = 1, Doc2Vec.train_document_dbow() is called there, for each document in the job.
In this method, the model is trained to predict each word in the document. For this, Word2Vec.train_sg_pair() is used. Only, instead of two words, this method now receives the document tag and a word: the task is to correctly predict each word given the document tag. In this method, weights are updated. It seems, then, that at each stochastic gradient descent iteration, only one training example is used.

Comparison of ours and gensim’s Doc2Vec implementation

By just eyeballing the code, at first sight, the following similarities and differences stand out:


  • The network architecture used seems the same in both implementations: the input layer has as many neurons as there are documents in the corpus, there is one hidden layer, and the output layer has the vocabulary size.
  • Neither implementation uses regularisation.


  • One training example (document, term) per SGD iteration is used by gensim, whereas we allow computing the loss function over multiple training examples.
  • All terms in a document are offered to SGD right after each other in gensim, whereas we generate batches consisting of several term windows from different documents.
  • In gensim, the order in which training documents are offered is the same order each epoch; we randomize the order of term windows again each epoch.

Inferring document vectors

Given a new, unseen document, using Doc2Vec.infer_vector(), a document vector can be estimated anyway. How? Well, the idea is that we keep the word classifier that operates on the hidden space fixed. In other words, we keep the weights between the hidden layer and the output layer fixed. Now, we train a new mini-network with just one input neuron–the new document id–. We optimize the network such that the document gets an optimal position in the hidden space. In other words, again, we train the weights that connect this new document id to the hidden layer. How does gensim initialize the weights for the new input neuron? Randomly, set to small weights, just like it was done for the initial documents that were trained on. The training procedure consists of a fixed number of steps (we can choose how many). At each step, all words in the document are offered as a training example, one after the other.


We’ve seen that gensim’s implementation and ours do implement roughly the same thing, although there are a number of differences. This consolidates our position with our small ‘proof-of-concept’ implementation of doc2vec ( We’ve eyeballed how gensim’s doc2vec implementation manages to infer a document vector for an unseen document; now we are in a position to extend our own implementation to do the same. Of course, you can do it yourself, too!

World Summit AI 2017 Amsterdam – some highlights

From Wednesday October 11th to Thursday October 12th, the World Summit AI 2017 conference was held in Amsterdam. It had a terrific line-up, with top professors in the field of AI, like Stuart Russell and Yann LeCun, and ‘our own’ Max Welling. Most of these people were actually there in person. But it wasn’t all science. Large companies like ING, Accenture, and smaller companies like BrainCreators (they were new for me) also had a large presence, with talks on the main stage, or with stands in front of the main arena. In this blog post, I’ll briefly discuss some of my favorite talks, in non-chronological order.

On the second day, Meredith Whittaker, leader of Google’s open research group and co-founder of AI Now Institute, New York University, focused on current applications of ‘AI-technologies’ in the domain of human labour. And on what’s wrong with them, mostly. My summary of the problems she outlined would be that machine learning techniques are widely applied without a solid statistical methodology. The simplest example she gave?



A soap dispenser that would not give soap to everybody. It obviously has not been tested on people with varied skin color. The field of statistics could have helped here. Who are the intended users of the soap dispensers? People. So, let’s fine-tune, or ‘train’, the soap dispensers on a representative sample of this population. Not just on a bunch of people you know, who happen to be similar to you. The same principles play a decisive role in any machine learning application. And, sadly, they are often not applied correctly, as Meredith very eloquently pointed out. She said much more about this, and there is still much more to be said about it. There certainly are many opportunities for improvement in this area.

Talking about the second conference day, who opened it? None lesser than Stuart Russell, one of the authors of, can I say, the AI Bible? It turns out he is a talented comedian as well. He had the whole audience in the main stage laughing. Even the end of the human species did not seem so bad when he talked about it. His main message? “I cannot fetch coffee if I’m dead”. Currently, machine learning algorithms are programmed to optimise some quantity. In technical terms, they optimise an ‘objective function’. The problem? Humans can’t state very well what they want. Having a super-intelligent AI (if we ever succeed in creating it) optimise a function that was formulated by humans may lead to problems. And we might not be able to stop this AI. It will quickly figure out that it needs to stay alive in order to optimise its objective. And therefore, it will not allow us to switch it off. Being super-intelligent, it may actually succeed in this.
Almost as a side point, professor Russell offered one possible avenue towards a solution: make algorithms uncertain about their objective function. Program AI so that it will put the interests of humans first, even if it is still unsure about what those interests are. If a human switches it off, it should therefore always gladly accept. Could there be dilemma’s and difficulties here as well? Well, certainly there were a good couple of laughs with the follow-up scenarios he offered.



In his TED talk ( he discusses the same subject matter, so go ahead and watch that right after this blog post.

On the first day, one of the stand-out talks for me was the one by Laurens van der Maaten, now at Facebook, famous for the T-SNE algorithm. If you don’t want to know what that is, you may well skip this paragraph, because his talk was one of the more technical of the conference. But it was a strong talk, leading up to a novel and original combination of deep learning techniques and symbolic AI approaches. The setting was the task of visual question answering: an algorithm gets as input an image and a question about it, and has to produce a correct answer. Admittedly, the scope was narrowed down a bit further: the images in question were generated images of geometrical objects, and the questions were often about characteristics such as color, texture, size, and shape. In this artificial world, however, the proposed solution performed very well indeed: better than humans. It seems these days you cannot sell anything less anymore, no? The sentence containing the question was processed by an LSTM-based sequence-to-sequence model. The output? A small computer program, a combination of primitive functions such as ‘count’, ‘filter color’, ‘filter’ shape. How are these functions executed? Well, they are themselves trained neural networks. An impressive composition of trained neural networks to achieve something bigger! A pre-print of the paper (with a nice picture of the architecture, showing everything I just wrote) is available here:

With this, I’d like to leave you now, perhaps we’ll add more summaries here later!

Coding doc2vec

In two previous posts, we googled doc2vec [1] and “implemented” [2] a simple version of a doc2vec algorithm. You could say we gave the specifications for a doc2vec algorithm. But we did not actually write any code. In this post, we’ll code doc2vec, according to our specificication. Together, these three blog posts give some understanding of how doc2vec works, under the hood. Understanding by building. We’ve made a Jupyter notebook [3], and we’ll walk you through it by highlighting some key parts. As a starting point for the notebook, we’ve used the word2vec implementation [4] from the Vincent Vanhoucke Udacity course. We’ve modified basically everything in that a bit, and added some new stuff. But the idea remains the same: a minimal code example for educational purposes.

What we’ll do

We’ll get and preprocess some documents from a well known Reuters corpus. We’ll train a small doc2vec network on these documents. While we train it, the document vectors for each document are changing. Every so many steps, we use the document vectors. First, we visualise them with T-SNE in a two-dimensional space. We color code the document vectors with class labels (the news categories from the Reuters corpus), to see if document vectors belong to the same class are getting closer to each other. Second, we train a simple linear classifier using the document vectors as input, to predict class labels. We’ll call this prediction task the end-to-end task. We’ll observe the following:

1. The doc2vec network it’s loss decreases as it trains. This is a good sign.
2. If we train doc2vec longer, the performance for the end-to-end task increases. This is another good sign.
3. The two-dimensional visualisation with color-coded document vectors is not very informative. This is too bad, but not a problem. In the end, it’s just a visualisation.


The first step is preprocessing. The following one liner reads in the entire Reuters data set in memory. If your data is very big, this is not advisable of course. But: early optimization is the root of all evil 🙂

    fileid2words = {fileid:
            [normalize(word) for word in word_tokenize(
                    reuters.raw(fileid)) if accept(word)] \
            for fileid in reuters.fileids() if accept_doc(fileid)}

In words: we iterate over Reuters files, and accept some of the documents. For each document, we accept some of the words. And the words we do accept, we normalise. Let’s take a closer look at each step.

def accept_doc(fileid):
    return fileid.startswith('training/') \
            and np.random.random() * 100 < PERCENTAGE_DOCS

We only accept documents from the training set of the Reuters corpus. And we select a random percentage of these documents according to the hyperparameter PERCENTAGE_DOCS that we have set by hand at the top of the notebook.

def accept(word):
    # Accept if not only Unicode non-word characters are present
    return re.sub(r'\W', '', word) != ''

We refuse words that consist entirely of non-word characters. The words that are refused here are taken out of the token stream before anything happens. You can play with this and refuse other words, too. For example, stopwords like ‘the’, ‘and’, etc. This may or may not be a good idea. One way of learning more about it is to play with it and see what happens to your performance in the end-to-end task.

def normalize(word):
    return word.lower()

And we lowercase all tokens. This is a first step to reduce data sparsity of natural language. There are other ideas, too. For example, replacing numbers with a special NUMBER token, or spelling out numbers with words, so that ‘123’ becomes ‘one two three’. There is always a tradeoff: normalising tokens may lead to some information loss.

After we have obtained the dictionary fileid2words, we build our vocabulary:

    count = [['__UNK__', 0], ['__NULL__', 0]]
    count.extend([(word, count) for word, count in collections.Counter(
            [word for words in fileid2words.values() \
            for word in words]).most_common(
                    VOCAB_SIZE - 2 + REMOVE_TOP_K_TERMS)[
                    ] if count >= MIN_TERM_FREQ])

Here, first we flatten the dictionary, our entire dataset, to just a sequence of tokens (words). Then we count the occurence of all the unique words. We add these counts to the counts of two special tokens: __UNK__ and __NULL__. We use the most common words as our vocabulary. We’ll remove the top k most common terms, because these tend to be stopwords. And we require the word frequency to be higher than a certain minimum. That is because we can hardly expect our network to predict a term that would only occur, say, once in the whole corpus. Occurences for words that did not end up in our vocabulary will later on be replaced with the __UNK__ (unknown) token. So at this point no words will be taken out of the corpus anymore, they will only be replaced. One thing that you can try is if it works better to really remove stopwords from the corpus. Don’t worry about the __NULL__ token, it is only used when our documents are too short to even fit a single text window in (remember that in doc2vec, we try to predict words from fixed size text windows that occur in a document). That will not happen often in the Reuters corpus.

Training the network

In Tensorflow, training a network is done in two steps. First, you define the model. You can think of the model as a graph. Second, you run the model. We’ll take a look at the first step, how our model is defined. First: the input.

# Input data
dataset = tf.placeholder(tf.int32, shape=[BATCH_SIZE])
labels = tf.placeholder(tf.int32, shape=[BATCH_SIZE, 1])

The dataset is defined as a placeholder with the shape of a simple array that contains ints. When we run the model, it will contain document ids. Nothing more, nothing less. It will contain as many document ids as we want to feed the stochastic gradient descent algorithm in a single iteration. The labels placeholder is also a vector. It will contain integers that represent words from the vocabulary. So, basically, for each document id we want to predict a word that occurs in it. In our implementation, we make sure that a batch contains one or more text windows. So if we use a text window of size 8, a batch will contain one or more sequences of eight consecutive words. Next, we take a look at the weights in our neural network:

# Weights
embeddings = tf.Variable(
        tf.random_uniform([len(doclens), EMBEDDING_SIZE],
                          -1.0, 1.0))
softmax_weights = tf.Variable(
                [vocab_size, EMBEDDING_SIZE],
                stddev=1.0 / np.sqrt(EMBEDDING_SIZE)))
softmax_biases = tf.Variable(tf.zeros([vocab_size]))

You can think of embeddings as the transpose of the matrix D from our previous post [2]. In its rows, it has a document vector of length EMBEDDING_SIZE for each document id. This document vector is also called an “input vector”. You can also think of embeddings as the weights between the input layer and the middle layer of our small doc2vec network. When we run the session, we will initialize the embeddings variable with random weights between -1.0 and 1.0.

The softmax_weights are the weights between the middle layer and the output layer of our network. You can also think of them as the matrix U from our previous post. On its rows, it has an “output vector” of length EMBEDDING_SIZE for each word in the vocabulary. When we run the model in our session, we will initialize these weights with (truncated) normally distributed random variables with mean zero and a standard deviation that is inversely proportional to EMBEDDING_SIZE. Why are these variables initialized using a normal distribution, instead of with a uniform distribution like we used for the embeddings? The short answer is: because this way of initialisation has apparently worked well in the past. You can try different initialisation schemes yourself, and see what it does to your end-to-end performance. The long answer; well, perhaps that’s food for another blog post.

The softmax_biases are initialised here with zeroes. In our previous post, we mentioned that softmax biases are often used, but omitted them in our final loss function. Here, we used them, because the word2vec implementation we based this notebook on used them. And the function we use for negative sampling wants them, too.

The activation in the middle layer, or, alternatively, the estimated document vector for a document id is given by embed:

embed = tf.nn.embedding_lookup(embeddings, dataset)

tf.nn.embedding_lookup will provide us with fast lookup of a document vector for a given document id.

Finally, we are ready to compute the loss function that we’ll minimise:

loss = tf.reduce_mean(
                softmax_weights, softmax_biases, embed,
                labels, NUM_SAMPLED, vocab_size))

Here, tf.nn.sampled_softmax_loss takes care of negative sampling for us. tf.reduce_mean will compute the average loss over all the training examples in our batch.

As an aside, if you take a look at the complete source code in the notebook, you’ll notice that we also have a function test_loss. That function does not use negative sampling. It should not, because negative sampling underestimates the true loss of the network. It is only used because it is faster to compute than the real loss. When you run the notebook, you will see that the training losses it prints are always lower than the test losses. One other remark about the test loss is the following: the test examples are taken from the same set of documents as the training examples! This is because in our network, we have no input to represent a document that we have never seen before. The text windows that have to be predicted for a given document id are different though, in the test set.

A two-d visualisation of the document vectors with T-SNE

As we train our network, as we minimise our loss function, we keep making small changes to all the weights in the network. You can think of the weights between the embedding layer and the output layer as weights that affect where the decision boundaries of our output neurons are located. And you can think of the weights between our input layer and the embedding layer as weights that determine where our documents are projected in the embedding space. These weights help determine our document vectors. We train our network to predict words from text windows sampled from documents. So, intuitively, as we train longer, the documents change location in the embedding space in such a way that it becomes easier to predict the words that occur in these documents. And now here comes an important point. If we can predict which words occur in a document, then perhaps we can also say something about the topics in a document, or about the genre or category of the document. This is in fact an important motivation for using doc2vec as a step in a document classification algorithm.

Okay. So, as we train to predict words, our document vectors keep changing location in the embedding space. If we want to use the document vectors as input for a document classifier, we hope that documents with the same class are located close to each other in the embedding space. If we want to make a visualisation that will show if this is happening, we need to use a dimensionality reduction algorithm, because we can’t visualise our embedding space if it has more than three dimensions. Because we are interested in the distance between document vectors, we use T-SNE. T-SNE is an algorithm that aims to project points close to each other that were also close to each other in the original high-dimensional space. And the other way around for points that are far apart in the high-dimensional space. And after we visualise the document vectors in 2D, we color code the points: purple points have the most common Reuters class as one of the class labels. In our experiments this was usually the class ‘earn’. The rest of the points does not have this label. Sadly, in experiments we did not see a clear purple cluster yet. Often, it looked something like this:

However, it is hard to measure the quality of the document vectors by visualising them like this. To do that more quantitavely, we need to perform the end-to-end task that want to used the document vectors for.

Predicting the most common class label from document vectors

The end-to-end task that we’ll address in this post is to classify for each document whether or not it has the class label for the most common Reuters class (‘earn’, in most of the Reuters samples in my experiments). This is a binary classification task. As input we’ll use the document vectors that are learned by our doc2vec network. As algorithm we’ll use a linear classifier, because the inventors of doc2vec specifically mention that paragraph vectors are suitable as input for any kind of linear classification algorithm. It makes sense, too: doc2vec itself has only one classification layer on top of the embedding layer. Each neuron in its output layer has an activation that is just a weighted sum, just a linear function, of the document vector in the embedding layer. And then all that happens after that is the softmax function. Predicting the class label of a document is easier than predicting the words in a document. So our first try will just be to use a simple linear classifier for document classification as well. Just to show that the classification algorithm we use does not have to be a neural network, we’ll use a support vector machine with a linear kernel.

As an evaluation metric, we calculate precision and recall for both classes; then we calculate the harmonic mean of these two for each class: this is the F1 score. Then we take the weighted average of the F1 scores of the two classes: weighted by the number of observations of both classes. We report scores on a random subset of the Reuters training set. We do not report scores on the Reuters test set yet, because we are only in an exploratory phase yet here. In general, you should use test data sparingly, in this post, we’ll not use it at all!

Every so many steps of training, we pause training of the network, and we use the current document vectors to train a classifier. In one of hour experiments, where we trained doc2vec on all documents in the Reuters training set, we got the following F1 results:

Number of SGD iterations (steps) average F1 PV-DBOW training loss PV-DBOW test loss
30K 0.53 2.6 4.6
60K 0.57 2.4 4.3
90K 0.60 2.3 4.2
100K 0.61 2.2 4.2

What you can see here in the third and fourth column is that the training loss and test loss of our doc2vec network slowly decreases as we are training. We are slowly getting better at predicting the words from the document ids. But what is more important is that you can see in the second column that the performance on the end-to-end task (document classification) is also increasing as we are training the network longer. That means we are on the right track, even if the absolute performance is not yet that impressive.

Now that we have walked you through the main steps of our Jupyter notebook [3], go ahead and run it yourself! All you need is a working installation of Python 3, numpy, pandas, sklearn, Tensorflow and Jupyter Notebook. pandas is not essential, we only use it to output some descriptive statistics quickly. You can implement alternatives for all of the many choices we had along the way. You can try different values for the configuration variables at the top of the notebook. It should be quite possible to improve the performance that we got in our first experiments: good luck!. In the beginning, do set `PERCENTAGE_DOCS` to a lower value, e.g., something like 5 percent. 100K steps of training on all training set documents in the Reuters dataset took about an hour on a single desktop computer with 8 cores. You don’t want to wait that long just to try out if everything works.



Implementing doc2vec

In a previous post [1], we’ve taken a look at what doc2vec is, who invented it, what it does, what implementations exist, and what has been written about it. In this post, we will work out the details of a version of the simplest doc2vec algorithm: PV-DBOW. Our approach may not be exactly the same as in the original paper [7]. Our aim is a starting point: minimal, easy to understand, and still interesting enough to be expected to work. In a follow-up post, we may develop a Tensorflow implementation. This post should be detailed enough for you to do it yourself, perhaps you’ll beat us to it 🙂 For your inspiration, also refer to the doc2vec Tensorflow code here [18]; that code implements PV-DM, the other doc2vec algorithm.

If you have already watched the videos from the Udacity course on deep learning by Vincent van Hoecken [2] or you already have done some other neural network tutorials, I guess this post should be quite easy to follow. But if this post series is one of the first things you read on neural nets, perhaps you will still be able to follow along, looking up some things along the way yourself.

Before we dive in, I would like to recall the overall strategy from the previous post [1]. Recall that we use doc2vec because we want vector representations of documents, paragraphs or sentences. These vectors should have a fixed, not too large size. We will use these vectors for some task, for example document classification. Doc2vec has the following strategy:

1. We’re training a small network on a task that is somehow related to the end-to-end task.
2. In the middle of this small network we have a small layer.
3. After training, we remove the output layer of our small network (picture literally breaking the network apart).
4. The middle layer is now the output layer of our network, and it produces document vectors now.

It is a quite common pattern to pre-train a network and then use part of that network for some other purpose. Doc2vec is just one example. By working it out in detail, you may get some intuition about why it works. And this way of thinking you can then apply to other tasks, and build similar networks yourself. And, on the side, you may get something out of the details themselves.

The distributed bag of words model (PV-DBOW)

Recall that doc2vec algorithms were originally named paragraph vector algorithms. The authors of the original paper [7] start with a variant of paragraph vectors they call the distributed memory model (PV-DM). Later, they introduce another variant, the distributed bag of words model (PV-DBOW). For this blog post, we’ll start with the latter one, because it is the simpler model of the two. Starting with the simpler model will make things easier to learn. It is also easier to use. It has less parameters that need to be trained. As a consequence, it needs less data to train on. The network we’re training in PV-DBOW looks like this:


It looks a bit different than Figure 3 in the paper [7]. Here, we have drawn each and every unit of the network. Also, we have only room for one word in the output layer on the right. In the paper, Figure 3 has multiple words in the output layer. Our version looks a bit simpler. What we have here is a neural network with three layers. Let’s go over all of them.

The input layer, on the left, has N units. N is the number of documents in the notation used in [7]. During training, we will present a document in the input layer. The document will be encoded as a one-hot vector. Each unit in the input layer corresponds to a single document-id in your corpus. When we present a document, the unit corresponding to its id will be set to 1, and all other units will be set to 0. Together, the units form the document vector \vec{d}. Surprising, isn’t it? The only information we use as input is the document id.

The middle layer has size p. p is a free parameter that is chosen by you. It is the size of the paragraph vectors that doc2vec will output for you. The big fat arrow between the input layer and the hidden layer means that the two layers are fully connected. Each unit in the input layer is connected to each unit in the middle layer. Each connection has a certain weight. We should initialize these weights randomly, with small values, prior to training the network.

If you know a bit about neural networks, you might expect that at this point I am going to say something about a threshold in the middle layer, or about an activation function. But doc2vec, like word2vec, uses a very simple middle layer. One reason for this is that it speeds up computation quite a bit. The activation in that middle layer is just a weighted sum of the activation in the input layer. A compact way to represent this is:

 \hat{\vec{d}} = D\vec{d}

Here, D is an p\times N matrix that contains all the weights. \hat{\vec{d}} is the activation in the middle layer. It is the so-called paragraph or document vector produced by the network for \vec{d}. That is why we denote it here with \hat{\vec{d}}: we view it as an estimation, a representation of \vec{d} in a lower, p-dimensional space. It is also called an “input vector” [7]. Because \vec{d} is a one-hot vector, we have that the n‘th column in D contains \hat{\vec{d_n}}: the paragraph vector for the document with id n. A layer like the middle layer in our network here is often called an embedding layer.

The output layer has M units. In [7], M is used to denote the number of unique words in the vocabulary. The output layer activity in each of its units corresponds to a single word in the vocabulary. The output layer can be seen as a crude estimation of a term vector, I’ve called it \vec{l} in the picture: l for “logits”: a common name for activations in the output layer of a network. As you can see, the output layer is fully connected to the hidden layer. We can represent this with a matrix multiplication again:

 \vec{l} = U\hat{\vec{d}}

Here, U is an M\times p matrix that contains all the weights between the middle and the output layer. It is quite common to also have biases in the output layer, in which case we would write l = U\hat{\vec{d}} + \vec{b}. In expositions of the related word2vec algorithm, biases are sometimes included [5, 7] and sometimes excluded [16]. We omit them here. Note that U has a row for every word in our vocabulary. These rows are sometimes also called “output vectors” for the corresponding words. Let’s refer to these vectors here as \vec{u_m}, where m is the index of the corresponding word in the output layer. Then, we have

 l_m = \vec{u_m}\cdot\hat{\vec{d}}\mathrm{.}

Back to our overall strategy. After training, D contains the document vectors that we will take out and re-use as part of another application. Another way to think about this, is that after training, we will remove the last layer of the network. That would make the middle layer the output layer of the network. And what will it give us if we feed it a document vector? A low-dimensional document vector of a fixed size.

Next, we work on what our network will learn. Or, equivalently, what our objective function is. The function we want to minimise.

The softmax function and a loss function

In [7], we read that we will force the model to predict words randomly sampled from the paragraph. In our model above, we use documents as paragraphs. So what happens is that we give the network a document id, and we ask it to predict a randomly sampled word from the corresponding document. Amazing, isn’t it? Can we hope to achieve this? Well, remember that we can give the network a huge quantity of unique examples: as many as we have words in our corpus! And, in practice, during training, we’ll feed it our entire corpus many times.

Two connected functions we are going to add to our model in this section are the softmax function and a loss function:


The first three layers here are the same as before. The output layer activations are now passed into the softmax function s:

 s(l_i) = \frac{e^{l_i}}{\sum_k{e^{l_k}}}

Some properties of this function, that you can verify yourself:

1. 0 < s(l_i) < 1
2. \sum_i s(l_i) = 1
3. It boosts the activation of the units with the highest activation.
4. If the activation in the logits is higher, s(l_i) values will get closer to one and zero: the network will be more “sure” of its predictions.

The softmax is used for multi-class classification tasks. In multi-class classification tasks each observation is labeled with one of several classes. The task that we set our network, however, is a multi-class multi-label classification [4]: an observation can be labeled with one or more of several classes. Our network is asked to predict the words (multiple labels) in the document from the document id. How do we fix this? Well, as a training example, we’ll feed a document (observation) to our network, each time with a single target word (label). Each time the document is presented as a training example, possibly a different target word is used as label. In this, we follow the same pattern as in the word2vec algorithm Skip-Gram [5, 7]. This algorithm is what inspired PV-DBOW [7].

The result of the softmax function is the fourth layer in the above picture, which I’ve called \hat{\vec{t}}. I’ve called it \hat{\vec{t}} because it is meant to be an approximation of the actual term to be predicted. Because of the first two properties of the softmax function, we can now more or less legally think of the output of our network as an estimated probability distribution over the classes for an input document.

To train our network, we need to measure how well these probabilities were estimated for sampled target terms. For this, we compare \hat{\vec{t}} to the target term vector. That vector is all the way in the right in the picture above, it is called \vec{t}. This is a one-hot vector. We compare \vec{t} and \hat{\vec{t}} using a loss function. The loss function should be low when \hat{\vec{t}} and \vec{t} are similar. A common choice [7, 8, 9, 10] based on cross entropy is:

 E(\hat{\vec{t}}, \vec{t}) = -\sum_i t_i \log{\hat{t}_i}

Here are some properties of E that you can verify for yourself:

1. E(\hat{\vec{t}}, \vec{t}) > 0
2. If t_i is zero, the i‘th element of the sum will be zero, regardless of the value \hat{t}_i}, our prediction.
3. E is low when our prediction, \hat{t}_i, is close to 1.0 for the element of our target vector t_i that is one.

The second fact puzzled me for a bit. Does it not matter at all what a network predicts for units that are zero in the target vector? What if my network predicts a high value for all output neurons? Well, the softmax function prevents this. If all logits are equally high, the normalisation will make them all low. And the network will then incur loss on the unit that is one in the target vector. So, the combination of the softmax and this cross entropy loss is important.

Because \vec{t} is a one-hot vector, we can simplify E a bit further. Let y be the index where t_y = 1. Then

 E(\hat{\vec{t}}, \vec{t}) = -\log{\hat{t}_y}

Training our neural network involves minimising a loss function on a set of training points. For that, we need a loss function that compares a set of training predictions to a set of corresponding correct target vectors. Let’s denote our set of training target vectors as T, and let \vec{t_j} denote the j‘th target vector. As a loss function, we’ll use

 L = \frac{1}{|T|} \sum_{j=1}^{|T|}E(\hat{\vec{t_j}}, \vec{t_j})\mathrm{,}

we are simply averaging E over a set of training examples here.

Combining all our equations up to this point, it is not hard to write out L in full, you can try it yourself:

 L = \frac{1}{|T|} \sum_{j=1}^{|T|}{-\vec{u_{y_j}}\cdot\hat{\vec{d_j}} + \log{\sum_{m=1}^M{ e^{\vec{u_m}\cdot\hat{\vec{d_j}}}}}

Here, y_j is the index of the correct word for training example j in the target vector \vec{t_j}. So, \vec{u_{y_j}} is just the output vector corresponding to that word.

Now there is only one thing to add to our loss function: regularisation. This is usually an important part in trying to prevent a network from overfitting. Still, in the doc2vec and word2vec papers [7, 11, 12] it is not mentioned; we may experiment with omitting it as well. An interesting aspect of doc2vec is that we are not really interested in good prediction on a test set of our network. Rather, we are interested that the paragraph vectors we get in D in the end will be useful for some other purpose. In any case, a common regularisation approach is L2 regularisation:

 L' = L + \lambda \frac{1}{2}||\vec{w}||_2^2

Here, ||\vec{w}||_2^2 is the L2 norm (length) of \vec{w}, squared. \vec{w} here is one big vector with all the network weights in it. In our network, that is all the elements of the matrices U and D. \lambda is a parameter that we have to set in advance, it is not learned by stochastic gradient descent. Such a parameter is often called a hyperparameter.

One way to think about this regularisation is that we are rewarding weight configurations with small weights on average. And now if you think about the fourth property of the softmax function as discussed above, then you can translate this as rewarding weight configurations where the network is less sure of itself.

Now that we have defined our loss function, we have the bull by the horns. It constitutes the full definition of the optimisation problem we are solving, and the full definition of our network. In the next section we very briefly sketch how we minimise L.

Stochastic gradient descent

It would require another series of blog posts to fully explain the way neural networks are trained with backpropagation and gradient descent, but fortunately it is not hard to find a lot of material on that. It is important to understand the details [11], but we give only a very brief outline here, for completeness.

For a given set of input vectors \vec{d_j}, everything in L is constant save the weights in D and U. Our goal is to find a configuration for the weights such that L is minimised. This can be analytically solved actually for the complete training set. But for a big training set that would be computationally inefficient. Instead the idea of stochastic gradient descent is to sample some training samples randomly, so \{\vec{d_1}, \ldots, \vec{d_j}, \ldots\} is then a relatively small set of training examples. Such a set of examples is often called a batch. For this batch, again, L can be thought of as a function of the weights in D and U. These weights span a weight space. During training, we compute the gradient of L in the current point in the weight space. This gradient is the direction in which L increases most sharply. What we want to do is move in the opposite direction, but only slowly, using a learning rate that we will have to determine in advance (including a possible decay function and / or decay rate for this learning rate). The SGD iteration results in a set of small adjustments to all the weights: the weight delta’s. This process of finding the weight delta’s for all the weights in the network is often called backpropagation.

Stochastic gradient descent (SGD) is a very successful algorithm and it is the driving force behind the success of neural networks. Now that we have sketched SGD, it is time to define the final nuts and bolts of our first, simple doc2vec implementation. We start with how we will sample training examples.

Generating training batches

A training example is a tuple (docid, word) in our network. When we sample such tuples, there are a couple of choices to make. The first and most important is what to do with the notion of a text window. Then, we have to deal with the extremes in term frequencies: in natural language, there are many extremely rare words, as well as some extremely frequent words. And last but not least, we have to do something about the computational complexity of our loss function, which contains a summation over our entire vocabulary.

Using a text window

From the description of PV-DBOW [7], it is not easy to determine what is done: “(…) at each iteration of stochastic gradient descent, we sample a text window, then sample a random word from the text window and form a classification task given the Paragraph Vector.”. The simplest possible interpretation of this is to just randomly sample words. Let’s call this option A. If we go with option A, then we would end up with a model that can predict the words for a given document. Given the document, this would give us a global idea of its contents. This makes option A related to methods like LSA, LDA, and ALS [19].

The description just given suggests more. If we just randomly sample words, why mention a text window at all? The authors mention that PV-DBOW is similar to the word2vec Skip-Gram model [7]. A PV-DBOW approach inspired by the Skip-Gram model is to sample a text window, and then offer all the words in this text window in the same batch together, i.e., in the same iteration of stochastic gradient descent. Let’s call this option B. We could rewrite our loss function a little bit to make it more explicit that we want to correctly predict sets of words that are close together. Then, we could postulate a Naive Bayes assumption: given a document, and a text window, words in the window occur independently of each other. Then we could simplify our loss function and show that option B implements minimising that simplified loss function. See [16] for a similar derivation for Skip-Gram. If we follow option B, we would end up with a model that can predict sets of words that are close together in documents. Words that are close together often have some syntactic and semantic relations to each other. Option B would capture some of that. As do the word2vec algorithms CBOW and Skip-Gram. Since this is a major strength of these algorithms, our first bet would be to use option B for PV-DBOW.

Dealing with term frequencies

Some terms are very infrequent: a common approach is to ignore these terms and focus efforts only on the most frequent, e.g., 100K, 500K, or 1M words. Remaining terms could be mapped to a catch-all token ‘UNKNOWN’ if we want to keep their position in the text [5]. We will do this initially, too.

Other terms are very frequent, words like ‘the’, ‘a’, ‘of’, etc., and do not convey much meaning at all (although they have syntactic functions). In [12], for the Skip-Gram model, training words are sampled with a lower probability if they have a very high frequency, according to a slightly odd heuristic formula. How to combine this with text windows is not immediately obvious. Do we discard each word in our corpus with a probability based on term frequency, and then act as if it was never there, before we even sample text windows? In our implementation, initially, we will not make any adjustments for high frequency words at first; no adjustments are mentioned in [16] either. And if things don’t work, perhaps our first try will be a very crude but deterministic and often used heuristic: remove the top K terms from the corpus before doing anything.

Negative sampling

Finally, we will have to take some steps to make training our network computationally feasible. To see why, consider how big our vocabulary could typically be, and then think of computing the softmax function over and over again during training. We have a choice of methods to fix this, namely hierchical softmax [12] or negative sampling [12, 16]. We’ll opt for negative sampling. The basic idea in terms of our network structure is that in our SGD iteration we will work with a very small output layer. It will contain the correct terms, and only a handful of randomly sampled incorrect terms [2]. This translates to some changes to our loss function [12]. A good exposition of these changes for the word2vec Skip-Gram model is given in [16]. A similar line of reasoning can be followed for PV-DBOW.


We now have a detailed recipe for a PV-DBOW implementation. The level of detail is such that we should now be able to implement it in Tensorflow. Tensorflow has primitives for the first two layers (the embeddings), the output layer (just a matrix multiplication), the softmax-loss-function combo, and even negative sampling. What remains for us to do is to supply the training batches, to tie it all together, and to choose sensible values for the hyperparameters. We may follow-up on this post with an example PV-DBOW implementation, and otherwise, we hope you now feel confident enough to implement it yourself!


7. Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents.
10. Pattern Recognition and Machine Learning. Bishop, 2006, p 209
11. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
12. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

Googling doc2vec

On this site, recently, we featured a blog post [12] that used Doc2vec [4]. What is Doc2vec? Where does it come from? What does it do? Why use doc2vec, instead of other algorithms that do the same? What implementations exist? Where can I read more about it? If you, like me, are curious about these questions, read on.

So what is Doc2vec and where does it come from? In recent years some Google papers were published by Tomas Mikolov and friends about a neural network that could be trained to produce so-called paragraph vectors [1, 2, 3, 9]. The authors did not release software with their research papers. So others have tried to implement it. Doc2vec is an implementation of paragraph vectors by the authors of gensim, a much used library for numerical methods in the field of natural language processing (NLP). The name is different, but it is the same algorithm: doc2vec just sounds better than paragraph vectors. It is also a nicer name, because doc2vec builds upon the same neural network architectures that underly those other famous algorithms that go by the name word2vec. If you don’t know word2vec, Google it, there are plenty of resources where you can learn about it. Resources to learn about doc2vec, however, are just a bit less abundant, so in this post we’ll google it for you.

First, what does doc2vec do? Well, it gives you vectors of a fixed length–to be determined by you–that can represent text fragments of varying size, such as sentences, paragraphs, or documents. It achieves this by training a small neural network to perform a prediction task. To train a network, you need labels. In this case, the labels will come from the texts themselves. After the network has been trained, you can re-use a part of it, and this part will give you your sentence / paragraph / document vectors. These vectors can then be used in various algorithms, including document classification [12]. One of the success factors for using doc2vec will be the answer to this: the task you are using the doc2vec vectors for, is it related to the way doc2vec was trained, in a useful way?

What benefits does doc2vec offer over other methods? There are many ways to represent sentences, paragraphs or documents as a fixed size vector. The simplest way is to create a vocabulary of all the words in a corpus, and represent each document by a vector that has an entry for each word in the vocabulary. But such a vector would be quite large, and it would be quite sparse, too (it would contain many zeroes). Some algorithms have difficulty working with sparse and high dimensional vectors.  doc2vec yields vectors of a more manageable size, as determined by you. Again, there are many algorithms that do this for you, such as LDA [18], LSI [19], or Siamese CBOW [17], to name a recent one by a former colleague. To argue for the one or the other, what researchers would normally do is implement the prediction task they care about with several algorithms, and then measure which algorithm performed best. For example, in [9] paragraph vectors are compared to LDA for various tasks; the authors conclude that paragraph vectors outperform LDA. This does not mean that doc2vec will always be best for your particular application. But perhaps it is worth trying out. Running experiments with doc2vec is one way of learning about what it does, when you can use it, when it works well, and when it is less useful.

In terms of implementations, we’ve already mentioned the Doc2Vec class in gensim [4]. There’s also an implementation in deeplearning4j [15]. And Facebook’s fastText may have an implementation, too [16]. Since I like working with Tensorflow, I’ve googled “doc2vec tensorflow” and found a promising, at first sight clean and consise implementation  [13]. And a nice discussion about a few lines of Tensorflow code as well, with the discussion shifting to the gensim implementation [11]. Implementations in other low level neural network frameworks may exist.

Zooming in just a bit, it turns out that doc2vec is not just one algorithm, but rather it refers to a small group of alternative algorithms. And these are of course extended in new research. For example, in [9], a modified version of a particular doc2vec algorithm is used, according to a blog post about the paper [10]. In that blog post, some details on the extension in [9] are given, based on correspondence with the authors of [9].  An implementation of that extension is believed not to be there yet by the author of [10]. In general, it may be impossible to recreate exactly the same algorithms as the authors of the original papers used. Rather, studying concrete implementations is another way of learning about how doc2vec algorithms work.

A third way of learning more is reading. If you know a bit about how neural networks work, you can start by checking the original papers [1, 2, 3, 9]. There are some notes on the papers, too, in blogs by various people [10, 14]. The papers and the blog posts leave some details to the reader, and are not primarily intended as lecture material. The Stanford course on deep learning for NLP has some good lecture notes on some of the algorithms leading up to doc2vec [7], but doc2vec itself is not covered. There are enough posts explaining how to use the gensim Doc2Vec class [5, 6, 8, 12]. Some of these posts do include some remarks on the workings of Doc2Vec [5, 6, 8] or even perform experiments with it [6, 8, 12]. But they do not really drill down to the details of the neural net itself. I could not find a blog post explaining the neural net layout in [13], or reporting on experiments with [13].

Now that you have come this far, wouldn’t it be nice to set out to take a look at how doc2vec, the algorithm, works? With the aim to add some detail and elaboration to the concise exposition in the original papers. And perhaps we can add and discuss some working code, if not too much of it is needed! Stay tuned for more on this.


  1. Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents.
  2. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
  3. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
  9. Andrew N. Dai, Christopher Olah, Quoc V. Le. Document Embedding with Paragraph Vectors, NIPS 2014.
  17. Tom Kenter, Alexey Borisov, Maarten de Rijke. Siamese CBOW: Optimizing Word Embeddings for Sentence Representations. ACL 2016.


Alternating Least Squares with implicit feedback – The search for alpha

So you want to build a recommendation engine with ALS… You search the internet for some example code in your language of choice… You copy paste the code and start tweaking… But then you realize that your data is different from all the examples you found online. You don’t have explicit ratings in some range from 1 to 10; instead, you have click events where 1 means ‘clicked’. Will you still be able to use ALS? And if so, how?

A brief recap on Collaborative Filtering and Alternating Least Squares

Collaborative Filtering is the technique of predicting the preference of a user by collecting the preferences of many users. The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B’s opinion on a different issue i than to have the opinion on i of a person chosen randomly (Wikipedia).
A preference takes the form of a (user, item, rating) triple. Collecting them yields a sparse matrix R_{u \times i} of known ratings of users u for items i. The task is to fill in the missing values. In the Latent Model approach to Collaborative Filtering, we do this by decomposing R_{u \times i} into a user matrix X_{u \times g} and an items matrix Y_{i \times g} so that we can find the ‘hidden’ features g of users and items. (In the case of movies one could think of these hidden features as genres.) By taking the product of the user matrix and item matrix, we can reconstruct the (complete) ratings matrix \hat{R} = X \cdot Y^T. Or for individual ratings: \hat{r}_{ui} = x_{u}^T y_i
To compute these factors, we will first randomly initialize X and Y and iteratively recompute them by minimizing the loss function L:
 \sum\limits_{u,i} (r_{ui} - x_u^T y_i)^{2} + \lambda \Big( \sum\limits_u \|x_u\|^{2} + \sum\limits_i \|y_i\|^{2} \Big)
The first term in L is the sum of squared errors and the second term is used for regularization. In order to minimize the loss function, we will take the derivatives with respect to x and y and solve for 0.

 \begin{aligned} \frac{\partial L}{\partial x_u} &= 0 \\ -2\sum\limits_i(r_{ui} - x_u^T y_i) y_i^T + 2 \lambda x_u^T &= 0 \\ -2 (r_u^T - x_u^T Y^T)Y + 2 \lambda x_u^T &= 0 \\ -2 r_u^T Y + 2 x_u^T Y^T Y + 2 \lambda x_u^T &= 0 \\ x_u^T Y^T Y + \lambda x_u^T &= r_u^T Y \\ x_u^T \big( Y^T Y + \lambda I \big) &= r_u^T Y \\ \big( Y^T Y + \lambda I \big) x_{u} &= Y^T r_u \\ x_u &= \big( Y^T Y + \lambda I \big)^{-1} Y^T r_u \end{aligned}

And for y:
 \begin{aligned} \frac{\partial L}{\partial y_i} &= 0 \\ -2\sum\limits_i(r_{ui} - y_i^T x_u) x_u^T + 2 \lambda y_i^T &= 0 \\ y_i &= \big( X^T X + \lambda I \big)^{-1} X^T r_i \end{aligned} %y_i &= \big( X X^T + \lambda I \big)^{-1} X r_i^T

Recomputing x_{u} and y_i can be done with Stochastic Gradient Descent, but this is a non-convex optimization problem. We can convert it into a set of quadratic problems, by keeping either x_u or y_i fixed while optimizing the other. In that case, we can iteratively solve x and y by alternating between them until the algorithm converges. This is Alternating Least Squares.

Implicit Feedback

Unfortunately, you don’t have ratings, you have click events. And a ‘click’ event does not necessarily mean a user really likes the product; the user could be curious about the product for some other reason. Even when you are using ‘buy’ events you are not in the clear, because people buy gifts for other people all the time. Furthermore, the absence of a click event does not imply a dislike for a product i. So can you still use ALS? Yes, you can still use ALS, but you have to take into account the fact that you have implicit ratings/feedback. Luckily, your preferred machine learning library shows there is an ‘implicit’ switch on the ALS interface and that there is an ‘alpha’ parameter involved as well.

So what is this \alpha character?

To understand alpha, we should go to the source, which is Hu et al. 2008 (1). He suggests to split each rating into a preference and a confidence level. The preference is calculated by capping the rating to a binary value.
 p_{ui} = \begin{cases} 1, & r_{ui} > 0 \\ 0, & r_{ui} = 0 \end{cases}

The confidence level is defined as:
c_{ui} = 1 + \alpha r_{ui}
For a rating of 0 we would have a minimum confidence of 1 and if the rating increases, the confidence increases accordingly. The rate of increase is controlled by alpha. So alpha reflects how much we value observed events versus unobserved events. Factors are now computed by minimizing the following loss function L:
 \sum\limits_{u,i} c_{ui} (p_{ui} - x_{u}^{T}y_i)^{2} + \lambda \Big( \sum\limits_u \|x_u\|^{2} + \sum\limits_i \|y_i\|^{2} \Big)

Now suppose that for a given rating r_{ui} that x_{u}^T y_i is very large, so that the squared residual (p_{ui} - x_{u}^{T}y_i)^{2} is very large, then the rating r_{ui} has a big impact on our loss function. And it should! x_{u}^T y_i will be drawn towards the 0-1 range, which is a good thing, because we want to predict whether the event will occur or not (0 or 1).
Alternatively, suppose that for a given rating r_{ui} we have observed many events, and suppose also that our initial x_{u}^T y_i value is close to 0, so that the squared residual (p_{ui} - x_{u}^{T}y_i)^{2} approximates 1, then the rating r_{ui} will still have quite some impact on our loss function, because our confidence c_{ui} is large. Again, this is a good thing, because in this case, we want x_{u}^T y_i to go towards 1.
If either the confidence level is low or the residual is low, there is not much impact on the loss function, so the update of x_u and y_i will be small.


Now that we have some background on this alpha, we can safely copy-paste the recommender engine code we found online and expand it so that it includes the alpha parameter. All that is left now is some extra parameter tuning and we are done! After that, we can run our final model on the test set and we can calculate the root mean squared error… Wait.. What..? Somehow, that metric just doesn’t feel right. Oh, well… Enough for today 🙂


(1) Y. Hu, Y. Koren and C. Volinsky, “Collaborative Filtering for Implicit Feedback Datasets”, 2008

Machine learning – an example

In my previous blog post, I tried to give some intuition on what neural networks do. I explained that when given the right features, the neural network can generalize and identify regions of the same class in the feature space. The feature space consisted of only 2 dimensions so that it could be easily visualized. In this post, I want to look into a more practical problem of text classification. Specifically, I will use the Reuters 21578 news article dataset. I will describe a classification algorithm for this dataset that will utilize a novel feature extraction algorithm for text called doc2vec.

I will also make the point that because we use machine learning, which means the machine will do most of the work, the same algorithms can be used on any kind of text data and not just news articles. The algorithm will not contain any business logic that is specific to news articles. Especially the neural network is a very reusable part. In machine learning theorem a neural net is known as a universal approximator. That means that it can be used to approximate many interesting functions. In practical terms, it means you can use the same neural network architecture for image data, text data, audio data and much more. So trying to understand one application of a neural network can help you understand much more machine learning applications.

Training the Doc2vec model

In the previous post, I explained how important it is to select the right features. Doc2vec is an algorithm that extracts features from text documents. As the name implies it converts documents to vectors. How exactly it does that is beyond the scope of this blog (do see the paper at: but its interface is pretty simple. Below is the python code to create vectors from a collection of documents:

# Load the reuters news articles and convert them to TaggedDocuments
taggedDocuments = [TaggedDocument(words=word_tokenize(reuters.raw(fileId)), tags=[i]) for i, fileId in enumerate(reuters.fileids())]

# Create and train the doc2vec model
doc2vec = Doc2Vec(size=doc2vec_dimensions, min_count=2, iter=10, workers=12)

# Build the word2vec model from the corpus
# Build the doc2vec model from the corpus

(for the complete script see:

To get some intuition on what doc2vec does let’s convert some documents to vectors and look at their properties. The following code will convert documents from the topic jobs and documents from the topic trade to document vectors. With the help of dimensionality reduction tools (PCA and TSNE) we can reduce these high dimensional vectors to 2 dimensions. See scripts/ for the code. These tools work in such a way that coordinates in the high dimensional space that are far apart are also far apart in the 2-dimensional space and vice versa for coordinates that are near each other.


(see the source code at:

What you see here are the document classes, red for the “job” topic documents and blue for the “trade” topic documents. You can easily see that there are definitely regions with more red than blue dots. By doing this we can get some intuition that the features we selected can be used to make a distinction between these 2 classes. Keep in mind that the classifier can use the high dimensional features which probably show a better distinction than this 2-dimensional plot.

Another thing we can do is calculate the similarity between 2 doc vectors (see the similarity function of doc2vec for that: gensim.models.doc2vec.DocvecsArray#similarity). If I pick 50 job vectors their average similarity to each other is 0.16. The average similarity between 50 trade vectors is 0.13 If we now look at what the average similarity between 50 job vectors and 50 trade vectors we get a lower number: 0.02. We see that the trade vectors are farther apart from the job vectors than that they are from each other. We get some more intuition that our vectors contain information about the content of the news article.

There is also a function that given some example vectors finds the top n similar documents, see gensim.models.doc2vec.DocvecsArray#most_similar. This can also be useful to see if your trained doc2vec model can distinguish between classes. Given a news article, we expect to find more news articles of the same topic nearby.

Training the classifier

Now that we have a trained doc2vec model that can create a document vector given some text we can use that vector to train a neural network in recognizing the class of a vector.

Important to understand is that the doc2vec algorithm is an unsupervised algorithm. During training, we didn’t give it any information about the topic of the news article. We just gave it the raw text of the news article. The models we create during the training phase will be stored and will later be used in the prediction phase. Schematically our algorithm looks like this (for the training phase):

For the classifier, we will use a neural network that will train on all the articles in the training set (the reuters dataset is split up in a training and test set, the test set will later be used to validate the accuracy of the classifier). The code for the classifier looks like this:

model = Sequential()
model.add(Dense(input_dim=doc2vec_dimensions, output_dim=500, activation='relu'))
model.add(Dense(output_dim=1200, activation='relu'))
model.add(Dense(output_dim=400, activation='relu'))
model.add(Dense(output_dim=600, activation='relu'))
model.add(Dense(output_dim=train_labels.shape[1], activation='sigmoid'))
model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

(for the complete script see:

This code will build a neural network with 3 hidden layers. For each topic in the Reuters dataset, there will be an output neuron returning a value between 0 and 1 for the probability that the news article is about the topic. Keep in mind that a news article can have several topics. So the classifier can indicate the probability of more than 1 topic at once. This is achieved by using the binary_crossentropy loss function. Schematically the neural network will look like this:


Given a doc vector, the neural network will give a prediction between 0 and 1 for each topic. After the training phase both the model of the doc2vec algorithm and for the neural network will be stored so that they later can be used for the prediction phase.


When using a neural network it’s important not to have too few dimensions or too many. If the number of dimensions is too low the coordinates in the feature space will end up too close to each other which makes it hard to distinguish them from each other. Too many dimensions will cause the feature space to be too large, the neural network will have problems to relate data points of the same class. Doc2vec is a great tool that will create vectors that are not that large, the default being 300 dimensions. In the Doc2vec paper, it’s mentioned that this is the main advantage over techniques that create a dimension for every unique word in the text. This will create in the 10s or 100s thousand dimensions.

Prediction phase

For the prediction phase, we load the trained doc2vec model and the trained classifier model.


When we feed the algorithm the text of a news article the doc2vec algorithm will convert it to a doc vector and based on that the classifier will predict a topic. During the training phase, I withheld a small set of news articles from the training of the classifier. We can use that set to evaluate the accuracy of the predictions by comparing the predicted topic with the actual topic. Here are some predicted topics next to their actual topics:

predicted: [‘ship’] – actual: [‘ship’]

predicted: [‘coffee’] – actual: [‘coffee’, ‘lumber’, ‘palm-oil’, ‘rubber’, ‘veg-oil’]

predicted: [‘grain’, ‘wheat’] – actual: [‘grain’, ‘wheat’]

predicted: [] – actual: [‘gold’]

predicted: [‘earn’] – actual: [‘acq’]

predicted: [‘acq’] – actual: [‘tin’]

predicted: [‘interest’, ‘money-fx’] – actual: [‘interest’, ‘money-fx’]


Hopefully, by now I’ve given some intuition on what machine learning is.

First, your data needs to be converted to meaningful feature vectors with just the right amount of dimensions. You can verify the contents of your feature vectors by:

  • Reducing them to 2 dimensions and plot them on a graph to see if similar things end up near each other
  • Given a datapoint, find the closest other data points and see if they are similar

You need to divide your dataset in a training and a test set. Then you can train your classifier on the training set and verify it against the test set.

While this is a process that takes a while to understand and getting used to, the very interesting thing is that this algorithm can be used for a lot of different use cases. This algorithm describes classifying news articles. I’ve found that using exactly the same algorithm on other kinds of predictions, sentiment prediction for example, works exactly the same. It’s just a matter of swapping out the topics with positive or negative sentiment.

I’ve used the algorithm on other kinds of text documents: user reviews, product descriptions and medical data. The interesting thing is that the code changes required to apply these algorithms on other domains are minimal, you can use exactly the same algorithms. This is because the machine itself learns the business logic. Because of that the code doesn’t need to change. Understanding the described algorithm is not just learning a way to predict the topic of a news article. Basically, it can predict anything based on text data as long as you have examples. For me as software engineer, this is quite surprising. Usually, code is specific to an application and cannot be reused in another application. With machine learning, I can make software that can be applied in multiple very different domains. Very cool!

Source code

Further reading

For more practical tips on machine learning see the paper “A Few Useful Things to Know about Machine Learning” at:

About doc2vec: