Tag Archive: doc2vec

Implementing doc2vec

In a previous post [1], we’ve taken a look at what doc2vec is, who invented it, what it does, what implementations exist, and what has been written about it. In this post, we will work out the details of a version of the simplest doc2vec algorithm: PV-DBOW. Our approach may not be exactly the same as in the original paper [7]. Our aim is a starting point: minimal, easy to understand, and still interesting enough to be expected to work. In a follow-up post, we may develop a Tensorflow implementation. This post should be detailed enough for you to do it yourself, perhaps you’ll beat us to it 🙂 For your inspiration, also refer to the doc2vec Tensorflow code here [18]; that code implements PV-DM, the other doc2vec algorithm.

If you have already watched the videos from the Udacity course on deep learning by Vincent van Hoecken [2] or you already have done some other neural network tutorials, I guess this post should be quite easy to follow. But if this post series is one of the first things you read on neural nets, perhaps you will still be able to follow along, looking up some things along the way yourself.

Before we dive in, I would like to recall the overall strategy from the previous post [1]. Recall that we use doc2vec because we want vector representations of documents, paragraphs or sentences. These vectors should have a fixed, not too large size. We will use these vectors for some task, for example document classification. Doc2vec has the following strategy:

1. We’re training a small network on a task that is somehow related to the end-to-end task.
2. In the middle of this small network we have a small layer.
3. After training, we remove the output layer of our small network (picture literally breaking the network apart).
4. The middle layer is now the output layer of our network, and it produces document vectors now.

It is a quite common pattern to pre-train a network and then use part of that network for some other purpose. Doc2vec is just one example. By working it out in detail, you may get some intuition about why it works. And this way of thinking you can then apply to other tasks, and build similar networks yourself. And, on the side, you may get something out of the details themselves.

The distributed bag of words model (PV-DBOW)

Recall that doc2vec algorithms were originally named paragraph vector algorithms. The authors of the original paper [7] start with a variant of paragraph vectors they call the distributed memory model (PV-DM). Later, they introduce another variant, the distributed bag of words model (PV-DBOW). For this blog post, we’ll start with the latter one, because it is the simpler model of the two. Starting with the simpler model will make things easier to learn. It is also easier to use. It has less parameters that need to be trained. As a consequence, it needs less data to train on. The network we’re training in PV-DBOW looks like this:


It looks a bit different than Figure 3 in the paper [7]. Here, we have drawn each and every unit of the network. Also, we have only room for one word in the output layer on the right. In the paper, Figure 3 has multiple words in the output layer. Our version looks a bit simpler. What we have here is a neural network with three layers. Let’s go over all of them.

The input layer, on the left, has N units. N is the number of documents in the notation used in [7]. During training, we will present a document in the input layer. The document will be encoded as a one-hot vector. Each unit in the input layer corresponds to a single document-id in your corpus. When we present a document, the unit corresponding to its id will be set to 1, and all other units will be set to 0. Together, the units form the document vector \vec{d}. Surprising, isn’t it? The only information we use as input is the document id.

The middle layer has size p. p is a free parameter that is chosen by you. It is the size of the paragraph vectors that doc2vec will output for you. The big fat arrow between the input layer and the hidden layer means that the two layers are fully connected. Each unit in the input layer is connected to each unit in the middle layer. Each connection has a certain weight. We should initialize these weights randomly, with small values, prior to training the network.

If you know a bit about neural networks, you might expect that at this point I am going to say something about a threshold in the middle layer, or about an activation function. But doc2vec, like word2vec, uses a very simple middle layer. One reason for this is that it speeds up computation quite a bit. The activation in that middle layer is just a weighted sum of the activation in the input layer. A compact way to represent this is:

 \hat{\vec{d}} = D\vec{d}

Here, D is an p\times N matrix that contains all the weights. \hat{\vec{d}} is the activation in the middle layer. It is the so-called paragraph or document vector produced by the network for \vec{d}. That is why we denote it here with \hat{\vec{d}}: we view it as an estimation, a representation of \vec{d} in a lower, p-dimensional space. It is also called an “input vector” [7]. Because \vec{d} is a one-hot vector, we have that the n‘th column in D contains \hat{\vec{d_n}}: the paragraph vector for the document with id n. A layer like the middle layer in our network here is often called an embedding layer.

The output layer has M units. In [7], M is used to denote the number of unique words in the vocabulary. The output layer activity in each of its units corresponds to a single word in the vocabulary. The output layer can be seen as a crude estimation of a term vector, I’ve called it \vec{l} in the picture: l for “logits”: a common name for activations in the output layer of a network. As you can see, the output layer is fully connected to the hidden layer. We can represent this with a matrix multiplication again:

 \vec{l} = U\hat{\vec{d}}

Here, U is an M\times p matrix that contains all the weights between the middle and the output layer. It is quite common to also have biases in the output layer, in which case we would write l = U\hat{\vec{d}} + \vec{b}. In expositions of the related word2vec algorithm, biases are sometimes included [5, 7] and sometimes excluded [16]. We omit them here. Note that U has a row for every word in our vocabulary. These rows are sometimes also called “output vectors” for the corresponding words. Let’s refer to these vectors here as \vec{u_m}, where m is the index of the corresponding word in the output layer. Then, we have

 l_m = \vec{u_m}\cdot\hat{\vec{d}}\mathrm{.}

Back to our overall strategy. After training, D contains the document vectors that we will take out and re-use as part of another application. Another way to think about this, is that after training, we will remove the last layer of the network. That would make the middle layer the output layer of the network. And what will it give us if we feed it a document vector? A low-dimensional document vector of a fixed size.

Next, we work on what our network will learn. Or, equivalently, what our objective function is. The function we want to minimise.

The softmax function and a loss function

In [7], we read that we will force the model to predict words randomly sampled from the paragraph. In our model above, we use documents as paragraphs. So what happens is that we give the network a document id, and we ask it to predict a randomly sampled word from the corresponding document. Amazing, isn’t it? Can we hope to achieve this? Well, remember that we can give the network a huge quantity of unique examples: as many as we have words in our corpus! And, in practice, during training, we’ll feed it our entire corpus many times.

Two connected functions we are going to add to our model in this section are the softmax function and a loss function:


The first three layers here are the same as before. The output layer activations are now passed into the softmax function s:

 s(l_i) = \frac{e^{l_i}}{\sum_k{e^{l_k}}}

Some properties of this function, that you can verify yourself:

1. 0 < s(l_i) < 1
2. \sum_i s(l_i) = 1
3. It boosts the activation of the units with the highest activation.
4. If the activation in the logits is higher, s(l_i) values will get closer to one and zero: the network will be more “sure” of its predictions.

The softmax is used for multi-class classification tasks. In multi-class classification tasks each observation is labeled with one of several classes. The task that we set our network, however, is a multi-class multi-label classification [4]: an observation can be labeled with one or more of several classes. Our network is asked to predict the words (multiple labels) in the document from the document id. How do we fix this? Well, as a training example, we’ll feed a document (observation) to our network, each time with a single target word (label). Each time the document is presented as a training example, possibly a different target word is used as label. In this, we follow the same pattern as in the word2vec algorithm Skip-Gram [5, 7]. This algorithm is what inspired PV-DBOW [7].

The result of the softmax function is the fourth layer in the above picture, which I’ve called \hat{\vec{t}}. I’ve called it \hat{\vec{t}} because it is meant to be an approximation of the actual term to be predicted. Because of the first two properties of the softmax function, we can now more or less legally think of the output of our network as an estimated probability distribution over the classes for an input document.

To train our network, we need to measure how well these probabilities were estimated for sampled target terms. For this, we compare \hat{\vec{t}} to the target term vector. That vector is all the way in the right in the picture above, it is called \vec{t}. This is a one-hot vector. We compare \vec{t} and \hat{\vec{t}} using a loss function. The loss function should be low when \hat{\vec{t}} and \vec{t} are similar. A common choice [7, 8, 9, 10] based on cross entropy is:

 E(\hat{\vec{t}}, \vec{t}) = -\sum_i t_i \log{\hat{t}_i}

Here are some properties of E that you can verify for yourself:

1. E(\hat{\vec{t}}, \vec{t}) > 0
2. If t_i is zero, the i‘th element of the sum will be zero, regardless of the value \hat{t}_i}, our prediction.
3. E is low when our prediction, \hat{t}_i, is close to 1.0 for the element of our target vector t_i that is one.

The second fact puzzled me for a bit. Does it not matter at all what a network predicts for units that are zero in the target vector? What if my network predicts a high value for all output neurons? Well, the softmax function prevents this. If all logits are equally high, the normalisation will make them all low. And the network will then incur loss on the unit that is one in the target vector. So, the combination of the softmax and this cross entropy loss is important.

Because \vec{t} is a one-hot vector, we can simplify E a bit further. Let y be the index where t_y = 1. Then

 E(\hat{\vec{t}}, \vec{t}) = -\log{\hat{t}_y}

Training our neural network involves minimising a loss function on a set of training points. For that, we need a loss function that compares a set of training predictions to a set of corresponding correct target vectors. Let’s denote our set of training target vectors as T, and let \vec{t_j} denote the j‘th target vector. As a loss function, we’ll use

 L = \frac{1}{|T|} \sum_{j=1}^{|T|}E(\hat{\vec{t_j}}, \vec{t_j})\mathrm{,}

we are simply averaging E over a set of training examples here.

Combining all our equations up to this point, it is not hard to write out L in full, you can try it yourself:

 L = \frac{1}{|T|} \sum_{j=1}^{|T|}{-\vec{u_{y_j}}\cdot\hat{\vec{d_j}} + \log{\sum_{m=1}^M{ e^{\vec{u_m}\cdot\hat{\vec{d_j}}}}}

Here, y_j is the index of the correct word for training example j in the target vector \vec{t_j}. So, \vec{u_{y_j}} is just the output vector corresponding to that word.

Now there is only one thing to add to our loss function: regularisation. This is usually an important part in trying to prevent a network from overfitting. Still, in the doc2vec and word2vec papers [7, 11, 12] it is not mentioned; we may experiment with omitting it as well. An interesting aspect of doc2vec is that we are not really interested in good prediction on a test set of our network. Rather, we are interested that the paragraph vectors we get in D in the end will be useful for some other purpose. In any case, a common regularisation approach is L2 regularisation:

 L' = L + \lambda \frac{1}{2}||\vec{w}||_2^2

Here, ||\vec{w}||_2^2 is the L2 norm (length) of \vec{w}, squared. \vec{w} here is one big vector with all the network weights in it. In our network, that is all the elements of the matrices U and D. \lambda is a parameter that we have to set in advance, it is not learned by stochastic gradient descent. Such a parameter is often called a hyperparameter.

One way to think about this regularisation is that we are rewarding weight configurations with small weights on average. And now if you think about the fourth property of the softmax function as discussed above, then you can translate this as rewarding weight configurations where the network is less sure of itself.

Now that we have defined our loss function, we have the bull by the horns. It constitutes the full definition of the optimisation problem we are solving, and the full definition of our network. In the next section we very briefly sketch how we minimise L.

Stochastic gradient descent

It would require another series of blog posts to fully explain the way neural networks are trained with backpropagation and gradient descent, but fortunately it is not hard to find a lot of material on that. It is important to understand the details [11], but we give only a very brief outline here, for completeness.

For a given set of input vectors \vec{d_j}, everything in L is constant save the weights in D and U. Our goal is to find a configuration for the weights such that L is minimised. This can be analytically solved actually for the complete training set. But for a big training set that would be computationally inefficient. Instead the idea of stochastic gradient descent is to sample some training samples randomly, so \{\vec{d_1}, \ldots, \vec{d_j}, \ldots\} is then a relatively small set of training examples. Such a set of examples is often called a batch. For this batch, again, L can be thought of as a function of the weights in D and U. These weights span a weight space. During training, we compute the gradient of L in the current point in the weight space. This gradient is the direction in which L increases most sharply. What we want to do is move in the opposite direction, but only slowly, using a learning rate that we will have to determine in advance (including a possible decay function and / or decay rate for this learning rate). The SGD iteration results in a set of small adjustments to all the weights: the weight delta’s. This process of finding the weight delta’s for all the weights in the network is often called backpropagation.

Stochastic gradient descent (SGD) is a very successful algorithm and it is the driving force behind the success of neural networks. Now that we have sketched SGD, it is time to define the final nuts and bolts of our first, simple doc2vec implementation. We start with how we will sample training examples.

Generating training batches

A training example is a tuple (docid, word) in our network. When we sample such tuples, there are a couple of choices to make. The first and most important is what to do with the notion of a text window. Then, we have to deal with the extremes in term frequencies: in natural language, there are many extremely rare words, as well as some extremely frequent words. And last but not least, we have to do something about the computational complexity of our loss function, which contains a summation over our entire vocabulary.

Using a text window

From the description of PV-DBOW [7], it is not easy to determine what is done: “(…) at each iteration of stochastic gradient descent, we sample a text window, then sample a random word from the text window and form a classification task given the Paragraph Vector.”. The simplest possible interpretation of this is to just randomly sample words. Let’s call this option A. If we go with option A, then we would end up with a model that can predict the words for a given document. Given the document, this would give us a global idea of its contents. This makes option A related to methods like LSA, LDA, and ALS [19].

The description just given suggests more. If we just randomly sample words, why mention a text window at all? The authors mention that PV-DBOW is similar to the word2vec Skip-Gram model [7]. A PV-DBOW approach inspired by the Skip-Gram model is to sample a text window, and then offer all the words in this text window in the same batch together, i.e., in the same iteration of stochastic gradient descent. Let’s call this option B. We could rewrite our loss function a little bit to make it more explicit that we want to correctly predict sets of words that are close together. Then, we could postulate a Naive Bayes assumption: given a document, and a text window, words in the window occur independently of each other. Then we could simplify our loss function and show that option B implements minimising that simplified loss function. See [16] for a similar derivation for Skip-Gram. If we follow option B, we would end up with a model that can predict sets of words that are close together in documents. Words that are close together often have some syntactic and semantic relations to each other. Option B would capture some of that. As do the word2vec algorithms CBOW and Skip-Gram. Since this is a major strength of these algorithms, our first bet would be to use option B for PV-DBOW.

Dealing with term frequencies

Some terms are very infrequent: a common approach is to ignore these terms and focus efforts only on the most frequent, e.g., 100K, 500K, or 1M words. Remaining terms could be mapped to a catch-all token ‘UNKNOWN’ if we want to keep their position in the text [5]. We will do this initially, too.

Other terms are very frequent, words like ‘the’, ‘a’, ‘of’, etc., and do not convey much meaning at all (although they have syntactic functions). In [12], for the Skip-Gram model, training words are sampled with a lower probability if they have a very high frequency, according to a slightly odd heuristic formula. How to combine this with text windows is not immediately obvious. Do we discard each word in our corpus with a probability based on term frequency, and then act as if it was never there, before we even sample text windows? In our implementation, initially, we will not make any adjustments for high frequency words at first; no adjustments are mentioned in [16] either. And if things don’t work, perhaps our first try will be a very crude but deterministic and often used heuristic: remove the top K terms from the corpus before doing anything.

Negative sampling

Finally, we will have to take some steps to make training our network computationally feasible. To see why, consider how big our vocabulary could typically be, and then think of computing the softmax function over and over again during training. We have a choice of methods to fix this, namely hierchical softmax [12] or negative sampling [12, 16]. We’ll opt for negative sampling. The basic idea in terms of our network structure is that in our SGD iteration we will work with a very small output layer. It will contain the correct terms, and only a handful of randomly sampled incorrect terms [2]. This translates to some changes to our loss function [12]. A good exposition of these changes for the word2vec Skip-Gram model is given in [16]. A similar line of reasoning can be followed for PV-DBOW.


We now have a detailed recipe for a PV-DBOW implementation. The level of detail is such that we should now be able to implement it in Tensorflow. Tensorflow has primitives for the first two layers (the embeddings), the output layer (just a matrix multiplication), the softmax-loss-function combo, and even negative sampling. What remains for us to do is to supply the training batches, to tie it all together, and to choose sensible values for the hyperparameters. We may follow-up on this post with an example PV-DBOW implementation, and otherwise, we hope you now feel confident enough to implement it yourself!


1. https://amsterdam.luminis.eu/2017/01/16/googling-doc2vec/
2. https://classroom.udacity.com/courses/ud730
3. https://www.tensorflow.org/api_docs/python/nn/classification
4. https://en.wikipedia.org/wiki/Multi-label_classification
5. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/udacity/5_word2vec.ipynb
6. https://miguelmalvarez.com/2015/03/20/classifying-reuters-21578-collection-with-python-representing-the-data/
7. Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. http://arxiv.org/pdf/1405.4053v2.pdf
8. https://classroom.udacity.com/courses/ud730/lessons/6370362152/concepts/63798118260923#
9. https://www.r-bloggers.com/making-sense-of-logarithmic-loss/
10. Pattern Recognition and Machine Learning. Bishop, 2006, p 209
11. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
12. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
13. https://radimrehurek.com/gensim/models/doc2vec.html
14. https://rare-technologies.com/doc2vec-tutorial/
15. https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
16. http://cs224d.stanford.edu/lecture_notes/LectureNotes1.pdf
17. https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b#.xo9cr44zd
18. https://github.com/wangz10/tensorflow-playground/blob/master/doc2vec.py
19. https://amsterdam.luminis.eu/2016/12/04/alternating-least-squares-implicit-feedback-search-alpha/

Googling doc2vec

On this site, recently, we featured a blog post [12] that used Doc2vec [4]. What is Doc2vec? Where does it come from? What does it do? Why use doc2vec, instead of other algorithms that do the same? What implementations exist? Where can I read more about it? If you, like me, are curious about these questions, read on.

So what is Doc2vec and where does it come from? In recent years some Google papers were published by Tomas Mikolov and friends about a neural network that could be trained to produce so-called paragraph vectors [1, 2, 3, 9]. The authors did not release software with their research papers. So others have tried to implement it. Doc2vec is an implementation of paragraph vectors by the authors of gensim, a much used library for numerical methods in the field of natural language processing (NLP). The name is different, but it is the same algorithm: doc2vec just sounds better than paragraph vectors. It is also a nicer name, because doc2vec builds upon the same neural network architectures that underly those other famous algorithms that go by the name word2vec. If you don’t know word2vec, Google it, there are plenty of resources where you can learn about it. Resources to learn about doc2vec, however, are just a bit less abundant, so in this post we’ll google it for you.

First, what does doc2vec do? Well, it gives you vectors of a fixed length–to be determined by you–that can represent text fragments of varying size, such as sentences, paragraphs, or documents. It achieves this by training a small neural network to perform a prediction task. To train a network, you need labels. In this case, the labels will come from the texts themselves. After the network has been trained, you can re-use a part of it, and this part will give you your sentence / paragraph / document vectors. These vectors can then be used in various algorithms, including document classification [12]. One of the success factors for using doc2vec will be the answer to this: the task you are using the doc2vec vectors for, is it related to the way doc2vec was trained, in a useful way?

What benefits does doc2vec offer over other methods? There are many ways to represent sentences, paragraphs or documents as a fixed size vector. The simplest way is to create a vocabulary of all the words in a corpus, and represent each document by a vector that has an entry for each word in the vocabulary. But such a vector would be quite large, and it would be quite sparse, too (it would contain many zeroes). Some algorithms have difficulty working with sparse and high dimensional vectors.  doc2vec yields vectors of a more manageable size, as determined by you. Again, there are many algorithms that do this for you, such as LDA [18], LSI [19], or Siamese CBOW [17], to name a recent one by a former colleague. To argue for the one or the other, what researchers would normally do is implement the prediction task they care about with several algorithms, and then measure which algorithm performed best. For example, in [9] paragraph vectors are compared to LDA for various tasks; the authors conclude that paragraph vectors outperform LDA. This does not mean that doc2vec will always be best for your particular application. But perhaps it is worth trying out. Running experiments with doc2vec is one way of learning about what it does, when you can use it, when it works well, and when it is less useful.

In terms of implementations, we’ve already mentioned the Doc2Vec class in gensim [4]. There’s also an implementation in deeplearning4j [15]. And Facebook’s fastText may have an implementation, too [16]. Since I like working with Tensorflow, I’ve googled “doc2vec tensorflow” and found a promising, at first sight clean and consise implementation  [13]. And a nice discussion about a few lines of Tensorflow code as well, with the discussion shifting to the gensim implementation [11]. Implementations in other low level neural network frameworks may exist.

Zooming in just a bit, it turns out that doc2vec is not just one algorithm, but rather it refers to a small group of alternative algorithms. And these are of course extended in new research. For example, in [9], a modified version of a particular doc2vec algorithm is used, according to a blog post about the paper [10]. In that blog post, some details on the extension in [9] are given, based on correspondence with the authors of [9].  An implementation of that extension is believed not to be there yet by the author of [10]. In general, it may be impossible to recreate exactly the same algorithms as the authors of the original papers used. Rather, studying concrete implementations is another way of learning about how doc2vec algorithms work.

A third way of learning more is reading. If you know a bit about how neural networks work, you can start by checking the original papers [1, 2, 3, 9]. There are some notes on the papers, too, in blogs by various people [10, 14]. The papers and the blog posts leave some details to the reader, and are not primarily intended as lecture material. The Stanford course on deep learning for NLP has some good lecture notes on some of the algorithms leading up to doc2vec [7], but doc2vec itself is not covered. There are enough posts explaining how to use the gensim Doc2Vec class [5, 6, 8, 12]. Some of these posts do include some remarks on the workings of Doc2Vec [5, 6, 8] or even perform experiments with it [6, 8, 12]. But they do not really drill down to the details of the neural net itself. I could not find a blog post explaining the neural net layout in [13], or reporting on experiments with [13].

Now that you have come this far, wouldn’t it be nice to set out to take a look at how doc2vec, the algorithm, works? With the aim to add some detail and elaboration to the concise exposition in the original papers. And perhaps we can add and discuss some working code, if not too much of it is needed! Stay tuned for more on this.


  1. Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. http://arxiv.org/pdf/1405.4053v2.pdf
  2. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
  3. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
  4. https://radimrehurek.com/gensim/models/doc2vec.html
  5. https://rare-technologies.com/doc2vec-tutorial/
  6. https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
  7. http://cs224d.stanford.edu/lecture_notes/LectureNotes1.pdf
  8. https://ireneli.eu/2016/07/27/nlp-05-from-word2vec-to-doc2vec-a-simple-example-with-gensim/
  9. Andrew N. Dai, Christopher Olah, Quoc V. Le. Document Embedding with Paragraph Vectors, NIPS 2014.
  10. http://building-babylon.net/2015/06/03/document-embedding-with-paragraph-vectors/
  11. https://groups.google.com/forum/#!topic/gensim/0GVxA055yOU
  12. https://amsterdam.luminis.eu/2016/11/15/machine-learning-example/
  13. https://github.com/wangz10/tensorflow-playground/blob/master/doc2vec.py
  14. https://blog.acolyer.org/2016/06/01/distributed-representations-of-sentences-and-documents/
  15. https://deeplearning4j.org/doc2vec
  16. https://github.com/facebookresearch/fastText/issues/26
  17. Tom Kenter, Alexey Borisov, Maarten de Rijke. Siamese CBOW: Optimizing Word Embeddings for Sentence Representations. ACL 2016.
  18. https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
  19. https://en.wikipedia.org/wiki/Latent_semantic_analysis


Machine learning – an example

In my previous blog post, I tried to give some intuition on what neural networks do. I explained that when given the right features, the neural network can generalize and identify regions of the same class in the feature space. The feature space consisted of only 2 dimensions so that it could be easily visualized. In this post, I want to look into a more practical problem of text classification. Specifically, I will use the Reuters 21578 news article dataset. I will describe a classification algorithm for this dataset that will utilize a novel feature extraction algorithm for text called doc2vec.

I will also make the point that because we use machine learning, which means the machine will do most of the work, the same algorithms can be used on any kind of text data and not just news articles. The algorithm will not contain any business logic that is specific to news articles. Especially the neural network is a very reusable part. In machine learning theorem a neural net is known as a universal approximator. That means that it can be used to approximate many interesting functions. In practical terms, it means you can use the same neural network architecture for image data, text data, audio data and much more. So trying to understand one application of a neural network can help you understand much more machine learning applications.

Training the Doc2vec model

In the previous post, I explained how important it is to select the right features. Doc2vec is an algorithm that extracts features from text documents. As the name implies it converts documents to vectors. How exactly it does that is beyond the scope of this blog (do see the paper at: https://arxiv.org/pdf/1405.4053v2.pdf) but its interface is pretty simple. Below is the python code to create vectors from a collection of documents:

# Load the reuters news articles and convert them to TaggedDocuments
taggedDocuments = [TaggedDocument(words=word_tokenize(reuters.raw(fileId)), tags=[i]) for i, fileId in enumerate(reuters.fileids())]

# Create and train the doc2vec model
doc2vec = Doc2Vec(size=doc2vec_dimensions, min_count=2, iter=10, workers=12)

# Build the word2vec model from the corpus
# Build the doc2vec model from the corpus

(for the complete script see: reuters-doc2vec-train.py)

To get some intuition on what doc2vec does let’s convert some documents to vectors and look at their properties. The following code will convert documents from the topic jobs and documents from the topic trade to document vectors. With the help of dimensionality reduction tools (PCA and TSNE) we can reduce these high dimensional vectors to 2 dimensions. See scripts/doc2vec-news-article-plot.py for the code. These tools work in such a way that coordinates in the high dimensional space that are far apart are also far apart in the 2-dimensional space and vice versa for coordinates that are near each other.


(see the source code at: doc2vec-news-article-plot.py)

What you see here are the document classes, red for the “job” topic documents and blue for the “trade” topic documents. You can easily see that there are definitely regions with more red than blue dots. By doing this we can get some intuition that the features we selected can be used to make a distinction between these 2 classes. Keep in mind that the classifier can use the high dimensional features which probably show a better distinction than this 2-dimensional plot.

Another thing we can do is calculate the similarity between 2 doc vectors (see the similarity function of doc2vec for that: gensim.models.doc2vec.DocvecsArray#similarity). If I pick 50 job vectors their average similarity to each other is 0.16. The average similarity between 50 trade vectors is 0.13 If we now look at what the average similarity between 50 job vectors and 50 trade vectors we get a lower number: 0.02. We see that the trade vectors are farther apart from the job vectors than that they are from each other. We get some more intuition that our vectors contain information about the content of the news article.

There is also a function that given some example vectors finds the top n similar documents, see gensim.models.doc2vec.DocvecsArray#most_similar. This can also be useful to see if your trained doc2vec model can distinguish between classes. Given a news article, we expect to find more news articles of the same topic nearby.

Training the classifier

Now that we have a trained doc2vec model that can create a document vector given some text we can use that vector to train a neural network in recognizing the class of a vector.

Important to understand is that the doc2vec algorithm is an unsupervised algorithm. During training, we didn’t give it any information about the topic of the news article. We just gave it the raw text of the news article. The models we create during the training phase will be stored and will later be used in the prediction phase. Schematically our algorithm looks like this (for the training phase):

For the classifier, we will use a neural network that will train on all the articles in the training set (the reuters dataset is split up in a training and test set, the test set will later be used to validate the accuracy of the classifier). The code for the classifier looks like this:

model = Sequential()
model.add(Dense(input_dim=doc2vec_dimensions, output_dim=500, activation='relu'))
model.add(Dense(output_dim=1200, activation='relu'))
model.add(Dense(output_dim=400, activation='relu'))
model.add(Dense(output_dim=600, activation='relu'))
model.add(Dense(output_dim=train_labels.shape[1], activation='sigmoid'))
model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

(for the complete script see: reuters-classifier-train.py)

This code will build a neural network with 3 hidden layers. For each topic in the Reuters dataset, there will be an output neuron returning a value between 0 and 1 for the probability that the news article is about the topic. Keep in mind that a news article can have several topics. So the classifier can indicate the probability of more than 1 topic at once. This is achieved by using the binary_crossentropy loss function. Schematically the neural network will look like this:


Given a doc vector, the neural network will give a prediction between 0 and 1 for each topic. After the training phase both the model of the doc2vec algorithm and for the neural network will be stored so that they later can be used for the prediction phase.


When using a neural network it’s important not to have too few dimensions or too many. If the number of dimensions is too low the coordinates in the feature space will end up too close to each other which makes it hard to distinguish them from each other. Too many dimensions will cause the feature space to be too large, the neural network will have problems to relate data points of the same class. Doc2vec is a great tool that will create vectors that are not that large, the default being 300 dimensions. In the Doc2vec paper, it’s mentioned that this is the main advantage over techniques that create a dimension for every unique word in the text. This will create in the 10s or 100s thousand dimensions.

Prediction phase

For the prediction phase, we load the trained doc2vec model and the trained classifier model.


When we feed the algorithm the text of a news article the doc2vec algorithm will convert it to a doc vector and based on that the classifier will predict a topic. During the training phase, I withheld a small set of news articles from the training of the classifier. We can use that set to evaluate the accuracy of the predictions by comparing the predicted topic with the actual topic. Here are some predicted topics next to their actual topics:

predicted: [‘ship’] – actual: [‘ship’]

predicted: [‘coffee’] – actual: [‘coffee’, ‘lumber’, ‘palm-oil’, ‘rubber’, ‘veg-oil’]

predicted: [‘grain’, ‘wheat’] – actual: [‘grain’, ‘wheat’]

predicted: [] – actual: [‘gold’]

predicted: [‘earn’] – actual: [‘acq’]

predicted: [‘acq’] – actual: [‘tin’]

predicted: [‘interest’, ‘money-fx’] – actual: [‘interest’, ‘money-fx’]


Hopefully, by now I’ve given some intuition on what machine learning is.

First, your data needs to be converted to meaningful feature vectors with just the right amount of dimensions. You can verify the contents of your feature vectors by:

  • Reducing them to 2 dimensions and plot them on a graph to see if similar things end up near each other
  • Given a datapoint, find the closest other data points and see if they are similar

You need to divide your dataset in a training and a test set. Then you can train your classifier on the training set and verify it against the test set.

While this is a process that takes a while to understand and getting used to, the very interesting thing is that this algorithm can be used for a lot of different use cases. This algorithm describes classifying news articles. I’ve found that using exactly the same algorithm on other kinds of predictions, sentiment prediction for example, works exactly the same. It’s just a matter of swapping out the topics with positive or negative sentiment.

I’ve used the algorithm on other kinds of text documents: user reviews, product descriptions and medical data. The interesting thing is that the code changes required to apply these algorithms on other domains are minimal, you can use exactly the same algorithms. This is because the machine itself learns the business logic. Because of that the code doesn’t need to change. Understanding the described algorithm is not just learning a way to predict the topic of a news article. Basically, it can predict anything based on text data as long as you have examples. For me as software engineer, this is quite surprising. Usually, code is specific to an application and cannot be reused in another application. With machine learning, I can make software that can be applied in multiple very different domains. Very cool!

Source code


Further reading

For more practical tips on machine learning see the paper “A Few Useful Things to Know about Machine Learning” at: https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

About doc2vec: