Tag Archive: machine learning



How to bring your python machine learning model in production using RESTful APIs

In recent years there is much attention to bringing machine learning models in production whereas up to a few years ago the results of machine learning came into slides or some dashboards.
Bringing machine learning in production is important to integrate the outputs of machine learning with other systems.

What does “bring in production” mean?

To bring a machine learning in production means to run the machine learning model more often and integrate and use the model’s output in other systems.

There are some ways that you can bring your machine learning models in production, such as:

  1. To build a web-service around it and use it in real time (API calls, microservice architecture)
  2. Schedule your code to be run regularly (such as Oozie and Airflow)
  3. Stream analytics (such as Spark Streams) for lambda/kappa architect

The focus of this article is to discuss the first way of going on production. This method is usually used in web-based environments and microservices architect.

Python and modeling

In this article, we build up a sample machine learning model for Online Shoppers’ Purchasing Intention Dataset Data Set, available and discussed in https://bit.ly/2UnSeRX.
In the below you can find the code for the data preparation and modeling:

  

import pandas as pd # to use dataframes and etc. 
import numpy as np #for arithmatic operationms
from time import strptime #to convert month abbrivations to numeric values
from sklearn.model_selection import train_test_split #to split up the samples
from sklearn.tree import DecisionTreeRegressor #ression tree model
from sklearn.metrics import confusion_matrix # to check the confusion matrix and evaluate the accuracy

#reading the data
dataset=pd.read_csv(".../online_shoppers_intention.csv")

#preparing for split
df=dataset.drop("Revenue",axis=1)
y=dataset["Revenue"].map(lambda x: int(x))
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2)

#making a copy (to be able to change the values)
X_train=X_train.copy()
X_test=X_test.copy()

#data prep phase
def dataPrep(localData):
    #The problem is "June"is the full month name and the rest abbriviation --> we turn all in abbr.
    localData["Month"]= localData["Month"].map(lambda x: str.replace(x,"June","Jun"))
    # Our model doesn't ingest text, we transform it to int
    localData["Month"]= localData["Month"].map(lambda x: strptime(x,"%b").tm_mon)
    # The weeknd should also be turned into int
    localData["Weekend"]= localData["Weekend"].map(lambda x: int(x))
    # turning string to int
    localData["VisitorType"] = localData["VisitorType"].astype('category').cat.codes
    return localData

#Sending the data through data prep phase
X_train=dataPrep(X_train)
X_test=dataPrep(X_test)
#define the regression tree
regr_tree = DecisionTreeRegressor(max_depth=200)
#fiting the tree with the training data
regr_tree.fit(X_train, y_train)
# running the predictions
pred=regr_tree.predict(X_test)
# looking at the confusion matrix
confusion_matrix(y_test,pred)


For data scientists, the above code should be very familiar. We read the data, do very few data wranglings and model it with decision trees.

Save the model

The next step which does not appear in data scientists workflow regularly is to save the model on hard-drive. This step is necessary if you bring your python code in production.
In below you can see how “joblib” can be of assistant to do this:

  

from sklearn.externals import joblib
joblib.dump(regr_tree, '.../model3.pkl')


Build your Flask back-end

If you are not familiar with building back-end programs and RESTful APIs, I highly recommend reading https://bit.ly/2AVTwxW and other related materials. But in short Web Services, and RESTful APIs are servers provide functions (on the server). The application can remotely call those function and get the outputs back. In our example, we call our machine learning model from anywhere through internet and TCPIP protocol. Once the model is called with the data, the result of classification is back to the client or the computer which has already called the machine learning model.
Discussing details about web-services and web APIs are beyond the scope of this article but you can find many interesting articles on this by some internet search.
In below we use Flask to build the webservice around the machine learning model.

  

from flask import Flask, request
from flask_restful import Resource, Api
from sklearn.externals import joblib
import pandas as pd


app = Flask(__name__)
api = Api(app)



class Classify(Resource):
    def get(self): # get is the right http verb because it caches the outputs and is faster in general
        data = request.get_json() # greading the data
        data1 = pd.DataFrame.from_dict(data,orient='index') # converting data into  DataFrame (as our technique does not ingest json)
        data1=data1.transpose() # once Json is converted to DataFrame is not columnar, we need to convert it to columnar
        model = joblib.load('../model3.pkl') # loading the model from disk
        result = list(model.predict(data1)) # conversion to list because numpy.ndarray cannot be jsonified
        return result # returning the result of classification 



api.add_resource(Classify, '/classify')

app.run(port=5001)


Test your API

You can use various techniques to test if the back-end works. I use Postman software to test if the API is working.
You need to consider that we made a GET request in our Flask application. The motivation behind choosing GET request is the ability of the webservers to cash the results helping with the speed of the webservice.
Another consideration is we send the data in JSON format (in the format after data preparation phase) for the call and the results are back also in JSON.

  

{
    "Administrative":0,
    "Administrative_Duration": 0.0,
    "Informational": 0,
    "Informational_Duration": 0.0,
    "ProductRelated": 1,
    "ProductRelated_Duration": 0.0,
    "BounceRates": 0.2,
    "ExitRates": 0.2,
    "PageValues": 0.0,
    "SpecialDay": 0.0,
    "Month": 5,
    "OperatingSystems": 2,
    "Browser": 10,
    "Region": 5,
    "TrafficType": 1,
    "VisitorType": 87093223,
    "Weekend":0
}


Microservices architect:

I personally like to bring machine learning in production using RESTful APIs and the motivation behind it is because of microservices architect. The microservices architect lets developers to build up loosely coupled services and enables continuous delivery and deployment.

Scaling up

To scale up your webservice there are many choices of which I would recommend load balancing using Kubernetes


Looking back on AWS Summit Benelux 2018

Last week I visited AWS Summit Benelux together with Sander. AWS Summit is all about cloud computing and the topics that surround cloud computing. This being my first AWS conference I can say it was a really nice experience. Sure there was room for improvement (no coffee or tea after the opening keynote being one), but other than that it was a very good experience. Getting inside was a breeze with all the different check-in points and after you entered you were directly on the exhibitor floor where a lot of Amazon partners showed their products.

Opening keynote

The day started with an introduction by Kamini Aisola, Head of Amazon Benelux. With this being my first AWS summit it was great to see Kamini showing some numbers about the conference: 2000 attendees and 28 technical sessions. She also showed us the growth pattern of AWS with an increasing growth of 49% compared to last year. That’s really impressive!

Who are builders?

Shortly after, Amazon.com CTO Werner Vogels started with his opening keynote. Werner showed how AWS evolved from being ‘just’ an IaaS company to now offering more than 125 different services. More than 90% of the developed services were based on customer feedback from the last couple of years. That’s probably one of the reasons why AWS is growing so rapidly and customers are adopting the AWS platform.

What I noticed throughout the entire keynote is that AWS is constantly thinking about what builders want to build (in the cloud) and what kind of tools those builders need to have to be successful. These tools come in different forms and sizes, but I noticed there is a certain pattern in how services evolve or are grown at AWS. The overall trend I noticed during the talks is that engineers or builders should have to spend less time focussing on lower level infrastructure and can start to really focus on delivering business value by leveraging the services that AWS has to offer.

During the keynote Werner ran through a couple of different focus areas for which he showed what AWS is currently offering. In this post I won’t go through all of them, because I expect you can probably watch a recording of the keynote on youtube soon, but I’ll highlight a few.

Let’s first start with the state of Machine Learning and analytics. Werner looked back at how machine learning evolved at Amazon.com and how services were developed to make machine learning more accessible for teams within the organisation. Out of this came a really nice mission statement:

AWS want’s to put machine learning in the hands of every developer and data scientist.

To achieve this mission AWS is currently offering a layered ML stack to engineers looking into to using ML on the AWS platform.

The layers go from low-level libraries to pre-build functionalities based on these lower level layers. I really liked that fact that these services are built in such a way that engineers can decide at which level of complexity they want to start using the ML services offered by AWS. Most of the time data engineers and data scientist will start from either SageMaker or even lower, but most application developers might just want to use a pre-built functionality like image recognition, text processing or speech recognition. See for instance this really awesome post on using Facial recognition by my colleague Roberto.

Another example of this layered approach was with regards to container support on AWS. A few years back Amazon added container support to their offering with Amazon Elastic Container Service (Amazon ECS). This allowed Amazon ECS helped customers run containers on AWS without having to manage all servers and manager their own container orchestration software. ECS delivered all of this. Now fast forwarding a few years Amazon is now offering Amazon EKS (managed Kubernetes on Amazon) after they noticed that about 63% of managed Kubernetes clusters ran on AWS. Kubernetes has become the current industry standard when it comes to container orchestration, so this makes a lot of sense. In addition, Amazon now also offers Amazon Fargate. With Fargate they take the next step which means that Fargate allows you as the developer to focus on running containers ‘without having to think about managing servers or clusters’.

During his keynote, Werner also mentioned the Well-Architected framework. The Well-Architect framework has been developed to help cloud architects run their applications in the cloud based on AWS best practices. When implemented correctly it allows you to fully focus on your functional requirements to deliver business value to your customers. The framework is based on the following five pillars:

  1. Operational Excellence
  2. Security
  3. Reliability
  4. Performance Efficiency
  5. Cost Optimization

I had not heard about the framework before, so during the weekend I read through some of its documentation. Some of the items are pretty straightforward, but others might give you some insights in what it means to run applications in the cloud. One aspect of the Well-Architected framework, Security, had been recurring throughout the entire keynote.

Werner emphasised a very important point during his presentation:

Security is EVERYONE’s job

With all the data breaches happening lately I think this is a really good point to make. Security should be everybody’s number one priority these days.

During the keynote, there were a couple of customers that showed how AWS had helped them achieve a certain goal. Bastiaan Terhorst, CPO at WeTransfer explained that being a cloud-scale company comes with certain problems. He explained how they moved from a brittle situation towards a more scalable solution. They could not modify the schema of their DB anymore without breaking the application, which is horrible if you reach a certain scale and customer base. They had to rearchitect the way they worked with incoming data and using historic data for reporting. I really liked the fact that he shared some hard-learned lessons about database scalability issues that can occur when you reach a certain scale.

Tim Bogaert, CTO at de Persgroep also showed how they moved from being a silo-ed organization with own datacenters and waterfall long-running projects towards all-in AWS with an agile approach and teams following the “You Build It, You Run It” mantra. It was an interesting story because I see a lot of larger enterprises still struggling with these transitions.

After the morning keynote, the breakout sessions started. There were 7 parallel tracks and all with different topics, so plenty to choose from. During the day I attended only a few, so here goes.

Improve Productivity with Continuous Integration & Delivery

This really nice talk by Clara Ligouri (software engineer for AWS Developer Tools) and Jamie van Brunschot (Cloud engineer at Coolblue) gave a good insight into all the different tools provided by AWS to support the full development and deployment lifecycle of an application.

Clara modified some code in Cloud9 (the online IDE), debugged some code, ran CI jobs, tests and deployments all from within her browser and pushed a new change to production within only a matter of minutes. It shows how far the current state of being a cloud-native developer has really come. I looked at Cloud9 years ago. Way before they were acquired by Amazon. I’ve always been a bit skeptical when it comes to using an online IDE. I remember having some good discussions with the CTO at my former company about if this would really be the next step for IDEs and software development in general. I’m just so comfortable with IntelliJ for Java development and it always works (even if I do not have any internet ;-)). I do wonder if anybody reading this is already using Cloud9 (or any other Web IDE) and is doing his / her development fully in the cloud. If you do, please leave a comment, I would love to learn from your experiences. The other tools like CodePipeline and CodeDeploy definitely looked interesting, so I need to find some time to play around with them.

GDPR

Next up was a talk on GDPR. The room was quite packed. I didn’t expect that though, because everybody should be GDPR compliant by now right? 🙂 Well not really. Companies are still implementing changes to be compliant with GDPR. The talk by Christian Hesse looked at different aspects of GDPR like:

  • The right to data portability
  • The right to be forgotten
  • Privacy by design
  • Data breach notification

He also talked about the shared responsibility model when it comes to being GDPR compliant. AWS as the processor of personal data and the company using AWS being the controller are both responsible for making sure data stays safe. GDPR is a hot topic and I guess it will stay so for the rest of the year at least. It’s something that we as engineers will always need to keep in the back of our minds while developing new applications or features.

Serverless

In the afternoon I also attended a talk on Serverless by Prakash Palanisamy (Solutions Architect, Amazon Web Services) and Joachim den Hertog (Solutions Architect, ReSnap / Albelli). This presentation gave a nice overview of Serverless and Step functions, but also showed new improvements like the Serverless Application Repository, save Serverless deployments and incremental deployments. Joachim gave some insights into how Albelli was using Serverless and Machine Learning on the AWS platform for their online photo book creator application called ReSnap.

Unfortunately I had to leave early, so I missed the end of the Serverless talk and the last breakout session, but all in all AWS Summit Benelux was a very nice experience with some interesting customer cases and architectures. For a ‘free’ event it was amazingly organized, I learned some new things and had a chance to speak with some people about how they used AWS. It has triggered me to spend some more time with AWS and its services. Let’s see what interesting things I can do on the next Luminis TechDay.

Build On!


Alternating Least Squares with implicit feedback – The search for alpha

So you want to build a recommendation engine with ALS… You search the internet for some example code in your language of choice… You copy paste the code and start tweaking… But then you realize that your data is different from all the examples you found online. You don’t have explicit ratings in some range from 1 to 10; instead, you have click events where 1 means ‘clicked’. Will you still be able to use ALS? And if so, how?

A brief recap on Collaborative Filtering and Alternating Least Squares

Collaborative Filtering is the technique of predicting the preference of a user by collecting the preferences of many users. The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B’s opinion on a different issue i than to have the opinion on i of a person chosen randomly (Wikipedia).
A preference takes the form of a (user, item, rating) triple. Collecting them yields a sparse matrix R_{u \times i} of known ratings of users u for items i. The task is to fill in the missing values. In the Latent Model approach to Collaborative Filtering, we do this by decomposing R_{u \times i} into a user matrix X_{u \times g} and an items matrix Y_{i \times g} so that we can find the ‘hidden’ features g of users and items. (In the case of movies one could think of these hidden features as genres.) By taking the product of the user matrix and item matrix, we can reconstruct the (complete) ratings matrix \hat{R} = X \cdot Y^T. Or for individual ratings: \hat{r}_{ui} = x_{u}^T y_i
To compute these factors, we will first randomly initialize X and Y and iteratively recompute them by minimizing the loss function L:
 \sum\limits_{u,i} (r_{ui} - x_u^T y_i)^{2} + \lambda \Big( \sum\limits_u \|x_u\|^{2} + \sum\limits_i \|y_i\|^{2} \Big)
The first term in L is the sum of squared errors and the second term is used for regularization. In order to minimize the loss function, we will take the derivatives with respect to x and y and solve for 0.

 \begin{aligned} \frac{\partial L}{\partial x_u} &= 0 \\ -2\sum\limits_i(r_{ui} - x_u^T y_i) y_i^T + 2 \lambda x_u^T &= 0 \\ -2 (r_u^T - x_u^T Y^T)Y + 2 \lambda x_u^T &= 0 \\ -2 r_u^T Y + 2 x_u^T Y^T Y + 2 \lambda x_u^T &= 0 \\ x_u^T Y^T Y + \lambda x_u^T &= r_u^T Y \\ x_u^T \big( Y^T Y + \lambda I \big) &= r_u^T Y \\ \big( Y^T Y + \lambda I \big) x_{u} &= Y^T r_u \\ x_u &= \big( Y^T Y + \lambda I \big)^{-1} Y^T r_u \end{aligned}

And for y:
 \begin{aligned} \frac{\partial L}{\partial y_i} &= 0 \\ -2\sum\limits_i(r_{ui} - y_i^T x_u) x_u^T + 2 \lambda y_i^T &= 0 \\ y_i &= \big( X^T X + \lambda I \big)^{-1} X^T r_i \end{aligned} %y_i &= \big( X X^T + \lambda I \big)^{-1} X r_i^T

Recomputing x_{u} and y_i can be done with Stochastic Gradient Descent, but this is a non-convex optimization problem. We can convert it into a set of quadratic problems, by keeping either x_u or y_i fixed while optimizing the other. In that case, we can iteratively solve x and y by alternating between them until the algorithm converges. This is Alternating Least Squares.

Implicit Feedback

Unfortunately, you don’t have ratings, you have click events. And a ‘click’ event does not necessarily mean a user really likes the product; the user could be curious about the product for some other reason. Even when you are using ‘buy’ events you are not in the clear, because people buy gifts for other people all the time. Furthermore, the absence of a click event does not imply a dislike for a product i. So can you still use ALS? Yes, you can still use ALS, but you have to take into account the fact that you have implicit ratings/feedback. Luckily, your preferred machine learning library shows there is an ‘implicit’ switch on the ALS interface and that there is an ‘alpha’ parameter involved as well.

So what is this \alpha character?

To understand alpha, we should go to the source, which is Hu et al. 2008 (1). He suggests to split each rating into a preference and a confidence level. The preference is calculated by capping the rating to a binary value.
 p_{ui} = \begin{cases} 1, & r_{ui} > 0 \\ 0, & r_{ui} = 0 \end{cases}

The confidence level is defined as:
c_{ui} = 1 + \alpha r_{ui}
For a rating of 0 we would have a minimum confidence of 1 and if the rating increases, the confidence increases accordingly. The rate of increase is controlled by alpha. So alpha reflects how much we value observed events versus unobserved events. Factors are now computed by minimizing the following loss function L:
 \sum\limits_{u,i} c_{ui} (p_{ui} - x_{u}^{T}y_i)^{2} + \lambda \Big( \sum\limits_u \|x_u\|^{2} + \sum\limits_i \|y_i\|^{2} \Big)

Now suppose that for a given rating r_{ui} that x_{u}^T y_i is very large, so that the squared residual (p_{ui} - x_{u}^{T}y_i)^{2} is very large, then the rating r_{ui} has a big impact on our loss function. And it should! x_{u}^T y_i will be drawn towards the 0-1 range, which is a good thing, because we want to predict whether the event will occur or not (0 or 1).
Alternatively, suppose that for a given rating r_{ui} we have observed many events, and suppose also that our initial x_{u}^T y_i value is close to 0, so that the squared residual (p_{ui} - x_{u}^{T}y_i)^{2} approximates 1, then the rating r_{ui} will still have quite some impact on our loss function, because our confidence c_{ui} is large. Again, this is a good thing, because in this case, we want x_{u}^T y_i to go towards 1.
If either the confidence level is low or the residual is low, there is not much impact on the loss function, so the update of x_u and y_i will be small.

Conclusion

Now that we have some background on this alpha, we can safely copy-paste the recommender engine code we found online and expand it so that it includes the alpha parameter. All that is left now is some extra parameter tuning and we are done! After that, we can run our final model on the test set and we can calculate the root mean squared error… Wait.. What..? Somehow, that metric just doesn’t feel right. Oh, well… Enough for today 🙂

References

(1) Y. Hu, Y. Koren and C. Volinsky, “Collaborative Filtering for Implicit Feedback Datasets”, 2008


Machine learning – an example

In my previous blog post, I tried to give some intuition on what neural networks do. I explained that when given the right features, the neural network can generalize and identify regions of the same class in the feature space. The feature space consisted of only 2 dimensions so that it could be easily visualized. In this post, I want to look into a more practical problem of text classification. Specifically, I will use the Reuters 21578 news article dataset. I will describe a classification algorithm for this dataset that will utilize a novel feature extraction algorithm for text called doc2vec.

I will also make the point that because we use machine learning, which means the machine will do most of the work, the same algorithms can be used on any kind of text data and not just news articles. The algorithm will not contain any business logic that is specific to news articles. Especially the neural network is a very reusable part. In machine learning theorem a neural net is known as a universal approximator. That means that it can be used to approximate many interesting functions. In practical terms, it means you can use the same neural network architecture for image data, text data, audio data and much more. So trying to understand one application of a neural network can help you understand much more machine learning applications.

Training the Doc2vec model

In the previous post, I explained how important it is to select the right features. Doc2vec is an algorithm that extracts features from text documents. As the name implies it converts documents to vectors. How exactly it does that is beyond the scope of this blog (do see the paper at: https://arxiv.org/pdf/1405.4053v2.pdf) but its interface is pretty simple. Below is the python code to create vectors from a collection of documents:

# Load the reuters news articles and convert them to TaggedDocuments
taggedDocuments = [TaggedDocument(words=word_tokenize(reuters.raw(fileId)), tags=[i]) for i, fileId in enumerate(reuters.fileids())]

# Create and train the doc2vec model
doc2vec = Doc2Vec(size=doc2vec_dimensions, min_count=2, iter=10, workers=12)

# Build the word2vec model from the corpus
doc2vec.build_vocab(documents)
# Build the doc2vec model from the corpus
doc2vec.train(documents)

(for the complete script see: reuters-doc2vec-train.py)

To get some intuition on what doc2vec does let’s convert some documents to vectors and look at their properties. The following code will convert documents from the topic jobs and documents from the topic trade to document vectors. With the help of dimensionality reduction tools (PCA and TSNE) we can reduce these high dimensional vectors to 2 dimensions. See scripts/doc2vec-news-article-plot.py for the code. These tools work in such a way that coordinates in the high dimensional space that are far apart are also far apart in the 2-dimensional space and vice versa for coordinates that are near each other.

image02

(see the source code at: doc2vec-news-article-plot.py)

What you see here are the document classes, red for the “job” topic documents and blue for the “trade” topic documents. You can easily see that there are definitely regions with more red than blue dots. By doing this we can get some intuition that the features we selected can be used to make a distinction between these 2 classes. Keep in mind that the classifier can use the high dimensional features which probably show a better distinction than this 2-dimensional plot.

Another thing we can do is calculate the similarity between 2 doc vectors (see the similarity function of doc2vec for that: gensim.models.doc2vec.DocvecsArray#similarity). If I pick 50 job vectors their average similarity to each other is 0.16. The average similarity between 50 trade vectors is 0.13 If we now look at what the average similarity between 50 job vectors and 50 trade vectors we get a lower number: 0.02. We see that the trade vectors are farther apart from the job vectors than that they are from each other. We get some more intuition that our vectors contain information about the content of the news article.

There is also a function that given some example vectors finds the top n similar documents, see gensim.models.doc2vec.DocvecsArray#most_similar. This can also be useful to see if your trained doc2vec model can distinguish between classes. Given a news article, we expect to find more news articles of the same topic nearby.

Training the classifier

Now that we have a trained doc2vec model that can create a document vector given some text we can use that vector to train a neural network in recognizing the class of a vector.

Important to understand is that the doc2vec algorithm is an unsupervised algorithm. During training, we didn’t give it any information about the topic of the news article. We just gave it the raw text of the news article. The models we create during the training phase will be stored and will later be used in the prediction phase. Schematically our algorithm looks like this (for the training phase):

For the classifier, we will use a neural network that will train on all the articles in the training set (the reuters dataset is split up in a training and test set, the test set will later be used to validate the accuracy of the classifier). The code for the classifier looks like this:

model = Sequential()
model.add(Dense(input_dim=doc2vec_dimensions, output_dim=500, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(output_dim=1200, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(output_dim=400, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(output_dim=600, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(output_dim=train_labels.shape[1], activation='sigmoid'))
model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

(for the complete script see: reuters-classifier-train.py)

This code will build a neural network with 3 hidden layers. For each topic in the Reuters dataset, there will be an output neuron returning a value between 0 and 1 for the probability that the news article is about the topic. Keep in mind that a news article can have several topics. So the classifier can indicate the probability of more than 1 topic at once. This is achieved by using the binary_crossentropy loss function. Schematically the neural network will look like this:

image00

Given a doc vector, the neural network will give a prediction between 0 and 1 for each topic. After the training phase both the model of the doc2vec algorithm and for the neural network will be stored so that they later can be used for the prediction phase.

Dimensionality

When using a neural network it’s important not to have too few dimensions or too many. If the number of dimensions is too low the coordinates in the feature space will end up too close to each other which makes it hard to distinguish them from each other. Too many dimensions will cause the feature space to be too large, the neural network will have problems to relate data points of the same class. Doc2vec is a great tool that will create vectors that are not that large, the default being 300 dimensions. In the Doc2vec paper, it’s mentioned that this is the main advantage over techniques that create a dimension for every unique word in the text. This will create in the 10s or 100s thousand dimensions.

Prediction phase

For the prediction phase, we load the trained doc2vec model and the trained classifier model.

image03

When we feed the algorithm the text of a news article the doc2vec algorithm will convert it to a doc vector and based on that the classifier will predict a topic. During the training phase, I withheld a small set of news articles from the training of the classifier. We can use that set to evaluate the accuracy of the predictions by comparing the predicted topic with the actual topic. Here are some predicted topics next to their actual topics:

title: AUSTRALIAN FOREIGN SHIP BAN ENDS BUT NSW PORTS HIT
predicted: [‘ship’] – actual: [‘ship’]

title: INDONESIAN COMMODITY EXCHANGE MAY EXPAND
predicted: [‘coffee’] – actual: [‘coffee’, ‘lumber’, ‘palm-oil’, ‘rubber’, ‘veg-oil’]

title: SRI LANKA GETS USDA APPROVAL FOR WHEAT PRICE
predicted: [‘grain’, ‘wheat’] – actual: [‘grain’, ‘wheat’]

title: WESTERN MINING TO OPEN NEW GOLD MINE IN AUSTRALIA
predicted: [] – actual: [‘gold’]

title: SUMITOMO BANK AIMS AT QUICK RECOVERY FROM MERGER
predicted: [‘earn’] – actual: [‘acq’]

title: SUBROTO SAYS INDONESIA SUPPORTS TIN PACT EXTENSION
predicted: [‘acq’] – actual: [‘tin’]

title: BUNDESBANK ALLOCATES 6.1 BILLION MARKS IN TENDER
predicted: [‘interest’, ‘money-fx’] – actual: [‘interest’, ‘money-fx’]

Conclusion

Hopefully, by now I’ve given some intuition on what machine learning is.

First, your data needs to be converted to meaningful feature vectors with just the right amount of dimensions. You can verify the contents of your feature vectors by:

  • Reducing them to 2 dimensions and plot them on a graph to see if similar things end up near each other
  • Given a datapoint, find the closest other data points and see if they are similar

You need to divide your dataset in a training and a test set. Then you can train your classifier on the training set and verify it against the test set.

While this is a process that takes a while to understand and getting used to, the very interesting thing is that this algorithm can be used for a lot of different use cases. This algorithm describes classifying news articles. I’ve found that using exactly the same algorithm on other kinds of predictions, sentiment prediction for example, works exactly the same. It’s just a matter of swapping out the topics with positive or negative sentiment.

I’ve used the algorithm on other kinds of text documents: user reviews, product descriptions and medical data. The interesting thing is that the code changes required to apply these algorithms on other domains are minimal, you can use exactly the same algorithms. This is because the machine itself learns the business logic. Because of that the code doesn’t need to change. Understanding the described algorithm is not just learning a way to predict the topic of a news article. Basically, it can predict anything based on text data as long as you have examples. For me as software engineer, this is quite surprising. Usually, code is specific to an application and cannot be reused in another application. With machine learning, I can make software that can be applied in multiple very different domains. Very cool!

Source code

https://github.com/luminis-ams/blog-text-classification

Further reading

For more practical tips on machine learning see the paper “A Few Useful Things to Know about Machine Learning” at: https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

About doc2vec:
https://radimrehurek.com/gensim/models/doc2vec.html
https://deeplearning4j.org/word2vec


Machine learning from a software engineer’s perspective

A couple of years ago I started to pursue my interest in machine learning. At that time, machine learning was more an academic area of computing. It was hard to put in practice. In this article, I want to show you that this has changed over the last year. Machine learning frameworks and libraries are improving and are becoming more practical. There seems to be an effort to put the newest machine learning algorithms in the hands of software developers to put them into practical use.

(more…)