Tag Archive: elastic



Creating an Elastic Canvas for Twitter while visiting Elasticon 2018

The past week we visited Elasticon 2018 in San Francisco. In our previous blog post we wrote about the Keynote and some of the more interesting new features of the elastic stack. In this blog post, we take one of the cool new products for a spin: Canvas. But what is Canvas?

Canvas is a composable, extendable, creative space for live data With Canvas you can combine dynamic data, coming from a query against Elasticsearch for instance, with nice looking graphs. You can also use tables, images and combine them with the data visualizations to create stunning, dynamic infographics. In this blog post, we create a Canvas about the tweets with the tags Elasticon during the last day of the elastic conference last week.

Below is the canvas we are going to create. It contains a number of different elements. The top row contains a pie chart with the language of the tweets, a bar chart with the number of tweets per time unit, followed by the total tracked tweets during the second day of elasticon. The next two elements are using the sentiment of the tweets. This was obtained using IBM Watson. Byron wrote a basic integration with Watson, he will give more details in a next blog post. The pie chart shows the complete results, the green smiley on the right shows the percentage of positive tweets of the total number of tweets that we could analyze without an error or be neutral.

Overview canvas

With the example in place, it is time to discuss how to create these canvasses yourself. First some information about the installation. A few of the general concepts and finally sample code for the used elements.

Installing canvas

Canvas is part of elastic Kibana. You have to install canvas as a plugin into Kibana. You do need to install X-Pack in Elasticsearch as well as in Kibana. The steps are well described in the installation page of Canvas. Beware though, installing the plugins in Kibana takes some time. They are working on improving this, but we have to deal with it at the moment.

If everything is installed, start Kibana in your browser. At this moment you could start creating the canvas, however, you have no data. So you have to import some data first. We used Logstash with a twitter input and elastic output. Cannot go into to much detail or else this blog post will be way too long. Might do this is a next blog post. For now, it is enough to know we have an index called twitter that contains tweets.

Creating the canvas with the first element

When clicking on the tab Canvas we can create a new Workpad. A Workpad can contain one of the multiple pages and each page can contain multiple elements. A Workpad defines the size of the screen. Best is to create it for a specific screen size. At elasticon they had multiple monitors, some of them horizontal, others vertical. It is good to create the canvas for a specific size. You can also choose a background color. These options can be found on the right side of the screen in the Workpad settings bar.

It is good to know that you can create a backup of you Workpad from the Workpads screen, there is a small download button on the right side. Restoring a dashboard is done by dropping the exported JSON into the dialog.

New work pad

Time to add our first element to the page. Use the plus sign at the bottom of the screen to add an element. You can choose from a number of elements. The first one we’ll try is the pie chart. When adding the pie chart, we see data in the chart. Hmm, how come, we did not select any data. Canvas comes with a default data source, this data source is used in all the elements. This way we immediately see what the element looks like. Ideal play around with all the options. Most options are available using the settings on the right. With the pie, you’ll see options for the slice labels and the slice angles. You can also see the Chart style and Element style. These configuration options show a plus signed button. With this button, you can add options like color pallet and text size and color. For the element, you can set a background color, border color, opacity and padding

Add element

Next, we want to assign our own data source to the element. After adding our own data source we most likely have to change the parameters for the element as well. In this case, we have to change the Slice labels and angles. Changing the data source is done using the button at the bottom, click the Change Datasource button/link. At the moment there are 4 data sources: demo data, demo prices, Elasticsearch query and timeline query. I’ll choose the Elasticsearch query, select the index, don’t use a specific query and select the fields I need. Selecting the fields I need can speed up the element as we only parse the data that we actually need. In this example, we only use the sentiment label.

Change data source

The last thing I want to mention here is the Code view. After pushing the >_ Code button you’ll see a different view of your element. In this view, you’ll get a code approach. This is more powerful than the settings window. But with great power comes great responsibility. It is easy to break stuff here. The code is organized in different steps. The output of each step is, of course, the input for the next step. In this specific example, there are five steps. First a filter step, next up the data source, then a point series that is required for a pie diagram. Finally the render step. If you change something using the settings the code tab gets updated immediately. If I add a background color to the container, the render step becomes:

render containerStyle={containerStyle backgroundColor="#86d2ed"}

If you make changes in the code block, use the Run button to apply the changes. In the next sections, we will only work in this code tab, just because it is easier to show to you.

Code view

Adding more elements

The basics of the available elements or function are documented here. We won’t go into details for all the different elements we have added. Some of them use the defaults and therefore you can add them yourselves easily. The first one I do want to explain is the Twitter logo with the number of tweets in there. This is actually two different elements. The logo is a static image. The number is more interesting. This makes use of the escount function and the markdown element. Below is the code.

filters
 | escount index="twitter"
 | as num_tweets
 | markdown "{{rows.0.num_tweets}}" font={font family="'Open Sans', Helvetica, Arial, sans-serif" size=60 align="left" color="#ffffff" weight="undefined" underline=false italic=false}

The filters are used to facilitate filtering (usually by time) using the special filter element. The next item is escount which does what you expect. It counts the number of items in the provided index. You can also provide a query to limit the results, but we did not need it. The output for escount is a number. This is a problem when sending it to a markdown element. The markdown element only accepts a datatable. Therefore we have to use the function as. This accepts a number and changes it into a datatable. The markdown element accepts a table and exposes it as rows. Therefore we use the rows to obtain the first row and of that row the column num_tweets. When playing with this element it is easy to remove the markdown line, Canvas will then render the table by default. Below the output for only the first two rows as well as the changes after adding the third line (as num_tweets)

200
num_tweets #

200

Next up are the text and the photo belonging to the actual tweets. The photo is a bit different from the Twitter logo as it is a dynamic photo. In the code below you can see that the image element does have a data URL attribute. We can use this attribute to get one cell from the provided data table. The getCell function has attributes for the row number as well as the name of the column.

esdocs index="twitter*" sort="@timestamp, desc" fields="media_url" count=5 query=""
 | image mode="contain" dataurl={getCell c="media_url" r=2}
 | render

With the text of the tweet, it is a bit different. Here we want to use the markdown widget, however, we do not have the data URL attribute. So we have to come up with a different strategy. If we want to obtain the third item, we select the top 3 and from the top 3, we take the last item.

filters 
| esdocs index="twitter*" sort="@timestamp, desc" fields="text, user.name, created_at" query="" 
| head 3 
| tail 1 
| mapColumn column=created_at_formatted fn=${getCell created_at | formatdate 'YYYY-MM-DD HH:mm:ss'} 
| markdown "{{#each rows}}
**{{'user.name'}}** 

(*{{created_at_formatted}}*)

{{text}}
{{/each}}" font={font family="'American Typewriter', 'Courier New', Courier, Monaco, mono" size=18 align="right" color="#b83c6f" weight="undefined" underline=false italic=false}

The row that starts with mapColumn is a way to format the date. The mapColumn can add a new column with the name as provided by the column attribute and the value as the result of a function. The function can be a chain of functions. In this case, we obtain the column create_at of the datatable and pass it to the format function.

Creating the partly green smiley

The most complicated feature was the smiley that turns green the more positive tweets we see. The positiveness of the tweets was determined using IBM Watson interface. In the end, it is the combination of twee images, one grey smiley, and one green smiley. The green smiley is only shown for a specific percentage. This is the revealImage function. First, we show the complete code.

esdocs index="twitter*" fields="sentiment_label" count=10000 
| ply by="sentiment_label" fn=${rowCount | as "row_count"} 
| filterrows fn=${if {getCell "sentiment_label" | compare "eq" to="error"} then=false else=true}
| filterrows fn=${if {getCell "sentiment_label" | compare "eq" to="neutral"} then=false else=true}
| staticColumn column="total" value={math "sum(row_count)"} 
| filterrows fn=${if {getCell "sentiment_label" | compare "eq" to="positive"} then=true else=false}
| staticColumn column="percent" value={math "divide(row_count, total)"} 
| getCell "percent" 
| revealImage image={asset "asset-488ae09a-d267-4f75-9f2f-e8f7d588fae1"} emptyImage={asset "asset-0570a559-618a-4e30-8d8e-64c90ed91e76"}

The first line is like we have seen before, select all rows from the twitter index. The second row does kind of a grouping of the rows. It groups by the values of sentiment_label. The value is a row count that is specified by the function. If I remove all the other rows we can see the output of just the ply function.

sentiment_label         row_count
negative                32
positive                73
neutral                 81
error                   14

The next steps filter out the rows for error and neutral, then we add a column for the total number of tweets with a positive or negative label. Now each row has this value. Check the following output.

sentiment_label         row_count       total
negative                32              105
positive                73              105

The next line removes the negative row, then we add a column with the percentage, obtain just one cell and call the revealImage function. This function has a number input and attributes for the image as well as the empty or background image.

That gives us all the different elements on the canvas.

Concluding

We really like the options you have with Canvas. You can easily create good-looking dashboard that contains static resources, help texts, images combined with dynamic data coming from Elasticsearch and in the future most likely other resources.

There are some improvements possible of course. It would be nice if we could also select doc_value fields and using aggregations in a query would be nice as well.

We will closely monitor the progression as well believe this is going to be a very interesting technology to keep using in the future.


Elasticon 2018 Day 1

The past few days have been fantastic. Together with Byron I am visiting San Francisco. We have seen amazing sights, but yesterday the reason why we came started. Day one of Elasticon starts with the keynote showing us cool new features to come and sometimes some interesting news. In this blog post, I want to give you a short recap of the keynote and tell you what I think was important.

Elasticon opening

Rollups

With more and more data finding its way to elasticsearch, some indexes become too large for their purpose. We do not need to keep all data of the past weeks and months. We just want to keep the data needed for aggregations we show on a dashboard. Think about an index containing logs of our web server. We have a chart with HTTP status codes, response times, browsers, etc. Now you can create a rollup configuration providing the aggregations we want to keep, containing a cron expression telling when to run and some additional information about how much data to keep. The result is a new index with a lot less data that you can keep for your dashboards.

More information about the rollup functionality can be found here.

Canvas

Last year at Elasticon Canvas was already shown. Elastic continued with the idea and it is starting to look amazing. With Canvas you can create beautiful looking dashboards that go a big step further than the standard dashboards in Kibana. You can customise almost everything you see. It comes with options to put an image on the background, a lot of color options, new sort of data visualisation integrated of course with elasticsearch. In a next blog post I’ll come up with a demo, it is looking very promising. Want to learn more about it, check this blog post.

Kubernetes/Docker logging

One of the areas I still need to learn is the Docker / Kubernetes ecosystem. But if you are looking for a solution to monitor you complete Kubernetes platform, have a look at all the options that elastic has these days. Elastic has impressive support to monitor all the running images. It comes with standard dashboards in Kibana. It now has a dedicated space in Kibana called the Infra tab. More information about the options and how to get started can be found here.

Presenting Geo/Gis data

A very cool demo was given on how to present data on a Map. The demo showed where all Elasticon attendees were coming from. The visual component has an option of creating different layers. So you can add data to give the different countries a color based on the number of attendees. In a next layer show the bigger cities where people are coming from in small circles. Use a logo of the venue in another layer. Etc. Really interesting if you are into geodata. In all makes use of the Elastic Maps Service. If you want more information about this, you can find it here.

Elastic Site Search

Up till now there was news about new ways to handle your logs coming from application monitoring, infrastructure components, application logs. We did not hear about new things around search, until showing the new product called Elastic Site Search. This was previously known as Swiftype. With Google naming its product google search appliance end of life, this is a very interesting replacement. Working with relevance, synonyms, search analytics is becoming a lot easier with this new product. More information can be found here.

Elastic cloud sizing

If you previously looked at the cloud options elastic offers, you might have noticed that choosing elastic nodes did not give you a lot of flexibility. When choosing the amount of required memory, you also got a fixed amount of disk space. With the upcoming release, you have a lot more flexibility when creating your cluster. You can configure different flavours of clusters. One of them being hot-warm cluster. With specific master nodes, hot nodes for recent indexes with more RAM and faster disks, warm nodes containing the older indices with bigger disks. This is a good improvement if you want to create a cluster in the cloud. More information can be found here.

Opening up X-Pack

Shay told a good story about creating a company that supports an open source product. Creating a company only on support is almost impossible in the long run. Therefore they started working on commercial additions now called the X-Pack. Problem with these products was that the code was not available. Therefore working with elastic to help them improve the product was not possible. Therefore they are now opening up their repositories. Beware, it is not becoming free software. You still need to pay, but now it becomes a lot easier to interact with elastic about how stuff works. Next to that, they are going to make it easier to work with the free stuff in X-Pack. Just ask for a license once instead of every year again. And if I understood correct, the download will contain the free stuff in easier installation packages. More information about the what and why in this blog post from Shay Banon.

Conference Party

Nice party but I had to sign a contract to prohibit me from telling stories about the party. I do plan on writing more blog post the coming two days.


A fresh look at Logstash

Soon after the release of elasticsearch it became clear that elasticsearch was good at more than providing search. It turned out that it could be used to store logs very effectively. That is why logstash was using elasticsearch. It contained standard parsers for apache httpd logs. To obtain the logs it had file monitoring plugins. It had plugins to extend and filter the content, and it had plugins to send the content to elasticsearch. That is Logstash in a nutshell back in the days. Of course the logs had to be shown, therefore a tool called Kibana was created. Kibana was a nice tool to create highly interactive dashboards to show and analyse your data. Together they became the famous ELK suite. Nowadays we have a lot more options in all these tools. We have Ingest node in elastic to pre-process documents before they move into elasticsearch, we have beats to monitor files, databases, machines, etc. And we have very nice and new Kibana dashboards. Time to re-investigate what the combination of Logstash, Elasticsearch and Kibana can do. In this blog post I’ll focus on Logstash.

X-Pack

As the company elastic has to make some money as well, they have created a product called X-Pack. X-Pack has a lot of features that sometimes span multiple products. There is a security component, by using this you can make users login in when using kibana and secure your content. Other interesting parts of X-Pack are machine learning, graph and monitoring. Parts of X-Pack can be used free of charge, you do need a license though. For other parts you need a paid license. I personally like the monitoring part so I regularly install X-Pack. In this blogpost I’ll also investigate the X-Pack features for Logstash. I’ll focus on out-of-the-box functionality and mostly what all these nice new things like monitoring and pipeline viewing bring us.

Using the version 6 release candidate

As elastic has already given us a RC1 of their complete stack, I’ll use this one for the evaluation. Beware though, this is still a release candidate, so not production ready.

What does Logstash do

If you never really heard about Logstash, let me give you a very short introduction. Logstash can be used to obtain data from a multitude of different sources. Than filter, transform and enrich the data. Finally store the data to again a multitude of datasources. Example data sources are relational databases, files, queues and websockets. Logstash ships with a large number of filter plugins, with these we can process data to exclude some fields. We can also enrich data, lookup information about ip addresses, or lookup records belonging to an id in for instance elasticsearch or a database. After the lookup we can add data to the document or event that we are handling before sending it to one or more outputs. Outputs can be elasticsearch, a database, but also queue’s like Kafka or RabbitMQ.

In the later releases logstash started to add more features that a tool handling large amounts of data over longer periods need. Things like monitoring and clustering of nodes were introduced and also persisting incoming data to disk. By now logstash in combination with Kibana and Elasticsearch is used by very large companies but also by a lot of start ups to monitor their servers and handle all sorts of interesting data streams.

Enough of this talk, let us get our hands dirty. First step install everything on our developer machines.

Installation

I’ll focus on the developer machine, if you want to install it on a server please refer to the extensive logstash documentation.

First download the zip or tar.gz file and extract it to a convenient location. Now create a folder where you can store the configuration files. To make the files small and to show you that you can split them, I create three different files in this folder: input.conf, filters.conf and output.conf. The most basic configuration is one with a stdin for input, no filters and stdout for output. Below the contents for the two files

input {
	stdin{}
}
output { 
	stdout { 
		codec => rubydebug
	}
}

Time to start logstash. Step into the downloaded and extracted folder with the logstash binaries and execute the following command.

bin/logstash -r -f ../logstashblog/

the -r, can be used during development for reloading the configuration on change. Beware, this does not work with the stdin plugin. With -f we tell logstash to load a configuration file or directory. In our case a directory containing the three mentioned files. When logstash is ready it will print something like this:

[2017-10-28T19:00:19,511][INFO ][logstash.pipeline        ] Pipeline started {"pipeline.id"=>"main"}
The stdin plugin is now waiting for input:
[2017-10-28T19:00:19,526][INFO ][logstash.agent           ] Pipelines running {:count=>1, :pipelines=>["main"]}

Now you can type something and the result is the created document or event that went through the almost empty pipeline. The thing to notice is that we now have a field called message containing the text we entered.

Just some text for input
{
      "@version" => "1",
          "host" => "Jettros-MBP.fritz.box",
    "@timestamp" => 2017-10-28T17:02:18.185Z,
       "message" => "Just some text for input"
}

Now that we know it is working, I want you to have a look at the monitoring options you have available using the rest endpoint.

http://localhost:9600/

{
"host": "Jettros-MBP.fritz.box",
"version": "6.0.0-rc1",
"http_address": "127.0.0.1:9600",
"id": "20290d5e-1303-4fbd-9e15-03f549886af1",
"name": "Jettros-MBP.fritz.box",
"build_date": "2017-09-25T20:32:16Z",
"build_sha": "c13a253bb733452031913c186892523d03967857",
"build_snapshot": false
}

You can use the same url with different endpoints to get information about the node, the plugins, stats and hot threads:
http://localhost:9600/_node
http://localhost:9600/_node/plugins
http://localhost:9600/_node/stats
http://localhost:9600/_node/hot_threads

It becomes a lot more fun if we have a UI, so let us install xpack into logstash. Before we can run logstash with monitoring on, we need to install elasticsearch and kibana with X-pack installed into those as well. Refer to the X-Pack documentation on how to do it.

The basic commands to install x-pack into elasticsearch and kibana are very easy. For now I disable security by adding the following line to both kibana.yml and elasticsearch.yml: xpack.security.enabled: false. After installing x-pack into logstash we have to add the following lines to the logstash.yml file in the config folder

xpack.monitoring.elasticsearch.url: ["http://localhost:9200"] 
xpack.monitoring.elasticsearch.username:
xpack.monitoring.elasticsearch.password:

Notice the empty username and password, this is required when security is disabled. Now move over to Kibana and check the monitoring tab (the heart shape figure) and click on logstash. In the first screen you can see the events, they could be zero, zo please enter some events. Now move to the pipeline tab. Of course with our basic pipeline, this is a bit stupid, but imagine what it will show later on.

Screen Shot 2017 10 28 at 19 52 46

Time to get some real input.

Import the Signalmedia dataset

Signalmedia has provided a dataset you can use for research. More information about the dataset and how to obtain it can be found here. The dataset contains an exact amount of 1 million news documents. You can download the file as a file that contains a JSON document on each line. The JSON document has the following format:

{
   "id": "a080f99a-07d9-47d1-8244-26a540017b7a",
   "content": "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ...",
   "title": "Pay up or face legal action: DBKL",
   "media-type": "News",
   "source": "My Sinchew",
   "published": "2015-09-15T10:17:53Z"
}

We want to import this big file with all the JSON documents as separate documents into elasticsearch using logstash. The first step is to create a logstash input. Use the path to point to the file. We can use the logstash file plugin to load the file, tell it to start at the beginning and mark each line as a JSON document. The file plugin has more options you can use. It can also handle rolling files that are used a lot in logging.

input {
	file {
        path => "/Volumes/Transcend/signalmedia-1m.jsonl"
        codec => "json"
        start_position => beginning 
    }
}

That is it, with the stdout plugin and the rubydebug codec this would give the following output.

{
          "path" => "/Volumes/Transcend/signalmedia-1m.jsonl",
    "@timestamp" => 2017-10-30T18:49:45.948Z,
      "@version" => "1",
          "host" => "Jettros-MBP.fritz.box",
            "id" => "a080f99a-07d9-47d1-8244-26a540017b7a",
        "source" => "My Sinchew",
     "published" => "2015-09-15T10:17:53Z",
         "title" => "Pay up or face legal action: DBKL",
    "media-type" => "News",
       "content" => "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ..."
}

Notice that besides the fields we expected: id, content, title, media-type, source and published we also got some additional fields. Before sending this to elasticsearch we want to clean it up. We do not need the path, host, @timestamp, @version. There is also something with the field id. We want to use the id field to create the document in elasticsearch, but we do not want to add it to the document. If we need the value of id in the output plugin later on, but we do not want to add it as a field to the document we can move it to the @metadata object. That is exactly what the first part of the filter does. The second part removes the fields we do not need.

filter {
	mutate {
		copy => {"id" => "[@metadata][id]"}
	}
	mutate {
		remove_field => ["@timestamp", "@version", "host", "path", "id"]
	}
}

With these filters in place the output of the same document would become:

{
        "source" => "My Sinchew",
     "published" => "2015-09-15T10:17:53Z",
         "title" => "Pay up or face legal action: DBKL",
    "media-type" => "News",
       "content" => "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ..."
}

Now the content is ready to be send to elasticsearch, so we need to configure the elasticsearch output plugin. When sending data to elastic you first need to think about creating the index and the mapping that goes with it. In this example I am going to create an index template. I am not going to explain a lot about the mappings as this is not an elasticsearch blog. But with the following code we insert the mapping template when connecting to elasticsearch and we can insert all documents. Do look at the way the document_id is created. Remember we talked about that @metadata and how we copied the id field into it. This is the reason why we did it. Now we use that value as the id of the document when inserting it into elasticsearch.

output {
	elasticsearch {
		index => "signalmedia"
		document_id => "%{[@metadata][id]}"
		document_type => "doc"
		manage_template => "true"
		template => "./signalmedia-template.json"
		template_name => "signalmediatemplate"
	}
	stdout { codec => dots }
}

Notice there are two outputs configured. The elasticsearch output of course, but also a stdout. This time not with the rubydebug codec, this would be way to verbose. We use the dots codec. This codec prints a dot for each document it parses.

For completeness I also want to show the mapping template. In this case I positioned it in the root folder of the logstash binary, usually this would of course be an absolute path somewhere else.

{
  "index_patterns": ["signalmedia"],
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 3
  },
  "mappings": {
    "doc": {
      "properties": {
        "source": {
          "type": "keyword"
        },
        "published": {
          "type": "date"
        },
        "title": {
          "type": "text"
        },
        "media-type": {
          "type": "keyword"
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

Now we want to import all the million documents and have a look at the monitoring along the way. Let’s do it.

Screen Shot 2017 10 30 at 20 50 36
Screen Shot 2017 10 30 at 20 48 21

Running a query

Of course we have to prove the documents are now available in elasticsearch. So lets execute one of my favourite queries that makes use of the new significant text aggregation. First the request and then parts of the response.

GET signalmedia/_search
{
  "query": {
    "match": {
      "content": "netherlands"
    }
  },
  "aggs": {
    "my_sampler": {
      "sampler": {
        "shard_size": 200
      },
      "aggs": {
        "keywords": {
          "significant_text": {
            "field": "content",
            "filter_duplicate_text": true
          }
        }
      }
    }
  },
  "size": 0
}

Just a very small part of the response, I stripped out a lot of the elements to make it better viewable. Good to see that that see dutch as a significant word when searching for the netherlands and of course geenstijl.

"buckets": [
  {"key": "netherlands","doc_count": 527},
  {"key": "dutch","doc_count": 196},
  {"key": "mmsi","doc_count": 7},
  {"key": "herikerbergweg","doc_count": 4},
  {"key": "konya","doc_count": 14},
  {"key": "geenstijl","doc_count": 3}
]

Concluding

Good to see the nice ui options in Kibana. The pipeline viewer is very useful. In a next blog post I’ll be looking at Kibana and all the new and interesting things in there.


Elasticsearch 6 is coming

For some time now, elasticsearch has been releasing versions of the new major release elasticsearch 6. At this moment the latest edition is already rc1, so it is time to start thinking about migrating to the latest and greatest. What backwards compatible issues will you run into and what new features can you start using. This blog post gives a summary of the items that are most important to me based on the projects that I do. First we’ll have a look at the breaking changes, than we move on to new features or interesting upgrades.

Breaking changes

Most of the breaking changes come from the elasticsearch documentation that you can of course also read yourself.

Migrating indexes from previous versions

As with all major release, only indexes created in the prior version can be migrated automatically. So if you have an index created in 2.x, migrated it to 5.x and now want to start using 6.x you have to use the reindex API to first index it into a 5.x index before migrating.

Index types

In elasticsearch 6 the first step is taken into indexes without types. The first step is to allow only a single type within a new index and be able to keep using multiple types in indexes migrated from 5.x. Starting with elasticsearch 5.6 you can prevent people from creating indexes with multiple types. This will make it easier to migrate to 6.x when it becomes available. By applying the following configuration option you can prevent people from making multiple types in one index

index.mapping.single_type: true

More reasoning about why the types need to be removed can be found in elasticsearch documentation removal of types. Also if you are into parent-child relationships in elasticsearch and are curious what the implications of not being able to use multiple types are, check this documentation page parent-join. Yes will will get joins in elasticsearch :-), though with very limited use.

Java High Level REST Client

This was already introduced in 5.6, still good to know as this will be the replacement for the Transport client. As you might know I am also creating some code to use in Java Applications on top of the Low Level REST client for java that is also being used by this new client. More information about my work can found here: part 1 and part 2.

Uniform response for create/update and delete requests

At the moment a create request returns a response field created true/false, and a delete request returns found true/false. If you are someone trying to parse the response and using this field, you can no longer use this. Use the result field instead. This will have the value created or updated in case of the create request and deleted or not_found in case of the delete request.

Translog change

The translog is used to keep documents that have not been flushed to disk yet by elasticsearch. In prior releases the translog files are removed when elasticsearch has performed a flush. However, due to optimisations made for recovery having the translog could speed the recovery process. Therefore the translog is now kept for by default 12 hours or a maximum of 512 Mb
More information about the translog can be found here: Translog.

Java Client

In a lot of java projects the java client is used. I have used it as well for numerous projects. However, with the introduction of the High Level Rest client for java projects should move away from the Transport Client. If you want/need to keep using it for now, there are some changes in packages and some methods have been removed. For me the one I used the most is the the order for aggregations, think about Terms.Order and Histogram.Order. They have been replaced by BucketOrder

Index mappings

There are two important changes that can affect your way of working with elastic. The first is the way booleans are handled. In indexes created in version 6, a boolean accepts only two values: true and false. Al other values will result in an exception.

The second change is the _all field. In prior version by default an _all field was created in which all values of fields were copied as strings and analysed with the standard analyser. This field was used by queries like the query_string. There was however a performance penalty as we now had to analyse and index a potentially big field. Soon it became a best practice to disable the field. In elasticsearch 6 the field is disabled by default and it cannot be configured for indices created with elasticsearch 6. If you still use the query_string query, it is now executed agains each field. You should be very careful with the query_string query. It comes with a lot of power. Users get a lot of options to create their own query. But with great power comes great responsibilities. They can create very heavy queries as well. And they can queries that break without a lot of feedback. More information about the query_string. If you still want to give you users more control, but the query_string query is one step to far, think about creating your own search DSL. Some ideas can be found in one of my previous blog posts: Creating a search DSL and Part 2 of creating a search DSL.

Booting elasticsearch

Some things changed with the startup options. You cannot configure the user elasticsearch runs with if you use the deb or rpm packages and the elasticsearch.yml file location is now configured differently. Now you have to export the path where to find all configuration files (elasticsearch.yml, jvm.options and log4j2.properties). You can expose an environment variable ES_PATH_CONF containing the path to the config folder. I use this regularly on my local machine. As I have multiple projects running often with different version of elasticsearch I have setup a structure where I put my config files in separate folders from the elasticsearch distributable. Find the structure in the image below. In the beginning I just copy the config files to my project specific folder. When I start the project with the script startNode.sh the following script is executed.

Elastic folder structure

#!/bin/bash

CURRENT_PROJECT=$(pwd)

export ES_PATH_CONF=$CURRENT_PROJECT/config

DATA=$CURRENT_PROJECT/data
LOGS=$CURRENT_PROJECT/logs
REPO=$CURRENT_PROJECT/backups
NODE_NAME=Node1
CLUSTER_NAME=playground

BASH_ES_OPTS="-Epath.data=$DATA -Epath.logs=$LOGS -Epath.repo=$REPO -Enode.name=$NODE_NAME -Ecluster.name=$CLUSTER_NAME"

ELASTICSEARCH=$HOME/Development/elastic/elasticsearch/elasticsearch-6.0.0-rc1

$ELASTICSEARCH/bin/elasticsearch $BASH_ES_OPTS

Now when you need additional configuration options, add them to the elasticsearch.yml. If you need more memory for the specific project, change the jvm.options file.

Plugins

When indexing pdf documents or word documents a lot of you out there have been using the mapper-attachments plugin. This was already deprecated, now it has been removed. You can switch to the ingest attachment plugin. Never heard about Injest? Injest can be used to pre process documents before they are being indexed by elasticsearch. It is a lightweight variant for Logstash, running within elasticsearch. Be warned though that plugins like the attachment mapper can be heavy on your cluster. So it is wise to have a separate node for Injest. Curious about what you can do to inject the contents of a pdf? The next few steps show you the commands to create the injest pipeline, send a document to it and obtain it again or create a query.

First create the injest pipeline

PUT _ingest/pipeline/attachment
{
  "description": "Extract attachment information",
  "processors": [
    {
      "attachment": {
        "field": "data"
      }
    }
  ]
}

Now when indexing a document containing the attachment as a base64 encoded string in the field data we need to tell elasticsearch to use a pipeline. Check the parameter in the url: pipeline=attachment. This is the name used when creating the pipeline.

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": ""
}

We could stop here, but how to get base64 encoded input from for instance a pdf. On linux and the mac you can use the base64 command for that. Below is a script that reads a specific pdf and creates a base64 ended string out of it. This string is than pushed to elasticsearch.

#!/bin/bash

pdf_doc=$(base64 ~/Desktop/Programma.pdf)

curl -XPUT "http://localhost:9200/my_index/my_type/my_id?pipeline=attachment" -H 'Content-Type: application/json' -d '{"data" : "'"$pdf_doc"'"}'

Scripts

If you are heavy into scripting in elasticsearch you need to check a few things. Changes have been made to the use of the lang attribute when obtaining or updating scripts, you cannot provide it any more. Also support for other languages than painless has been removed.

Search and query DSL changes

Most of the changes in this area are very specific. I am not going to sum them, please check the original documentation. Some of them I do want to mention as they are important to me.

  • If you are constructing queries and it can happen you have an empty query, you can no longer provide an empty object { }. You will get an exception if you keep doing it.
  • Bool queries had a disable_coord parameter, with this you could influence the score function to not use missing search terms as a penalty for the score. This option has been removed.
  • You could transform a match query into a match_phrase query by specifying a type. This is no longer possible, you should just create a phrase query now if you need it. Therefore also the slop parameter has been removed from the match query.

Calculating the score

I the beginning of elasticsearch the score for a document based on an executed query was calculated using an adjusted formula for TF/IDF. It turned out that for fields containing smaller amounts of text TF/IDF was less ideal. Therefore the default scoring algorithm was replaced by BM25. Moving away from TF/IDF to BM25 has been the topic for version 5. Now with 6 they have removed two mechanisms in the scoring: Query Normalization and Coordination Factors. Query Normalization was always hard to explain during trainings. It was an attempt to normalise the scores of queries. Normalizing should make it possible to compare them. However, it did not work and you still should not compare scores of different queries. The Coordinating Factors were more a penalty when having multiple terms to search for and not all of them were found, the coordinating factor gave a penalty to the score. You could easily see this when using the explain API.

That is it for the breaking changes, again there are more changes that you might want to investigate if you are really into all the elasticsearch details. Than have a look at the original documetation

Next up, cool new features

Now let us zoom in on some of the newer features or interesting upgrades.

Sequence Numbers

Sequence Numbers are now assigned to all index, update and delete operations. Using this number a shard that went offline for a moment can ask the primary shard for all operations after a certain sequence number. If the translog is still available (remember that we mentioned in the beginning that the translog was now kept around for 12 hours and or 512 Mb by default) the missing operations can be send to the shard preventing a complete refresh of all the shards contents.

Test Normalizer using analyse endpoint

One of the most important parts of elastic is configurating the mapping for your documents. How do you adjust the terms that you can search for based on the provided text. If you are not sure and you want to try out a specific tokeniser and filters combination you can use the analyze endpoint. Have a look at the following code sample and response where we try out a whitespace tokeniser with a lowercase filter.

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase"],
  "text": ["Jettro Coenradie"]
}

{
  "tokens": [
    {
      "token": "jettro",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "coenradie",
      "start_offset": 7,
      "end_offset": 16,
      "type": "word",
      "position": 1
    }
  ]
}

As you can see we now get two tokens and the uppercase characters are replaced by their lowercase counterparts. What if we do not want the text to become two terms, but we want it to stay as one term. Still we would like to replace the uppercase characters with their lowercase counterparts. This was not possible in the beginning. However, with the introduction of normalizer, a special analyser for fields of type keyword it became possible. In elasticsearch 6 we now have the functionality to use the analyse endpoint for normalisers as well. Check the following code block for an example.

PUT people
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 1,
    "analysis": {
      "normalizer": {
        "name_normalizer": {
          "type": "custom",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  }
}

GET people/_analyze
{
  "normalizer": "name_normalizer",
  "text": ["Jettro Coenradie"]
}

{
  "tokens": [
    {
      "token": "jettro coenradie",
      "start_offset": 0,
      "end_offset": 16,
      "type": "word",
      "position": 0
    }
  ]
}

LogLog-Beta

Ever heard about HyperLogLog or even HyperLogLog++? Well than you must be happy with LogLog-Beta. Some background, elasticsearch comes with a Cardinality Aggregation which can be used to calculate or better estimate the amount of distinct values. If we wanted to create an exact value, we would have to create a map of values with all unique values in there. This would require an extensive amount of memory. You can specify a threshold under which the amount of unique values would be close to exact. However the maximum value for this is 40000. Before elasticsearch used the HyperLogLog++ algorithm to estimate the unique values. With the new algorithm called LogLog-Beta there are better results with lower error margins and still the same performance.

Significant Text Aggregation

For some time the Significant Terms Aggregation has been available. The idea behind this aggregation is to find terms that are common to a specific scope and less common to a more general scope. So imagine we are looking for users of our website that place more orders in relation to pages they visit out of logs with page visits. You cannot calculate them by just counting the number of orders. You need to find those users that are more common to the set of orders than to the set of page visits. In the prior version this was already possible with terms, so not analysed fields. By enabling field-data or doc_values you could use small analysed fields. But for larger text fields this was a performance problem. Now with the Significant Text Aggregation we can overcome this problem. It also comes with an interesting functionality to deduplicate text (think about emails with the original text in a reply, or retweets).

Sounds a bit to vague? Ok, lets have an example. In elasticsearch documentation they use a dataset from Signal Media. As it is an interesting dataset to work with, I will also use it. You can try it out yourself as well. I downloaded the file and imported it into elasticsearch using logstash. This gist should help you. Now on to the query and the response

GET signalmedia/_search
{
  "query": {
    "match": {
      "content": "rain"
    }
  },
  "aggs": {
    "my_sampler": {
      "sampler": {
        "shard_size": 200
      },
      "aggs": {
        "keywords": {
          "significant_text": {
            "field": "content",
            "filter_duplicate_text": true
          }
        }
      }
    }
  },
  "size": 0
}

So we are looking for documents with the word rain. Now in these documents we are going to lookup terms that occur more often than in the global context.

{
  "took": 248,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 11722,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "my_sampler": {
      "doc_count": 600,
      "keywords": {
        "doc_count": 600,
        "bg_count": 1000000,
        "buckets": [
          {
            "key": "rain",
            "doc_count": 544,
            "score": 69.22167699861609,
            "bg_count": 11722
          },
          {
            "key": "showers",
            "doc_count": 164,
            "score": 32.66807368214775,
            "bg_count": 2268
          },
          {
            "key": "rainfall",
            "doc_count": 129,
            "score": 24.82562838569881,
            "bg_count": 1846
          },
          {
            "key": "thundery",
            "doc_count": 28,
            "score": 20.306396677050884,
            "bg_count": 107
          },
          {
            "key": "flooding",
            "doc_count": 153,
            "score": 17.767450110864743,
            "bg_count": 3608
          },
          {
            "key": "meteorologist",
            "doc_count": 63,
            "score": 16.498915662650603,
            "bg_count": 664
          },
          {
            "key": "downpours",
            "doc_count": 40,
            "score": 13.608547008547008,
            "bg_count": 325
          },
          {
            "key": "thunderstorms",
            "doc_count": 48,
            "score": 11.771851851851853,
            "bg_count": 540
          },
          {
            "key": "heatindex",
            "doc_count": 5,
            "score": 11.56574074074074,
            "bg_count": 6
          },
          {
            "key": "southeasterlies",
            "doc_count": 4,
            "score": 11.104444444444447,
            "bg_count": 4
          }
        ]
      }
    }
  }
}

Interesting terms when looking for rain: showers, rainfall, thundery, flooding, etc. These terms could now be returned to the user as possible candidates for improving their search results.

Concluding

That is it for now. I haven’t even scratched all the new cool stuff in the other components like X-Pack, Logstash and Kibana. More to come.


Monitoring our Devcon app backend with Elastic Beats

Last week was our Luminis conference called Devcon 2017. For the third time we organise this conference with around 500 attendees and 20 speakers. This year we wanted to have a mobile app with the conference program and information about speakers. The app was created with the Ionic framework and the backend is a spring boot application. Before and during the conference we wanted to monitor the servers. We wanted to monitor the log files, hardware statistics and uptime. I used a number of beats to collect data, store the data in elasticsearch and show nice dashboards with Kibana. In this blog post I’ll explain to you the different beats. I’ll show you how to set up a system like this yourselves and I’ll show you some of the interesting charts that you can create from the collected data.

Beats

Beats is the platform for single-purpose data shippers. They install as lightweight agents and send data from hundreds or thousands of machines to Logstash or Elasticsearch. ~ Elastic homepage

Beats is a library to make it easier to build single purpose data shippers. Elastic comes with a number of beats themselves, but the community has already been creating their own beats as well. There are 4 beats I have experience with, for our monitoring wishes we needed three of them.

  • Filebeat – Used to monitor files, can deal with multiple files in one directory, has module for files in well known formats like nginx, apache https, etc.
  • Metricbeat – Monitors the resources of our hardware, think about used memory, used cpu, disk space, etc
  • Heartbeat – Used to check endpoint for availability. Can check times it took to connect and if the remote system is up
  • Packetbeat – Takes a deep dive into the packets going over the wire. It has a number of protocols to sniff like http, dns and amp. It also understand the packets being sent to applications like: MySql, MongoDB and Redis.

The idea behind a beat is that it has some input, a file or an endpoint, and an output. The output can be elasticsearch, logstash but also a file. All beats come with default index templates that tell elasticsearch to create indexes with the right mapping. They also come with predefined Kibana dashboards that are easy to install. Have a look at the image below for an example.

Screencapture kibana beats cpu load

That should give you an idea of what beats are all about. In the next section I’ll show you the setup of our environment.

The backend and the monitoring servers

I want to tell you about the architecture of our application. It is a basic spring boot application with things like security, caching and spa enabled. We use MySql as a datastore. The app consists of two main parts. The first being an API for the mobile app. The second a graphical user interface created with Thymeleaf. The GUI is created to enable us to edit the conference properties like the speakers, the talks, used twitter tags, etc. We installed FileBeat and MetricBeat on the backend. We had a second server, this server was running elasticsearch. This second server is also the host for HeartBeat. The next image shows an overview of the platform.

Overview infrastructuur beats

All beats were installed using the debian packages and using the installation guides from the elastic website. As the documentation is thorough I won’t go into details here. I do want to show the configurations that I used for the different beats.

FileBeat

Filebeat was used to monitor the nginx access logs files. Filebeat makes use of modules with predefined templates. In our case we use the nginx module. Below the configuration that we used.

filebeat.modules:
- module: nginx

output.elasticsearch:
  hosts: ["10.132.29.182:9200"]

As we have been doing with logstash for a while, we want to enhance the log lines with things like browser extraction, gea enhancements. Elastic these days has an option to use ingest for this purpose. More information about how to install these ingest components can be found here.

https://www.elastic.co/guide/en/elasticsearch/plugins/5.3/ingest-user-agent.html
https://www.elastic.co/guide/en/elasticsearch/plugins/5.3/ingest-geoip.html

An example of the filebeat dashboard in Kibana is below.

Screencapture kibana nginx

MetricsBeat

Next step is monitoring the CPU, load factor, memory usage per process. Installing MetricBeat is easy when using the debian package. Below the configuration I used on the server.

metricbeat.modules:

- module: system
  metricsets:
    - cpu
    - load
    - filesystem
    - fsstat
    - memory
    - network
    - process
  enabled: true
  period: 30s
  processes: ['.*']

- module: mysql
  metricsets: ["status"]
  enabled: true
  period: 30s
  hosts: ["user:password@tcp(127.0.0.1:3306)/"]


output.elasticsearch:
  hosts: ["10.132.29.182:9200"]

As you can see, this one is bigger than the filebeat config. I configure he metrics to measure and how often to measure. In this example we measure every 30 seconds. We have two modules, the system module and the mysql module. With mysql module we get specific metrics about the mysql process. Below an idea of the available metrics. Interesting to see the amount of commands and threads.

{
  "_index": "metricbeat-2017.04.06",
  "_type": "metricsets",
  "_id": "AVtFR6tN2PYBRsZuE_1r",
  "_score": null,
  "_source": {
    "@timestamp": "2017-04-06T21:59:36.211Z",
    "beat": {
      "hostname": "devcon-api-2017",
      "name": "devcon-api-2017",
      "version": "5.3.0"
    },
    "metricset": {
      "host": "127.0.0.1:3306",
      "module": "mysql",
      "name": "status",
      "rtt": 3186
    },
    "mysql": {
      "status": {
        "aborted": {
          "clients": 24,
          "connects": 2
        },
        "binlog": {
          "cache": {
            "disk_use": 0,
            "use": 0
          }
        },
        "bytes": {
          "received": 91999863,
          "sent": 447340795
        },
        "command": {
          "delete": 281,
          "insert": 2802,
          "select": 161874,
          "update": 7
        },
        "connections": 68,
        "created": {
          "tmp": {
            "disk_tables": 3126,
            "files": 6,
            "tables": 27404
          }
        },
        "delayed": {
          "errors": 0,
          "insert_threads": 0,
          "writes": 0
        },
        "flush_commands": 1,
        "max_used_connections": 11,
        "open": {
          "files": 22,
          "streams": 0,
          "tables": 311
        },
        "opened_tables": 7794,
        "threads": {
          "cached": 6,
          "connected": 3,
          "created": 13,
          "running": 1
        }
      }
    },
    "type": "metricsets"
  },
  "fields": {
    "@timestamp": [
      1491515976211
    ]
  },
  "highlight": {
    "metricset.module": [
      "@kibana-highlighted-field@mysql@/kibana-highlighted-field@"
    ]
  },
  "sort": [
    1491515976211
  ]
}

Another example report was already shown in the introduction of this post.

HeartBeat

This beat can be used to monitor the availability of other services. I used it to monitor the availability of our backend as well as our homepage. Configuration is as easy as this.

# Configure monitors
heartbeat.monitors:
- type: http
  urls: ["https://amsterdam.luminis.eu:443"]
  schedule: '@every 60s'

- type: http
  urls: ["https://api.devcon.luminis.amsterdam:443/monitoring/ping"]
  schedule: '@every 30s'

dashboards.enabled: true

output.elasticsearch:
  hosts: ["localhost:9200"]

You see both monitors, they check a url every 30 seconds or 60 seconds in the first monitor. With this beat you can explicitly enable the kibana dashboards in the configuration. An example dashboard is presented in the image below. Not so funny to see that our website has been unavailable for a moment. Luckily this was not our mobile app backend.

Screencapture kibana heartbeat

Custom dashboards

Of course you can use all the prefabricated dashboard. You can however still create you own dashboards, combine the available views that you need to analyse your platform. In my case I want to create a new view. I want to have an indication of the usage of urls of the api over time. Kibana 5.3 comes with a new chart type called heatmap chart. In the next few images I am stepping through the creation of this chart. At the end I’ll also do an analysis of the chart for the conference day. Our heatmap shows the amount of calls to specific url. The darker the block, the more hits. First we create a new visualisation, choose the heatmap.

Screencapture kibana new chart 1

Next we need to chose the index to take the documents from. In our case we use the filebeat index.

Screencapture kibana new chart 2

Next choose the x-axis.

Screencapture kibana new chart 3

We create buckets for each time period, so a date histogram is what we need.

Screencapture kibana new chart 4

Now we need a sub aggregation, divide the histogram buckets per url. So we add a subaggregation in the interface and chose the y-Axis.

Screencapture kibana new chart 5

The sub aggregation can be a terms aggregation, use the field nginx.access.url.

Screencapture kibana new chart 6

Have a look at the image above, we see a number of urls we are not interested in. The have very little hits or are not interesting for our analysis. We can do two things to filter out specific urls. One of them is excluding values. These are under the advanced trigger. Open it by clicking it, bottom left. Than we can enter a minimum doc count of 2. We can also exclude a number of url with the pattern.

Screencapture kibana new chart 7

Finally let us have a better look at a larger version of this chart. In this chart we can see a few things. First of all the url /monitoring/ping. This is steady throughout the whole time period, which is no surprise as we call this url each 30 seconds using the heartbeat plugin. The second row with url /api/conferences/1 is a complete json representation of the program, session details as well as speaker information. Most used at the beginning of the conference and it stopped around 17:00 when the last sessions started. At the beginning of the day people marked their favourite sessions. People were not consistent in rating session which can be found in the last line. Than there is the url /icon-256.png. This is an image used by android devices when a push notification was received. The darks green blocks were exactly the moments when we send out a push notification.

Screencapture kibana new chart 8

Wrapping up

The test with using beats for monitoring felt really good. I like how easy they are to setup and the information you can obtain using the different kind of beats. Will definitely use them more in the future.


Creating an elasticsearch plugin, the basics

Elasticsearch is a search solution based on Lucene. It comes with a lot of features to enrich the search experience. Some of these features have been recognised as very useful in the analytics scene as well. Interacting with elasticsearch mainly takes place using the REST endpoint. You can do everything using the different available endpoints. You can create new indexes, insert documents, search for documents and lots of other things. Still some of the things are not available out of the box. If you need an analyser that is not available by default, you can install it as a plugin. If you need security, you can install a plugin. If you need alerting, you can install it as a plugin. I guess you get the idea by now. The plugin extension option is nice, but might be a bit hard to begin with. Therefore in this blog post I am going to write a few plugins. I’ll point you to some of the resources I used to get it running and I want to give you some inspiration for your own ideas for cool plugins that extend the elasticsearch functionality.

Bit of history

In the releases prior to version 5 there were two type of plugins, site and java plugins. Site plugins were used extensively. Some well known examples are: Head, HQ, Kopf. Also Kibana and Marvel started out as a site plugin. It was a nice feature, however not the core of elasticsearch. Therefore the elastic team deprecated site plugins in 2.3 and the support was removed in 5.

How does it work

The default elasticsearch installation already provides a script to install plugins. You can find it in the bin folder. You can install plugins from repositories but also from a local path. A plugin comes in the form of a jar file.

Plugins need to be installed on every node of the cluster. Installation is as simple as the following command.

bin/elasticsearch-plugin install file:///path/to/elastic-basic-plugin-5.1.2-1-SNAPSHOT.zip

In this case we install the plugin from our own hard drive. The plugins have a dependency on the elastic core and therefore need to have the exact same version as the elastic version you are using. So for each elasticsearch release you have to create a new version of the plugin. In the example I have created the plugin for elasticsearch 5.1.2.

Start with our own plugin

Elastic uses gradle internally to build the project, I still prefer maven over gradle. Luckily David Pilato wrote a good blog post about creating the maven project. I am not going to repeat all the steps of him. Feel free to take a peek at the pom.xml I used in my plugin.

Create BasicPlugin that does nothing

The first step in the plugin is to create a class that starts the plugin. Below is the class that has just one functionality, print a statement in the log that the plugin is installed.

public class BasicPlugin extends Plugin {
    private final static Logger LOGGER = LogManager.getLogger(BasicPlugin.class);
    public BasicPlugin() {
        super();
        LOGGER.warn("Create the Basic Plugin and installed it into elasticsearch");
    }
}

Next step is to configure the plugin as described by David Pilato in his blog I mentioned before. We need to add the maven assembly plugin using the file src/main/assemblies/plugin.xml. In this file we refer to another very important file, src/main/resources/plugin-descriptor.properties. With all this in place we can run maven to create the plugin in a jar.

mvn clean package -DskipTests

In the folder target/releases you’ll now find the file elastic-basic-plugin-5.1.2-1-SNAPSHOT.zip. Which is a jar file in disguise, we could change the extension to jar, there is no difference. Now use the command from above to install. If you get a message that the plugin is already there, you need to remove it first

bin/elasticsearch-plugin remove elastic-basic-plugin

Then after installing the plugin you’ll find the following line in the log of elasticsearch when starting

[2017-01-31T13:42:01,629][WARN ][n.g.e.p.b.BasicPlugin    ] Create the Basic Plugin and installed it into elasticsearch

This is of course a bit silly, let us create a new rest endpoint that checks if the elasticsearch database contains an index called jettro.

Create a new REST endpoint

The inspiration for this endpoint came from another blog post by David Pilato: Creating a new rest endpoint.

When creating a new endpoint you have to extend the class org.elasticsearch.rest.BaseRestHandler. But before we go there, we first initialise it in our plugin. To do that we implement the interface org.elasticsearch.plugins.ActionPlugin and implement the method getRestHandlers.

public class BasicPlugin extends Plugin implements ActionPlugin {
    private final static Logger LOGGER = LogManager.getLogger(BasicPlugin.class);
    public BasicPlugin() {
        super();
        LOGGER.warn("Create the Basic Plugin and installed it into elasticsearch");
    }

    @Override
    public List<Class<? extends RestHandler>> getRestHandlers() {
        return Collections.singletonList(JettroRestAction.class);
    }
}

Next is implementing the JettroRestAction class. Below the first part, the constructor and the method that handles the request. In the constructor we define the endpoint url patterns that this endpoint supports. The are clear from the code I think. Functionality wise, if you call without an action or with another action than exists, we return a message, if you ask for existence we return true or false. This handling is done in the prepareRequest method.

public class JettroRestAction extends BaseRestHandler {

    @Inject
    public JettroRestAction(Settings settings, RestController controller) {
        super(settings);
        controller.registerHandler(GET, "_jettro/{action}", this);
        controller.registerHandler(GET, "_jettro", this);
    }

    @Override
    protected RestChannelConsumer prepareRequest(RestRequest request, NodeClient client) throws IOException {
        String action = request.param("action");
        if (action != null && "exists".equals(action)) {
            return createExistsResponse(request, client);
        } else {
            return createMessageResponse(request);
        }
    }
}

We have two utility classes that transform data into XContent: Message and Exists. The implementations of the two methods: createExistsResponse and createMessageResponse, can be found here.

Time to re-install the plugin, first build it with maven, remove the old one and install the new version. Now we can test it in a browser or with curl. I personally use httpie to do the following requests.

Screen Shot 2017 01 31 at 15 23 10

This way we can create our own custom endpoint. Next we dive a little bit deeper into the heart of elastic. We are going to create a custom filter that can be used in an analyser.

Create a custom Filter

The first part is registering the Filter in the BasePlugin class. We need to extend the interface org.elasticsearch.plugins.AnalysisPlugin and override the method getTokenFilters. We register a factory class that instantiates the filter class. The registration is done using a name that can later on be used to use the filter. The method looks like this

    @Override
    public Map<String, AnalysisModule.AnalysisProvider<TokenFilterFactory>> getTokenFilters() {
        return Collections.singletonMap("jettro", JettroTokenFilterFactory::new);
    }

The implementation of the factory is fairly basic

public class JettroTokenFilterFactory extends AbstractTokenFilterFactory {
    public JettroTokenFilterFactory(IndexSettings indexSettings, 
                                    Environment environment, 
                                    String name, 
                                    Settings settings) {
        super(indexSettings, name, settings);
    }

    @Override
    public TokenStream create(TokenStream tokenStream) {
        return new JettroOnlyTokenFilter(tokenStream);
    }
}

The filter we are going to create has a bit strange functionality. It only accepts tokens that are the same as jettro. All other tokens are removed.

public class JettroOnlyTokenFilter extends FilteringTokenFilter {
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public JettroOnlyTokenFilter(TokenStream in) {
        super(in);
    }

    @Override
    protected boolean accept() throws IOException {
        return termAtt.toString().equals("jettro");
    }
}

Time to test my fresh created filter. We can do that using the analyse endpoint

curl -XGET 'localhost:9200/_analyze' -d '
{
  "tokenizer" : "standard",
  "filter" : ["jettro"],
  "text" : "this is a test for jettro"
}'

The response now is

{"tokens":[{"token":"jettro","start_offset":19,"end_offset":25,"type":"","position":5}]}

Concluding

That is it, we have created the foundations to create a plugin, thanks to David Pilato, we have written our own _jettro endpoint and we have created a filter that only accepts one specific word, jettro. Ok, I agree the plugin in itself is not very useful, however the construction of the plugin is re-useable. Hope you like it and stay tuned for more elastic plugin blogs. We’re working on an extension to the synonyms plugin and have some ideas for other plugins.


Elasticsearch 5 is coming, what is new and improved?

The guys at elastic have been working on the new 5.0 release of elastic and all the other products in their stack as well. From the first alpha release I have been playing around with new features. Wrote some blogposts about features I played around with. With release candidate 1 out, it is time to write a bit about the new features that I like, and (breaking) changes that I feel are important. Since it is a big release I need a big blog post, so don’t say I did not warn you.

(more…)


Using the new elasticsearch 5 percolator

In the upcoming version 5 of elasticsearch the implementation for the percolator has changed a lot. They moved the percolator from being a separate endpoint and API to being a member of the search API. In the new version you can execute a percolator query. Big advantage is that you can now use everything you want in the query that you could already in all other queries. In this blogpost I am going to show how to use the new percolator by building a very basic news notification service

(more…)


The new elasticsearch java Rest Client – part 2

In the previous blog post I started writing about using the new Rest based client for java in elasticsearch 5. In this blogpost I am going to continue that quest by experimenting with some layers of abstraction using spring features and other considerations.

Spring Factory Bean for RestClient

First I introduce the Client factory, responsible for creating the singleton client of type RestClient. This factory creates the client as a singleton. It also creates the Sniffer to check for available nodes, like we discussed in the first blog post (see references). The code is the same as before, therefore I am not going to show it again. Check the class: nl.gridshore.elastic.RestClientFactoryBean, git repo is referenced later on. The RestClientFactoryBean extends the spring AbstractFactoryBean. The result of using the spring Factory pattern is that this factory object creates the RestClient instance which can be injected into other beans.

The goal for the application we are going to create is an employee search tool. Therefore I have created and Employee object that I am going to use to index but also as the query result. But I do not want the elastic client execution code to be mixed with the employee business code. Therefore I have created an abstraction layer of interacting with the elastic client. The code to execute index requests for indexing employees and querying for employees is using this abstraction layer.

The EmployeeService knows the QueryTemplateFactory. It uses the factory to get an instance of a QueryTemplate. The QueryTemplateFactory knows the RestClientFactoryBean. The RestClientFactoryBean is responsible for maintaining the one instance of RestClient. The QueryTemplateFactory creates a new instance of the QueryTemplate for each call and injects the RestClient. The QueryTemplate could now be used to inject a string based query and handle the execution and response handling. Of course that is not really what we want, adding strings to create a query. There is a better way of doing this using a template engine. But first I want to talk about handling the response.

Handling the response

Handling a response can be challenging. For a basic query it is not to hard. But when interacting with nested structures and later on adding aggregations it is a lot harder. For now we focus on a basic response that looks like this:

{
  "took": 19,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.7373906,
    "hits": [
      {
        "_index": "luminis",
        "_type": "ams",
        "_id": "AVX57b0YY0m5tWxWetET",
        "_score": 0.7373906,
        "_source": {
          "name": "Jettro Coenradie",
          "email": "jettro@gridshore.nl",
          "specialties": [
            "java",
            "elasticsearch",
            "angularjs"
          ],
          "phone_number": "+31612345678"
        }
      }
    ]
  }
}

The QueryTemplate must to the boilerplate and from the Employee perspective we want to provide enough information to convert the _source parts into Employee objects. I use a combination of Jackson and generics to accomplish this. First let us have a look at the code to interact with the QueryTemplate:

public List<Employee> queryForEmployees(String name) {

    Map<String, Object> params = new HashMap<>();

    params.put("name", name);

    params.put("operator", "and");


    QueryTemplate<Employee> queryTemplate = queryTemplateFactory.createQueryTemplate();

    queryTemplate.setIndexString(INDEX);

    queryTemplate.setQueryFromTemplate("find_employee.twig", params);

    queryTemplate.setQueryTypeReference(new EmployeeTypeReference());


    return queryTemplate.execute();

}

As you can see from the code, the QueryTemplate receives the name of the index to query, the name of the twig template (see next section) and the parameters used by the twig template. The final thing we need to give is the TypeReference. This is necessary for Jackson to handle the generics. This type reference looks like this:

public class EmployeeTypeReference extends TypeReference<QueryResponse<Employee>> {
}


Finally we call the execute method of the QueryTemplate and notice that it returns a list of Employee objects. To see how this works we have to take a look at the response handling by the QueryTemplate.

public List<T> execute() {

    List<T> result = new ArrayList<>();


    this.queryService.executeQuery(indexString, query(), entity -> {

        try {

            QueryResponse<T> queryResponse = jacksonObjectMapper.readValue(entity.getContent(), this.typeReference);


            queryResponse.getHits().getHits().forEach(tHit -> {

                result.add(tHit.getSource());

            });

        } catch (IOException e) {

            logger.warn("Cannot execute query", e);

        }

    });


    return result;

}


Using jackson we convert the son based response from elasticsearch client into a QueryResponse object. Notice that we have a generic type ’T’. Jackson only know how to do this by passing the right TypeReference. The java object tree resembles the json structure:

public class QueryResponse<T> {

    private Long took;


    @JsonProperty(value = "timed_out")

    private Boolean timedOut;


    @JsonProperty(value = "_shards")

    private Shards shards;


    private Hits<T> hits;

}

public class Hits<T> {

    private List<Hit<T>> hits;

    private Long total;


    @JsonProperty("max_score")

    private Double maxScore;

}

public class Hit<T> {

    @JsonProperty(value = "_index")

    private String index;


    @JsonProperty(value = "_type")

    private String type;


    @JsonProperty(value = "_id")

    private String id;


    @JsonProperty(value = "_score")

    private Double score;


    @JsonProperty(value = "_source")

    private T source;
}

Notice what happens with the generic type ’T’, so the source is converted into the provided type. Now look back at the code of the execute method of the QueryTemplate. Here we obtain the hits from the QueryResponse and loop over these hits to obtain the Employee objects.

Using the twig template language

Why did I chose to use a template language to create the query? It is not really safe to just copy a few strings including user input together into one long string and provided that to the QueryTemplate executer. I wanted to stay close to json to make copy paste from the elastic console as easy a possible. However I also wanted some nice features to make creating queries easier. I found jTwig to be powerful enough and very easy to use in java. I do not want to write an extensive blogpost about jTwig. if you want to know more about it please check the references.

The following code block shows how we use jTwig:

public void setQueryFromTemplate(String templateName, Map<String, Object> modelParams) {

    JtwigTemplate template = JtwigTemplate.classpathTemplate("templates/" + templateName);

    JtwigModel model = JtwigModel.newModel();

    modelParams.forEach(model::with);


    this.query = template.render(model);

}

First you have to create the model and load the template. In my case I load the template from a file on the class path in the templates folder. Next I add the provided parameters to the model and finally I render the template using the model. The next code block shows the template.

{

    "query":{

{% if (length(name)==0) %}

        "match_all": {}

{% else %}

        "match":{

            "name": {

                "query": "{{ name }}",

                "operator": "{{ default(operator, 'or') }}"

            }

        }

{% endif %}

    }

}

This model supports two parameters: name and operator. First I check if the user provided something to search for. If the name is empty we just return the match_all query. If the name has length, a match query is created on the field name. Also notice that we can provide a default value for the operator parameter in this case. So if no operator is provided we make it or.

Concluding

That is it, now we have an easy way to write down a query in a jTwig template and parse the results using jackson. If would be very easy to query for another object instead of Employee if we wanted to. The indexing side is similar to the query side. Feel free to check this out yourself. All code is available in the github repo.

References

http://jtwig.org – Template language used for the queries.
https://amsterdam.luminis.eu/2016/07/07/the-new-elasticsearch-java-rest-client/ – part 1 of this blog series
https://github.com/jettro/elasticclientdemo – The sample project


Anomaly detection over a normally distributed aggregation in Elasticsearch

In my first month at Luminis I got the chance to dive into a data set stored in Elasticsearch. My assignment was easily phrased: detect the outliers in this data set. Anomaly detection is an interesting technique we would like to apply to different data sets. It can be useful in several business cases, like an investigation if something is wrong with these anomalies. This blog post is about detecting the outlying bucket(s) in an aggregation. The assumption is that the data is normally distributed. This can be tested with normality tests. But there is no mathematical definition of what constitutes an outlier. The domain of this data set is that the data is about messages to and from different suppliers. The point of interest is if the number of messages linked to a supplier differs in a week from all the weeks. The same can be done with different time intervals or over other aggregations, as long as they are normally distributed. Also a number field can be used instead of the number of messages in an aggregation (bucket document count). Or if the success status of a message is stored, that can be used as a boolean or enumeration. The first step is to choose an interesting aggregation on the data that gives normally distributed sized buckets. In the example below I chose a Date Histogram Aggregation with a weekly interval. This means that the data is divided into buckets of a week. Over the weeks the messages to and from the suppliers were evenly sent per week. Actually this results in that the size of the buckets are normally distributed. Then the following mathematical theory can be applied to determine the anomalies. The second step is to determine the lower and upper bound of the interval in which the most of the normally distributed data lies. This is done by calculating the mean and the standard deviation of the number field. It this case these are calculated over the document count of the weekly buckets. In empirical sciences it is conventional to consider points that don’t lie within three standard deviations of the mean an outlier, because 99,73 % of the points lie within that interval (3-sigma rule). The lower bound of the interval is calculated by subtracting three times the standard deviation of the mean and the upper bound by adding it to the mean. In Elasticsearch this can be done with the Extended Stats Bucket Aggregation, which is a Pipeline Aggregation that can be connected to the previous aggregation. We set parameter sigma to 3.0. Note that Elasticsearch calculates the population standard deviation by taking the root of the population variance. This is mathematically correct here, because the entire population of messages is stored. (If there would be more data than we have and we want to determine the outliers in our sample, then merely applying Bessel’s correction wouldn’t be enough, an unbiased estimation of the standard deviation needs to be used.) Both steps can be done in the same GET request in Sense.

 GET indexWithMessages/_search { "size": 0, "aggs": { "byWeeklyAggregation": { "date_histogram": { "field": "datumOntvangen", "interval": "week" } }, "extendedStatsOfDocCount": { "extended_stats_bucket": { "buckets_path": "byWeeklyAggregation>_count", "sigma": 3.0 } } } } 

The third step is to do the same query with the same aggregation, but this time only select the buckets with a document count outside the calculated interval. This can be done by connecting a different Pipeline Aggregation, namely the Bucket Selector Aggregation. The *lower* and *upper bound* need to be taken from the *std_deviation_bounds* of the *extendedStatsOfDocCount* Aggregation of the response. In my case these are *11107.573992258556* and *70207.24418955963*, respectively. Unfortunately the buckets path syntax doesn’t allow to go up in the aggregation tree. Otherwise the requests could be combined. It is possible to get the lower bound and the upper bound out of the Extended Stats Aggregation in a buckets path, but the syntax is not intuitive. See my question on discuss and the issue raised from it.

 GET indexWithMessages/_search { "size": 0, "aggs": { "byWeeklyAggregation": { "date_histogram": { "field": "datumOntvangen", "interval": "week" }, "aggs": { "outlierSelector": { "bucket_selector": { "buckets_path": { "doc_count": "_count" }, "script": "doc_count < 11107.573992258556 || 70207.24418955963 < doc_count" } } } } } } 

Now we have the weeks (buckets) that are outliers. Further investigation with domain knowledge can be done with this information. In this case, it could for example be a vacation or a supplier could have done less or more in a certain week.