Tag Archive: elasticsearch



Elasticsearch instances for integration testing

In my latest project I have implemented all communication with my Elasticsearch cluster using the high level REST client. My next step was to setup and teardown an Elasticsearch instance automatically in order to facilitate proper integration testing. This article describes three different ways of doing so and discusses some of the pros and cons. Please refer to this repository for implementations of all three methods.

docker-maven-plugin

This generic Docker plugin allows you to bind the starting and stopping of Docker containers to Maven lifecycles. You specify two blocks within the plugin; configuration and executions. In the configuration block, you choose the image that you want to run (Elasticsearch 6.3.0 in this case), the ports that you want to expose, a health check and any environment variables. See the snippet below for a complete example:

<plugin>
    <groupId>io.fabric8</groupId>
    <artifactId>docker-maven-plugin</artifactId>
    <version>${version.io.fabric8.docker-maven-plugin}</version>
    <configuration>
        <imagePullPolicy>always</imagePullPolicy>
        <images>
            <image>
                <alias>docker-elasticsearch-integration-test</alias>
                <name>docker.elastic.co/elasticsearch/elasticsearch:6.3.0</name>
                <run>
                    <namingStrategy>alias</namingStrategy>
                    <ports>
                        <port>9299:9200</port>
                        <port>9399:9300</port>
                    </ports>
                    <env>
                        <cluster.name>integration-test-cluster</cluster.name>
                    </env>
                    <wait>
                        <http>
                            <url>http://localhost:9299</url>
                            <method>GET</method>
                            <status>200</status>
                        </http>
                        <time>60000</time>
                    </wait>
                </run>
            </image>
        </images>
    </configuration>
    <executions>
        <execution>
            <id>docker:start</id>
            <phase>pre-integration-test</phase>
            <goals>
                <goal>start</goal>
            </goals>
        </execution>
        <execution>
            <id>docker:stop</id>
            <phase>post-integration-test</phase>
            <goals>
                <goal>stop</goal>
            </goals>
        </execution>
    </executions>
</plugin>

You can see that I’ve bound the plugin to the pre- and post-integration-test lifecycle phases. By doing so, the Elasticsearch container will be started just before any integration tests are ran and will be stopped after the integration tests have finished. I’ve used the maven-failsafe-plugin in order to trigger the execution of tests ending with *IT.java in the integration-test lifecycle phase.

Since this is a generic Docker plugin, there is no special functionality to easily install Elasticsearch plugins that may be needed during your integration tests. You could however create your own image with the required plugins and pull that image during your integration tests.

The integration with IntelliJ is also not optimal. When running an *IT.java class, IntelliJ will not trigger the correct lifecycle phases and will attempt to run your integration test without creating the required Docker container. Before running an integration test from IntelliJ, you need to manually start the container from the “Maven projects” view by running the docker:start commando:

Maven Projects view in IntelliJ

After running, you will also need to run the docker:stop commando to kill the container that is still running. If you forget to kill the running container and want to run a mvn clean install later on it will fail, since the build will attempt to create a container on the same port – as far as I know, the plugin does not allow for random ports to be chosen.

Pros:

  • Little setup, only requires configuration of one Maven plugin

Cons:

  • No out of the box functionality to start the Elasticsearch instance on a random port
  • No out of the box functionality to install extra Elasticsearch plugins
  • Extra dependency in your build pipeline (Docker)
  • IntelliJ does not trigger the correct lifecycle phases

elasticsearch-maven-plugin

This second plugin does not require Docker and only needs some Maven configuration to get started. See the snippet below for a complete example:

<plugin>
    <groupId>com.github.alexcojocaru</groupId>
    <artifactId>elasticsearch-maven-plugin</artifactId>
    <version>${version.com.github.alexcojocaru.elasticsearch-maven-plugin}</version>
    <configuration>
        <version>${version.org.elastic}</version>
        <clusterName>integration-test-cluster</clusterName>
        <transportPort>9399</transportPort>
        <httpPort>9299</httpPort>
    </configuration>
    <executions>
        <execution>
            <id>start-elasticsearch</id>
            <phase>pre-integration-test</phase>
            <goals>
                <goal>runforked</goal>
            </goals>
        </execution>
        <execution>
            <id>stop-elasticsearch</id>
            <phase>post-integration-test</phase>
            <goals>
                <goal>stop</goal>
            </goals>
        </execution>
    </executions>
</plugin>

Again, I’ve bound the plugin to the pre- and post-integration-test lifecycle phases in combination with the maven-failsafe-plugin.

This plugin provides a way of starting the Elasticsearch instance from IntelliJ in much the same way as the docker-maven-plugin. You can run the elasticsearch:runforked commando from the “Maven projects” view. However in my case, this started the container and then immediately exited. There is also no out of the box possibility of setting a random port for your instance. Of course, there are solutions to this at the expense of having a somewhat more complex Maven configuration.

Overall, this is a plugin that seems to provide almost everything we need with a lot of configuration options. You can automatically install Elasticsearch plugins or even bootstrap your instance with data.

In practice I did have some problems using the plugin in my build pipeline. Upon downloading the Elasticsearch zip the build would sometimes fail, or in other cases when attempting to download a plugin. Your mileage may vary, but this was reason for me to keep looking for another solution. Which brings me to plugin number three.

Pros:

  • Little setup, only requires configuration of one Maven plugin
  • No extra external dependencies
  • High amount of configuration possible

Cons:

  • No out of the box functionality to start the Elasticsearch instance on a random port
  • Poor integration with IntelliJ
  • Seems unstable

testcontainers-elasticsearch

This third plugin is different from the other two. It uses a Java testcontainer that you can configure through Java code. This gives you a lot of flexibility and requires no Maven configuration. Since there is no Maven configuration, it does require some work to make sure the Elasticsearch container is started and stopped at the correct moments.

In order to realize this, I have extended the standard SpringJUnit4ClassRunner class with my own ElasticsearchSpringRunner. In this runner, I have added a new JUnit RunListener named JUnitExecutionListener. This listener defines two methods testRunStarted and testRunFinished that enable me to start and stop the Elasticsearch container at the same points in time that the pre- and post-integration-test Maven lifecycle phases would. See the snippet below for the implementation of the listener:

public class JUnitExecutionListener extends RunListener {

    private static final Logger LOGGER = LoggerFactory.getLogger(JUnitExecutionListener.class);
    private static final String ELASTICSEARCH_IMAGE = "docker.elastic.co/elasticsearch/elasticsearch";
    private static final String ELASTICSEARCH_VERSION = "6.3.0";
    private static final String ELASTICSEARCH_HOST_PROPERTY = "nl.luminis.articles.maven.elasticsearch.host";
    private static final int ELASTICSEARCH_PORT = 9200;

    private ElasticsearchContainer container;

    @Override
    public void testRunStarted(Description description) {
        // Create a Docker Elasticsearch container when there is no existing host defined in default-test.properties.
        // Spring will use this property to configure the application when it starts.
        if (System.getProperty(ELASTICSEARCH_HOST_PROPERTY) == null) {
            LOGGER.debug("Create Elasticsearch container");
            int mappedPort = createContainer();
            System.setProperty(ELASTICSEARCH_HOST_PROPERTY, "localhost:" + mappedPort);
            String host = System.getProperty(ELASTICSEARCH_HOST_PROPERTY);
            RestAssured.basePath = "";
            RestAssured.baseURI = "http://" + host.split(":")[0];
            RestAssured.port = Integer.parseInt(host.split(":")[1]);
            LOGGER.debug("Created Elasticsearch container at {}", host);
        }
    }

    @Override
    public void testRunFinished(Result result) {
        if (container != null) {
            String host = System.getProperty(ELASTICSEARCH_HOST_PROPERTY);
            LOGGER.debug("Removing Elasticsearch container at {}", host);
            container.stop();
        }
    }

    private int createContainer() {
        container = new ElasticsearchContainer();
        container.withBaseUrl(ELASTICSEARCH_IMAGE);
        container.withVersion(ELASTICSEARCH_VERSION);
        container.withEnv("cluster.name", "integration-test-cluster");
        container.start();
        return container.getMappedPort(ELASTICSEARCH_PORT);
    }
}

It will create an Elasticsearch Docker container on a random port for use by the integration tests. The best thing about having this runner is that it works perfectly fine in IntelliJ. Simply right-click and run your *IT.java classes annotated with @RunWith(ElasticsearchSpringRunner.class) and IntelliJ will use the listener to setup the Elasticsearch container. This allows you to automate your build pipeline while still keeping developers happy.

Pros:

  • Neat integration with both Java and therefore your IDE
  • Sufficient configuration options out of the box

Cons:

  • More complex initial setup
  • Extra dependency in your build pipeline (Docker)

In summary, all three of the above plugins are able to realize the goal of starting an Elasticsearch instance for your integration testing. For me personally, I will be using the testcontainers-elasticsearch plugin going forward. The extra Docker dependency is not a problem since I use Docker in most of my build pipelines anyway. Furthermore, the integration with Java allows me to configure things in such a way that it works perfectly fine from both the command line and the IDE.

Feel free to checkout the code behind this article, play around with the integration tests that I’ve setup there and decide for yourself which plugin suits your needs best. Please note that the project has a special Maven profile that separates unittests from integration tests. Build the project using mvn clean install -P integration-test to run both.


A fresh look at Logstash

Soon after the release of elasticsearch it became clear that elasticsearch was good at more than providing search. It turned out that it could be used to store logs very effectively. That is why logstash was using elasticsearch. It contained standard parsers for apache httpd logs. To obtain the logs it had file monitoring plugins. It had plugins to extend and filter the content, and it had plugins to send the content to elasticsearch. That is Logstash in a nutshell back in the days. Of course the logs had to be shown, therefore a tool called Kibana was created. Kibana was a nice tool to create highly interactive dashboards to show and analyse your data. Together they became the famous ELK suite. Nowadays we have a lot more options in all these tools. We have Ingest node in elastic to pre-process documents before they move into elasticsearch, we have beats to monitor files, databases, machines, etc. And we have very nice and new Kibana dashboards. Time to re-investigate what the combination of Logstash, Elasticsearch and Kibana can do. In this blog post I’ll focus on Logstash.

X-Pack

As the company elastic has to make some money as well, they have created a product called X-Pack. X-Pack has a lot of features that sometimes span multiple products. There is a security component, by using this you can make users login in when using kibana and secure your content. Other interesting parts of X-Pack are machine learning, graph and monitoring. Parts of X-Pack can be used free of charge, you do need a license though. For other parts you need a paid license. I personally like the monitoring part so I regularly install X-Pack. In this blogpost I’ll also investigate the X-Pack features for Logstash. I’ll focus on out-of-the-box functionality and mostly what all these nice new things like monitoring and pipeline viewing bring us.

Using the version 6 release candidate

As elastic has already given us a RC1 of their complete stack, I’ll use this one for the evaluation. Beware though, this is still a release candidate, so not production ready.

What does Logstash do

If you never really heard about Logstash, let me give you a very short introduction. Logstash can be used to obtain data from a multitude of different sources. Than filter, transform and enrich the data. Finally store the data to again a multitude of datasources. Example data sources are relational databases, files, queues and websockets. Logstash ships with a large number of filter plugins, with these we can process data to exclude some fields. We can also enrich data, lookup information about ip addresses, or lookup records belonging to an id in for instance elasticsearch or a database. After the lookup we can add data to the document or event that we are handling before sending it to one or more outputs. Outputs can be elasticsearch, a database, but also queue’s like Kafka or RabbitMQ.

In the later releases logstash started to add more features that a tool handling large amounts of data over longer periods need. Things like monitoring and clustering of nodes were introduced and also persisting incoming data to disk. By now logstash in combination with Kibana and Elasticsearch is used by very large companies but also by a lot of start ups to monitor their servers and handle all sorts of interesting data streams.

Enough of this talk, let us get our hands dirty. First step install everything on our developer machines.

Installation

I’ll focus on the developer machine, if you want to install it on a server please refer to the extensive logstash documentation.

First download the zip or tar.gz file and extract it to a convenient location. Now create a folder where you can store the configuration files. To make the files small and to show you that you can split them, I create three different files in this folder: input.conf, filters.conf and output.conf. The most basic configuration is one with a stdin for input, no filters and stdout for output. Below the contents for the two files

input {
	stdin{}
}
output { 
	stdout { 
		codec => rubydebug
	}
}

Time to start logstash. Step into the downloaded and extracted folder with the logstash binaries and execute the following command.

bin/logstash -r -f ../logstashblog/

the -r, can be used during development for reloading the configuration on change. Beware, this does not work with the stdin plugin. With -f we tell logstash to load a configuration file or directory. In our case a directory containing the three mentioned files. When logstash is ready it will print something like this:

[2017-10-28T19:00:19,511][INFO ][logstash.pipeline        ] Pipeline started {"pipeline.id"=>"main"}
The stdin plugin is now waiting for input:
[2017-10-28T19:00:19,526][INFO ][logstash.agent           ] Pipelines running {:count=>1, :pipelines=>["main"]}

Now you can type something and the result is the created document or event that went through the almost empty pipeline. The thing to notice is that we now have a field called message containing the text we entered.

Just some text for input
{
      "@version" => "1",
          "host" => "Jettros-MBP.fritz.box",
    "@timestamp" => 2017-10-28T17:02:18.185Z,
       "message" => "Just some text for input"
}

Now that we know it is working, I want you to have a look at the monitoring options you have available using the rest endpoint.

http://localhost:9600/

{
"host": "Jettros-MBP.fritz.box",
"version": "6.0.0-rc1",
"http_address": "127.0.0.1:9600",
"id": "20290d5e-1303-4fbd-9e15-03f549886af1",
"name": "Jettros-MBP.fritz.box",
"build_date": "2017-09-25T20:32:16Z",
"build_sha": "c13a253bb733452031913c186892523d03967857",
"build_snapshot": false
}

You can use the same url with different endpoints to get information about the node, the plugins, stats and hot threads:
http://localhost:9600/_node
http://localhost:9600/_node/plugins
http://localhost:9600/_node/stats
http://localhost:9600/_node/hot_threads

It becomes a lot more fun if we have a UI, so let us install xpack into logstash. Before we can run logstash with monitoring on, we need to install elasticsearch and kibana with X-pack installed into those as well. Refer to the X-Pack documentation on how to do it.

The basic commands to install x-pack into elasticsearch and kibana are very easy. For now I disable security by adding the following line to both kibana.yml and elasticsearch.yml: xpack.security.enabled: false. After installing x-pack into logstash we have to add the following lines to the logstash.yml file in the config folder

xpack.monitoring.elasticsearch.url: ["http://localhost:9200"] 
xpack.monitoring.elasticsearch.username:
xpack.monitoring.elasticsearch.password:

Notice the empty username and password, this is required when security is disabled. Now move over to Kibana and check the monitoring tab (the heart shape figure) and click on logstash. In the first screen you can see the events, they could be zero, zo please enter some events. Now move to the pipeline tab. Of course with our basic pipeline, this is a bit stupid, but imagine what it will show later on.

Screen Shot 2017 10 28 at 19 52 46

Time to get some real input.

Import the Signalmedia dataset

Signalmedia has provided a dataset you can use for research. More information about the dataset and how to obtain it can be found here. The dataset contains an exact amount of 1 million news documents. You can download the file as a file that contains a JSON document on each line. The JSON document has the following format:

{
   "id": "a080f99a-07d9-47d1-8244-26a540017b7a",
   "content": "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ...",
   "title": "Pay up or face legal action: DBKL",
   "media-type": "News",
   "source": "My Sinchew",
   "published": "2015-09-15T10:17:53Z"
}

We want to import this big file with all the JSON documents as separate documents into elasticsearch using logstash. The first step is to create a logstash input. Use the path to point to the file. We can use the logstash file plugin to load the file, tell it to start at the beginning and mark each line as a JSON document. The file plugin has more options you can use. It can also handle rolling files that are used a lot in logging.

input {
	file {
        path => "/Volumes/Transcend/signalmedia-1m.jsonl"
        codec => "json"
        start_position => beginning 
    }
}

That is it, with the stdout plugin and the rubydebug codec this would give the following output.

{
          "path" => "/Volumes/Transcend/signalmedia-1m.jsonl",
    "@timestamp" => 2017-10-30T18:49:45.948Z,
      "@version" => "1",
          "host" => "Jettros-MBP.fritz.box",
            "id" => "a080f99a-07d9-47d1-8244-26a540017b7a",
        "source" => "My Sinchew",
     "published" => "2015-09-15T10:17:53Z",
         "title" => "Pay up or face legal action: DBKL",
    "media-type" => "News",
       "content" => "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ..."
}

Notice that besides the fields we expected: id, content, title, media-type, source and published we also got some additional fields. Before sending this to elasticsearch we want to clean it up. We do not need the path, host, @timestamp, @version. There is also something with the field id. We want to use the id field to create the document in elasticsearch, but we do not want to add it to the document. If we need the value of id in the output plugin later on, but we do not want to add it as a field to the document we can move it to the @metadata object. That is exactly what the first part of the filter does. The second part removes the fields we do not need.

filter {
	mutate {
		copy => {"id" => "[@metadata][id]"}
	}
	mutate {
		remove_field => ["@timestamp", "@version", "host", "path", "id"]
	}
}

With these filters in place the output of the same document would become:

{
        "source" => "My Sinchew",
     "published" => "2015-09-15T10:17:53Z",
         "title" => "Pay up or face legal action: DBKL",
    "media-type" => "News",
       "content" => "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ..."
}

Now the content is ready to be send to elasticsearch, so we need to configure the elasticsearch output plugin. When sending data to elastic you first need to think about creating the index and the mapping that goes with it. In this example I am going to create an index template. I am not going to explain a lot about the mappings as this is not an elasticsearch blog. But with the following code we insert the mapping template when connecting to elasticsearch and we can insert all documents. Do look at the way the document_id is created. Remember we talked about that @metadata and how we copied the id field into it. This is the reason why we did it. Now we use that value as the id of the document when inserting it into elasticsearch.

output {
	elasticsearch {
		index => "signalmedia"
		document_id => "%{[@metadata][id]}"
		document_type => "doc"
		manage_template => "true"
		template => "./signalmedia-template.json"
		template_name => "signalmediatemplate"
	}
	stdout { codec => dots }
}

Notice there are two outputs configured. The elasticsearch output of course, but also a stdout. This time not with the rubydebug codec, this would be way to verbose. We use the dots codec. This codec prints a dot for each document it parses.

For completeness I also want to show the mapping template. In this case I positioned it in the root folder of the logstash binary, usually this would of course be an absolute path somewhere else.

{
  "index_patterns": ["signalmedia"],
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 3
  },
  "mappings": {
    "doc": {
      "properties": {
        "source": {
          "type": "keyword"
        },
        "published": {
          "type": "date"
        },
        "title": {
          "type": "text"
        },
        "media-type": {
          "type": "keyword"
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

Now we want to import all the million documents and have a look at the monitoring along the way. Let’s do it.

Screen Shot 2017 10 30 at 20 50 36
Screen Shot 2017 10 30 at 20 48 21

Running a query

Of course we have to prove the documents are now available in elasticsearch. So lets execute one of my favourite queries that makes use of the new significant text aggregation. First the request and then parts of the response.

GET signalmedia/_search
{
  "query": {
    "match": {
      "content": "netherlands"
    }
  },
  "aggs": {
    "my_sampler": {
      "sampler": {
        "shard_size": 200
      },
      "aggs": {
        "keywords": {
          "significant_text": {
            "field": "content",
            "filter_duplicate_text": true
          }
        }
      }
    }
  },
  "size": 0
}

Just a very small part of the response, I stripped out a lot of the elements to make it better viewable. Good to see that that see dutch as a significant word when searching for the netherlands and of course geenstijl.

"buckets": [
  {"key": "netherlands","doc_count": 527},
  {"key": "dutch","doc_count": 196},
  {"key": "mmsi","doc_count": 7},
  {"key": "herikerbergweg","doc_count": 4},
  {"key": "konya","doc_count": 14},
  {"key": "geenstijl","doc_count": 3}
]

Concluding

Good to see the nice ui options in Kibana. The pipeline viewer is very useful. In a next blog post I’ll be looking at Kibana and all the new and interesting things in there.


Elasticsearch 6 is coming

For some time now, elasticsearch has been releasing versions of the new major release elasticsearch 6. At this moment the latest edition is already rc1, so it is time to start thinking about migrating to the latest and greatest. What backwards compatible issues will you run into and what new features can you start using. This blog post gives a summary of the items that are most important to me based on the projects that I do. First we’ll have a look at the breaking changes, than we move on to new features or interesting upgrades.

Breaking changes

Most of the breaking changes come from the elasticsearch documentation that you can of course also read yourself.

Migrating indexes from previous versions

As with all major release, only indexes created in the prior version can be migrated automatically. So if you have an index created in 2.x, migrated it to 5.x and now want to start using 6.x you have to use the reindex API to first index it into a 5.x index before migrating.

Index types

In elasticsearch 6 the first step is taken into indexes without types. The first step is to allow only a single type within a new index and be able to keep using multiple types in indexes migrated from 5.x. Starting with elasticsearch 5.6 you can prevent people from creating indexes with multiple types. This will make it easier to migrate to 6.x when it becomes available. By applying the following configuration option you can prevent people from making multiple types in one index

index.mapping.single_type: true

More reasoning about why the types need to be removed can be found in elasticsearch documentation removal of types. Also if you are into parent-child relationships in elasticsearch and are curious what the implications of not being able to use multiple types are, check this documentation page parent-join. Yes will will get joins in elasticsearch :-), though with very limited use.

Java High Level REST Client

This was already introduced in 5.6, still good to know as this will be the replacement for the Transport client. As you might know I am also creating some code to use in Java Applications on top of the Low Level REST client for java that is also being used by this new client. More information about my work can found here: part 1 and part 2.

Uniform response for create/update and delete requests

At the moment a create request returns a response field created true/false, and a delete request returns found true/false. If you are someone trying to parse the response and using this field, you can no longer use this. Use the result field instead. This will have the value created or updated in case of the create request and deleted or not_found in case of the delete request.

Translog change

The translog is used to keep documents that have not been flushed to disk yet by elasticsearch. In prior releases the translog files are removed when elasticsearch has performed a flush. However, due to optimisations made for recovery having the translog could speed the recovery process. Therefore the translog is now kept for by default 12 hours or a maximum of 512 Mb
More information about the translog can be found here: Translog.

Java Client

In a lot of java projects the java client is used. I have used it as well for numerous projects. However, with the introduction of the High Level Rest client for java projects should move away from the Transport Client. If you want/need to keep using it for now, there are some changes in packages and some methods have been removed. For me the one I used the most is the the order for aggregations, think about Terms.Order and Histogram.Order. They have been replaced by BucketOrder

Index mappings

There are two important changes that can affect your way of working with elastic. The first is the way booleans are handled. In indexes created in version 6, a boolean accepts only two values: true and false. Al other values will result in an exception.

The second change is the _all field. In prior version by default an _all field was created in which all values of fields were copied as strings and analysed with the standard analyser. This field was used by queries like the query_string. There was however a performance penalty as we now had to analyse and index a potentially big field. Soon it became a best practice to disable the field. In elasticsearch 6 the field is disabled by default and it cannot be configured for indices created with elasticsearch 6. If you still use the query_string query, it is now executed agains each field. You should be very careful with the query_string query. It comes with a lot of power. Users get a lot of options to create their own query. But with great power comes great responsibilities. They can create very heavy queries as well. And they can queries that break without a lot of feedback. More information about the query_string. If you still want to give you users more control, but the query_string query is one step to far, think about creating your own search DSL. Some ideas can be found in one of my previous blog posts: Creating a search DSL and Part 2 of creating a search DSL.

Booting elasticsearch

Some things changed with the startup options. You cannot configure the user elasticsearch runs with if you use the deb or rpm packages and the elasticsearch.yml file location is now configured differently. Now you have to export the path where to find all configuration files (elasticsearch.yml, jvm.options and log4j2.properties). You can expose an environment variable ES_PATH_CONF containing the path to the config folder. I use this regularly on my local machine. As I have multiple projects running often with different version of elasticsearch I have setup a structure where I put my config files in separate folders from the elasticsearch distributable. Find the structure in the image below. In the beginning I just copy the config files to my project specific folder. When I start the project with the script startNode.sh the following script is executed.

Elastic folder structure

#!/bin/bash

CURRENT_PROJECT=$(pwd)

export ES_PATH_CONF=$CURRENT_PROJECT/config

DATA=$CURRENT_PROJECT/data
LOGS=$CURRENT_PROJECT/logs
REPO=$CURRENT_PROJECT/backups
NODE_NAME=Node1
CLUSTER_NAME=playground

BASH_ES_OPTS="-Epath.data=$DATA -Epath.logs=$LOGS -Epath.repo=$REPO -Enode.name=$NODE_NAME -Ecluster.name=$CLUSTER_NAME"

ELASTICSEARCH=$HOME/Development/elastic/elasticsearch/elasticsearch-6.0.0-rc1

$ELASTICSEARCH/bin/elasticsearch $BASH_ES_OPTS

Now when you need additional configuration options, add them to the elasticsearch.yml. If you need more memory for the specific project, change the jvm.options file.

Plugins

When indexing pdf documents or word documents a lot of you out there have been using the mapper-attachments plugin. This was already deprecated, now it has been removed. You can switch to the ingest attachment plugin. Never heard about Injest? Injest can be used to pre process documents before they are being indexed by elasticsearch. It is a lightweight variant for Logstash, running within elasticsearch. Be warned though that plugins like the attachment mapper can be heavy on your cluster. So it is wise to have a separate node for Injest. Curious about what you can do to inject the contents of a pdf? The next few steps show you the commands to create the injest pipeline, send a document to it and obtain it again or create a query.

First create the injest pipeline

PUT _ingest/pipeline/attachment
{
  "description": "Extract attachment information",
  "processors": [
    {
      "attachment": {
        "field": "data"
      }
    }
  ]
}

Now when indexing a document containing the attachment as a base64 encoded string in the field data we need to tell elasticsearch to use a pipeline. Check the parameter in the url: pipeline=attachment. This is the name used when creating the pipeline.

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": ""
}

We could stop here, but how to get base64 encoded input from for instance a pdf. On linux and the mac you can use the base64 command for that. Below is a script that reads a specific pdf and creates a base64 ended string out of it. This string is than pushed to elasticsearch.

#!/bin/bash

pdf_doc=$(base64 ~/Desktop/Programma.pdf)

curl -XPUT "http://localhost:9200/my_index/my_type/my_id?pipeline=attachment" -H 'Content-Type: application/json' -d '{"data" : "'"$pdf_doc"'"}'

Scripts

If you are heavy into scripting in elasticsearch you need to check a few things. Changes have been made to the use of the lang attribute when obtaining or updating scripts, you cannot provide it any more. Also support for other languages than painless has been removed.

Search and query DSL changes

Most of the changes in this area are very specific. I am not going to sum them, please check the original documentation. Some of them I do want to mention as they are important to me.

  • If you are constructing queries and it can happen you have an empty query, you can no longer provide an empty object { }. You will get an exception if you keep doing it.
  • Bool queries had a disable_coord parameter, with this you could influence the score function to not use missing search terms as a penalty for the score. This option has been removed.
  • You could transform a match query into a match_phrase query by specifying a type. This is no longer possible, you should just create a phrase query now if you need it. Therefore also the slop parameter has been removed from the match query.

Calculating the score

I the beginning of elasticsearch the score for a document based on an executed query was calculated using an adjusted formula for TF/IDF. It turned out that for fields containing smaller amounts of text TF/IDF was less ideal. Therefore the default scoring algorithm was replaced by BM25. Moving away from TF/IDF to BM25 has been the topic for version 5. Now with 6 they have removed two mechanisms in the scoring: Query Normalization and Coordination Factors. Query Normalization was always hard to explain during trainings. It was an attempt to normalise the scores of queries. Normalizing should make it possible to compare them. However, it did not work and you still should not compare scores of different queries. The Coordinating Factors were more a penalty when having multiple terms to search for and not all of them were found, the coordinating factor gave a penalty to the score. You could easily see this when using the explain API.

That is it for the breaking changes, again there are more changes that you might want to investigate if you are really into all the elasticsearch details. Than have a look at the original documetation

Next up, cool new features

Now let us zoom in on some of the newer features or interesting upgrades.

Sequence Numbers

Sequence Numbers are now assigned to all index, update and delete operations. Using this number a shard that went offline for a moment can ask the primary shard for all operations after a certain sequence number. If the translog is still available (remember that we mentioned in the beginning that the translog was now kept around for 12 hours and or 512 Mb by default) the missing operations can be send to the shard preventing a complete refresh of all the shards contents.

Test Normalizer using analyse endpoint

One of the most important parts of elastic is configurating the mapping for your documents. How do you adjust the terms that you can search for based on the provided text. If you are not sure and you want to try out a specific tokeniser and filters combination you can use the analyze endpoint. Have a look at the following code sample and response where we try out a whitespace tokeniser with a lowercase filter.

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase"],
  "text": ["Jettro Coenradie"]
}

{
  "tokens": [
    {
      "token": "jettro",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "coenradie",
      "start_offset": 7,
      "end_offset": 16,
      "type": "word",
      "position": 1
    }
  ]
}

As you can see we now get two tokens and the uppercase characters are replaced by their lowercase counterparts. What if we do not want the text to become two terms, but we want it to stay as one term. Still we would like to replace the uppercase characters with their lowercase counterparts. This was not possible in the beginning. However, with the introduction of normalizer, a special analyser for fields of type keyword it became possible. In elasticsearch 6 we now have the functionality to use the analyse endpoint for normalisers as well. Check the following code block for an example.

PUT people
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 1,
    "analysis": {
      "normalizer": {
        "name_normalizer": {
          "type": "custom",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  }
}

GET people/_analyze
{
  "normalizer": "name_normalizer",
  "text": ["Jettro Coenradie"]
}

{
  "tokens": [
    {
      "token": "jettro coenradie",
      "start_offset": 0,
      "end_offset": 16,
      "type": "word",
      "position": 0
    }
  ]
}

LogLog-Beta

Ever heard about HyperLogLog or even HyperLogLog++? Well than you must be happy with LogLog-Beta. Some background, elasticsearch comes with a Cardinality Aggregation which can be used to calculate or better estimate the amount of distinct values. If we wanted to create an exact value, we would have to create a map of values with all unique values in there. This would require an extensive amount of memory. You can specify a threshold under which the amount of unique values would be close to exact. However the maximum value for this is 40000. Before elasticsearch used the HyperLogLog++ algorithm to estimate the unique values. With the new algorithm called LogLog-Beta there are better results with lower error margins and still the same performance.

Significant Text Aggregation

For some time the Significant Terms Aggregation has been available. The idea behind this aggregation is to find terms that are common to a specific scope and less common to a more general scope. So imagine we are looking for users of our website that place more orders in relation to pages they visit out of logs with page visits. You cannot calculate them by just counting the number of orders. You need to find those users that are more common to the set of orders than to the set of page visits. In the prior version this was already possible with terms, so not analysed fields. By enabling field-data or doc_values you could use small analysed fields. But for larger text fields this was a performance problem. Now with the Significant Text Aggregation we can overcome this problem. It also comes with an interesting functionality to deduplicate text (think about emails with the original text in a reply, or retweets).

Sounds a bit to vague? Ok, lets have an example. In elasticsearch documentation they use a dataset from Signal Media. As it is an interesting dataset to work with, I will also use it. You can try it out yourself as well. I downloaded the file and imported it into elasticsearch using logstash. This gist should help you. Now on to the query and the response

GET signalmedia/_search
{
  "query": {
    "match": {
      "content": "rain"
    }
  },
  "aggs": {
    "my_sampler": {
      "sampler": {
        "shard_size": 200
      },
      "aggs": {
        "keywords": {
          "significant_text": {
            "field": "content",
            "filter_duplicate_text": true
          }
        }
      }
    }
  },
  "size": 0
}

So we are looking for documents with the word rain. Now in these documents we are going to lookup terms that occur more often than in the global context.

{
  "took": 248,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 11722,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "my_sampler": {
      "doc_count": 600,
      "keywords": {
        "doc_count": 600,
        "bg_count": 1000000,
        "buckets": [
          {
            "key": "rain",
            "doc_count": 544,
            "score": 69.22167699861609,
            "bg_count": 11722
          },
          {
            "key": "showers",
            "doc_count": 164,
            "score": 32.66807368214775,
            "bg_count": 2268
          },
          {
            "key": "rainfall",
            "doc_count": 129,
            "score": 24.82562838569881,
            "bg_count": 1846
          },
          {
            "key": "thundery",
            "doc_count": 28,
            "score": 20.306396677050884,
            "bg_count": 107
          },
          {
            "key": "flooding",
            "doc_count": 153,
            "score": 17.767450110864743,
            "bg_count": 3608
          },
          {
            "key": "meteorologist",
            "doc_count": 63,
            "score": 16.498915662650603,
            "bg_count": 664
          },
          {
            "key": "downpours",
            "doc_count": 40,
            "score": 13.608547008547008,
            "bg_count": 325
          },
          {
            "key": "thunderstorms",
            "doc_count": 48,
            "score": 11.771851851851853,
            "bg_count": 540
          },
          {
            "key": "heatindex",
            "doc_count": 5,
            "score": 11.56574074074074,
            "bg_count": 6
          },
          {
            "key": "southeasterlies",
            "doc_count": 4,
            "score": 11.104444444444447,
            "bg_count": 4
          }
        ]
      }
    }
  }
}

Interesting terms when looking for rain: showers, rainfall, thundery, flooding, etc. These terms could now be returned to the user as possible candidates for improving their search results.

Concluding

That is it for now. I haven’t even scratched all the new cool stuff in the other components like X-Pack, Logstash and Kibana. More to come.


Part 2 of Creating a search DSL

In my previous blog post I wrote about creating your own search DSL using Antlr. In that post I discussed the Antlr language for constructs like AND/OR, multiple words and combining words. In this blog post I am showing how to use the visitor mechanism to write actual elasticsearch queries.

If you did not read the first post yet, please do so. It will make it easier to follow along. If you want the code, please visit the github page.

https://amsterdam.luminis.eu/2017/06/28/creating-search-dsl/
Github repository

What queries to use

In the previous blog post we ended up with some of the queries we want to support

  • apple
  • apple OR juice
  • apple raspberry OR juice
  • apple AND raspberry AND juice OR Cola
  • “apple juice” OR applejuice

Based on these queries we have some choices to make. The first query seems obvious, searching for one word would become a match query. However, in which field do you want to search? In Elasticsearch there is a special field called the _all field. In the example we are using the _all field, however it would be easy to create a query against a number of specific fields using a multi_match.

In the second example we have two words with OR in between. The most basic implementation would again be a match query, since the match query by default uses OR if you supply multiple words. However, the DSL uses OR to combine terms as well as and queries. A term in itself can be a quoted term as well. Therefore, to translate the apple OR juice we need to create a boolean query. Now look at the last example, here we use quotes. One would expect quotes to keep words together. In elasticsearch we would use the Phrase query to accomplish this.

As the current DSL is fairly simple, creating the queries is not that hard. But a lot more extensions are possible that can make use of more advance query options. Using wildcards could result in fuzzy queries, using title:apple could look into one specific field and using single quotes could mean an exact match, so we would need to use the term query.

Now you should have an idea of the queries we would need, let us have a look at the code and see Antlr DSL in action.

Generate json queries

As mentioned in the introduction we are going to use the visitor to parse the tree. Of course we need to create the tree first. Below the code to create the tree.

static SearchdslParser.QueryContext createTreeFromString(String searchString) {
    CharStream charStream = CharStreams.fromString(searchString);
    SearchdslLexer lexer = new SearchdslLexer(charStream);
    CommonTokenStream commonTokenStream = new CommonTokenStream(lexer);
    SearchdslParser parser = new SearchdslParser(commonTokenStream);

    return parser.query();
}

AS mentioned in the previous posts, the parser and the visitor classes get generated by Antlr. Methods are generated for visiting the different nodes of the tree. Check the class

  • SearchdslBaseVisitor
  • for the methods you can override.

    To understand what happens, it is best to have a look at the tree itself. Below the image of the tree that we are going to visit.

    Antlr4 parse tree

    We visit the tree from the top. The first method or Node that we visit is the top level Query. Below the code of the visit method.

    @Override
    public String visitQuery(SearchdslParser.QueryContext ctx) {
        String query = visitChildren(ctx);
    
        return
                "{" +
                    "\"query\":" + query +
                "}";
    }
    

    Every visitor generates a string, with the query we just visit all the possible children and create a json string with a query in there. In the image we see only a child orQuery, but it could also be a Term or andQuery. By calling the visitChildren method we continue to walk the tree. Next step is the visitOrQuery.

    @Override
    public String visitOrQuery(SearchdslParser.OrQueryContext ctx) {
        List<String> shouldQueries = ctx.orExpr().stream().map(this::visit).collect(Collectors.toList());
        String query = String.join(",", shouldQueries);
    
        return
                "{\"bool\": {" +
                        "\"should\": [" +
                            query +
                        "]" +
                "}}";
    }
    

    When creating an OR query we use the bool query with the should clause. Next we have to obtain the queries to include in the should clause. We obtain the orExpr items from the orQuery and for each orExpr we again call the visit method. This time we will visit the orExpr Node, this node does not contain important information for us, therefore we let the template method just call the visitChildren method. orExpr nodes can contain a term or an andQuery. Let us have a look at visiting the andQuery first.

    @Override
    public String visitAndQuery(SearchdslParser.AndQueryContext ctx) {
        List<String> mustQueries = ctx.term().stream().map(this::visit).collect(Collectors.toList());
        String query = String.join(",", mustQueries);
        
        return
                "{" +
                        "\"bool\": {" +
                            "\"must\": [" +
                                query +
                            "]" +
                        "}" +
                "}";
    }
    

    Notice how closely this resembles the orQuery, big difference in the query is that we now use the bool query with a must part. We are almost there. The next step is the Term node. This node contains words to transform into a match query, or it contains a quotedTerm. The next code block shows the visit method of a Term.

    @Override
    public String visitTerm(SearchdslParser.TermContext ctx) {
        if (ctx.quotedTerm() != null) {
            return visit(ctx.quotedTerm());
        }
        List<TerminalNode> words = ctx.WORD();
        String termsAsText = obtainWords(words);
    
        return
                "{" +
                        "\"match\": {" +
                            "\"_all\":\"" + termsAsText + "\"" +
                        "}" +
                "}";
    }
    
    private String obtainWords(List<TerminalNode> words) {
        if (words == null || words.isEmpty()) {
            return "";
        }
        List<String> foundWords = words.stream().map(TerminalNode::getText).collect(Collectors.toList());
        
        return String.join(" ", foundWords);
    }
    

    Notice we first check if the term contain a quotedTerm. If it does not contain a quotedTerm we obtain the words and combine them into one string. The final step is to visit the quotedTerm node.

    @Override
    public String visitQuotedTerm(SearchdslParser.QuotedTermContext ctx) {
        List<TerminalNode> words = ctx.WORD();
        String termsAsText = obtainWords(words);
    
        return
                "{" +
                        "\"match_phrase\": {" +
                            "\"_all\":\"" + termsAsText + "\"" +
                        "}" +
                "}";
    }
    

    Notice we parse this part into a match_phrase query, other than that it is almost the same as the term visitor. Finally we can generate the complete query.

    Example

    “multi search” && find && doit OR succeed && nothing

    {
      "query": {
        "bool": {
          "should": [
            {
              "bool": {
                "must": [
                  {
                    "match_phrase": {
                      "_all": "multi search"
                    }
                  },
                  {
                    "match": {
                      "_all": "find"
                    }
                  },
                  {
                    "match": {
                      "_all": "doit"
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "must": [
                  {
                    "match": {
                      "_all": "succeed"
                    }
                  },
                  {
                    "match": {
                      "_all": "nothing"
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }
    

    In the codebase on github there is also a Jackson JsonNode based visitor, if you don’t like the string based approach.

    That is about it, I am planning on extending the example further. If I have added some interesting new concepts I’ll get back to you with a part 3


    Creating a search DSL

    As an (elastic)search expert, I regularly visit customers. For these customers I often do a short analysis of their search solution and I give advice about improvements they can make. It is always interesting to look at solutions customers come up with. At one of my most recent customers I noticed a search solution based on a very extensive search DSL (Domain Specific Language) created with Antlr. I knew about Antlr, but never thought about creating my own search DSL.

    To better understand the options of Antlr and to practice with creating my own DSL I started experimenting with it. In this blog post I’ll take you on my learning journey. I am going to create my own very basic search DSL.

    Specifying the DSL

    First we need to define the queries we would like our users to enter. Below are some examples:

    • tree – This is an easy one, just one word to search for
    • tree apple – Two words to look up
    • tree apple AND sell – Find matching content for tree apple, but also containing sell.
    • tree AND apple OR juice – Find matching content containing the terms tree and apple or containing the term juice.
    • “apple tree” OR juice – Find content having the terms apple and tree next to each other in the right order (Phrase query) or having the term juice.

    These are the combinations we need to make. In the next sections we setup our environment and I explain the basics of Antlr that you need to understand to follow along.

    Setting up Antlr for your project

    There are lots of resources about setting up your local Antlr environment. I personally learned most from tomassetti. I prefer to use Maven to gather the required dependencies. I also use the Maven Antlr plugin to generate the Java classes based on the Lexar and Grammar rules.

    I also installed Antlr using Homebrew, but you do not really need this for this blog post.

    You can find the project on Github: https://github.com/jettro/search-dsl

    I generally just load the Maven project into IntelliJ and get everything running from there. If you don’t want to use an IDE, you can also do this with pure Maven.

    proj_home #> mvn clean install
    proj_home #> mvn dependency:copy-dependencies
    proj_home #> java -classpath "target/search-dsl-1.0-SNAPSHOT.jar:target/dependency/*"  nl.gridshore.searchdsl.RunStep1
    

    Of course you can change the RunStep1 into one of the other three classes.

    Antlr introduction

    This blog post does not have the intention to explain all ins and outs of Antlr. But there are a few things you need to know if you want to follow along with the code samples.

    • Lexer – A program that takes a phrase and obtains tokens from the phrase. Examples of lexers are: AND consisting of the characters ‘AND’ but also the specials characters ‘&&’. Another example is a WORD consisting of upper or lowercase characters and numbers. Tokens coming out of a Lexer contain the type of the token as well as the matched characters by that token.
    • Grammar – Rules that make use of the Lexer to create the syntax of your DSL. The result is a parser that creates a ParseTree out of your phrase. For example, we have a grammar rule query that parses a phrase like tree AND apple into the following ParseTree. The Grammar rule is: query : term (AND term)+ ;.
    • ParseTree – Tree by Antlr using the grammar and lexer from the provided phrase. Antlr also comes with a tool to create a visual tree. See an example below. In this blog post we create our own parser of the tree, there are however two better alternatives. The first is using the classic Listener pattern. The other is the Visitor pattern.
      Antlr4 parse tree 1
    • Listener – Antlr generates some parent classes to create your own listener. The idea behind a a listener is that you receive events when a new element is started and when the element is finished. This resembles how for instance the SAX parser works.
    • Visitor – Antlr generates some parent classes to create your own Visitors. With a visitor you start visiting your top level element, then you visit the children, that way you recursively go down the tree. In a next blog post we’ll discuss the visitor pattern in depth.

    Search DSL Basics

    In this section we are going to create the DSL in four small steps. For each step we have a StepXLexerRules.g4 and a StepXSearchDsl.g4 file containing the Antlr lexer and grammar rules. Each step also contains a Java file with the name RunStepX.

    Step 1

    In this step we want to write rules like:

    • apple
    • apple juice
    • apple1 juice
    lexer
    WORD        : ([A-z]|[0-9])+ ;
    WS          : [ \t\r\n]+ -&amp;gt; skip ;
    
    grammar
    query       : WORD+ ;
    

    In all the Java examples we’ll start the same. I’ll mention the rules here but will not go into depth in the other steps.

    Lexer lexer = new Step1LexerRules(CharStreams.fromString("apple juice"));
    CommonTokenStream commonTokenStream = new CommonTokenStream(lexer);
    
    Step1SearchDslParser parser = new Step1SearchDslParser(commonTokenStream);
    Step1SearchDslParser.QueryContext queryContext = parser.query();
    
    handleWordTokens(queryContext.WORD());
    

    First we create the Lexer, the Lexer is generated by Antlr. The input is a stream of characters created using the class CharStreams. From the Lexer we obtain a stream of Tokens, which is the input for the parser. The parser is also generated by Antlr. Using the parser we can obtain the queryContext. Notice the method query. This is the same name as the first grammar rule.

    In this basic example a query consists of at least one WORD and a WORD consists of upper and lower case characters and numbers. The output for the first step is:

    Source: apple
    WORDS (1): apple,
    Source: apple juice
    WORDS (2): apple,juice,
    Source: apple1 juice
    WORDS (2): apple1,juice,
    

    In the next step we are extending the DSL with an option to keep words together.

    Step 2

    In the previous step you got the option to search for one or multiple words. In this step we are adding the option to keep some words together by surrounding them with quotes. We add the following lines to the lexer and grammar.

    lexer
    QUOTE   : ["];
    
    grammar
    query               : term ;
    
    term                : WORD+|quotedTerm;
    quotedTerm          : QUOTE WORD+ QUOTE ;
    

    Now we can support queries like

    • apple
    • “apple juice”

    The addition to the lexer is QUOTE, the grammar becomes slightly more complex. The query now is a term, a term can be multiple WORDs or a quoted term consisting of multiple WORDs surrounded by QUOTEs. In Java we have to check from the termContext that is obtained from the queryContext if the term contains WORDs or a quotedTerm. That is what is shown in the next code block.

    Step2SearchDslParser.TermContext termContext = queryContext.term();
    handleTermOrQuotedTerm(termContext);
    
    private void handleTermOrQuotedTerm(Step2SearchDslParser.TermContext termContext) {
        if (null != termContext.quotedTerm()) {
            handleQuotedTerm(termContext.quotedTerm());
        } else {
            handleWordTokens(termContext.WORD());
        }
    }
    
    private void handleQuotedTerm(Step2SearchDslParser.QuotedTermContext quotedTermContext) {
        System.out.print("QUOTED ");
        handleWordTokens(quotedTermContext.WORD());
    }
    

    Notice how we check if the termContext contains a quotedTerm, just by checking if it is null. The output then becomes

    Source: apple
    WORDS (1): apple,
    Source: "apple juice"
    QUOTED WORDS (2): apple,juice,
    

    Time to take the next step, this time we make it possible to specify to make it explicit to query for one term or the other.

    Step 3

    In this step we make it possible to make it optional for a term to match as long as another term matches. Example queries are:

    • apple
    • apple OR juice
    • “apple juice” OR applejuice

    The change to the Lexer is just one type OR. The grammar has to change, now the query needs to support a term or an orQuery. The orQuery consists of a term extended with OR and a term, at least once.

    lexer
    OR      : 'OR' | '||' ;
    
    grammar
    query   : term | orQuery ;
    orQuery : term (OR term)+ ;
    

    The handling in Java is straightforward now, again some null checks and handle methods.

    if (queryContext.orQuery() != null) {
        handleOrContext(queryContext.orQuery());
    } else {
        handleTermContext(queryContext.term());
    }
    

    The output of the program then becomes:

    Source: apple
    WORDS (1): apple,
    Source: apple OR juice
    Or query: 
    WORDS (1): apple,
    WORDS (1): juice,
    Source: "apple juice" OR applejuice
    Or query: 
    QUOTED WORDS (2): apple,juice,
    WORDS (1): applejuice,
    

    In the final step we want to make the OR complete by also adding an AND.

    Step 4

    In the final step for this blog we are going to introduce AND. With the combination of AND we can make more complicated combinations. What would you make from one AND two OR three OR four AND five. In my DSL I first do the AND, then the OR. So this would become (one AND two) OR three OR (four AND five). So a document would match if it contains one and two, or four and five, or three. The Lexer does change a bit, again we just add a type for AND. The grammar has to introduce some new terms. It is good to have an overview of the complete grammar.

    query               : term | orQuery | andQuery ;
    
    orQuery             : orExpr (OR orExpr)+ ;
    orExpr              : term|andQuery;
    
    andQuery            : term (AND term)+ ;
    term                : WORD+|quotedTerm;
    quotedTerm          : QUOTE WORD+ QUOTE ;
    

    As you can see, we introduced an orExpr, being a term or an andQuery. We changed an orQuery to become an orExpr followed by at least one combination of OR and another orExpr. The query now is a term, an orQuery or an andQuery. Some examples below.

    • apple
    • apple OR juice
    • apple raspberry OR juice
    • apple AND raspberry AND juice OR Cola
    • “apple juice” OR applejuice

    The java code becomes a bit boring by now, so let us move to the output of the program immediately.

    Source: apple
    WORDS (1): apple,
    Source: apple OR juice
    Or query: 
    WORDS (1): apple,
    WORDS (1): juice,
    Source: apple raspberry OR juice
    Or query: 
    WORDS (2): apple,raspberry,
    WORDS (1): juice,
    Source: apple AND raspberry AND juice OR Cola
    Or query: 
    And Query: 
    WORDS (1): apple,
    WORDS (1): raspberry,
    WORDS (1): juice,
    WORDS (1): Cola,
    Source: "apple juice" OR applejuice
    Or query: 
    QUOTED WORDS (2): apple,juice,
    WORDS (1): applejuice,
    

    Concluding

    That is it for now, of course this is not the most complicated search DSL. You can most likely come up with other interesting constructs. The goal for this blogpost was to get you underway. In the next blog post I intend to discuss and show how to create a visitor that makes a real elasticsearch query based on the DSL.


    Looking ahead: new field collapsing feature in Elasticsearch

    At Luminis Amsterdam, Search is one of our main focus points. Because of that, we closely keep an eye out for upcoming features.

    Only a few weeks ago, I noticed that the following pull request (“Add field collapsing for search request”) was merged into the Elasticsearch code base, tagged for the 5.3/6.x release. This feature allows you to group your search results based on a specific key. In the past, this was merely possible by using a combination of an ‘aggregation’ and ‘top hits’.

    Now a good question would be: ‘why would I want this?’ or ‘what is this grouping you are talking about?’. Imagine having a website where you sell Apple products. MacBook’s, iPhones, iPad’s etc… Let’s say because of functional requirements, we have to create separate documents for each variant of each device. (eg. separate documents for iPad Air 2 32GB Silver, iPad Air 2 32GB Gold etc..) When a user searches for the word ‘iPad’, having no result grouping, will mean that your users will see search results for all the iPads you are selling. This could mean that your result list looks like the following:

    1. iPad Air 2 32GB Pink
    2. iPad Air 2 128GB Pink
    3. iPad Air 2 32GB Space Grey
    4. iPad Air 2 128GB Space Grey
    5. ..
    6. ..
    7. ..
    8. ..
    9. ..
    10. iPad Pro 12.9 Inch 32GB Space Grey
    11. Ipad Case with happy colourful pictures on it.

    Now for the sake of this example, let’s say we only show 10 products per page. If our user was really looking for an iPad case, he wouldn’t see this product, but instead, would be shown a long list of ‘the same’ iPad. This is not really user-friendly. Now, a better approach would be to group all the Ipad Air 2 products in one, so that it would take only 1 spot in the search results list. You would have to think of a visual presentation in order to notify the user that there are more variants of that same product.

    As mentioned before, grouping of results was already possible in older versions of Elasticseach, but the downside of this old approach was that it would use a lot of memory when computing this on big data sets, plus that paginating result was not (really) possible. An example:

    GET shop/_search
    {
      "size": 0,
      "query": {
        "match": {
          "title": "iPad"
        }
      },
      "aggs": {
        "collapse_by_id": {
          "terms": {
            "field": "family_id",
            "size": 10,
            "order": {
              "max_score": "desc"
            }
          },
          "aggs": {
            "max_score": {
              "max": {
                "script": "_score"
              }
            },
            "top_hits_for_family": {
              "top_hits": {
                "size": 3
              }
            }
          }
        }
      }
    }
    
    • We perform a Terms aggregation on the family_id, which results in the grouping we want. Next, we can use top_hits to get the documents belonging to that family.

    All seems well. Now let’s say we have a website where users are viewing 10 products per page. In order for users to go to the next page, we would have to execute the same query, up the number of aggregations to 20 and remove the first 10 results. Aggregations use quite some processing power, so having to constantly aggregate over the complete set will not be really performant when having a big data set. Another way would be to eliminate the first page results by executing a query with for page 2 combined with a filter to eliminate the families already shown. All in all, this would be a lot of extra work in order to achieve a field collapsing feature.

    Now that Elasticsearch added the field collapsing feature, this becomes a lot easier. You can download my gist( with some setup for if you want to play along with the example. The gist contains some settings/mappings, test data and the queries which I will be showing you in a minute.

    Alongside query, aggregations, suggestions, sorting/pagination options etc.. Elasticsearch has added a new ‘collapse’ feature:

    GET shop/_search
    {
      "query": {
        "match": {
          "title": "Ipad"
        }
      },
      "collapse": {
        "field": "family_id"
      }
    }
    

    The simplest version of collapse only takes a field name on which to form the grouping. If we execute this query, it will generate the following result:

    "hits": {
        "total": 6,
        "max_score": null,
        "hits": [
          {
            "_index": "shop",
            "_type": "product",
            "_id": "5",
            "_score": 0.078307986,
            "_source": {
              "title": "iPad Pro ipad",
              "colour": "Space Grey",
              "brand": "Apple",
              "size": "128gb",
              "price": 899,
              "family_id": "apple-5678"
            },
            "fields": {
              "family_id": [
                "apple-5678"
              ]
            }
          },
          {
            "_index": "shop",
            "_type": "product",
            "_id": "1",
            "_score": 0.05406233,
            "_source": {
              "title": "iPad Air 2",
              "colour": "Silver",
              "brand": "Apple",
              "size": "32gb",
              "price": 399,
              "family_id": "apple-1234"
            },
            "fields": {
              "family_id": [
                "apple-1234"
              ]
            }
          }
        ]
      }
    

    Notice the total amounts in the query response, showing the total amount of documents that were matched against the query. Our hits only return 2 hits, but if we look at the ‘fields’ section of the result, we can see our two unique family_id’s. The best matching result for each family_id is returned in the search results.

    It is also possible to retrieve the documents directly for each family_id by adding an inner_hits block inside collapse:

    GET shop/_search
    {
      "query": {
        "match": {
          "title": "iPad"
        }
      },
      "collapse": {
        "field": "family_id",
        "inner_hits": {
          "name": "collapsed_by_family_id",
          "from": 1,
          "size": 2
        }
      }
    }
    
    • You can use ‘from:1’ to exclude the first hit in the family, since it’s already the returned parent of the family

    Which results in:

    "hits": {
        "total": 6,
        "max_score": null,
        "hits": [
          {
            "_index": "shop",
            "_type": "product",
            "_id": "5",
            "_score": 0.078307986,
            "_source": {
              "title": "iPad Pro ipad",
              "colour": "Space Grey",
              "brand": "Apple",
              "size": "128gb",
              "price": 899,
              "family_id": "apple-5678"
            },
            "fields": {
              "family_id": [
                "apple-5678"
              ]
            },
            "inner_hits": {
              "collapsed_family_id": {
                "hits": {
                  "total": 2,
                  "max_score": 0.078307986,
                  "hits": [
                    {
                      "_index": "shop",
                      "_type": "product",
                      "_id": "6",
                      "_score": 0.066075005,
                      "_source": {
                        "title": "iPad Pro",
                        "colour": "Space Grey",
                        "brand": "Apple",
                        "size": "256gb",
                        "price": 999,
                        "family_id": "apple-5678"
                      }
                    }
                  ]
                }
              }
            }
          },
          {
            "_index": "shop",
            "_type": "product",
            "_id": "1",
            "_score": 0.05406233,
            "_source": {
              "title": "iPad Air 2",
              "colour": "Silver",
              "brand": "Apple",
              "size": "32gb",
              "price": 399,
              "family_id": "apple-1234"
            },
            "fields": {
              "family_id": [
                "apple-1234"
              ]
            },
            "inner_hits": {
              "collapsed_family_id": {
                "hits": {
                  "total": 4,
                  "max_score": 0.05406233,
                  "hits": [
                    {
                      "_index": "shop",
                      "_type": "product",
                      "_id": "2",
                      "_score": 0.05406233,
                      "_source": {
                        "title": "iPad Air 2",
                        "colour": "Gold",
                        "brand": "Apple",
                        "size": "32gb",
                        "price": 399,
                        "family_id": "apple-1234"
                      }
                    },
                    {
                      "_index": "shop",
                      "_type": "product",
                      "_id": "3",
                      "_score": 0.05406233,
                      "_source": {
                        "title": "iPad Air 2",
                        "colour": "Space Grey",
                        "brand": "Apple",
                        "size": "32gb",
                        "price": 399,
                        "family_id": "apple-1234"
                      }
                    }
                  ]
                }
              }
            }
          }
        ]
      }
    

    Paging was an issue with the old approach, but since documents are grouped inside the search results, paging works out of the box. Same way as it does for normal queries and with the same limitations.

    A lot of people in the community have been waiting for this feature and I’m excited that it finally arrived. You can play around with the data set and try some more ‘collapsing’ (eg by color, brand, size etc..). I hope this gave you a small overview of what’s to come in the upcoming 5.3/6.x release.


    Creating an elasticsearch plugin, the basics

    Elasticsearch is a search solution based on Lucene. It comes with a lot of features to enrich the search experience. Some of these features have been recognised as very useful in the analytics scene as well. Interacting with elasticsearch mainly takes place using the REST endpoint. You can do everything using the different available endpoints. You can create new indexes, insert documents, search for documents and lots of other things. Still some of the things are not available out of the box. If you need an analyser that is not available by default, you can install it as a plugin. If you need security, you can install a plugin. If you need alerting, you can install it as a plugin. I guess you get the idea by now. The plugin extension option is nice, but might be a bit hard to begin with. Therefore in this blog post I am going to write a few plugins. I’ll point you to some of the resources I used to get it running and I want to give you some inspiration for your own ideas for cool plugins that extend the elasticsearch functionality.

    Bit of history

    In the releases prior to version 5 there were two type of plugins, site and java plugins. Site plugins were used extensively. Some well known examples are: Head, HQ, Kopf. Also Kibana and Marvel started out as a site plugin. It was a nice feature, however not the core of elasticsearch. Therefore the elastic team deprecated site plugins in 2.3 and the support was removed in 5.

    How does it work

    The default elasticsearch installation already provides a script to install plugins. You can find it in the bin folder. You can install plugins from repositories but also from a local path. A plugin comes in the form of a jar file.

    Plugins need to be installed on every node of the cluster. Installation is as simple as the following command.

    bin/elasticsearch-plugin install file:///path/to/elastic-basic-plugin-5.1.2-1-SNAPSHOT.zip
    

    In this case we install the plugin from our own hard drive. The plugins have a dependency on the elastic core and therefore need to have the exact same version as the elastic version you are using. So for each elasticsearch release you have to create a new version of the plugin. In the example I have created the plugin for elasticsearch 5.1.2.

    Start with our own plugin

    Elastic uses gradle internally to build the project, I still prefer maven over gradle. Luckily David Pilato wrote a good blog post about creating the maven project. I am not going to repeat all the steps of him. Feel free to take a peek at the pom.xml I used in my plugin.

    Create BasicPlugin that does nothing

    The first step in the plugin is to create a class that starts the plugin. Below is the class that has just one functionality, print a statement in the log that the plugin is installed.

    public class BasicPlugin extends Plugin {
        private final static Logger LOGGER = LogManager.getLogger(BasicPlugin.class);
        public BasicPlugin() {
            super();
            LOGGER.warn("Create the Basic Plugin and installed it into elasticsearch");
        }
    }
    

    Next step is to configure the plugin as described by David Pilato in his blog I mentioned before. We need to add the maven assembly plugin using the file src/main/assemblies/plugin.xml. In this file we refer to another very important file, src/main/resources/plugin-descriptor.properties. With all this in place we can run maven to create the plugin in a jar.

    mvn clean package -DskipTests
    

    In the folder target/releases you’ll now find the file elastic-basic-plugin-5.1.2-1-SNAPSHOT.zip. Which is a jar file in disguise, we could change the extension to jar, there is no difference. Now use the command from above to install. If you get a message that the plugin is already there, you need to remove it first

    bin/elasticsearch-plugin remove elastic-basic-plugin
    

    Then after installing the plugin you’ll find the following line in the log of elasticsearch when starting

    [2017-01-31T13:42:01,629][WARN ][n.g.e.p.b.BasicPlugin    ] Create the Basic Plugin and installed it into elasticsearch
    

    This is of course a bit silly, let us create a new rest endpoint that checks if the elasticsearch database contains an index called jettro.

    Create a new REST endpoint

    The inspiration for this endpoint came from another blog post by David Pilato: Creating a new rest endpoint.

    When creating a new endpoint you have to extend the class org.elasticsearch.rest.BaseRestHandler. But before we go there, we first initialise it in our plugin. To do that we implement the interface org.elasticsearch.plugins.ActionPlugin and implement the method getRestHandlers.

    public class BasicPlugin extends Plugin implements ActionPlugin {
        private final static Logger LOGGER = LogManager.getLogger(BasicPlugin.class);
        public BasicPlugin() {
            super();
            LOGGER.warn("Create the Basic Plugin and installed it into elasticsearch");
        }
    
        @Override
        public List<Class<? extends RestHandler>> getRestHandlers() {
            return Collections.singletonList(JettroRestAction.class);
        }
    }
    

    Next is implementing the JettroRestAction class. Below the first part, the constructor and the method that handles the request. In the constructor we define the endpoint url patterns that this endpoint supports. The are clear from the code I think. Functionality wise, if you call without an action or with another action than exists, we return a message, if you ask for existence we return true or false. This handling is done in the prepareRequest method.

    public class JettroRestAction extends BaseRestHandler {
    
        @Inject
        public JettroRestAction(Settings settings, RestController controller) {
            super(settings);
            controller.registerHandler(GET, "_jettro/{action}", this);
            controller.registerHandler(GET, "_jettro", this);
        }
    
        @Override
        protected RestChannelConsumer prepareRequest(RestRequest request, NodeClient client) throws IOException {
            String action = request.param("action");
            if (action != null && "exists".equals(action)) {
                return createExistsResponse(request, client);
            } else {
                return createMessageResponse(request);
            }
        }
    }
    

    We have two utility classes that transform data into XContent: Message and Exists. The implementations of the two methods: createExistsResponse and createMessageResponse, can be found here.

    Time to re-install the plugin, first build it with maven, remove the old one and install the new version. Now we can test it in a browser or with curl. I personally use httpie to do the following requests.

    Screen Shot 2017 01 31 at 15 23 10

    This way we can create our own custom endpoint. Next we dive a little bit deeper into the heart of elastic. We are going to create a custom filter that can be used in an analyser.

    Create a custom Filter

    The first part is registering the Filter in the BasePlugin class. We need to extend the interface org.elasticsearch.plugins.AnalysisPlugin and override the method getTokenFilters. We register a factory class that instantiates the filter class. The registration is done using a name that can later on be used to use the filter. The method looks like this

        @Override
        public Map<String, AnalysisModule.AnalysisProvider<TokenFilterFactory>> getTokenFilters() {
            return Collections.singletonMap("jettro", JettroTokenFilterFactory::new);
        }
    

    The implementation of the factory is fairly basic

    public class JettroTokenFilterFactory extends AbstractTokenFilterFactory {
        public JettroTokenFilterFactory(IndexSettings indexSettings, 
                                        Environment environment, 
                                        String name, 
                                        Settings settings) {
            super(indexSettings, name, settings);
        }
    
        @Override
        public TokenStream create(TokenStream tokenStream) {
            return new JettroOnlyTokenFilter(tokenStream);
        }
    }
    

    The filter we are going to create has a bit strange functionality. It only accepts tokens that are the same as jettro. All other tokens are removed.

    public class JettroOnlyTokenFilter extends FilteringTokenFilter {
        private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    
        public JettroOnlyTokenFilter(TokenStream in) {
            super(in);
        }
    
        @Override
        protected boolean accept() throws IOException {
            return termAtt.toString().equals("jettro");
        }
    }
    

    Time to test my fresh created filter. We can do that using the analyse endpoint

    curl -XGET 'localhost:9200/_analyze' -d '
    {
      "tokenizer" : "standard",
      "filter" : ["jettro"],
      "text" : "this is a test for jettro"
    }'
    

    The response now is

    {"tokens":[{"token":"jettro","start_offset":19,"end_offset":25,"type":"","position":5}]}
    

    Concluding

    That is it, we have created the foundations to create a plugin, thanks to David Pilato, we have written our own _jettro endpoint and we have created a filter that only accepts one specific word, jettro. Ok, I agree the plugin in itself is not very useful, however the construction of the plugin is re-useable. Hope you like it and stay tuned for more elastic plugin blogs. We’re working on an extension to the synonyms plugin and have some ideas for other plugins.


    Elasticsearch 5 is coming, what is new and improved?

    The guys at elastic have been working on the new 5.0 release of elastic and all the other products in their stack as well. From the first alpha release I have been playing around with new features. Wrote some blogposts about features I played around with. With release candidate 1 out, it is time to write a bit about the new features that I like, and (breaking) changes that I feel are important. Since it is a big release I need a big blog post, so don’t say I did not warn you.

    (more…)


    Using the new elasticsearch 5 percolator

    In the upcoming version 5 of elasticsearch the implementation for the percolator has changed a lot. They moved the percolator from being a separate endpoint and API to being a member of the search API. In the new version you can execute a percolator query. Big advantage is that you can now use everything you want in the query that you could already in all other queries. In this blogpost I am going to show how to use the new percolator by building a very basic news notification service

    (more…)


    Upgrade your elasticsearch 1.x cluster to 5.x

    Today I was reading about breaking changes in the upcoming 5.0 release of elasticsearch. I read about indexes that are created in 1.x cannot be migrated to 5.x. Even if they were created in the 1.x, migrated to 2.x they cannot be run in 5.x. Luckily there is a migration plugin for 2.4.x. To be a little bit prepared, I wanted to try it out for myself.

    (more…)