Categories for Search/analytics



Being relevant with Solr

Ok, I have to admit, I am not (yet) a Solr expert. However, I have been working now with elasticsearch for years and the fundamentals of obtaining relevant results for Solr and Elasticsearch are still the same. For a customer, I am working on fine-tuning their search results using an outdated Solr version. They are using Solr 4.8.1, yes an upgrade is planned in the future. Still, they want to try to improve their search results. Using my search knowledge I started getting into Solr and, I liked what I saw. I saw a query that had matching algorithms, filters to limit documents that need to be considered and on top of that lots of boosting options. So many boosting options that I had to experiment a lot to get to the right results.

In this blog post, I am going to explain what I did with Solr coming from an elasticsearch background. I do not intend to create a complete guideline on how to use Solr. I’ll focus on the bits that surprised me and the stuff that deals with tuning the edismax type of query.

A little bit of context

Imagine you a running a successful eCommerce website. Or even better, you have created a superb online shopping experience. With an excellent marketing strategy, you are generating lots of traffic to your website. But, yes there is a but, sales are not as expected. Maybe it is even a bit disappointing. You start evaluating the visits to your site using analytics. When going through your search analytics you notice that most of the executed searches do not result in a click and therefore not into sales. It looks like the results shown to your visitors for their searches are far from optimal.

So we need better search results.

Getting to know your data

Before we start writing queries, we need to have an idea about our data. We need to create inverted indexes per field, or for combinations of fields, using different analyzers. We need to be able to search for parts of sentences, but also be able to boost matching sentences. We want to be able to find matches for combinations of fields. Imagine you sell books, you want people to be able to look for the latest book by Dan Brown called Origin. Users might enter a search query like Dan Brown Origin. If might become a challenge if you have structured data like:

{
    "author": "Dan Brown",
    "title": "Origin"
}

How would you do it if people want to have the latest Dan Brown? What if you want to help people choose by using the popularity of books using ratings or sales. Or how to act if people want to look at all the books in the Harry Potter series. Of course, we need to have the right data to be able to facilitate our visitors with these new requirements. We also need a media_type field later on. With the media type, we can filter on all eBooks for example. So the data becomes something like the following block.

{
    "id": "9781400079162",
    "author": "Dan Brown",
    "title": "Origin",
    "release_date": "2017-10-08T00:00:00Z",
    "rating": 4.5,
    "sales_pastyear": 239,
    "media_type": "ebook"
}

Ranking requirements

Based on analysis and domain knowledge we have the following thoughts translated into requirements for the ranking of search results:

  • Recent books are more important than older books
  • Books with a higher rating are more important than lower rated books
  • Unrated books are more important than low rated books
  • Books that are sold more often in the past year are more important than unsold books
  • Normal text matching rules should be applied

Mapping data to our index

In Solr, you create a schema.xml to map the expected data to specific types. You can also use the copy_to functionality to create new fields that are analyzed differently or are a combination of the provided fields. An example could be to add a field that contains all searchable other fields. In our case, we could create a field containing the author as well as the title. This field is analyzed in the most optimal way to do matching. We add a tokenizer, but also filters for lowercasing, stop words, diacritics, and compound words. We also have fields that are more for boosting using phrases and numbers or dates. We want fields like title and author to support phrases but also full matches. With this, we got a few extra search requirements

  • Documents of which the exact author or title matches the query should be more important
  • Documents of which the title contains the words in the query in the same order are more important

With these rules, we can start to create a query and apply our matching and boosting requirements.

The Query

Creating the query was my biggest surprise when going for Solr. Another configuration mechanism is the Solrconfig.xml. This file configures the Solr node. It gives you the option to create your own endpoint for a query that comes with lots of defaults. One thing we can do for instance is to create an endpoint that automatically filters for only ebooks. We can call this endpoint to search for ebooks only. Below you’ll find a sample of the config that does just this.

<requesthandler name="/ebook" class="solr.SearchHandler">
     <lst name="defaults">
       <str name="wt">json</str> <!-- Return response as json -->
       <str name="fq">media_type:ebook</str> <!-- Filter on all items of media_type ebook -->
       <str name="qf">combined_author_title</str> <!-- Search in the field combined_author_title -->
     </lst>
  </requesthandler>

For our own query, we’ll need other options that Solr provides. This is called the edismax query. This comes by default with options to boost your results using phrases, but also for boosting using ratings, release dates, etc. Below an image giving you an idea of what the query should do.

Next, I’ll show you how this translates into the Solr configuration

<requesthandler name="/select" class="solr.SearchHandler">
     <lst name="defaults">
       <str name="wt">json</str>
       <str name="ps">2</str>
       <str name="mm">3</str>
       <str name="pf">author^4 title^2</str>
       <str name="pf2">author^4 title^2</str>
       <str name="pf3">author^4 title^2</str>
       <str name="bq">author_full^5 title_full^5</str>
       <str name="boost">product(div(def(rating,4),4),recip(ms(NOW/DAY,releasedate),3.16e-11,1,1),log(product(5,sales_pastyear)))</str>
       <str name="qf">combined_author_title</str>
       <str name="defType">edismax</str>
       <str name="lowercaseOperators">false</str>
     </lst>
  </requesthandler>

I am not going over all the different parameters. For multi-term queries we use phrases. These are configured with pf, pf2, and pf3. Also, mm is used for multi-term queries. This has to do with the number of terms that have to match. So if you use three terms, they all have to match. The edismax query also supports using AND/OR when you need more control over what terms to match. With lowercaseOperators we prevent that and/or in lowercase are also used to create your own boolean query.

With respect to boosting there is the bq, these numbers are added to the score. With the field boost, we do a multiplication. Look also at the diagram. Also notice the bq has text related scores, while the boost has numeric scores.

That is about it for now. I think it is good to look at the differences between Solr en Elasticsearch. I like the idea of creating a query with Solr. Of course, you can do the same with Elasticsearch. The json API for creating a query is really flexible, but you have to create the constructs used in Solr yourself.


Elasticsearch instances for integration testing

In my latest project I have implemented all communication with my Elasticsearch cluster using the high level REST client. My next step was to setup and teardown an Elasticsearch instance automatically in order to facilitate proper integration testing. This article describes three different ways of doing so and discusses some of the pros and cons. Please refer to this repository for implementations of all three methods.

docker-maven-plugin

This generic Docker plugin allows you to bind the starting and stopping of Docker containers to Maven lifecycles. You specify two blocks within the plugin; configuration and executions. In the configuration block, you choose the image that you want to run (Elasticsearch 6.5.3 in this case), the ports that you want to expose, a health check and any environment variables. See the snippet below for a complete example:

<plugin>
    <groupId>io.fabric8</groupId>
    <artifactId>docker-maven-plugin</artifactId>
    <version>${version.io.fabric8.docker-maven-plugin}</version>
    <configuration>
        <imagePullPolicy>always</imagePullPolicy>
        <images>
            <image>
                <alias>docker-elasticsearch-integration-test</alias>
                <name>docker.elastic.co/elasticsearch/elasticsearch:6.5.3</name>
                <run>
                    <namingStrategy>alias</namingStrategy>
                    <ports>
                        <port>9299:9200</port>
                        <port>9399:9300</port>
                    </ports>
                    <env>
                        <cluster.name>integration-test-cluster</cluster.name>
                    </env>
                    <wait>
                        <http>
                            <url>http://localhost:9299</url>
                            <method>GET</method>
                            <status>200</status>
                        </http>
                        <time>60000</time>
                    </wait>
                </run>
            </image>
        </images>
    </configuration>
    <executions>
        <execution>
            <id>docker:start</id>
            <phase>pre-integration-test</phase>
            <goals>
                <goal>start</goal>
            </goals>
        </execution>
        <execution>
            <id>docker:stop</id>
            <phase>post-integration-test</phase>
            <goals>
                <goal>stop</goal>
            </goals>
        </execution>
    </executions>
</plugin>

You can see that I’ve bound the plugin to the pre- and post-integration-test lifecycle phases. By doing so, the Elasticsearch container will be started just before any integration tests are ran and will be stopped after the integration tests have finished. I’ve used the maven-failsafe-plugin in order to trigger the execution of tests ending with *IT.java in the integration-test lifecycle phase.

Since this is a generic Docker plugin, there is no special functionality to easily install Elasticsearch plugins that may be needed during your integration tests. You could however create your own image with the required plugins and pull that image during your integration tests.

The integration with IntelliJ is also not optimal. When running an *IT.java class, IntelliJ will not trigger the correct lifecycle phases and will attempt to run your integration test without creating the required Docker container. Before running an integration test from IntelliJ, you need to manually start the container from the “Maven projects” view by running the docker:start commando:

Maven Projects view in IntelliJ

After running, you will also need to run the docker:stop commando to kill the container that is still running. If you forget to kill the running container and want to run a mvn clean install later on it will fail, since the build will attempt to create a container on the same port – as far as I know, the plugin does not allow for random ports to be chosen.

Pros:

  • Little setup, only requires configuration of one Maven plugin

Cons:

  • No out of the box functionality to start the Elasticsearch instance on a random port
  • No out of the box functionality to install extra Elasticsearch plugins
  • Extra dependency in your build pipeline (Docker)
  • IntelliJ does not trigger the correct lifecycle phases

elasticsearch-maven-plugin

This second plugin does not require Docker and only needs some Maven configuration to get started. See the snippet below for a complete example:

<plugin>
    <groupId>com.github.alexcojocaru</groupId>
    <artifactId>elasticsearch-maven-plugin</artifactId>
    <version>${version.com.github.alexcojocaru.elasticsearch-maven-plugin}</version>
    <configuration>
        <version>6.5.3</version>
        <clusterName>integration-test-cluster</clusterName>
        <transportPort>9399</transportPort>
        <httpPort>9299</httpPort>
    </configuration>
    <executions>
        <execution>
            <id>start-elasticsearch</id>
            <phase>pre-integration-test</phase>
            <goals>
                <goal>runforked</goal>
            </goals>
        </execution>
        <execution>
            <id>stop-elasticsearch</id>
            <phase>post-integration-test</phase>
            <goals>
                <goal>stop</goal>
            </goals>
        </execution>
    </executions>
</plugin>

Again, I’ve bound the plugin to the pre- and post-integration-test lifecycle phases in combination with the maven-failsafe-plugin.

This plugin provides a way of starting the Elasticsearch instance from IntelliJ in much the same way as the docker-maven-plugin. You can run the elasticsearch:runforked commando from the “Maven projects” view. However in my case, this started the container and then immediately exited. There is also no out of the box possibility of setting a random port for your instance. However, there are solutions to this at the expense of having a somewhat more complex Maven configuration.

Overall, this is a plugin that seems to provide almost everything we need with a lot of configuration options. You can automatically install Elasticsearch plugins or even bootstrap your instance with data.

In practice I did have some problems using the plugin in my build pipeline. Upon downloading the Elasticsearch ZIP the build would sometimes fail, or in other cases when attempting to download a plugin. Your mileage may vary, but this was reason for me to keep looking for another solution. Which brings me to plugin number three.

Pros:

  • Little setup, only requires configuration of one Maven plugin
  • No extra external dependencies
  • High amount of configuration possible

Cons:

  • No out of the box functionality to start the Elasticsearch instance on a random port
  • Poor integration with IntelliJ
  • Seems unstable

testcontainers-elasticsearch

This third plugin is different from the other two. It uses a Java testcontainer that you can configure through Java code. This gives you a lot of flexibility and requires no Maven configuration. Since there is no Maven configuration, it does require some work to make sure the Elasticsearch container is started and stopped at the correct moments.

In order to realize this, I have extended the standard SpringJUnit4ClassRunner class with my own ElasticsearchSpringRunner. In this runner, I have added a new JUnit RunListener named JUnitExecutionListener. This listener defines two methods testRunStarted and testRunFinished that enable me to start and stop the Elasticsearch container at the same points in time that the pre- and post-integration-test Maven lifecycle phases would. See the snippet below for the implementation of the listener:

public class JUnitExecutionListener extends RunListener {

    private static final String ELASTICSEARCH_IMAGE = "docker.elastic.co/elasticsearch/elasticsearch";
    private static final String ELASTICSEARCH_VERSION = "6.5.3";
    private static final String ELASTICSEARCH_HOST_PROPERTY = "spring.elasticsearch.rest.uris";
    private static final int ELASTICSEARCH_PORT = 9200;

    private ElasticsearchContainer container;
    private RunNotifier notifier;

    public JUnitExecutionListener(RunNotifier notifier) {
        this.notifier = notifier;
    }

    @Override
    public void testRunStarted(Description description) {
        try {
            if (System.getProperty(ELASTICSEARCH_HOST_PROPERTY) == null) {
                log.debug("Create Elasticsearch container");
                int mappedPort = createContainer();
                System.setProperty(ELASTICSEARCH_HOST_PROPERTY, "localhost:" + mappedPort);
                String host = System.getProperty(ELASTICSEARCH_HOST_PROPERTY);
                RestAssured.basePath = "";
                RestAssured.baseURI = "http://" + host.split(":")[0];
                RestAssured.port = Integer.parseInt(host.split(":")[1]);
                log.debug("Created Elasticsearch container at {}", host);
            }
        } catch (Exception e) {
            notifier.pleaseStop();
            throw e;
        }
    }

    @Override
    public void testRunFinished(Result result) {
        if (container != null) {
            String host = System.getProperty(ELASTICSEARCH_HOST_PROPERTY);
            log.debug("Removing Elasticsearch container at {}", host);
            container.stop();
        }
    }

    private int createContainer() {
        container = new ElasticsearchContainer();
        container.withBaseUrl(ELASTICSEARCH_IMAGE);
        container.withVersion(ELASTICSEARCH_VERSION);
        container.withEnv("cluster.name", "integration-test-cluster");
        container.start();
        return container.getMappedPort(ELASTICSEARCH_PORT);
    }
}

It will create an Elasticsearch Docker container on a random port for use by the integration tests. The best thing about having this runner is that it works perfectly fine in IntelliJ. Simply right-click and run your *IT.java classes annotated with @RunWith(ElasticsearchSpringRunner.class) and IntelliJ will use the listener to setup the Elasticsearch container. This allows you to automate your build pipeline while still keeping developers happy.

Pros:

  • Neat integration with both Java and therefore your IDE
  • Sufficient configuration options out of the box

Cons:

  • More complex initial setup
  • Extra dependency in your build pipeline (Docker)

In summary, all three of the above plugins are able to realize the goal of starting an Elasticsearch instance for your integration testing. For me personally, I will be using the testcontainers-elasticsearch plugin going forward. The extra Docker dependency is not a problem since I use Docker in most of my build pipelines anyway. Furthermore, the integration with Java allows me to configure things in such a way that it works perfectly fine from both the command line and the IDE.

Feel free to checkout the code behind this article, play around with the integration tests that I’ve setup there and decide for yourself which plugin suits your needs best.


Setting up data analytics pipeline: the best practices

The picture is courtesy of https://bit.ly/2K44Nk5 1 Datapipeline Architect Example

In a data science analogy with the automotive industry, the data plays the role of the raw-oil which is not yet ready for combustion. The data modeling phase is comparable with combustion in the engines and data preparation is the refinery process turning raw-oil to the fuel i.e., ready for combustion. In this analogy data analytics pipeline includes all the steps from extracting the oil up to combustion, driving and reaching to the destination (analogous to reach the business goals). As you can imagine, the data (or oil in this analogy) goes through a various transformation and goes from one stage of the process to another. But the question is what is the best practice in terms of data format and tooling? Although there are many tools that make the best practice sometimes very use-case specific but generally JSON is the best practice for the data-format of communication or the lingua franca and Python is the best practice for orchestration, data preparation, analytics and live production.

What is the common inefficiency and why it happens?

The current inefficiency is overusing of tabular (csv-like) data-formats for communication or lingua franca. I believe data scientists still overuse the structured data types for communication within data analytics pipeline because of standard data-frame-like data formats offered by major analytic tools such as Python and R. Data scientists start getting used to data-frame mentality forgetting the fact that tabular storage of the data is a low scale solution, not optimized for communication and when it comes to bigger sets of data or flexibility to add new fields to the data, data-frames and their tabular form are non-efficient.

DataOps Pipeline and Data Analytics

A very important aspect for analytics being ignored in some circumstances is going live and getting integrated with other systems. DataOps is about setting up a set of tools from capturing data, storing them up to analytics and integration, falling into an interdisciplinary realm of the DevOps, Data Engineering, Analytics and Software Engineering (Hereinafter I use data analytics pipeline and DataOps pipeline interchangeably.) The modeling part and probably some parts in data prep phases need a data-frame like data format but the rest of the pipeline is more efficient and robust if is JSON native. It allows adding/removing features easier and is a compact form for communication between modules.

The picture is courtesy of https://zalando-jobsite.cdn.prismic.io/zalando-jobsite/2ed778169b702ca83c2505ceb65424d748351109_image_5-0d8e25c02668e476dd491d457f605d89.jpg 2

The role of Python

Python is a great programming language used not only by the scientific community but also the application developers. It is ready to be used as back-end and by combining it with Django you can build up full-stack web applications. Python has almost everything you need to set up a DataOps pipeline and is ready for integration and live production.

Python Example: transforming CSV to JSON and storing it in MongoDB

To show some capabilities of Python in combination with JSON, I have brought a simple example. In this example, a dataframe is converted to JSON (Python dictionaries) and is stored in MongoDB. MongoDB is an important database in today’s data storage as it is JSON native storing data in a document format bringing high flexibility .

<br />### Loading packages

from pymongo import MongoClient import pandas as pd

# Connecting to the database

client = MongoClient('localhost', 27017)

# Creating database and schema

db = client.pymongo_test posts = db.posts

# Defining a dummy dataframe

df = pd.DataFrame({'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])

# Transforming dataframe to a dictionary (JSON)

dic=df.to_dict()

# Writing to the database

result = posts.insert_one(dic) print('One post: {0}'.format(result.inserted_id))

The above example shows the ability of python in data transformation from dataframe to JSON and its ability to connect to various tooling (MongoDB in this example) in DataOps pipeline.

Recap

This article is an extension to my previous article on future of data science (https://bit.ly/2sz8EdM). In my earlier article, I have sketched the future of data science and have recommended data scientists to go towards full-stack. Once you have a full stack and various layers for DataOps / data analytics JSON is the lingua franca between modules bringing robustness and flexibility for this communication and Python is the orchestrator of various tools and techniques in this pipeline.


1: The picture is courtesy of https://cdn-images-1.medium.com/max/1600/1*8-NNHZhRVb5EPHK5iin92Q.png 2: The picture is courtesy of https://zalando-jobsite.cdn.prismic.io/zalando-jobsite/2ed778169b702ca83c2505ceb65424d748351109_image_5-0d8e25c02668e476dd491d457f605d89.jpg


Creating an Elastic Canvas for Twitter while visiting Elasticon 2018

The past week we visited Elasticon 2018 in San Francisco. In our previous blog post we wrote about the Keynote and some of the more interesting new features of the elastic stack. In this blog post, we take one of the cool new products for a spin: Canvas. But what is Canvas?

Canvas is a composable, extendable, creative space for live data With Canvas you can combine dynamic data, coming from a query against Elasticsearch for instance, with nice looking graphs. You can also use tables, images and combine them with the data visualizations to create stunning, dynamic infographics. In this blog post, we create a Canvas about the tweets with the tags Elasticon during the last day of the elastic conference last week.

Below is the canvas we are going to create. It contains a number of different elements. The top row contains a pie chart with the language of the tweets, a bar chart with the number of tweets per time unit, followed by the total tracked tweets during the second day of elasticon. The next two elements are using the sentiment of the tweets. This was obtained using IBM Watson. Byron wrote a basic integration with Watson, he will give more details in a next blog post. The pie chart shows the complete results, the green smiley on the right shows the percentage of positive tweets of the total number of tweets that we could analyze without an error or be neutral.

Overview canvas

With the example in place, it is time to discuss how to create these canvasses yourself. First some information about the installation. A few of the general concepts and finally sample code for the used elements.

Installing canvas

Canvas is part of elastic Kibana. You have to install canvas as a plugin into Kibana. You do need to install X-Pack in Elasticsearch as well as in Kibana. The steps are well described in the installation page of Canvas. Beware though, installing the plugins in Kibana takes some time. They are working on improving this, but we have to deal with it at the moment.

If everything is installed, start Kibana in your browser. At this moment you could start creating the canvas, however, you have no data. So you have to import some data first. We used Logstash with a twitter input and elastic output. Cannot go into to much detail or else this blog post will be way too long. Might do this is a next blog post. For now, it is enough to know we have an index called twitter that contains tweets.

Creating the canvas with the first element

When clicking on the tab Canvas we can create a new Workpad. A Workpad can contain one of the multiple pages and each page can contain multiple elements. A Workpad defines the size of the screen. Best is to create it for a specific screen size. At elasticon they had multiple monitors, some of them horizontal, others vertical. It is good to create the canvas for a specific size. You can also choose a background color. These options can be found on the right side of the screen in the Workpad settings bar.

It is good to know that you can create a backup of you Workpad from the Workpads screen, there is a small download button on the right side. Restoring a dashboard is done by dropping the exported JSON into the dialog.

New work pad

Time to add our first element to the page. Use the plus sign at the bottom of the screen to add an element. You can choose from a number of elements. The first one we’ll try is the pie chart. When adding the pie chart, we see data in the chart. Hmm, how come, we did not select any data. Canvas comes with a default data source, this data source is used in all the elements. This way we immediately see what the element looks like. Ideal play around with all the options. Most options are available using the settings on the right. With the pie, you’ll see options for the slice labels and the slice angles. You can also see the Chart style and Element style. These configuration options show a plus signed button. With this button, you can add options like color pallet and text size and color. For the element, you can set a background color, border color, opacity and padding

Add element

Next, we want to assign our own data source to the element. After adding our own data source we most likely have to change the parameters for the element as well. In this case, we have to change the Slice labels and angles. Changing the data source is done using the button at the bottom, click the Change Datasource button/link. At the moment there are 4 data sources: demo data, demo prices, Elasticsearch query and timeline query. I’ll choose the Elasticsearch query, select the index, don’t use a specific query and select the fields I need. Selecting the fields I need can speed up the element as we only parse the data that we actually need. In this example, we only use the sentiment label.

Change data source

The last thing I want to mention here is the Code view. After pushing the >_ Code button you’ll see a different view of your element. In this view, you’ll get a code approach. This is more powerful than the settings window. But with great power comes great responsibility. It is easy to break stuff here. The code is organized in different steps. The output of each step is, of course, the input for the next step. In this specific example, there are five steps. First a filter step, next up the data source, then a point series that is required for a pie diagram. Finally the render step. If you change something using the settings the code tab gets updated immediately. If I add a background color to the container, the render step becomes:

render containerStyle={containerStyle backgroundColor="#86d2ed"}

If you make changes in the code block, use the Run button to apply the changes. In the next sections, we will only work in this code tab, just because it is easier to show to you.

Code view

Adding more elements

The basics of the available elements or function are documented here. We won’t go into details for all the different elements we have added. Some of them use the defaults and therefore you can add them yourselves easily. The first one I do want to explain is the Twitter logo with the number of tweets in there. This is actually two different elements. The logo is a static image. The number is more interesting. This makes use of the escount function and the markdown element. Below is the code.

filters
 | escount index="twitter"
 | as num_tweets
 | markdown "{{rows.0.num_tweets}}" font={font family="'Open Sans', Helvetica, Arial, sans-serif" size=60 align="left" color="#ffffff" weight="undefined" underline=false italic=false}

The filters are used to facilitate filtering (usually by time) using the special filter element. The next item is escount which does what you expect. It counts the number of items in the provided index. You can also provide a query to limit the results, but we did not need it. The output for escount is a number. This is a problem when sending it to a markdown element. The markdown element only accepts a datatable. Therefore we have to use the function as. This accepts a number and changes it into a datatable. The markdown element accepts a table and exposes it as rows. Therefore we use the rows to obtain the first row and of that row the column num_tweets. When playing with this element it is easy to remove the markdown line, Canvas will then render the table by default. Below the output for only the first two rows as well as the changes after adding the third line (as num_tweets)

200
num_tweets #

200

Next up are the text and the photo belonging to the actual tweets. The photo is a bit different from the Twitter logo as it is a dynamic photo. In the code below you can see that the image element does have a data URL attribute. We can use this attribute to get one cell from the provided data table. The getCell function has attributes for the row number as well as the name of the column.

esdocs index="twitter*" sort="@timestamp, desc" fields="media_url" count=5 query=""
 | image mode="contain" dataurl={getCell c="media_url" r=2}
 | render

With the text of the tweet, it is a bit different. Here we want to use the markdown widget, however, we do not have the data URL attribute. So we have to come up with a different strategy. If we want to obtain the third item, we select the top 3 and from the top 3, we take the last item.

filters 
| esdocs index="twitter*" sort="@timestamp, desc" fields="text, user.name, created_at" query="" 
| head 3 
| tail 1 
| mapColumn column=created_at_formatted fn=${getCell created_at | formatdate 'YYYY-MM-DD HH:mm:ss'} 
| markdown "{{#each rows}}
**{{'user.name'}}** 

(*{{created_at_formatted}}*)

{{text}}
{{/each}}" font={font family="'American Typewriter', 'Courier New', Courier, Monaco, mono" size=18 align="right" color="#b83c6f" weight="undefined" underline=false italic=false}

The row that starts with mapColumn is a way to format the date. The mapColumn can add a new column with the name as provided by the column attribute and the value as the result of a function. The function can be a chain of functions. In this case, we obtain the column create_at of the datatable and pass it to the format function.

Creating the partly green smiley

The most complicated feature was the smiley that turns green the more positive tweets we see. The positiveness of the tweets was determined using IBM Watson interface. In the end, it is the combination of twee images, one grey smiley, and one green smiley. The green smiley is only shown for a specific percentage. This is the revealImage function. First, we show the complete code.

esdocs index="twitter*" fields="sentiment_label" count=10000 
| ply by="sentiment_label" fn=${rowCount | as "row_count"} 
| filterrows fn=${if {getCell "sentiment_label" | compare "eq" to="error"} then=false else=true}
| filterrows fn=${if {getCell "sentiment_label" | compare "eq" to="neutral"} then=false else=true}
| staticColumn column="total" value={math "sum(row_count)"} 
| filterrows fn=${if {getCell "sentiment_label" | compare "eq" to="positive"} then=true else=false}
| staticColumn column="percent" value={math "divide(row_count, total)"} 
| getCell "percent" 
| revealImage image={asset "asset-488ae09a-d267-4f75-9f2f-e8f7d588fae1"} emptyImage={asset "asset-0570a559-618a-4e30-8d8e-64c90ed91e76"}

The first line is like we have seen before, select all rows from the twitter index. The second row does kind of a grouping of the rows. It groups by the values of sentiment_label. The value is a row count that is specified by the function. If I remove all the other rows we can see the output of just the ply function.

sentiment_label         row_count
negative                32
positive                73
neutral                 81
error                   14

The next steps filter out the rows for error and neutral, then we add a column for the total number of tweets with a positive or negative label. Now each row has this value. Check the following output.

sentiment_label         row_count       total
negative                32              105
positive                73              105

The next line removes the negative row, then we add a column with the percentage, obtain just one cell and call the revealImage function. This function has a number input and attributes for the image as well as the empty or background image.

That gives us all the different elements on the canvas.

Concluding

We really like the options you have with Canvas. You can easily create good-looking dashboard that contains static resources, help texts, images combined with dynamic data coming from Elasticsearch and in the future most likely other resources.

There are some improvements possible of course. It would be nice if we could also select doc_value fields and using aggregations in a query would be nice as well.

We will closely monitor the progression as well believe this is going to be a very interesting technology to keep using in the future.


Elasticon 2018 Day 1

The past few days have been fantastic. Together with Byron I am visiting San Francisco. We have seen amazing sights, but yesterday the reason why we came started. Day one of Elasticon starts with the keynote showing us cool new features to come and sometimes some interesting news. In this blog post, I want to give you a short recap of the keynote and tell you what I think was important.

Elasticon opening

Rollups

With more and more data finding its way to elasticsearch, some indexes become too large for their purpose. We do not need to keep all data of the past weeks and months. We just want to keep the data needed for aggregations we show on a dashboard. Think about an index containing logs of our web server. We have a chart with HTTP status codes, response times, browsers, etc. Now you can create a rollup configuration providing the aggregations we want to keep, containing a cron expression telling when to run and some additional information about how much data to keep. The result is a new index with a lot less data that you can keep for your dashboards.

More information about the rollup functionality can be found here.

Canvas

Last year at Elasticon Canvas was already shown. Elastic continued with the idea and it is starting to look amazing. With Canvas you can create beautiful looking dashboards that go a big step further than the standard dashboards in Kibana. You can customise almost everything you see. It comes with options to put an image on the background, a lot of color options, new sort of data visualisation integrated of course with elasticsearch. In a next blog post I’ll come up with a demo, it is looking very promising. Want to learn more about it, check this blog post.

Kubernetes/Docker logging

One of the areas I still need to learn is the Docker / Kubernetes ecosystem. But if you are looking for a solution to monitor you complete Kubernetes platform, have a look at all the options that elastic has these days. Elastic has impressive support to monitor all the running images. It comes with standard dashboards in Kibana. It now has a dedicated space in Kibana called the Infra tab. More information about the options and how to get started can be found here.

Presenting Geo/Gis data

A very cool demo was given on how to present data on a Map. The demo showed where all Elasticon attendees were coming from. The visual component has an option of creating different layers. So you can add data to give the different countries a color based on the number of attendees. In a next layer show the bigger cities where people are coming from in small circles. Use a logo of the venue in another layer. Etc. Really interesting if you are into geodata. In all makes use of the Elastic Maps Service. If you want more information about this, you can find it here.

Elastic Site Search

Up till now there was news about new ways to handle your logs coming from application monitoring, infrastructure components, application logs. We did not hear about new things around search, until showing the new product called Elastic Site Search. This was previously known as Swiftype. With Google naming its product google search appliance end of life, this is a very interesting replacement. Working with relevance, synonyms, search analytics is becoming a lot easier with this new product. More information can be found here.

Elastic cloud sizing

If you previously looked at the cloud options elastic offers, you might have noticed that choosing elastic nodes did not give you a lot of flexibility. When choosing the amount of required memory, you also got a fixed amount of disk space. With the upcoming release, you have a lot more flexibility when creating your cluster. You can configure different flavours of clusters. One of them being hot-warm cluster. With specific master nodes, hot nodes for recent indexes with more RAM and faster disks, warm nodes containing the older indices with bigger disks. This is a good improvement if you want to create a cluster in the cloud. More information can be found here.

Opening up X-Pack

Shay told a good story about creating a company that supports an open source product. Creating a company only on support is almost impossible in the long run. Therefore they started working on commercial additions now called the X-Pack. Problem with these products was that the code was not available. Therefore working with elastic to help them improve the product was not possible. Therefore they are now opening up their repositories. Beware, it is not becoming free software. You still need to pay, but now it becomes a lot easier to interact with elastic about how stuff works. Next to that, they are going to make it easier to work with the free stuff in X-Pack. Just ask for a license once instead of every year again. And if I understood correct, the download will contain the free stuff in easier installation packages. More information about the what and why in this blog post from Shay Banon.

Conference Party

Nice party but I had to sign a contract to prohibit me from telling stories about the party. I do plan on writing more blog post the coming two days.


Tracing API’s: Combining Spring’s Sleuth, Zipkin & ELK

Tracing bugs, errors and the cause of lagging performance can be cumbersome, especially when functionality is distributed over multiple microservices.
In order to keep track of these issues, the usage of an ELK stack (or any similar sytem) is already a big step forward in creating a clear overview of all the processes of a service and finding these issues.
Often bugs can be traced by using ELK far more easily than just using a plain log file – if even available.
Optimization in this approach can be preferred, as for example you may want to see the trace logging for one specific event only. (more…)


A fresh look at Logstash

Soon after the release of elasticsearch it became clear that elasticsearch was good at more than providing search. It turned out that it could be used to store logs very effectively. That is why logstash was using elasticsearch. It contained standard parsers for apache httpd logs. To obtain the logs it had file monitoring plugins. It had plugins to extend and filter the content, and it had plugins to send the content to elasticsearch. That is Logstash in a nutshell back in the days. Of course the logs had to be shown, therefore a tool called Kibana was created. Kibana was a nice tool to create highly interactive dashboards to show and analyse your data. Together they became the famous ELK suite. Nowadays we have a lot more options in all these tools. We have Ingest node in elastic to pre-process documents before they move into elasticsearch, we have beats to monitor files, databases, machines, etc. And we have very nice and new Kibana dashboards. Time to re-investigate what the combination of Logstash, Elasticsearch and Kibana can do. In this blog post I’ll focus on Logstash.

X-Pack

As the company elastic has to make some money as well, they have created a product called X-Pack. X-Pack has a lot of features that sometimes span multiple products. There is a security component, by using this you can make users login in when using kibana and secure your content. Other interesting parts of X-Pack are machine learning, graph and monitoring. Parts of X-Pack can be used free of charge, you do need a license though. For other parts you need a paid license. I personally like the monitoring part so I regularly install X-Pack. In this blogpost I’ll also investigate the X-Pack features for Logstash. I’ll focus on out-of-the-box functionality and mostly what all these nice new things like monitoring and pipeline viewing bring us.

Using the version 6 release candidate

As elastic has already given us a RC1 of their complete stack, I’ll use this one for the evaluation. Beware though, this is still a release candidate, so not production ready.

What does Logstash do

If you never really heard about Logstash, let me give you a very short introduction. Logstash can be used to obtain data from a multitude of different sources. Than filter, transform and enrich the data. Finally store the data to again a multitude of datasources. Example data sources are relational databases, files, queues and websockets. Logstash ships with a large number of filter plugins, with these we can process data to exclude some fields. We can also enrich data, lookup information about ip addresses, or lookup records belonging to an id in for instance elasticsearch or a database. After the lookup we can add data to the document or event that we are handling before sending it to one or more outputs. Outputs can be elasticsearch, a database, but also queue’s like Kafka or RabbitMQ.

In the later releases logstash started to add more features that a tool handling large amounts of data over longer periods need. Things like monitoring and clustering of nodes were introduced and also persisting incoming data to disk. By now logstash in combination with Kibana and Elasticsearch is used by very large companies but also by a lot of start ups to monitor their servers and handle all sorts of interesting data streams.

Enough of this talk, let us get our hands dirty. First step install everything on our developer machines.

Installation

I’ll focus on the developer machine, if you want to install it on a server please refer to the extensive logstash documentation.

First download the zip or tar.gz file and extract it to a convenient location. Now create a folder where you can store the configuration files. To make the files small and to show you that you can split them, I create three different files in this folder: input.conf, filters.conf and output.conf. The most basic configuration is one with a stdin for input, no filters and stdout for output. Below the contents for the two files

input {
	stdin{}
}
output { 
	stdout { 
		codec => rubydebug
	}
}

Time to start logstash. Step into the downloaded and extracted folder with the logstash binaries and execute the following command.

bin/logstash -r -f ../logstashblog/

the -r, can be used during development for reloading the configuration on change. Beware, this does not work with the stdin plugin. With -f we tell logstash to load a configuration file or directory. In our case a directory containing the three mentioned files. When logstash is ready it will print something like this:

[2017-10-28T19:00:19,511][INFO ][logstash.pipeline        ] Pipeline started {"pipeline.id"=>"main"}
The stdin plugin is now waiting for input:
[2017-10-28T19:00:19,526][INFO ][logstash.agent           ] Pipelines running {:count=>1, :pipelines=>["main"]}

Now you can type something and the result is the created document or event that went through the almost empty pipeline. The thing to notice is that we now have a field called message containing the text we entered.

Just some text for input
{
      "@version" => "1",
          "host" => "Jettros-MBP.fritz.box",
    "@timestamp" => 2017-10-28T17:02:18.185Z,
       "message" => "Just some text for input"
}

Now that we know it is working, I want you to have a look at the monitoring options you have available using the rest endpoint.

http://localhost:9600/

{
"host": "Jettros-MBP.fritz.box",
"version": "6.0.0-rc1",
"http_address": "127.0.0.1:9600",
"id": "20290d5e-1303-4fbd-9e15-03f549886af1",
"name": "Jettros-MBP.fritz.box",
"build_date": "2017-09-25T20:32:16Z",
"build_sha": "c13a253bb733452031913c186892523d03967857",
"build_snapshot": false
}

You can use the same url with different endpoints to get information about the node, the plugins, stats and hot threads:
http://localhost:9600/_node
http://localhost:9600/_node/plugins
http://localhost:9600/_node/stats
http://localhost:9600/_node/hot_threads

It becomes a lot more fun if we have a UI, so let us install xpack into logstash. Before we can run logstash with monitoring on, we need to install elasticsearch and kibana with X-pack installed into those as well. Refer to the X-Pack documentation on how to do it.

The basic commands to install x-pack into elasticsearch and kibana are very easy. For now I disable security by adding the following line to both kibana.yml and elasticsearch.yml: xpack.security.enabled: false. After installing x-pack into logstash we have to add the following lines to the logstash.yml file in the config folder

xpack.monitoring.elasticsearch.url: ["http://localhost:9200"] 
xpack.monitoring.elasticsearch.username:
xpack.monitoring.elasticsearch.password:

Notice the empty username and password, this is required when security is disabled. Now move over to Kibana and check the monitoring tab (the heart shape figure) and click on logstash. In the first screen you can see the events, they could be zero, zo please enter some events. Now move to the pipeline tab. Of course with our basic pipeline, this is a bit stupid, but imagine what it will show later on.

Screen Shot 2017 10 28 at 19 52 46

Time to get some real input.

Import the Signalmedia dataset

Signalmedia has provided a dataset you can use for research. More information about the dataset and how to obtain it can be found here. The dataset contains an exact amount of 1 million news documents. You can download the file as a file that contains a JSON document on each line. The JSON document has the following format:

{
   "id": "a080f99a-07d9-47d1-8244-26a540017b7a",
   "content": "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ...",
   "title": "Pay up or face legal action: DBKL",
   "media-type": "News",
   "source": "My Sinchew",
   "published": "2015-09-15T10:17:53Z"
}

We want to import this big file with all the JSON documents as separate documents into elasticsearch using logstash. The first step is to create a logstash input. Use the path to point to the file. We can use the logstash file plugin to load the file, tell it to start at the beginning and mark each line as a JSON document. The file plugin has more options you can use. It can also handle rolling files that are used a lot in logging.

input {
	file {
        path => "/Volumes/Transcend/signalmedia-1m.jsonl"
        codec => "json"
        start_position => beginning 
    }
}

That is it, with the stdout plugin and the rubydebug codec this would give the following output.

{
          "path" => "/Volumes/Transcend/signalmedia-1m.jsonl",
    "@timestamp" => 2017-10-30T18:49:45.948Z,
      "@version" => "1",
          "host" => "Jettros-MBP.fritz.box",
            "id" => "a080f99a-07d9-47d1-8244-26a540017b7a",
        "source" => "My Sinchew",
     "published" => "2015-09-15T10:17:53Z",
         "title" => "Pay up or face legal action: DBKL",
    "media-type" => "News",
       "content" => "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ..."
}

Notice that besides the fields we expected: id, content, title, media-type, source and published we also got some additional fields. Before sending this to elasticsearch we want to clean it up. We do not need the path, host, @timestamp, @version. There is also something with the field id. We want to use the id field to create the document in elasticsearch, but we do not want to add it to the document. If we need the value of id in the output plugin later on, but we do not want to add it as a field to the document we can move it to the @metadata object. That is exactly what the first part of the filter does. The second part removes the fields we do not need.

filter {
	mutate {
		copy => {"id" => "[@metadata][id]"}
	}
	mutate {
		remove_field => ["@timestamp", "@version", "host", "path", "id"]
	}
}

With these filters in place the output of the same document would become:

{
        "source" => "My Sinchew",
     "published" => "2015-09-15T10:17:53Z",
         "title" => "Pay up or face legal action: DBKL",
    "media-type" => "News",
       "content" => "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ..."
}

Now the content is ready to be send to elasticsearch, so we need to configure the elasticsearch output plugin. When sending data to elastic you first need to think about creating the index and the mapping that goes with it. In this example I am going to create an index template. I am not going to explain a lot about the mappings as this is not an elasticsearch blog. But with the following code we insert the mapping template when connecting to elasticsearch and we can insert all documents. Do look at the way the document_id is created. Remember we talked about that @metadata and how we copied the id field into it. This is the reason why we did it. Now we use that value as the id of the document when inserting it into elasticsearch.

output {
	elasticsearch {
		index => "signalmedia"
		document_id => "%{[@metadata][id]}"
		document_type => "doc"
		manage_template => "true"
		template => "./signalmedia-template.json"
		template_name => "signalmediatemplate"
	}
	stdout { codec => dots }
}

Notice there are two outputs configured. The elasticsearch output of course, but also a stdout. This time not with the rubydebug codec, this would be way to verbose. We use the dots codec. This codec prints a dot for each document it parses.

For completeness I also want to show the mapping template. In this case I positioned it in the root folder of the logstash binary, usually this would of course be an absolute path somewhere else.

{
  "index_patterns": ["signalmedia"],
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 3
  },
  "mappings": {
    "doc": {
      "properties": {
        "source": {
          "type": "keyword"
        },
        "published": {
          "type": "date"
        },
        "title": {
          "type": "text"
        },
        "media-type": {
          "type": "keyword"
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

Now we want to import all the million documents and have a look at the monitoring along the way. Let’s do it.

Screen Shot 2017 10 30 at 20 50 36
Screen Shot 2017 10 30 at 20 48 21

Running a query

Of course we have to prove the documents are now available in elasticsearch. So lets execute one of my favourite queries that makes use of the new significant text aggregation. First the request and then parts of the response.

GET signalmedia/_search
{
  "query": {
    "match": {
      "content": "netherlands"
    }
  },
  "aggs": {
    "my_sampler": {
      "sampler": {
        "shard_size": 200
      },
      "aggs": {
        "keywords": {
          "significant_text": {
            "field": "content",
            "filter_duplicate_text": true
          }
        }
      }
    }
  },
  "size": 0
}

Just a very small part of the response, I stripped out a lot of the elements to make it better viewable. Good to see that that see dutch as a significant word when searching for the netherlands and of course geenstijl.

"buckets": [
  {"key": "netherlands","doc_count": 527},
  {"key": "dutch","doc_count": 196},
  {"key": "mmsi","doc_count": 7},
  {"key": "herikerbergweg","doc_count": 4},
  {"key": "konya","doc_count": 14},
  {"key": "geenstijl","doc_count": 3}
]

Concluding

Good to see the nice ui options in Kibana. The pipeline viewer is very useful. In a next blog post I’ll be looking at Kibana and all the new and interesting things in there.


Elasticsearch 6 is coming

For some time now, elasticsearch has been releasing versions of the new major release elasticsearch 6. At this moment the latest edition is already rc1, so it is time to start thinking about migrating to the latest and greatest. What backwards compatible issues will you run into and what new features can you start using. This blog post gives a summary of the items that are most important to me based on the projects that I do. First we’ll have a look at the breaking changes, than we move on to new features or interesting upgrades.

Breaking changes

Most of the breaking changes come from the elasticsearch documentation that you can of course also read yourself.

Migrating indexes from previous versions

As with all major release, only indexes created in the prior version can be migrated automatically. So if you have an index created in 2.x, migrated it to 5.x and now want to start using 6.x you have to use the reindex API to first index it into a 5.x index before migrating.

Index types

In elasticsearch 6 the first step is taken into indexes without types. The first step is to allow only a single type within a new index and be able to keep using multiple types in indexes migrated from 5.x. Starting with elasticsearch 5.6 you can prevent people from creating indexes with multiple types. This will make it easier to migrate to 6.x when it becomes available. By applying the following configuration option you can prevent people from making multiple types in one index

index.mapping.single_type: true

More reasoning about why the types need to be removed can be found in elasticsearch documentation removal of types. Also if you are into parent-child relationships in elasticsearch and are curious what the implications of not being able to use multiple types are, check this documentation page parent-join. Yes will will get joins in elasticsearch :-), though with very limited use.

Java High Level REST Client

This was already introduced in 5.6, still good to know as this will be the replacement for the Transport client. As you might know I am also creating some code to use in Java Applications on top of the Low Level REST client for java that is also being used by this new client. More information about my work can found here: part 1 and part 2.

Uniform response for create/update and delete requests

At the moment a create request returns a response field created true/false, and a delete request returns found true/false. If you are someone trying to parse the response and using this field, you can no longer use this. Use the result field instead. This will have the value created or updated in case of the create request and deleted or not_found in case of the delete request.

Translog change

The translog is used to keep documents that have not been flushed to disk yet by elasticsearch. In prior releases the translog files are removed when elasticsearch has performed a flush. However, due to optimisations made for recovery having the translog could speed the recovery process. Therefore the translog is now kept for by default 12 hours or a maximum of 512 Mb
More information about the translog can be found here: Translog.

Java Client

In a lot of java projects the java client is used. I have used it as well for numerous projects. However, with the introduction of the High Level Rest client for java projects should move away from the Transport Client. If you want/need to keep using it for now, there are some changes in packages and some methods have been removed. For me the one I used the most is the the order for aggregations, think about Terms.Order and Histogram.Order. They have been replaced by BucketOrder

Index mappings

There are two important changes that can affect your way of working with elastic. The first is the way booleans are handled. In indexes created in version 6, a boolean accepts only two values: true and false. Al other values will result in an exception.

The second change is the _all field. In prior version by default an _all field was created in which all values of fields were copied as strings and analysed with the standard analyser. This field was used by queries like the query_string. There was however a performance penalty as we now had to analyse and index a potentially big field. Soon it became a best practice to disable the field. In elasticsearch 6 the field is disabled by default and it cannot be configured for indices created with elasticsearch 6. If you still use the query_string query, it is now executed agains each field. You should be very careful with the query_string query. It comes with a lot of power. Users get a lot of options to create their own query. But with great power comes great responsibilities. They can create very heavy queries as well. And they can queries that break without a lot of feedback. More information about the query_string. If you still want to give you users more control, but the query_string query is one step to far, think about creating your own search DSL. Some ideas can be found in one of my previous blog posts: Creating a search DSL and Part 2 of creating a search DSL.

Booting elasticsearch

Some things changed with the startup options. You cannot configure the user elasticsearch runs with if you use the deb or rpm packages and the elasticsearch.yml file location is now configured differently. Now you have to export the path where to find all configuration files (elasticsearch.yml, jvm.options and log4j2.properties). You can expose an environment variable ES_PATH_CONF containing the path to the config folder. I use this regularly on my local machine. As I have multiple projects running often with different version of elasticsearch I have setup a structure where I put my config files in separate folders from the elasticsearch distributable. Find the structure in the image below. In the beginning I just copy the config files to my project specific folder. When I start the project with the script startNode.sh the following script is executed.

Elastic folder structure

#!/bin/bash

CURRENT_PROJECT=$(pwd)

export ES_PATH_CONF=$CURRENT_PROJECT/config

DATA=$CURRENT_PROJECT/data
LOGS=$CURRENT_PROJECT/logs
REPO=$CURRENT_PROJECT/backups
NODE_NAME=Node1
CLUSTER_NAME=playground

BASH_ES_OPTS="-Epath.data=$DATA -Epath.logs=$LOGS -Epath.repo=$REPO -Enode.name=$NODE_NAME -Ecluster.name=$CLUSTER_NAME"

ELASTICSEARCH=$HOME/Development/elastic/elasticsearch/elasticsearch-6.0.0-rc1

$ELASTICSEARCH/bin/elasticsearch $BASH_ES_OPTS

Now when you need additional configuration options, add them to the elasticsearch.yml. If you need more memory for the specific project, change the jvm.options file.

Plugins

When indexing pdf documents or word documents a lot of you out there have been using the mapper-attachments plugin. This was already deprecated, now it has been removed. You can switch to the ingest attachment plugin. Never heard about Injest? Injest can be used to pre process documents before they are being indexed by elasticsearch. It is a lightweight variant for Logstash, running within elasticsearch. Be warned though that plugins like the attachment mapper can be heavy on your cluster. So it is wise to have a separate node for Injest. Curious about what you can do to inject the contents of a pdf? The next few steps show you the commands to create the injest pipeline, send a document to it and obtain it again or create a query.

First create the injest pipeline

PUT _ingest/pipeline/attachment
{
  "description": "Extract attachment information",
  "processors": [
    {
      "attachment": {
        "field": "data"
      }
    }
  ]
}

Now when indexing a document containing the attachment as a base64 encoded string in the field data we need to tell elasticsearch to use a pipeline. Check the parameter in the url: pipeline=attachment. This is the name used when creating the pipeline.

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": ""
}

We could stop here, but how to get base64 encoded input from for instance a pdf. On linux and the mac you can use the base64 command for that. Below is a script that reads a specific pdf and creates a base64 ended string out of it. This string is than pushed to elasticsearch.

#!/bin/bash

pdf_doc=$(base64 ~/Desktop/Programma.pdf)

curl -XPUT "http://localhost:9200/my_index/my_type/my_id?pipeline=attachment" -H 'Content-Type: application/json' -d '{"data" : "'"$pdf_doc"'"}'

Scripts

If you are heavy into scripting in elasticsearch you need to check a few things. Changes have been made to the use of the lang attribute when obtaining or updating scripts, you cannot provide it any more. Also support for other languages than painless has been removed.

Search and query DSL changes

Most of the changes in this area are very specific. I am not going to sum them, please check the original documentation. Some of them I do want to mention as they are important to me.

  • If you are constructing queries and it can happen you have an empty query, you can no longer provide an empty object { }. You will get an exception if you keep doing it.
  • Bool queries had a disable_coord parameter, with this you could influence the score function to not use missing search terms as a penalty for the score. This option has been removed.
  • You could transform a match query into a match_phrase query by specifying a type. This is no longer possible, you should just create a phrase query now if you need it. Therefore also the slop parameter has been removed from the match query.

Calculating the score

I the beginning of elasticsearch the score for a document based on an executed query was calculated using an adjusted formula for TF/IDF. It turned out that for fields containing smaller amounts of text TF/IDF was less ideal. Therefore the default scoring algorithm was replaced by BM25. Moving away from TF/IDF to BM25 has been the topic for version 5. Now with 6 they have removed two mechanisms in the scoring: Query Normalization and Coordination Factors. Query Normalization was always hard to explain during trainings. It was an attempt to normalise the scores of queries. Normalizing should make it possible to compare them. However, it did not work and you still should not compare scores of different queries. The Coordinating Factors were more a penalty when having multiple terms to search for and not all of them were found, the coordinating factor gave a penalty to the score. You could easily see this when using the explain API.

That is it for the breaking changes, again there are more changes that you might want to investigate if you are really into all the elasticsearch details. Than have a look at the original documetation

Next up, cool new features

Now let us zoom in on some of the newer features or interesting upgrades.

Sequence Numbers

Sequence Numbers are now assigned to all index, update and delete operations. Using this number a shard that went offline for a moment can ask the primary shard for all operations after a certain sequence number. If the translog is still available (remember that we mentioned in the beginning that the translog was now kept around for 12 hours and or 512 Mb by default) the missing operations can be send to the shard preventing a complete refresh of all the shards contents.

Test Normalizer using analyse endpoint

One of the most important parts of elastic is configurating the mapping for your documents. How do you adjust the terms that you can search for based on the provided text. If you are not sure and you want to try out a specific tokeniser and filters combination you can use the analyze endpoint. Have a look at the following code sample and response where we try out a whitespace tokeniser with a lowercase filter.

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase"],
  "text": ["Jettro Coenradie"]
}

{
  "tokens": [
    {
      "token": "jettro",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "coenradie",
      "start_offset": 7,
      "end_offset": 16,
      "type": "word",
      "position": 1
    }
  ]
}

As you can see we now get two tokens and the uppercase characters are replaced by their lowercase counterparts. What if we do not want the text to become two terms, but we want it to stay as one term. Still we would like to replace the uppercase characters with their lowercase counterparts. This was not possible in the beginning. However, with the introduction of normalizer, a special analyser for fields of type keyword it became possible. In elasticsearch 6 we now have the functionality to use the analyse endpoint for normalisers as well. Check the following code block for an example.

PUT people
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 1,
    "analysis": {
      "normalizer": {
        "name_normalizer": {
          "type": "custom",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  }
}

GET people/_analyze
{
  "normalizer": "name_normalizer",
  "text": ["Jettro Coenradie"]
}

{
  "tokens": [
    {
      "token": "jettro coenradie",
      "start_offset": 0,
      "end_offset": 16,
      "type": "word",
      "position": 0
    }
  ]
}

LogLog-Beta

Ever heard about HyperLogLog or even HyperLogLog++? Well than you must be happy with LogLog-Beta. Some background, elasticsearch comes with a Cardinality Aggregation which can be used to calculate or better estimate the amount of distinct values. If we wanted to create an exact value, we would have to create a map of values with all unique values in there. This would require an extensive amount of memory. You can specify a threshold under which the amount of unique values would be close to exact. However the maximum value for this is 40000. Before elasticsearch used the HyperLogLog++ algorithm to estimate the unique values. With the new algorithm called LogLog-Beta there are better results with lower error margins and still the same performance.

Significant Text Aggregation

For some time the Significant Terms Aggregation has been available. The idea behind this aggregation is to find terms that are common to a specific scope and less common to a more general scope. So imagine we are looking for users of our website that place more orders in relation to pages they visit out of logs with page visits. You cannot calculate them by just counting the number of orders. You need to find those users that are more common to the set of orders than to the set of page visits. In the prior version this was already possible with terms, so not analysed fields. By enabling field-data or doc_values you could use small analysed fields. But for larger text fields this was a performance problem. Now with the Significant Text Aggregation we can overcome this problem. It also comes with an interesting functionality to deduplicate text (think about emails with the original text in a reply, or retweets).

Sounds a bit to vague? Ok, lets have an example. In elasticsearch documentation they use a dataset from Signal Media. As it is an interesting dataset to work with, I will also use it. You can try it out yourself as well. I downloaded the file and imported it into elasticsearch using logstash. This gist should help you. Now on to the query and the response

GET signalmedia/_search
{
  "query": {
    "match": {
      "content": "rain"
    }
  },
  "aggs": {
    "my_sampler": {
      "sampler": {
        "shard_size": 200
      },
      "aggs": {
        "keywords": {
          "significant_text": {
            "field": "content",
            "filter_duplicate_text": true
          }
        }
      }
    }
  },
  "size": 0
}

So we are looking for documents with the word rain. Now in these documents we are going to lookup terms that occur more often than in the global context.

{
  "took": 248,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 11722,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "my_sampler": {
      "doc_count": 600,
      "keywords": {
        "doc_count": 600,
        "bg_count": 1000000,
        "buckets": [
          {
            "key": "rain",
            "doc_count": 544,
            "score": 69.22167699861609,
            "bg_count": 11722
          },
          {
            "key": "showers",
            "doc_count": 164,
            "score": 32.66807368214775,
            "bg_count": 2268
          },
          {
            "key": "rainfall",
            "doc_count": 129,
            "score": 24.82562838569881,
            "bg_count": 1846
          },
          {
            "key": "thundery",
            "doc_count": 28,
            "score": 20.306396677050884,
            "bg_count": 107
          },
          {
            "key": "flooding",
            "doc_count": 153,
            "score": 17.767450110864743,
            "bg_count": 3608
          },
          {
            "key": "meteorologist",
            "doc_count": 63,
            "score": 16.498915662650603,
            "bg_count": 664
          },
          {
            "key": "downpours",
            "doc_count": 40,
            "score": 13.608547008547008,
            "bg_count": 325
          },
          {
            "key": "thunderstorms",
            "doc_count": 48,
            "score": 11.771851851851853,
            "bg_count": 540
          },
          {
            "key": "heatindex",
            "doc_count": 5,
            "score": 11.56574074074074,
            "bg_count": 6
          },
          {
            "key": "southeasterlies",
            "doc_count": 4,
            "score": 11.104444444444447,
            "bg_count": 4
          }
        ]
      }
    }
  }
}

Interesting terms when looking for rain: showers, rainfall, thundery, flooding, etc. These terms could now be returned to the user as possible candidates for improving their search results.

Concluding

That is it for now. I haven’t even scratched all the new cool stuff in the other components like X-Pack, Logstash and Kibana. More to come.


Part 2 of Creating a search DSL

In my previous blog post I wrote about creating your own search DSL using Antlr. In that post I discussed the Antlr language for constructs like AND/OR, multiple words and combining words. In this blog post I am showing how to use the visitor mechanism to write actual elasticsearch queries.

If you did not read the first post yet, please do so. It will make it easier to follow along. If you want the code, please visit the github page.

https://amsterdam.luminis.eu/2017/06/28/creating-search-dsl/
Github repository

What queries to use

In the previous blog post we ended up with some of the queries we want to support

  • apple
  • apple OR juice
  • apple raspberry OR juice
  • apple AND raspberry AND juice OR Cola
  • “apple juice” OR applejuice

Based on these queries we have some choices to make. The first query seems obvious, searching for one word would become a match query. However, in which field do you want to search? In Elasticsearch there is a special field called the _all field. In the example we are using the _all field, however it would be easy to create a query against a number of specific fields using a multi_match.

In the second example we have two words with OR in between. The most basic implementation would again be a match query, since the match query by default uses OR if you supply multiple words. However, the DSL uses OR to combine terms as well as and queries. A term in itself can be a quoted term as well. Therefore, to translate the apple OR juice we need to create a boolean query. Now look at the last example, here we use quotes. One would expect quotes to keep words together. In elasticsearch we would use the Phrase query to accomplish this.

As the current DSL is fairly simple, creating the queries is not that hard. But a lot more extensions are possible that can make use of more advance query options. Using wildcards could result in fuzzy queries, using title:apple could look into one specific field and using single quotes could mean an exact match, so we would need to use the term query.

Now you should have an idea of the queries we would need, let us have a look at the code and see Antlr DSL in action.

Generate json queries

As mentioned in the introduction we are going to use the visitor to parse the tree. Of course we need to create the tree first. Below the code to create the tree.

static SearchdslParser.QueryContext createTreeFromString(String searchString) {
    CharStream charStream = CharStreams.fromString(searchString);
    SearchdslLexer lexer = new SearchdslLexer(charStream);
    CommonTokenStream commonTokenStream = new CommonTokenStream(lexer);
    SearchdslParser parser = new SearchdslParser(commonTokenStream);

    return parser.query();
}

AS mentioned in the previous posts, the parser and the visitor classes get generated by Antlr. Methods are generated for visiting the different nodes of the tree. Check the class

  • SearchdslBaseVisitor
  • for the methods you can override.

    To understand what happens, it is best to have a look at the tree itself. Below the image of the tree that we are going to visit.

    Antlr4 parse tree

    We visit the tree from the top. The first method or Node that we visit is the top level Query. Below the code of the visit method.

    @Override
    public String visitQuery(SearchdslParser.QueryContext ctx) {
        String query = visitChildren(ctx);
    
        return
                "{" +
                    "\"query\":" + query +
                "}";
    }
    

    Every visitor generates a string, with the query we just visit all the possible children and create a json string with a query in there. In the image we see only a child orQuery, but it could also be a Term or andQuery. By calling the visitChildren method we continue to walk the tree. Next step is the visitOrQuery.

    @Override
    public String visitOrQuery(SearchdslParser.OrQueryContext ctx) {
        List<String> shouldQueries = ctx.orExpr().stream().map(this::visit).collect(Collectors.toList());
        String query = String.join(",", shouldQueries);
    
        return
                "{\"bool\": {" +
                        "\"should\": [" +
                            query +
                        "]" +
                "}}";
    }
    

    When creating an OR query we use the bool query with the should clause. Next we have to obtain the queries to include in the should clause. We obtain the orExpr items from the orQuery and for each orExpr we again call the visit method. This time we will visit the orExpr Node, this node does not contain important information for us, therefore we let the template method just call the visitChildren method. orExpr nodes can contain a term or an andQuery. Let us have a look at visiting the andQuery first.

    @Override
    public String visitAndQuery(SearchdslParser.AndQueryContext ctx) {
        List<String> mustQueries = ctx.term().stream().map(this::visit).collect(Collectors.toList());
        String query = String.join(",", mustQueries);
        
        return
                "{" +
                        "\"bool\": {" +
                            "\"must\": [" +
                                query +
                            "]" +
                        "}" +
                "}";
    }
    

    Notice how closely this resembles the orQuery, big difference in the query is that we now use the bool query with a must part. We are almost there. The next step is the Term node. This node contains words to transform into a match query, or it contains a quotedTerm. The next code block shows the visit method of a Term.

    @Override
    public String visitTerm(SearchdslParser.TermContext ctx) {
        if (ctx.quotedTerm() != null) {
            return visit(ctx.quotedTerm());
        }
        List<TerminalNode> words = ctx.WORD();
        String termsAsText = obtainWords(words);
    
        return
                "{" +
                        "\"match\": {" +
                            "\"_all\":\"" + termsAsText + "\"" +
                        "}" +
                "}";
    }
    
    private String obtainWords(List<TerminalNode> words) {
        if (words == null || words.isEmpty()) {
            return "";
        }
        List<String> foundWords = words.stream().map(TerminalNode::getText).collect(Collectors.toList());
        
        return String.join(" ", foundWords);
    }
    

    Notice we first check if the term contain a quotedTerm. If it does not contain a quotedTerm we obtain the words and combine them into one string. The final step is to visit the quotedTerm node.

    @Override
    public String visitQuotedTerm(SearchdslParser.QuotedTermContext ctx) {
        List<TerminalNode> words = ctx.WORD();
        String termsAsText = obtainWords(words);
    
        return
                "{" +
                        "\"match_phrase\": {" +
                            "\"_all\":\"" + termsAsText + "\"" +
                        "}" +
                "}";
    }
    

    Notice we parse this part into a match_phrase query, other than that it is almost the same as the term visitor. Finally we can generate the complete query.

    Example

    “multi search” && find && doit OR succeed && nothing

    {
      "query": {
        "bool": {
          "should": [
            {
              "bool": {
                "must": [
                  {
                    "match_phrase": {
                      "_all": "multi search"
                    }
                  },
                  {
                    "match": {
                      "_all": "find"
                    }
                  },
                  {
                    "match": {
                      "_all": "doit"
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "must": [
                  {
                    "match": {
                      "_all": "succeed"
                    }
                  },
                  {
                    "match": {
                      "_all": "nothing"
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }
    

    In the codebase on github there is also a Jackson JsonNode based visitor, if you don’t like the string based approach.

    That is about it, I am planning on extending the example further. If I have added some interesting new concepts I’ll get back to you with a part 3


    Creating a search DSL

    As an (elastic)search expert, I regularly visit customers. For these customers I often do a short analysis of their search solution and I give advice about improvements they can make. It is always interesting to look at solutions customers come up with. At one of my most recent customers I noticed a search solution based on a very extensive search DSL (Domain Specific Language) created with Antlr. I knew about Antlr, but never thought about creating my own search DSL.

    To better understand the options of Antlr and to practice with creating my own DSL I started experimenting with it. In this blog post I’ll take you on my learning journey. I am going to create my own very basic search DSL.

    Specifying the DSL

    First we need to define the queries we would like our users to enter. Below are some examples:

    • tree – This is an easy one, just one word to search for
    • tree apple – Two words to look up
    • tree apple AND sell – Find matching content for tree apple, but also containing sell.
    • tree AND apple OR juice – Find matching content containing the terms tree and apple or containing the term juice.
    • “apple tree” OR juice – Find content having the terms apple and tree next to each other in the right order (Phrase query) or having the term juice.

    These are the combinations we need to make. In the next sections we setup our environment and I explain the basics of Antlr that you need to understand to follow along.

    Setting up Antlr for your project

    There are lots of resources about setting up your local Antlr environment. I personally learned most from tomassetti. I prefer to use Maven to gather the required dependencies. I also use the Maven Antlr plugin to generate the Java classes based on the Lexar and Grammar rules.

    I also installed Antlr using Homebrew, but you do not really need this for this blog post.

    You can find the project on Github: https://github.com/jettro/search-dsl

    I generally just load the Maven project into IntelliJ and get everything running from there. If you don’t want to use an IDE, you can also do this with pure Maven.

    proj_home #> mvn clean install
    proj_home #> mvn dependency:copy-dependencies
    proj_home #> java -classpath "target/search-dsl-1.0-SNAPSHOT.jar:target/dependency/*"  nl.gridshore.searchdsl.RunStep1
    

    Of course you can change the RunStep1 into one of the other three classes.

    Antlr introduction

    This blog post does not have the intention to explain all ins and outs of Antlr. But there are a few things you need to know if you want to follow along with the code samples.

    • Lexer – A program that takes a phrase and obtains tokens from the phrase. Examples of lexers are: AND consisting of the characters ‘AND’ but also the specials characters ‘&&’. Another example is a WORD consisting of upper or lowercase characters and numbers. Tokens coming out of a Lexer contain the type of the token as well as the matched characters by that token.
    • Grammar – Rules that make use of the Lexer to create the syntax of your DSL. The result is a parser that creates a ParseTree out of your phrase. For example, we have a grammar rule query that parses a phrase like tree AND apple into the following ParseTree. The Grammar rule is: query : term (AND term)+ ;.
    • ParseTree – Tree by Antlr using the grammar and lexer from the provided phrase. Antlr also comes with a tool to create a visual tree. See an example below. In this blog post we create our own parser of the tree, there are however two better alternatives. The first is using the classic Listener pattern. The other is the Visitor pattern.
      Antlr4 parse tree 1
    • Listener – Antlr generates some parent classes to create your own listener. The idea behind a a listener is that you receive events when a new element is started and when the element is finished. This resembles how for instance the SAX parser works.
    • Visitor – Antlr generates some parent classes to create your own Visitors. With a visitor you start visiting your top level element, then you visit the children, that way you recursively go down the tree. In a next blog post we’ll discuss the visitor pattern in depth.

    Search DSL Basics

    In this section we are going to create the DSL in four small steps. For each step we have a StepXLexerRules.g4 and a StepXSearchDsl.g4 file containing the Antlr lexer and grammar rules. Each step also contains a Java file with the name RunStepX.

    Step 1

    In this step we want to write rules like:

    • apple
    • apple juice
    • apple1 juice
    lexer
    WORD        : ([A-z]|[0-9])+ ;
    WS          : [ \t\r\n]+ -&amp;gt; skip ;
    
    grammar
    query       : WORD+ ;
    

    In all the Java examples we’ll start the same. I’ll mention the rules here but will not go into depth in the other steps.

    Lexer lexer = new Step1LexerRules(CharStreams.fromString("apple juice"));
    CommonTokenStream commonTokenStream = new CommonTokenStream(lexer);
    
    Step1SearchDslParser parser = new Step1SearchDslParser(commonTokenStream);
    Step1SearchDslParser.QueryContext queryContext = parser.query();
    
    handleWordTokens(queryContext.WORD());
    

    First we create the Lexer, the Lexer is generated by Antlr. The input is a stream of characters created using the class CharStreams. From the Lexer we obtain a stream of Tokens, which is the input for the parser. The parser is also generated by Antlr. Using the parser we can obtain the queryContext. Notice the method query. This is the same name as the first grammar rule.

    In this basic example a query consists of at least one WORD and a WORD consists of upper and lower case characters and numbers. The output for the first step is:

    Source: apple
    WORDS (1): apple,
    Source: apple juice
    WORDS (2): apple,juice,
    Source: apple1 juice
    WORDS (2): apple1,juice,
    

    In the next step we are extending the DSL with an option to keep words together.

    Step 2

    In the previous step you got the option to search for one or multiple words. In this step we are adding the option to keep some words together by surrounding them with quotes. We add the following lines to the lexer and grammar.

    lexer
    QUOTE   : ["];
    
    grammar
    query               : term ;
    
    term                : WORD+|quotedTerm;
    quotedTerm          : QUOTE WORD+ QUOTE ;
    

    Now we can support queries like

    • apple
    • “apple juice”

    The addition to the lexer is QUOTE, the grammar becomes slightly more complex. The query now is a term, a term can be multiple WORDs or a quoted term consisting of multiple WORDs surrounded by QUOTEs. In Java we have to check from the termContext that is obtained from the queryContext if the term contains WORDs or a quotedTerm. That is what is shown in the next code block.

    Step2SearchDslParser.TermContext termContext = queryContext.term();
    handleTermOrQuotedTerm(termContext);
    
    private void handleTermOrQuotedTerm(Step2SearchDslParser.TermContext termContext) {
        if (null != termContext.quotedTerm()) {
            handleQuotedTerm(termContext.quotedTerm());
        } else {
            handleWordTokens(termContext.WORD());
        }
    }
    
    private void handleQuotedTerm(Step2SearchDslParser.QuotedTermContext quotedTermContext) {
        System.out.print("QUOTED ");
        handleWordTokens(quotedTermContext.WORD());
    }
    

    Notice how we check if the termContext contains a quotedTerm, just by checking if it is null. The output then becomes

    Source: apple
    WORDS (1): apple,
    Source: "apple juice"
    QUOTED WORDS (2): apple,juice,
    

    Time to take the next step, this time we make it possible to specify to make it explicit to query for one term or the other.

    Step 3

    In this step we make it possible to make it optional for a term to match as long as another term matches. Example queries are:

    • apple
    • apple OR juice
    • “apple juice” OR applejuice

    The change to the Lexer is just one type OR. The grammar has to change, now the query needs to support a term or an orQuery. The orQuery consists of a term extended with OR and a term, at least once.

    lexer
    OR      : 'OR' | '||' ;
    
    grammar
    query   : term | orQuery ;
    orQuery : term (OR term)+ ;
    

    The handling in Java is straightforward now, again some null checks and handle methods.

    if (queryContext.orQuery() != null) {
        handleOrContext(queryContext.orQuery());
    } else {
        handleTermContext(queryContext.term());
    }
    

    The output of the program then becomes:

    Source: apple
    WORDS (1): apple,
    Source: apple OR juice
    Or query: 
    WORDS (1): apple,
    WORDS (1): juice,
    Source: "apple juice" OR applejuice
    Or query: 
    QUOTED WORDS (2): apple,juice,
    WORDS (1): applejuice,
    

    In the final step we want to make the OR complete by also adding an AND.

    Step 4

    In the final step for this blog we are going to introduce AND. With the combination of AND we can make more complicated combinations. What would you make from one AND two OR three OR four AND five. In my DSL I first do the AND, then the OR. So this would become (one AND two) OR three OR (four AND five). So a document would match if it contains one and two, or four and five, or three. The Lexer does change a bit, again we just add a type for AND. The grammar has to introduce some new terms. It is good to have an overview of the complete grammar.

    query               : term | orQuery | andQuery ;
    
    orQuery             : orExpr (OR orExpr)+ ;
    orExpr              : term|andQuery;
    
    andQuery            : term (AND term)+ ;
    term                : WORD+|quotedTerm;
    quotedTerm          : QUOTE WORD+ QUOTE ;
    

    As you can see, we introduced an orExpr, being a term or an andQuery. We changed an orQuery to become an orExpr followed by at least one combination of OR and another orExpr. The query now is a term, an orQuery or an andQuery. Some examples below.

    • apple
    • apple OR juice
    • apple raspberry OR juice
    • apple AND raspberry AND juice OR Cola
    • “apple juice” OR applejuice

    The java code becomes a bit boring by now, so let us move to the output of the program immediately.

    Source: apple
    WORDS (1): apple,
    Source: apple OR juice
    Or query: 
    WORDS (1): apple,
    WORDS (1): juice,
    Source: apple raspberry OR juice
    Or query: 
    WORDS (2): apple,raspberry,
    WORDS (1): juice,
    Source: apple AND raspberry AND juice OR Cola
    Or query: 
    And Query: 
    WORDS (1): apple,
    WORDS (1): raspberry,
    WORDS (1): juice,
    WORDS (1): Cola,
    Source: "apple juice" OR applejuice
    Or query: 
    QUOTED WORDS (2): apple,juice,
    WORDS (1): applejuice,
    

    Concluding

    That is it for now, of course this is not the most complicated search DSL. You can most likely come up with other interesting constructs. The goal for this blogpost was to get you underway. In the next blog post I intend to discuss and show how to create a visitor that makes a real elasticsearch query based on the DSL.