Categories for Search/analytics



Using Compound words in Elasticsearch

From one of our customers, we got a question about using compound words in queries. Compound words are essential in languages like Dutch and German. Some examples in Dutch are zomervakantie, kookboek and koffiekopje. When a user enters koffiekopje, we want to find documents containing koffie kop as well as koffiekop and of course koffiekopje. In Elasticsearch, and other Lucene based search engines, you can use the Compound Word Token Filter to accomplish this.

There are two different versions, the Hyphenation decompounder, and the Dictionary decompounder. Reading the documentation, you’ll learn that you should always use the Hyphenation one. The hyphenation breaks up the terms in their hyphens. By combining adjacent hyphens, we create terms and check them with a provided dictionary. With a match, the term gets added to the query, just like a synonym.

An example of hyphenation is: 
Koffiekopje -> kof - fie - kop - je
These hyphens would potentially result in the terms:
koffie, koffiekop, kop, kopje, koffiekopje.

As with other things with the Dutch language, I was a bit skeptic about the results of the filter. Therefore I decided to have a look at the implementation and try it out before I started using it for real. I had to create a class and do some sub-classing to get access to protected parameters and methods. But I was able to distill the mechanism as used by the filter and the appropriate Lucene classes.

The first step is the hyphenation part. For this part, you use the Lucene class HyphenationTree. The following piece of code shows the construction of the hyphenation tree using the mentioned XML file from the Objects For Formatting Objects project.

public TryHyphenation(String hyphenRulesPath) {
    HyphenationTree tree = new HyphenationTree();
    try (InputStream hyphenRules = new FileSystemResource(hyphenRulesPath).getInputStream()) {
        InputSource source = new InputSource(hyphenRules);
        tree.loadPatterns(source);
    } catch (IOException e) {
        LOGGER.error("Problem while loading the hyphen file", e);
    }
    this.tree = tree;
}

The constructor gives us access to the HyphenationTree containing the rules. Now we can ask for the hyphens of a string we can choose our selves. The result is an array with numbers. Each number represents the start of a new hyphen. The following code block finds a list of strings containing the found hyphens. Printing the hyphens is just a matter of joining the strings with a separator.

public List<String> hyphenate(String sourceString) {
    Hyphenation hyphenator = this.tree.hyphenate(sourceString, 1, 1);
    int[] hyphenationPoints = hyphenator.getHyphenationPoints();
    List<String> parts = new ArrayList<>();
    for (int i = 1; i < hyphenationPoints.length; i++) {
        parts.add(sourceString.substring(hyphenationPoints[i-1], hyphenationPoints[i]));
    }
    return parts;
}

TryHyphenation hyphenation = new TryHyphenation(HYPHEN_CONFIG);
String sourceString = "Koffiekopje";
System.out.println("*** Find Hyphens:");
List<String> hyphens = hyphenation.hyphenate(sourceString);
String joinedHyphens = StringUtils.arrayToDelimitedString(
        hyphens.toArray(), " - ");
System.out.println(joinedHyphens);

Running the code results in the following hyphens (or output).

*** Find Hyphens:
Kof - fie - kop - je

Next step is finding the terms we want to search for based on the provided compound word. The elasticsearch analyzer uses the Lucene class HyphenationCompoundWordTokenFilter to find terms out of compound words. We can use this class in our sample code as well; we have to extend it to get access to the protected tokens variable. Therefore we created this following sub-class.

private class AccessibleHyphenationCompoundWordTokenFilter extends HyphenationCompoundWordTokenFilter {
    public AccessibleHyphenationCompoundWordTokenFilter(TokenStream input, 
                                                        HyphenationTree hyphenator, 
                                                        CharArraySet dictionary) {
        super(input, hyphenator, dictionary);
    }

    public List<String> getTokens() {
        return tokens.stream().map(compoundToken -> compoundToken.txt.toString())
                .collect(Collectors.toList());
    }
}

With the following code, we can find the tokens that are available in our dictionary that are equal to found hyphens or combinations of hyphens. This class is not meant for our way of using it. Therefore the code looks a bit weird. But it does help us to understand what happens. We need a tokenizer; we use the Standard tokenizer from Lucene. We also need a reader with access to the string that needs to be tokenized. Next, we create the CharSetArray containing our dictionary of terms to find. With the HyphenationTree, the tokenizer and the dictionary we create the AccessibleHyphenationCompoundWordTokenFilter. After calling the internal methods of the filter, we can call our method with access to the internal variable tokens.

public static final List<string> DICTIONARY = Arrays.asList("koffie","kop", "kopje");
public List<String> findTokens(String sourceString) {
    StandardTokenizer tokenizer = new StandardTokenizer();
    tokenizer.setReader(new StringReader(sourceString));

    CharArraySet charArraySet = new CharArraySet(DICTIONARY, true);
    AccessibleHyphenationCompoundWordTokenFilter filter = 
            new AccessibleHyphenationCompoundWordTokenFilter(tokenizer, tree, charArraySet);
    try {
        filter.reset();
        filter.incrementToken();
        filter.close();
    } catch (IOException e) {
        LOGGER.error("Could not tokenize", e);
    }
    return filter.getTokens();
}

Now we have the terms from the compound word that is also in our dictionary.

System.out.println("\n*** Find Tokens:");
List<String> tokens = hyphenation.findTokens(sourceString);
String joinedTokens = StringUtils.arrayToDelimitedString(tokens.toArray(), ", ");
System.out.println(joinedTokens);

*** Find Tokens:
Koffie, kop, kopje

Using this test class is nice, but now we want to use it within elasticsearch. The following link is a reference to a gist containing the commands to try it out in Kibana Console. Using this sample, you can play around and investigate the effect of the HyphenationCompoundWordTokenFilter. Don’t forget to install the Dutch language file in the config folder of elasticsearch. Compound Word Token Filter Instalation

Gist containing java class and Kibana Console example


Migrating from GSA to Elasticsearch with ANTLR

Starting January 1st of 2019, the Google Search Appliance (GSA) is set to be EOL. For one of my clients, we have chosen Elasticsearch as our alternative from 2019 onwards. However, there was one problem; the current GSA solution is in use by several API’s that can’t be changed for various reasons. Therefore, it was up to us to replicate the GSA’s behaviour with the new Elasticsearch implementation and come migration time, to swap out the GSA for Elasticsearch with as little functional changes as possible.

The GSA provides a range of functionality, some of which is easily implemented with other technologies. In our case, this included of course search functionality, but also other things such as website crawling. The part that I found most interesting however, was the GSA’s ability to enable users to form queries based on two small domain specific languages (DSL).

Queries in these two DSL’s reach the GSA as query parameters on the GSA URL. The first DSL, specified with query parameter q, has three “modes” of functionality:

  • Free text search, by simply putting some space separated terms
  • allintext search, a search query much like the free text search, but excluding fields such as metadata, anchors and URL’s from the search
  • inmeta search, which can potentially do a lot, but in our case was merely restricted to searches on metadata of the form key=value.

The second DSL, specified with query parameter partialFields also provides searching on metadata. In this case, searches are of the form (key:value) and may be combined with three boolean operators:

  • .: AND
  • |: OR
  • -: NOT

An example query could then be (key1:value1)|(key2.value2).

In this blog, I will explain how to implement these two DSL’s using ANTLR and I will show you how ANTLR enables us to separate the parsing of the DSL from our other application logic.

If this is your first time working with ANTLR, you may want to read two posts ([1], [2]) that have been posted on our blog earlier.

If you are looking for the complete implementation, then please refer to the Github repository.

Parsing the GSA DSL

Let us start with parsing the q DSL. I have split the ANTLR grammar into a separate parser and lexer file for readability.

The parser is as follows:

parser grammar GsaQueryParser;

options { tokenVocab=GsaQueryLexer; }

query   : pair (OR? pair)*  #pairQuery
        | TEXT+             #freeTextQuery;

pair    : IN_META TEXT+                 #inmetaPair
        | ALL_IN_TEXT TEXT (OR? TEXT)*  #allintextPair;

And the lexer is defined below:

lexer grammar GsaQueryLexer;

ALL_IN_TEXT : 'allintext:';
IN_META     : 'inmeta:';

OR          : 'OR';

TEXT        : ~(' '|'='|':'|'|'|'('|')')+;

WHITESPACE  : [ \t\r\n]+ -> skip;
IGNORED     : [=:|()]+ -> skip;

Note that the parser grammar reflects the two different ways that the q DSL can be used; by specifying pairs or by simply putting a free text query. The pairs can be separated by an OR operator. Furthermore, the allintext keyword may separate terms with OR as well.

The definition of the partialFields DSL is somewhat different because it allows for query nesting and more boolean operators. Both the parser and the lexer are shown below, again in two separate files.

Parser:

parser grammar GsaPartialFieldsParser;

options { tokenVocab=GsaPartialFieldsLexer; }

query       : pair
            | subQuery;

subQuery    : LEFTBRACKET subQuery RIGHTBRACKET
            | pair (AND pair)+
            | pair (OR pair)+
            | subQuery (AND subQuery)+
            | subQuery (OR subQuery)+
            | subQuery AND pair
            | subQuery OR pair
            | pair AND subQuery
            | pair OR subQuery;

pair        : LEFTBRACKET KEYWORD VALUE RIGHTBRACKET        #inclusionPair
            | LEFTBRACKET NOT KEYWORD VALUE RIGHTBRACKET    #exclusionPair
            | LEFTBRACKET pair RIGHTBRACKET                 #nestedPair;

Lexer:

lexer grammar GsaPartialFieldsLexer;

AND         : '.';
OR          : '|';
NOT         : '-';

KEYWORD     : [A-z0-9]([A-z0-9]|'-'|'.')*;
VALUE       : SEPARATOR~(')')+;

SEPARATOR   : [:];
LEFTBRACKET : [(];
RIGHTBRACKET: [)];
WHITESPACE  : [\t\r\n]+ -> skip;

Note the usage of labels in both grammars which, in the above case, allows me to easily distinguish different types of key-value pairs; nested, inclusive or exclusive. Furthermore, there is a gotcha in the matching of the VALUE token. To make a clear distinction between KEYWORD and VALUE tokens, I’ve included the : as part of a VALUE token.

Creating the Elasticsearch query

Now that we have our grammars ready, it’s time to use the parse tree generated by ANTLR to construct corresponding Elasticsearch queries. I will post some source code snippets, but make sure to refer to the complete implementation for all details.

For both DSL’s, I have chosen to walk the tree using the visitor pattern. We will start be reviewing the q DSL.

Creating queries from the q DSL

The visitor of the q DSL extends a BaseVisitor generated by ANTLR and will eventually return a QueryBuilder, as indicated by the generic type:

public class QueryVisitor extends GsaQueryParserBaseVisitor<QueryBuilder>

There are three cases that we can distinguish for this DSL: a free text query, an allintext query or an inmeta query. Implementing the free text and allintext query means extracting the TEXT token from the tree and then constructing a MultiMatchQueryBuilder, e.g.:

@Override
public QueryBuilder visitFreeTextQuery(GsaQueryParser.FreeTextQueryContext ctx) {
    String text = concatenateValues(ctx.TEXT());
    return new MultiMatchQueryBuilder(text, "album", "artist", "id", "information", "label", "year");
}

private String concatenateValues(List<TerminalNode> textNodes) {
    return textNodes.stream().map(ParseTree::getText).collect(joining(" "));
}

The fields that you use in this match query depend on the data that is in Elasticsearch – in my case some documents describing music albums.

An inmeta query requires us to extract both the field and the value, which we then use to construct a MatchQueryBuilder, e.g.:

@Override
public QueryBuilder visitInmetaPair(GsaQueryParser.InmetaPairContext ctx) {
    List<TerminalNode> textNodes = ctx.TEXT();

    String key = textNodes.get(0).getText().toLowerCase();
    textNodes.remove(0);
    String value = concatenateValues(textNodes);

    return new MatchQueryBuilder(key, value);
}

We can then combine multiple pairs by implementing the visitPairQuery method:

@Override
public QueryBuilder visitPairQuery(GsaQueryParser.PairQueryContext ctx) {
    BoolQueryBuilder result = new BoolQueryBuilder();
    ctx.pair().forEach(pair -> {
        QueryBuilder builder = visit(pair);
        if (hasOrClause(ctx, pair)) {
            result.should(builder);
            result.minimumShouldMatch(1);
        } else {
            result.must(builder);
        }
    });
    return result;
}

Based on the presence of OR clauses we either create a should or must boolean clause for our Elasticsearch query.

Creating queries from the partialFields DSL

The visitor of the partialFields DSL also extends a BaseVisitor generated by ANTLR and also returns a QueryBuilder:

public class PartialFieldsVisitor extends GsaPartialFieldsParserBaseVisitor<QueryBuilder>

There are three kinds of pairs that we can specify with this DSL (inclusion, exclusion or nested pair) and we can override a separate method for each option, because we labelled these alternatives in our grammar. A nested pair is simply unwrapped and then passed back to ANTLR for further processing:

@Override
public QueryBuilder visitNestedPair(GsaPartialFieldsParser.NestedPairContext ctx) {
    return visit(ctx.pair());
} 

The inclusion and exclusion query implementations are quite similar to each other:

@Override
public QueryBuilder visitInclusionPair(GsaPartialFieldsParser.InclusionPairContext ctx) {
    return createQuery(ctx.KEYWORD().getText(), ctx.VALUE().getText(), false);
}

@Override
public QueryBuilder visitExclusionPair(GsaPartialFieldsParser.ExclusionPairContext ctx) {
    return createQuery(ctx.KEYWORD().getText(), ctx.VALUE().getText(), true);
}

private QueryBuilder createQuery(String key, String value, boolean isExcluded) {       
    value = value.substring(1);

    if (isExcluded) {
        return new BoolQueryBuilder().mustNot(new MatchQueryBuilder(key, value));
    } else {
        return new MatchQueryBuilder(key, value).operator(Operator.AND);
    }
}

Remember that we included the : to help our token recognition? The code above is where we need to handle this by taking the substring of the value. What remains is to implement a way to handle the combinations of pairs and boolean operators. This is done by implementing the visitSubQuery method and you can view the implementation here. Based on the presence of an AND or OR operator, we apply must or should clauses, respectively.

Examples

In my repository, I’ve included a REST controller that can be used to execute queries using the two DSL’s. Execute the following steps to follow along with the examples below:

  • Start an Elasticsearch instance at http://localhost:9200 (the application assumes v6.4.2)
  • Clone the repository: git clone https://github.com/markkrijgsman/migrate-gsa-to-elasticsearch.git && cd migrate-gsa-to-elasticsearch
  • Compile the repository: mvn clean install
  • Run the application: cd target && java -jar search-api.jar
  • Fill the Elasticsearch instance with some documents: http://localhost:8080/load
  • Start searching: http://localhost:8080/search

You can also use the Swagger UI to execute some requests: http://localhost:8080/swagger-ui.html. For each example I will list the URL for the request and the resulting Elasticsearch query that is constructed by the application.

Get all albums mentioning Elton John
http://localhost:8080/search?q=Elton

GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "Elton",
            "fields": [
              "album",
              "artist",
              "id",
              "information",
              "label",
              "year"
            ],
            "type": "best_fields",
            "operator": "AND",
            "lenient": true,
          }
        }
      ]
    }
  }
}

Get all albums where Elton John or Frank Sinatra are mentioned
http://localhost:8080/search?q=allintext:Elton%20OR%20Sinatra

GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "must": [
              {
                "multi_match": {
                  "query": "Elton Sinatra",
                  "fields": [
                    "album",
                    "artist",
                    "id",
                    "information",
                    "label",
                    "year"
                  ],
                  "type": "best_fields",
                  "operator": "OR",
                  "lenient": true
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Note that the operator for the multi match query is now OR, where it was AND in the previous example.

Get all albums where the artist is Elton John
http://localhost:8080/search?partialFields=(artist:Elton)

GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "artist": {
              "query": "Elton",
              "operator": "AND"
            }
          }
        }
      ]
    }
  }
}

Get all albums where Elton John is mentioned, but is not the artist
http://localhost:8080/search?partialFields=(-artist:Elton)&q=Elton

GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "must_not": [
              {
                "match": {
                  "artist": {
                    "query": "Elton",
                    "operator": "OR"
                  }
                }
              }
            ]
          }
        },
        {
          "multi_match": {
            "query": "Elton",
            "fields": [
              "album",
              "artist",
              "id",
              "information",
              "label",
              "year"
            ],
            "type": "best_fields",
            "operator": "AND",
            "lenient": true
          }
        }
      ]
    }
  }
}

Get all albums created by Elton John between 1972 and 1974 for the label MCA
http://localhost:8080/search?partialFields=(artist:Elton).(label:MCA)&q=inmeta:year:1972..1974

GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "must": [
              {
                "match": {
                  "artist": {
                    "query": "Elton",
                    "operator": "AND"
                  }
                }
              },
              {
                "match": {
                  "label": {
                    "query": "MCA",
                    "operator": "AND"
                  }
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must": [
              {
                "range": {
                  "year": {
                    "from": "1972",
                    "to": "1974",
                    "include_lower": true,
                    "include_upper": true
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Please refer to the unittests for a lot more examples:

Conclusions

As you can see, the usage of ANTLR allows us to specify fairly complex DSL’s without compromising readability. We’ve cleanly separated the parsing of a user query from the actual construction of the resulting Elasticsearch query. All code is easily testable and makes little to no use of hard to understand regular expressions.

A good addition would be to add some integration tests to your implementation, which you can learn more about here. If you have any questions or comments, let me know!


Being relevant with Solr

Ok, I have to admit, I am not (yet) a Solr expert. However, I have been working now with elasticsearch for years and the fundamentals of obtaining relevant results for Solr and Elasticsearch are still the same. For a customer, I am working on fine-tuning their search results using an outdated Solr version. They are using Solr 4.8.1, yes an upgrade is planned in the future. Still, they want to try to improve their search results. Using my search knowledge I started getting into Solr and, I liked what I saw. I saw a query that had matching algorithms, filters to limit documents that need to be considered and on top of that lots of boosting options. So many boosting options that I had to experiment a lot to get to the right results.

In this blog post, I am going to explain what I did with Solr coming from an elasticsearch background. I do not intend to create a complete guideline on how to use Solr. I’ll focus on the bits that surprised me and the stuff that deals with tuning the edismax type of query.

A little bit of context

Imagine you a running a successful eCommerce website. Or even better, you have created a superb online shopping experience. With an excellent marketing strategy, you are generating lots of traffic to your website. But, yes there is a but, sales are not as expected. Maybe it is even a bit disappointing. You start evaluating the visits to your site using analytics. When going through your search analytics you notice that most of the executed searches do not result in a click and therefore not into sales. It looks like the results shown to your visitors for their searches are far from optimal.

So we need better search results.

Getting to know your data

Before we start writing queries, we need to have an idea about our data. We need to create inverted indexes per field, or for combinations of fields, using different analyzers. We need to be able to search for parts of sentences, but also be able to boost matching sentences. We want to be able to find matches for combinations of fields. Imagine you sell books, you want people to be able to look for the latest book by Dan Brown called Origin. Users might enter a search query like Dan Brown Origin. If might become a challenge if you have structured data like:

{
    "author": "Dan Brown",
    "title": "Origin"
}

How would you do it if people want to have the latest Dan Brown? What if you want to help people choose by using the popularity of books using ratings or sales. Or how to act if people want to look at all the books in the Harry Potter series. Of course, we need to have the right data to be able to facilitate our visitors with these new requirements. We also need a media_type field later on. With the media type, we can filter on all eBooks for example. So the data becomes something like the following block.

{
    "id": "9781400079162",
    "author": "Dan Brown",
    "title": "Origin",
    "release_date": "2017-10-08T00:00:00Z",
    "rating": 4.5,
    "sales_pastyear": 239,
    "media_type": "ebook"
}

Ranking requirements

Based on analysis and domain knowledge we have the following thoughts translated into requirements for the ranking of search results:

  • Recent books are more important than older books
  • Books with a higher rating are more important than lower rated books
  • Unrated books are more important than low rated books
  • Books that are sold more often in the past year are more important than unsold books
  • Normal text matching rules should be applied

Mapping data to our index

In Solr, you create a schema.xml to map the expected data to specific types. You can also use the copy_to functionality to create new fields that are analyzed differently or are a combination of the provided fields. An example could be to add a field that contains all searchable other fields. In our case, we could create a field containing the author as well as the title. This field is analyzed in the most optimal way to do matching. We add a tokenizer, but also filters for lowercasing, stop words, diacritics, and compound words. We also have fields that are more for boosting using phrases and numbers or dates. We want fields like title and author to support phrases but also full matches. With this, we got a few extra search requirements

  • Documents of which the exact author or title matches the query should be more important
  • Documents of which the title contains the words in the query in the same order are more important

With these rules, we can start to create a query and apply our matching and boosting requirements.

The Query

Creating the query was my biggest surprise when going for Solr. Another configuration mechanism is the Solrconfig.xml. This file configures the Solr node. It gives you the option to create your own endpoint for a query that comes with lots of defaults. One thing we can do for instance is to create an endpoint that automatically filters for only ebooks. We can call this endpoint to search for ebooks only. Below you’ll find a sample of the config that does just this.

<requesthandler name="/ebook" class="solr.SearchHandler">
     <lst name="defaults">
       <str name="wt">json</str> <!-- Return response as json -->
       <str name="fq">media_type:ebook</str> <!-- Filter on all items of media_type ebook -->
       <str name="qf">combined_author_title</str> <!-- Search in the field combined_author_title -->
     </lst>
  </requesthandler>

For our own query, we’ll need other options that Solr provides. This is called the edismax query. This comes by default with options to boost your results using phrases, but also for boosting using ratings, release dates, etc. Below an image giving you an idea of what the query should do.

Next, I’ll show you how this translates into the Solr configuration

<requesthandler name="/select" class="solr.SearchHandler">
     <lst name="defaults">
       <str name="wt">json</str>
       <str name="ps">2</str>
       <str name="mm">3</str>
       <str name="pf">author^4 title^2</str>
       <str name="pf2">author^4 title^2</str>
       <str name="pf3">author^4 title^2</str>
       <str name="bq">author_full^5 title_full^5</str>
       <str name="boost">product(div(def(rating,4),4),recip(ms(NOW/DAY,releasedate),3.16e-11,1,1),log(product(5,sales_pastyear)))</str>
       <str name="qf">combined_author_title</str>
       <str name="defType">edismax</str>
       <str name="lowercaseOperators">false</str>
     </lst>
  </requesthandler>

I am not going over all the different parameters. For multi-term queries we use phrases. These are configured with pf, pf2, and pf3. Also, mm is used for multi-term queries. This has to do with the number of terms that have to match. So if you use three terms, they all have to match. The edismax query also supports using AND/OR when you need more control over what terms to match. With lowercaseOperators we prevent that and/or in lowercase are also used to create your own boolean query.

With respect to boosting there is the bq, these numbers are added to the score. With the field boost, we do a multiplication. Look also at the diagram. Also notice the bq has text related scores, while the boost has numeric scores.

That is about it for now. I think it is good to look at the differences between Solr en Elasticsearch. I like the idea of creating a query with Solr. Of course, you can do the same with Elasticsearch. The json API for creating a query is really flexible, but you have to create the constructs used in Solr yourself.


Elasticsearch instances for integration testing

In my latest project I have implemented all communication with my Elasticsearch cluster using the high level REST client. My next step was to setup and teardown an Elasticsearch instance automatically in order to facilitate proper integration testing. This article describes three different ways of doing so and discusses some of the pros and cons. Please refer to this repository for implementations of all three methods.

docker-maven-plugin

This generic Docker plugin allows you to bind the starting and stopping of Docker containers to Maven lifecycles. You specify two blocks within the plugin; configuration and executions. In the configuration block, you choose the image that you want to run (Elasticsearch 6.5.3 in this case), the ports that you want to expose, a health check and any environment variables. See the snippet below for a complete example:

<plugin>
    <groupId>io.fabric8</groupId>
    <artifactId>docker-maven-plugin</artifactId>
    <version>${version.io.fabric8.docker-maven-plugin}</version>
    <configuration>
        <imagePullPolicy>always</imagePullPolicy>
        <images>
            <image>
                <alias>docker-elasticsearch-integration-test</alias>
                <name>docker.elastic.co/elasticsearch/elasticsearch:6.5.3</name>
                <run>
                    <namingStrategy>alias</namingStrategy>
                    <ports>
                        <port>9299:9200</port>
                        <port>9399:9300</port>
                    </ports>
                    <env>
                        <cluster.name>integration-test-cluster</cluster.name>
                    </env>
                    <wait>
                        <http>
                            <url>http://localhost:9299</url>
                            <method>GET</method>
                            <status>200</status>
                        </http>
                        <time>60000</time>
                    </wait>
                </run>
            </image>
        </images>
    </configuration>
    <executions>
        <execution>
            <id>docker:start</id>
            <phase>pre-integration-test</phase>
            <goals>
                <goal>start</goal>
            </goals>
        </execution>
        <execution>
            <id>docker:stop</id>
            <phase>post-integration-test</phase>
            <goals>
                <goal>stop</goal>
            </goals>
        </execution>
    </executions>
</plugin>

You can see that I’ve bound the plugin to the pre- and post-integration-test lifecycle phases. By doing so, the Elasticsearch container will be started just before any integration tests are ran and will be stopped after the integration tests have finished. I’ve used the maven-failsafe-plugin in order to trigger the execution of tests ending with *IT.java in the integration-test lifecycle phase.

Since this is a generic Docker plugin, there is no special functionality to easily install Elasticsearch plugins that may be needed during your integration tests. You could however create your own image with the required plugins and pull that image during your integration tests.

The integration with IntelliJ is also not optimal. When running an *IT.java class, IntelliJ will not trigger the correct lifecycle phases and will attempt to run your integration test without creating the required Docker container. Before running an integration test from IntelliJ, you need to manually start the container from the “Maven projects” view by running the docker:start commando:

Maven Projects view in IntelliJ

After running, you will also need to run the docker:stop commando to kill the container that is still running. If you forget to kill the running container and want to run a mvn clean install later on it will fail, since the build will attempt to create a container on the same port – as far as I know, the plugin does not allow for random ports to be chosen.

Pros:

  • Little setup, only requires configuration of one Maven plugin

Cons:

  • No out of the box functionality to start the Elasticsearch instance on a random port
  • No out of the box functionality to install extra Elasticsearch plugins
  • Extra dependency in your build pipeline (Docker)
  • IntelliJ does not trigger the correct lifecycle phases

elasticsearch-maven-plugin

This second plugin does not require Docker and only needs some Maven configuration to get started. See the snippet below for a complete example:

<plugin>
    <groupId>com.github.alexcojocaru</groupId>
    <artifactId>elasticsearch-maven-plugin</artifactId>
    <version>${version.com.github.alexcojocaru.elasticsearch-maven-plugin}</version>
    <configuration>
        <version>6.5.3</version>
        <clusterName>integration-test-cluster</clusterName>
        <transportPort>9399</transportPort>
        <httpPort>9299</httpPort>
    </configuration>
    <executions>
        <execution>
            <id>start-elasticsearch</id>
            <phase>pre-integration-test</phase>
            <goals>
                <goal>runforked</goal>
            </goals>
        </execution>
        <execution>
            <id>stop-elasticsearch</id>
            <phase>post-integration-test</phase>
            <goals>
                <goal>stop</goal>
            </goals>
        </execution>
    </executions>
</plugin>

Again, I’ve bound the plugin to the pre- and post-integration-test lifecycle phases in combination with the maven-failsafe-plugin.

This plugin provides a way of starting the Elasticsearch instance from IntelliJ in much the same way as the docker-maven-plugin. You can run the elasticsearch:runforked commando from the “Maven projects” view. However in my case, this started the container and then immediately exited. There is also no out of the box possibility of setting a random port for your instance. However, there are solutions to this at the expense of having a somewhat more complex Maven configuration.

Overall, this is a plugin that seems to provide almost everything we need with a lot of configuration options. You can automatically install Elasticsearch plugins or even bootstrap your instance with data.

In practice I did have some problems using the plugin in my build pipeline. Upon downloading the Elasticsearch ZIP the build would sometimes fail, or in other cases when attempting to download a plugin. Your mileage may vary, but this was reason for me to keep looking for another solution. Which brings me to plugin number three.

Pros:

  • Little setup, only requires configuration of one Maven plugin
  • No extra external dependencies
  • High amount of configuration possible

Cons:

  • No out of the box functionality to start the Elasticsearch instance on a random port
  • Poor integration with IntelliJ
  • Seems unstable

testcontainers-elasticsearch

This third plugin is different from the other two. It uses a Java testcontainer that you can configure through Java code. This gives you a lot of flexibility and requires no Maven configuration. Since there is no Maven configuration, it does require some work to make sure the Elasticsearch container is started and stopped at the correct moments.

In order to realize this, I have extended the standard SpringJUnit4ClassRunner class with my own ElasticsearchSpringRunner. In this runner, I have added a new JUnit RunListener named JUnitExecutionListener. This listener defines two methods testRunStarted and testRunFinished that enable me to start and stop the Elasticsearch container at the same points in time that the pre- and post-integration-test Maven lifecycle phases would. See the snippet below for the implementation of the listener:

public class JUnitExecutionListener extends RunListener {

    private static final String ELASTICSEARCH_IMAGE = "docker.elastic.co/elasticsearch/elasticsearch";
    private static final String ELASTICSEARCH_VERSION = "6.5.3";
    private static final String ELASTICSEARCH_HOST_PROPERTY = "spring.elasticsearch.rest.uris";
    private static final int ELASTICSEARCH_PORT = 9200;

    private ElasticsearchContainer container;
    private RunNotifier notifier;

    public JUnitExecutionListener(RunNotifier notifier) {
        this.notifier = notifier;
    }

    @Override
    public void testRunStarted(Description description) {
        try {
            if (System.getProperty(ELASTICSEARCH_HOST_PROPERTY) == null) {
                log.debug("Create Elasticsearch container");
                int mappedPort = createContainer();
                System.setProperty(ELASTICSEARCH_HOST_PROPERTY, "localhost:" + mappedPort);
                String host = System.getProperty(ELASTICSEARCH_HOST_PROPERTY);
                RestAssured.basePath = "";
                RestAssured.baseURI = "http://" + host.split(":")[0];
                RestAssured.port = Integer.parseInt(host.split(":")[1]);
                log.debug("Created Elasticsearch container at {}", host);
            }
        } catch (Exception e) {
            notifier.pleaseStop();
            throw e;
        }
    }

    @Override
    public void testRunFinished(Result result) {
        if (container != null) {
            String host = System.getProperty(ELASTICSEARCH_HOST_PROPERTY);
            log.debug("Removing Elasticsearch container at {}", host);
            container.stop();
        }
    }

    private int createContainer() {
        container = new ElasticsearchContainer();
        container.withBaseUrl(ELASTICSEARCH_IMAGE);
        container.withVersion(ELASTICSEARCH_VERSION);
        container.withEnv("cluster.name", "integration-test-cluster");
        container.start();
        return container.getMappedPort(ELASTICSEARCH_PORT);
    }
}

It will create an Elasticsearch Docker container on a random port for use by the integration tests. The best thing about having this runner is that it works perfectly fine in IntelliJ. Simply right-click and run your *IT.java classes annotated with @RunWith(ElasticsearchSpringRunner.class) and IntelliJ will use the listener to setup the Elasticsearch container. This allows you to automate your build pipeline while still keeping developers happy.

Pros:

  • Neat integration with both Java and therefore your IDE
  • Sufficient configuration options out of the box

Cons:

  • More complex initial setup
  • Extra dependency in your build pipeline (Docker)

In summary, all three of the above plugins are able to realize the goal of starting an Elasticsearch instance for your integration testing. For me personally, I will be using the testcontainers-elasticsearch plugin going forward. The extra Docker dependency is not a problem since I use Docker in most of my build pipelines anyway. Furthermore, the integration with Java allows me to configure things in such a way that it works perfectly fine from both the command line and the IDE.

Feel free to checkout the code behind this article, play around with the integration tests that I’ve setup there and decide for yourself which plugin suits your needs best.


Setting up data analytics pipeline: the best practices

The picture is courtesy of https://bit.ly/2K44Nk5 1 Datapipeline Architect Example

In a data science analogy with the automotive industry, the data plays the role of the raw-oil which is not yet ready for combustion. The data modeling phase is comparable with combustion in the engines and data preparation is the refinery process turning raw-oil to the fuel i.e., ready for combustion. In this analogy data analytics pipeline includes all the steps from extracting the oil up to combustion, driving and reaching to the destination (analogous to reach the business goals). As you can imagine, the data (or oil in this analogy) goes through a various transformation and goes from one stage of the process to another. But the question is what is the best practice in terms of data format and tooling? Although there are many tools that make the best practice sometimes very use-case specific but generally JSON is the best practice for the data-format of communication or the lingua franca and Python is the best practice for orchestration, data preparation, analytics and live production.

What is the common inefficiency and why it happens?

The current inefficiency is overusing of tabular (csv-like) data-formats for communication or lingua franca. I believe data scientists still overuse the structured data types for communication within data analytics pipeline because of standard data-frame-like data formats offered by major analytic tools such as Python and R. Data scientists start getting used to data-frame mentality forgetting the fact that tabular storage of the data is a low scale solution, not optimized for communication and when it comes to bigger sets of data or flexibility to add new fields to the data, data-frames and their tabular form are non-efficient.

DataOps Pipeline and Data Analytics

A very important aspect for analytics being ignored in some circumstances is going live and getting integrated with other systems. DataOps is about setting up a set of tools from capturing data, storing them up to analytics and integration, falling into an interdisciplinary realm of the DevOps, Data Engineering, Analytics and Software Engineering (Hereinafter I use data analytics pipeline and DataOps pipeline interchangeably.) The modeling part and probably some parts in data prep phases need a data-frame like data format but the rest of the pipeline is more efficient and robust if is JSON native. It allows adding/removing features easier and is a compact form for communication between modules.

The picture is courtesy of https://zalando-jobsite.cdn.prismic.io/zalando-jobsite/2ed778169b702ca83c2505ceb65424d748351109_image_5-0d8e25c02668e476dd491d457f605d89.jpg 2

The role of Python

Python is a great programming language used not only by the scientific community but also the application developers. It is ready to be used as back-end and by combining it with Django you can build up full-stack web applications. Python has almost everything you need to set up a DataOps pipeline and is ready for integration and live production.

Python Example: transforming CSV to JSON and storing it in MongoDB

To show some capabilities of Python in combination with JSON, I have brought a simple example. In this example, a dataframe is converted to JSON (Python dictionaries) and is stored in MongoDB. MongoDB is an important database in today’s data storage as it is JSON native storing data in a document format bringing high flexibility .

<br />### Loading packages

from pymongo import MongoClient import pandas as pd

# Connecting to the database

client = MongoClient('localhost', 27017)

# Creating database and schema

db = client.pymongo_test posts = db.posts

# Defining a dummy dataframe

df = pd.DataFrame({'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])

# Transforming dataframe to a dictionary (JSON)

dic=df.to_dict()

# Writing to the database

result = posts.insert_one(dic) print('One post: {0}'.format(result.inserted_id))

The above example shows the ability of python in data transformation from dataframe to JSON and its ability to connect to various tooling (MongoDB in this example) in DataOps pipeline.

Recap

This article is an extension to my previous article on future of data science (https://bit.ly/2sz8EdM). In my earlier article, I have sketched the future of data science and have recommended data scientists to go towards full-stack. Once you have a full stack and various layers for DataOps / data analytics JSON is the lingua franca between modules bringing robustness and flexibility for this communication and Python is the orchestrator of various tools and techniques in this pipeline.


1: The picture is courtesy of https://cdn-images-1.medium.com/max/1600/1*8-NNHZhRVb5EPHK5iin92Q.png 2: The picture is courtesy of https://zalando-jobsite.cdn.prismic.io/zalando-jobsite/2ed778169b702ca83c2505ceb65424d748351109_image_5-0d8e25c02668e476dd491d457f605d89.jpg


Creating an Elastic Canvas for Twitter while visiting Elasticon 2018

The past week we visited Elasticon 2018 in San Francisco. In our previous blog post we wrote about the Keynote and some of the more interesting new features of the elastic stack. In this blog post, we take one of the cool new products for a spin: Canvas. But what is Canvas?

Canvas is a composable, extendable, creative space for live data With Canvas you can combine dynamic data, coming from a query against Elasticsearch for instance, with nice looking graphs. You can also use tables, images and combine them with the data visualizations to create stunning, dynamic infographics. In this blog post, we create a Canvas about the tweets with the tags Elasticon during the last day of the elastic conference last week.

Below is the canvas we are going to create. It contains a number of different elements. The top row contains a pie chart with the language of the tweets, a bar chart with the number of tweets per time unit, followed by the total tracked tweets during the second day of elasticon. The next two elements are using the sentiment of the tweets. This was obtained using IBM Watson. Byron wrote a basic integration with Watson, he will give more details in a next blog post. The pie chart shows the complete results, the green smiley on the right shows the percentage of positive tweets of the total number of tweets that we could analyze without an error or be neutral.

Overview canvas

With the example in place, it is time to discuss how to create these canvasses yourself. First some information about the installation. A few of the general concepts and finally sample code for the used elements.

Installing canvas

Canvas is part of elastic Kibana. You have to install canvas as a plugin into Kibana. You do need to install X-Pack in Elasticsearch as well as in Kibana. The steps are well described in the installation page of Canvas. Beware though, installing the plugins in Kibana takes some time. They are working on improving this, but we have to deal with it at the moment.

If everything is installed, start Kibana in your browser. At this moment you could start creating the canvas, however, you have no data. So you have to import some data first. We used Logstash with a twitter input and elastic output. Cannot go into to much detail or else this blog post will be way too long. Might do this is a next blog post. For now, it is enough to know we have an index called twitter that contains tweets.

Creating the canvas with the first element

When clicking on the tab Canvas we can create a new Workpad. A Workpad can contain one of the multiple pages and each page can contain multiple elements. A Workpad defines the size of the screen. Best is to create it for a specific screen size. At elasticon they had multiple monitors, some of them horizontal, others vertical. It is good to create the canvas for a specific size. You can also choose a background color. These options can be found on the right side of the screen in the Workpad settings bar.

It is good to know that you can create a backup of you Workpad from the Workpads screen, there is a small download button on the right side. Restoring a dashboard is done by dropping the exported JSON into the dialog.

New work pad

Time to add our first element to the page. Use the plus sign at the bottom of the screen to add an element. You can choose from a number of elements. The first one we’ll try is the pie chart. When adding the pie chart, we see data in the chart. Hmm, how come, we did not select any data. Canvas comes with a default data source, this data source is used in all the elements. This way we immediately see what the element looks like. Ideal play around with all the options. Most options are available using the settings on the right. With the pie, you’ll see options for the slice labels and the slice angles. You can also see the Chart style and Element style. These configuration options show a plus signed button. With this button, you can add options like color pallet and text size and color. For the element, you can set a background color, border color, opacity and padding

Add element

Next, we want to assign our own data source to the element. After adding our own data source we most likely have to change the parameters for the element as well. In this case, we have to change the Slice labels and angles. Changing the data source is done using the button at the bottom, click the Change Datasource button/link. At the moment there are 4 data sources: demo data, demo prices, Elasticsearch query and timeline query. I’ll choose the Elasticsearch query, select the index, don’t use a specific query and select the fields I need. Selecting the fields I need can speed up the element as we only parse the data that we actually need. In this example, we only use the sentiment label.

Change data source

The last thing I want to mention here is the Code view. After pushing the >_ Code button you’ll see a different view of your element. In this view, you’ll get a code approach. This is more powerful than the settings window. But with great power comes great responsibility. It is easy to break stuff here. The code is organized in different steps. The output of each step is, of course, the input for the next step. In this specific example, there are five steps. First a filter step, next up the data source, then a point series that is required for a pie diagram. Finally the render step. If you change something using the settings the code tab gets updated immediately. If I add a background color to the container, the render step becomes:

render containerStyle={containerStyle backgroundColor="#86d2ed"}

If you make changes in the code block, use the Run button to apply the changes. In the next sections, we will only work in this code tab, just because it is easier to show to you.

Code view

Adding more elements

The basics of the available elements or function are documented here. We won’t go into details for all the different elements we have added. Some of them use the defaults and therefore you can add them yourselves easily. The first one I do want to explain is the Twitter logo with the number of tweets in there. This is actually two different elements. The logo is a static image. The number is more interesting. This makes use of the escount function and the markdown element. Below is the code.

filters
 | escount index="twitter"
 | as num_tweets
 | markdown "{{rows.0.num_tweets}}" font={font family="'Open Sans', Helvetica, Arial, sans-serif" size=60 align="left" color="#ffffff" weight="undefined" underline=false italic=false}

The filters are used to facilitate filtering (usually by time) using the special filter element. The next item is escount which does what you expect. It counts the number of items in the provided index. You can also provide a query to limit the results, but we did not need it. The output for escount is a number. This is a problem when sending it to a markdown element. The markdown element only accepts a datatable. Therefore we have to use the function as. This accepts a number and changes it into a datatable. The markdown element accepts a table and exposes it as rows. Therefore we use the rows to obtain the first row and of that row the column num_tweets. When playing with this element it is easy to remove the markdown line, Canvas will then render the table by default. Below the output for only the first two rows as well as the changes after adding the third line (as num_tweets)

200
num_tweets #

200

Next up are the text and the photo belonging to the actual tweets. The photo is a bit different from the Twitter logo as it is a dynamic photo. In the code below you can see that the image element does have a data URL attribute. We can use this attribute to get one cell from the provided data table. The getCell function has attributes for the row number as well as the name of the column.

esdocs index="twitter*" sort="@timestamp, desc" fields="media_url" count=5 query=""
 | image mode="contain" dataurl={getCell c="media_url" r=2}
 | render

With the text of the tweet, it is a bit different. Here we want to use the markdown widget, however, we do not have the data URL attribute. So we have to come up with a different strategy. If we want to obtain the third item, we select the top 3 and from the top 3, we take the last item.

filters 
| esdocs index="twitter*" sort="@timestamp, desc" fields="text, user.name, created_at" query="" 
| head 3 
| tail 1 
| mapColumn column=created_at_formatted fn=${getCell created_at | formatdate 'YYYY-MM-DD HH:mm:ss'} 
| markdown "{{#each rows}}
**{{'user.name'}}** 

(*{{created_at_formatted}}*)

{{text}}
{{/each}}" font={font family="'American Typewriter', 'Courier New', Courier, Monaco, mono" size=18 align="right" color="#b83c6f" weight="undefined" underline=false italic=false}

The row that starts with mapColumn is a way to format the date. The mapColumn can add a new column with the name as provided by the column attribute and the value as the result of a function. The function can be a chain of functions. In this case, we obtain the column create_at of the datatable and pass it to the format function.

Creating the partly green smiley

The most complicated feature was the smiley that turns green the more positive tweets we see. The positiveness of the tweets was determined using IBM Watson interface. In the end, it is the combination of twee images, one grey smiley, and one green smiley. The green smiley is only shown for a specific percentage. This is the revealImage function. First, we show the complete code.

esdocs index="twitter*" fields="sentiment_label" count=10000 
| ply by="sentiment_label" fn=${rowCount | as "row_count"} 
| filterrows fn=${if {getCell "sentiment_label" | compare "eq" to="error"} then=false else=true}
| filterrows fn=${if {getCell "sentiment_label" | compare "eq" to="neutral"} then=false else=true}
| staticColumn column="total" value={math "sum(row_count)"} 
| filterrows fn=${if {getCell "sentiment_label" | compare "eq" to="positive"} then=true else=false}
| staticColumn column="percent" value={math "divide(row_count, total)"} 
| getCell "percent" 
| revealImage image={asset "asset-488ae09a-d267-4f75-9f2f-e8f7d588fae1"} emptyImage={asset "asset-0570a559-618a-4e30-8d8e-64c90ed91e76"}

The first line is like we have seen before, select all rows from the twitter index. The second row does kind of a grouping of the rows. It groups by the values of sentiment_label. The value is a row count that is specified by the function. If I remove all the other rows we can see the output of just the ply function.

sentiment_label         row_count
negative                32
positive                73
neutral                 81
error                   14

The next steps filter out the rows for error and neutral, then we add a column for the total number of tweets with a positive or negative label. Now each row has this value. Check the following output.

sentiment_label         row_count       total
negative                32              105
positive                73              105

The next line removes the negative row, then we add a column with the percentage, obtain just one cell and call the revealImage function. This function has a number input and attributes for the image as well as the empty or background image.

That gives us all the different elements on the canvas.

Concluding

We really like the options you have with Canvas. You can easily create good-looking dashboard that contains static resources, help texts, images combined with dynamic data coming from Elasticsearch and in the future most likely other resources.

There are some improvements possible of course. It would be nice if we could also select doc_value fields and using aggregations in a query would be nice as well.

We will closely monitor the progression as well believe this is going to be a very interesting technology to keep using in the future.


Elasticon 2018 Day 1

The past few days have been fantastic. Together with Byron I am visiting San Francisco. We have seen amazing sights, but yesterday the reason why we came started. Day one of Elasticon starts with the keynote showing us cool new features to come and sometimes some interesting news. In this blog post, I want to give you a short recap of the keynote and tell you what I think was important.

Elasticon opening

Rollups

With more and more data finding its way to elasticsearch, some indexes become too large for their purpose. We do not need to keep all data of the past weeks and months. We just want to keep the data needed for aggregations we show on a dashboard. Think about an index containing logs of our web server. We have a chart with HTTP status codes, response times, browsers, etc. Now you can create a rollup configuration providing the aggregations we want to keep, containing a cron expression telling when to run and some additional information about how much data to keep. The result is a new index with a lot less data that you can keep for your dashboards.

More information about the rollup functionality can be found here.

Canvas

Last year at Elasticon Canvas was already shown. Elastic continued with the idea and it is starting to look amazing. With Canvas you can create beautiful looking dashboards that go a big step further than the standard dashboards in Kibana. You can customise almost everything you see. It comes with options to put an image on the background, a lot of color options, new sort of data visualisation integrated of course with elasticsearch. In a next blog post I’ll come up with a demo, it is looking very promising. Want to learn more about it, check this blog post.

Kubernetes/Docker logging

One of the areas I still need to learn is the Docker / Kubernetes ecosystem. But if you are looking for a solution to monitor you complete Kubernetes platform, have a look at all the options that elastic has these days. Elastic has impressive support to monitor all the running images. It comes with standard dashboards in Kibana. It now has a dedicated space in Kibana called the Infra tab. More information about the options and how to get started can be found here.

Presenting Geo/Gis data

A very cool demo was given on how to present data on a Map. The demo showed where all Elasticon attendees were coming from. The visual component has an option of creating different layers. So you can add data to give the different countries a color based on the number of attendees. In a next layer show the bigger cities where people are coming from in small circles. Use a logo of the venue in another layer. Etc. Really interesting if you are into geodata. In all makes use of the Elastic Maps Service. If you want more information about this, you can find it here.

Elastic Site Search

Up till now there was news about new ways to handle your logs coming from application monitoring, infrastructure components, application logs. We did not hear about new things around search, until showing the new product called Elastic Site Search. This was previously known as Swiftype. With Google naming its product google search appliance end of life, this is a very interesting replacement. Working with relevance, synonyms, search analytics is becoming a lot easier with this new product. More information can be found here.

Elastic cloud sizing

If you previously looked at the cloud options elastic offers, you might have noticed that choosing elastic nodes did not give you a lot of flexibility. When choosing the amount of required memory, you also got a fixed amount of disk space. With the upcoming release, you have a lot more flexibility when creating your cluster. You can configure different flavours of clusters. One of them being hot-warm cluster. With specific master nodes, hot nodes for recent indexes with more RAM and faster disks, warm nodes containing the older indices with bigger disks. This is a good improvement if you want to create a cluster in the cloud. More information can be found here.

Opening up X-Pack

Shay told a good story about creating a company that supports an open source product. Creating a company only on support is almost impossible in the long run. Therefore they started working on commercial additions now called the X-Pack. Problem with these products was that the code was not available. Therefore working with elastic to help them improve the product was not possible. Therefore they are now opening up their repositories. Beware, it is not becoming free software. You still need to pay, but now it becomes a lot easier to interact with elastic about how stuff works. Next to that, they are going to make it easier to work with the free stuff in X-Pack. Just ask for a license once instead of every year again. And if I understood correct, the download will contain the free stuff in easier installation packages. More information about the what and why in this blog post from Shay Banon.

Conference Party

Nice party but I had to sign a contract to prohibit me from telling stories about the party. I do plan on writing more blog post the coming two days.


Tracing API’s: Combining Spring’s Sleuth, Zipkin & ELK

Tracing bugs, errors and the cause of lagging performance can be cumbersome, especially when functionality is distributed over multiple microservices.
In order to keep track of these issues, the usage of an ELK stack (or any similar sytem) is already a big step forward in creating a clear overview of all the processes of a service and finding these issues.
Often bugs can be traced by using ELK far more easily than just using a plain log file – if even available.
Optimization in this approach can be preferred, as for example you may want to see the trace logging for one specific event only. (more…)


A fresh look at Logstash

Soon after the release of elasticsearch it became clear that elasticsearch was good at more than providing search. It turned out that it could be used to store logs very effectively. That is why logstash was using elasticsearch. It contained standard parsers for apache httpd logs. To obtain the logs it had file monitoring plugins. It had plugins to extend and filter the content, and it had plugins to send the content to elasticsearch. That is Logstash in a nutshell back in the days. Of course the logs had to be shown, therefore a tool called Kibana was created. Kibana was a nice tool to create highly interactive dashboards to show and analyse your data. Together they became the famous ELK suite. Nowadays we have a lot more options in all these tools. We have Ingest node in elastic to pre-process documents before they move into elasticsearch, we have beats to monitor files, databases, machines, etc. And we have very nice and new Kibana dashboards. Time to re-investigate what the combination of Logstash, Elasticsearch and Kibana can do. In this blog post I’ll focus on Logstash.

X-Pack

As the company elastic has to make some money as well, they have created a product called X-Pack. X-Pack has a lot of features that sometimes span multiple products. There is a security component, by using this you can make users login in when using kibana and secure your content. Other interesting parts of X-Pack are machine learning, graph and monitoring. Parts of X-Pack can be used free of charge, you do need a license though. For other parts you need a paid license. I personally like the monitoring part so I regularly install X-Pack. In this blogpost I’ll also investigate the X-Pack features for Logstash. I’ll focus on out-of-the-box functionality and mostly what all these nice new things like monitoring and pipeline viewing bring us.

Using the version 6 release candidate

As elastic has already given us a RC1 of their complete stack, I’ll use this one for the evaluation. Beware though, this is still a release candidate, so not production ready.

What does Logstash do

If you never really heard about Logstash, let me give you a very short introduction. Logstash can be used to obtain data from a multitude of different sources. Than filter, transform and enrich the data. Finally store the data to again a multitude of datasources. Example data sources are relational databases, files, queues and websockets. Logstash ships with a large number of filter plugins, with these we can process data to exclude some fields. We can also enrich data, lookup information about ip addresses, or lookup records belonging to an id in for instance elasticsearch or a database. After the lookup we can add data to the document or event that we are handling before sending it to one or more outputs. Outputs can be elasticsearch, a database, but also queue’s like Kafka or RabbitMQ.

In the later releases logstash started to add more features that a tool handling large amounts of data over longer periods need. Things like monitoring and clustering of nodes were introduced and also persisting incoming data to disk. By now logstash in combination with Kibana and Elasticsearch is used by very large companies but also by a lot of start ups to monitor their servers and handle all sorts of interesting data streams.

Enough of this talk, let us get our hands dirty. First step install everything on our developer machines.

Installation

I’ll focus on the developer machine, if you want to install it on a server please refer to the extensive logstash documentation.

First download the zip or tar.gz file and extract it to a convenient location. Now create a folder where you can store the configuration files. To make the files small and to show you that you can split them, I create three different files in this folder: input.conf, filters.conf and output.conf. The most basic configuration is one with a stdin for input, no filters and stdout for output. Below the contents for the two files

input {
	stdin{}
}
output { 
	stdout { 
		codec => rubydebug
	}
}

Time to start logstash. Step into the downloaded and extracted folder with the logstash binaries and execute the following command.

bin/logstash -r -f ../logstashblog/

the -r, can be used during development for reloading the configuration on change. Beware, this does not work with the stdin plugin. With -f we tell logstash to load a configuration file or directory. In our case a directory containing the three mentioned files. When logstash is ready it will print something like this:

[2017-10-28T19:00:19,511][INFO ][logstash.pipeline        ] Pipeline started {"pipeline.id"=>"main"}
The stdin plugin is now waiting for input:
[2017-10-28T19:00:19,526][INFO ][logstash.agent           ] Pipelines running {:count=>1, :pipelines=>["main"]}

Now you can type something and the result is the created document or event that went through the almost empty pipeline. The thing to notice is that we now have a field called message containing the text we entered.

Just some text for input
{
      "@version" => "1",
          "host" => "Jettros-MBP.fritz.box",
    "@timestamp" => 2017-10-28T17:02:18.185Z,
       "message" => "Just some text for input"
}

Now that we know it is working, I want you to have a look at the monitoring options you have available using the rest endpoint.

http://localhost:9600/

{
"host": "Jettros-MBP.fritz.box",
"version": "6.0.0-rc1",
"http_address": "127.0.0.1:9600",
"id": "20290d5e-1303-4fbd-9e15-03f549886af1",
"name": "Jettros-MBP.fritz.box",
"build_date": "2017-09-25T20:32:16Z",
"build_sha": "c13a253bb733452031913c186892523d03967857",
"build_snapshot": false
}

You can use the same url with different endpoints to get information about the node, the plugins, stats and hot threads:
http://localhost:9600/_node
http://localhost:9600/_node/plugins
http://localhost:9600/_node/stats
http://localhost:9600/_node/hot_threads

It becomes a lot more fun if we have a UI, so let us install xpack into logstash. Before we can run logstash with monitoring on, we need to install elasticsearch and kibana with X-pack installed into those as well. Refer to the X-Pack documentation on how to do it.

The basic commands to install x-pack into elasticsearch and kibana are very easy. For now I disable security by adding the following line to both kibana.yml and elasticsearch.yml: xpack.security.enabled: false. After installing x-pack into logstash we have to add the following lines to the logstash.yml file in the config folder

xpack.monitoring.elasticsearch.url: ["http://localhost:9200"] 
xpack.monitoring.elasticsearch.username:
xpack.monitoring.elasticsearch.password:

Notice the empty username and password, this is required when security is disabled. Now move over to Kibana and check the monitoring tab (the heart shape figure) and click on logstash. In the first screen you can see the events, they could be zero, zo please enter some events. Now move to the pipeline tab. Of course with our basic pipeline, this is a bit stupid, but imagine what it will show later on.

Screen Shot 2017 10 28 at 19 52 46

Time to get some real input.

Import the Signalmedia dataset

Signalmedia has provided a dataset you can use for research. More information about the dataset and how to obtain it can be found here. The dataset contains an exact amount of 1 million news documents. You can download the file as a file that contains a JSON document on each line. The JSON document has the following format:

{
   "id": "a080f99a-07d9-47d1-8244-26a540017b7a",
   "content": "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ...",
   "title": "Pay up or face legal action: DBKL",
   "media-type": "News",
   "source": "My Sinchew",
   "published": "2015-09-15T10:17:53Z"
}

We want to import this big file with all the JSON documents as separate documents into elasticsearch using logstash. The first step is to create a logstash input. Use the path to point to the file. We can use the logstash file plugin to load the file, tell it to start at the beginning and mark each line as a JSON document. The file plugin has more options you can use. It can also handle rolling files that are used a lot in logging.

input {
	file {
        path => "/Volumes/Transcend/signalmedia-1m.jsonl"
        codec => "json"
        start_position => beginning 
    }
}

That is it, with the stdout plugin and the rubydebug codec this would give the following output.

{
          "path" => "/Volumes/Transcend/signalmedia-1m.jsonl",
    "@timestamp" => 2017-10-30T18:49:45.948Z,
      "@version" => "1",
          "host" => "Jettros-MBP.fritz.box",
            "id" => "a080f99a-07d9-47d1-8244-26a540017b7a",
        "source" => "My Sinchew",
     "published" => "2015-09-15T10:17:53Z",
         "title" => "Pay up or face legal action: DBKL",
    "media-type" => "News",
       "content" => "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ..."
}

Notice that besides the fields we expected: id, content, title, media-type, source and published we also got some additional fields. Before sending this to elasticsearch we want to clean it up. We do not need the path, host, @timestamp, @version. There is also something with the field id. We want to use the id field to create the document in elasticsearch, but we do not want to add it to the document. If we need the value of id in the output plugin later on, but we do not want to add it as a field to the document we can move it to the @metadata object. That is exactly what the first part of the filter does. The second part removes the fields we do not need.

filter {
	mutate {
		copy => {"id" => "[@metadata][id]"}
	}
	mutate {
		remove_field => ["@timestamp", "@version", "host", "path", "id"]
	}
}

With these filters in place the output of the same document would become:

{
        "source" => "My Sinchew",
     "published" => "2015-09-15T10:17:53Z",
         "title" => "Pay up or face legal action: DBKL",
    "media-type" => "News",
       "content" => "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ..."
}

Now the content is ready to be send to elasticsearch, so we need to configure the elasticsearch output plugin. When sending data to elastic you first need to think about creating the index and the mapping that goes with it. In this example I am going to create an index template. I am not going to explain a lot about the mappings as this is not an elasticsearch blog. But with the following code we insert the mapping template when connecting to elasticsearch and we can insert all documents. Do look at the way the document_id is created. Remember we talked about that @metadata and how we copied the id field into it. This is the reason why we did it. Now we use that value as the id of the document when inserting it into elasticsearch.

output {
	elasticsearch {
		index => "signalmedia"
		document_id => "%{[@metadata][id]}"
		document_type => "doc"
		manage_template => "true"
		template => "./signalmedia-template.json"
		template_name => "signalmediatemplate"
	}
	stdout { codec => dots }
}

Notice there are two outputs configured. The elasticsearch output of course, but also a stdout. This time not with the rubydebug codec, this would be way to verbose. We use the dots codec. This codec prints a dot for each document it parses.

For completeness I also want to show the mapping template. In this case I positioned it in the root folder of the logstash binary, usually this would of course be an absolute path somewhere else.

{
  "index_patterns": ["signalmedia"],
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 3
  },
  "mappings": {
    "doc": {
      "properties": {
        "source": {
          "type": "keyword"
        },
        "published": {
          "type": "date"
        },
        "title": {
          "type": "text"
        },
        "media-type": {
          "type": "keyword"
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

Now we want to import all the million documents and have a look at the monitoring along the way. Let’s do it.

Screen Shot 2017 10 30 at 20 50 36
Screen Shot 2017 10 30 at 20 48 21

Running a query

Of course we have to prove the documents are now available in elasticsearch. So lets execute one of my favourite queries that makes use of the new significant text aggregation. First the request and then parts of the response.

GET signalmedia/_search
{
  "query": {
    "match": {
      "content": "netherlands"
    }
  },
  "aggs": {
    "my_sampler": {
      "sampler": {
        "shard_size": 200
      },
      "aggs": {
        "keywords": {
          "significant_text": {
            "field": "content",
            "filter_duplicate_text": true
          }
        }
      }
    }
  },
  "size": 0
}

Just a very small part of the response, I stripped out a lot of the elements to make it better viewable. Good to see that that see dutch as a significant word when searching for the netherlands and of course geenstijl.

"buckets": [
  {"key": "netherlands","doc_count": 527},
  {"key": "dutch","doc_count": 196},
  {"key": "mmsi","doc_count": 7},
  {"key": "herikerbergweg","doc_count": 4},
  {"key": "konya","doc_count": 14},
  {"key": "geenstijl","doc_count": 3}
]

Concluding

Good to see the nice ui options in Kibana. The pipeline viewer is very useful. In a next blog post I’ll be looking at Kibana and all the new and interesting things in there.


Elasticsearch 6 is coming

For some time now, elasticsearch has been releasing versions of the new major release elasticsearch 6. At this moment the latest edition is already rc1, so it is time to start thinking about migrating to the latest and greatest. What backwards compatible issues will you run into and what new features can you start using. This blog post gives a summary of the items that are most important to me based on the projects that I do. First we’ll have a look at the breaking changes, than we move on to new features or interesting upgrades.

Breaking changes

Most of the breaking changes come from the elasticsearch documentation that you can of course also read yourself.

Migrating indexes from previous versions

As with all major release, only indexes created in the prior version can be migrated automatically. So if you have an index created in 2.x, migrated it to 5.x and now want to start using 6.x you have to use the reindex API to first index it into a 5.x index before migrating.

Index types

In elasticsearch 6 the first step is taken into indexes without types. The first step is to allow only a single type within a new index and be able to keep using multiple types in indexes migrated from 5.x. Starting with elasticsearch 5.6 you can prevent people from creating indexes with multiple types. This will make it easier to migrate to 6.x when it becomes available. By applying the following configuration option you can prevent people from making multiple types in one index

index.mapping.single_type: true

More reasoning about why the types need to be removed can be found in elasticsearch documentation removal of types. Also if you are into parent-child relationships in elasticsearch and are curious what the implications of not being able to use multiple types are, check this documentation page parent-join. Yes will will get joins in elasticsearch :-), though with very limited use.

Java High Level REST Client

This was already introduced in 5.6, still good to know as this will be the replacement for the Transport client. As you might know I am also creating some code to use in Java Applications on top of the Low Level REST client for java that is also being used by this new client. More information about my work can found here: part 1 and part 2.

Uniform response for create/update and delete requests

At the moment a create request returns a response field created true/false, and a delete request returns found true/false. If you are someone trying to parse the response and using this field, you can no longer use this. Use the result field instead. This will have the value created or updated in case of the create request and deleted or not_found in case of the delete request.

Translog change

The translog is used to keep documents that have not been flushed to disk yet by elasticsearch. In prior releases the translog files are removed when elasticsearch has performed a flush. However, due to optimisations made for recovery having the translog could speed the recovery process. Therefore the translog is now kept for by default 12 hours or a maximum of 512 Mb
More information about the translog can be found here: Translog.

Java Client

In a lot of java projects the java client is used. I have used it as well for numerous projects. However, with the introduction of the High Level Rest client for java projects should move away from the Transport Client. If you want/need to keep using it for now, there are some changes in packages and some methods have been removed. For me the one I used the most is the the order for aggregations, think about Terms.Order and Histogram.Order. They have been replaced by BucketOrder

Index mappings

There are two important changes that can affect your way of working with elastic. The first is the way booleans are handled. In indexes created in version 6, a boolean accepts only two values: true and false. Al other values will result in an exception.

The second change is the _all field. In prior version by default an _all field was created in which all values of fields were copied as strings and analysed with the standard analyser. This field was used by queries like the query_string. There was however a performance penalty as we now had to analyse and index a potentially big field. Soon it became a best practice to disable the field. In elasticsearch 6 the field is disabled by default and it cannot be configured for indices created with elasticsearch 6. If you still use the query_string query, it is now executed agains each field. You should be very careful with the query_string query. It comes with a lot of power. Users get a lot of options to create their own query. But with great power comes great responsibilities. They can create very heavy queries as well. And they can queries that break without a lot of feedback. More information about the query_string. If you still want to give you users more control, but the query_string query is one step to far, think about creating your own search DSL. Some ideas can be found in one of my previous blog posts: Creating a search DSL and Part 2 of creating a search DSL.

Booting elasticsearch

Some things changed with the startup options. You cannot configure the user elasticsearch runs with if you use the deb or rpm packages and the elasticsearch.yml file location is now configured differently. Now you have to export the path where to find all configuration files (elasticsearch.yml, jvm.options and log4j2.properties). You can expose an environment variable ES_PATH_CONF containing the path to the config folder. I use this regularly on my local machine. As I have multiple projects running often with different version of elasticsearch I have setup a structure where I put my config files in separate folders from the elasticsearch distributable. Find the structure in the image below. In the beginning I just copy the config files to my project specific folder. When I start the project with the script startNode.sh the following script is executed.

Elastic folder structure

#!/bin/bash

CURRENT_PROJECT=$(pwd)

export ES_PATH_CONF=$CURRENT_PROJECT/config

DATA=$CURRENT_PROJECT/data
LOGS=$CURRENT_PROJECT/logs
REPO=$CURRENT_PROJECT/backups
NODE_NAME=Node1
CLUSTER_NAME=playground

BASH_ES_OPTS="-Epath.data=$DATA -Epath.logs=$LOGS -Epath.repo=$REPO -Enode.name=$NODE_NAME -Ecluster.name=$CLUSTER_NAME"

ELASTICSEARCH=$HOME/Development/elastic/elasticsearch/elasticsearch-6.0.0-rc1

$ELASTICSEARCH/bin/elasticsearch $BASH_ES_OPTS

Now when you need additional configuration options, add them to the elasticsearch.yml. If you need more memory for the specific project, change the jvm.options file.

Plugins

When indexing pdf documents or word documents a lot of you out there have been using the mapper-attachments plugin. This was already deprecated, now it has been removed. You can switch to the ingest attachment plugin. Never heard about Injest? Injest can be used to pre process documents before they are being indexed by elasticsearch. It is a lightweight variant for Logstash, running within elasticsearch. Be warned though that plugins like the attachment mapper can be heavy on your cluster. So it is wise to have a separate node for Injest. Curious about what you can do to inject the contents of a pdf? The next few steps show you the commands to create the injest pipeline, send a document to it and obtain it again or create a query.

First create the injest pipeline

PUT _ingest/pipeline/attachment
{
  "description": "Extract attachment information",
  "processors": [
    {
      "attachment": {
        "field": "data"
      }
    }
  ]
}

Now when indexing a document containing the attachment as a base64 encoded string in the field data we need to tell elasticsearch to use a pipeline. Check the parameter in the url: pipeline=attachment. This is the name used when creating the pipeline.

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": ""
}

We could stop here, but how to get base64 encoded input from for instance a pdf. On linux and the mac you can use the base64 command for that. Below is a script that reads a specific pdf and creates a base64 ended string out of it. This string is than pushed to elasticsearch.

#!/bin/bash

pdf_doc=$(base64 ~/Desktop/Programma.pdf)

curl -XPUT "http://localhost:9200/my_index/my_type/my_id?pipeline=attachment" -H 'Content-Type: application/json' -d '{"data" : "'"$pdf_doc"'"}'

Scripts

If you are heavy into scripting in elasticsearch you need to check a few things. Changes have been made to the use of the lang attribute when obtaining or updating scripts, you cannot provide it any more. Also support for other languages than painless has been removed.

Search and query DSL changes

Most of the changes in this area are very specific. I am not going to sum them, please check the original documentation. Some of them I do want to mention as they are important to me.

  • If you are constructing queries and it can happen you have an empty query, you can no longer provide an empty object { }. You will get an exception if you keep doing it.
  • Bool queries had a disable_coord parameter, with this you could influence the score function to not use missing search terms as a penalty for the score. This option has been removed.
  • You could transform a match query into a match_phrase query by specifying a type. This is no longer possible, you should just create a phrase query now if you need it. Therefore also the slop parameter has been removed from the match query.

Calculating the score

I the beginning of elasticsearch the score for a document based on an executed query was calculated using an adjusted formula for TF/IDF. It turned out that for fields containing smaller amounts of text TF/IDF was less ideal. Therefore the default scoring algorithm was replaced by BM25. Moving away from TF/IDF to BM25 has been the topic for version 5. Now with 6 they have removed two mechanisms in the scoring: Query Normalization and Coordination Factors. Query Normalization was always hard to explain during trainings. It was an attempt to normalise the scores of queries. Normalizing should make it possible to compare them. However, it did not work and you still should not compare scores of different queries. The Coordinating Factors were more a penalty when having multiple terms to search for and not all of them were found, the coordinating factor gave a penalty to the score. You could easily see this when using the explain API.

That is it for the breaking changes, again there are more changes that you might want to investigate if you are really into all the elasticsearch details. Than have a look at the original documetation

Next up, cool new features

Now let us zoom in on some of the newer features or interesting upgrades.

Sequence Numbers

Sequence Numbers are now assigned to all index, update and delete operations. Using this number a shard that went offline for a moment can ask the primary shard for all operations after a certain sequence number. If the translog is still available (remember that we mentioned in the beginning that the translog was now kept around for 12 hours and or 512 Mb by default) the missing operations can be send to the shard preventing a complete refresh of all the shards contents.

Test Normalizer using analyse endpoint

One of the most important parts of elastic is configurating the mapping for your documents. How do you adjust the terms that you can search for based on the provided text. If you are not sure and you want to try out a specific tokeniser and filters combination you can use the analyze endpoint. Have a look at the following code sample and response where we try out a whitespace tokeniser with a lowercase filter.

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase"],
  "text": ["Jettro Coenradie"]
}

{
  "tokens": [
    {
      "token": "jettro",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "coenradie",
      "start_offset": 7,
      "end_offset": 16,
      "type": "word",
      "position": 1
    }
  ]
}

As you can see we now get two tokens and the uppercase characters are replaced by their lowercase counterparts. What if we do not want the text to become two terms, but we want it to stay as one term. Still we would like to replace the uppercase characters with their lowercase counterparts. This was not possible in the beginning. However, with the introduction of normalizer, a special analyser for fields of type keyword it became possible. In elasticsearch 6 we now have the functionality to use the analyse endpoint for normalisers as well. Check the following code block for an example.

PUT people
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 1,
    "analysis": {
      "normalizer": {
        "name_normalizer": {
          "type": "custom",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  }
}

GET people/_analyze
{
  "normalizer": "name_normalizer",
  "text": ["Jettro Coenradie"]
}

{
  "tokens": [
    {
      "token": "jettro coenradie",
      "start_offset": 0,
      "end_offset": 16,
      "type": "word",
      "position": 0
    }
  ]
}

LogLog-Beta

Ever heard about HyperLogLog or even HyperLogLog++? Well than you must be happy with LogLog-Beta. Some background, elasticsearch comes with a Cardinality Aggregation which can be used to calculate or better estimate the amount of distinct values. If we wanted to create an exact value, we would have to create a map of values with all unique values in there. This would require an extensive amount of memory. You can specify a threshold under which the amount of unique values would be close to exact. However the maximum value for this is 40000. Before elasticsearch used the HyperLogLog++ algorithm to estimate the unique values. With the new algorithm called LogLog-Beta there are better results with lower error margins and still the same performance.

Significant Text Aggregation

For some time the Significant Terms Aggregation has been available. The idea behind this aggregation is to find terms that are common to a specific scope and less common to a more general scope. So imagine we are looking for users of our website that place more orders in relation to pages they visit out of logs with page visits. You cannot calculate them by just counting the number of orders. You need to find those users that are more common to the set of orders than to the set of page visits. In the prior version this was already possible with terms, so not analysed fields. By enabling field-data or doc_values you could use small analysed fields. But for larger text fields this was a performance problem. Now with the Significant Text Aggregation we can overcome this problem. It also comes with an interesting functionality to deduplicate text (think about emails with the original text in a reply, or retweets).

Sounds a bit to vague? Ok, lets have an example. In elasticsearch documentation they use a dataset from Signal Media. As it is an interesting dataset to work with, I will also use it. You can try it out yourself as well. I downloaded the file and imported it into elasticsearch using logstash. This gist should help you. Now on to the query and the response

GET signalmedia/_search
{
  "query": {
    "match": {
      "content": "rain"
    }
  },
  "aggs": {
    "my_sampler": {
      "sampler": {
        "shard_size": 200
      },
      "aggs": {
        "keywords": {
          "significant_text": {
            "field": "content",
            "filter_duplicate_text": true
          }
        }
      }
    }
  },
  "size": 0
}

So we are looking for documents with the word rain. Now in these documents we are going to lookup terms that occur more often than in the global context.

{
  "took": 248,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 11722,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "my_sampler": {
      "doc_count": 600,
      "keywords": {
        "doc_count": 600,
        "bg_count": 1000000,
        "buckets": [
          {
            "key": "rain",
            "doc_count": 544,
            "score": 69.22167699861609,
            "bg_count": 11722
          },
          {
            "key": "showers",
            "doc_count": 164,
            "score": 32.66807368214775,
            "bg_count": 2268
          },
          {
            "key": "rainfall",
            "doc_count": 129,
            "score": 24.82562838569881,
            "bg_count": 1846
          },
          {
            "key": "thundery",
            "doc_count": 28,
            "score": 20.306396677050884,
            "bg_count": 107
          },
          {
            "key": "flooding",
            "doc_count": 153,
            "score": 17.767450110864743,
            "bg_count": 3608
          },
          {
            "key": "meteorologist",
            "doc_count": 63,
            "score": 16.498915662650603,
            "bg_count": 664
          },
          {
            "key": "downpours",
            "doc_count": 40,
            "score": 13.608547008547008,
            "bg_count": 325
          },
          {
            "key": "thunderstorms",
            "doc_count": 48,
            "score": 11.771851851851853,
            "bg_count": 540
          },
          {
            "key": "heatindex",
            "doc_count": 5,
            "score": 11.56574074074074,
            "bg_count": 6
          },
          {
            "key": "southeasterlies",
            "doc_count": 4,
            "score": 11.104444444444447,
            "bg_count": 4
          }
        ]
      }
    }
  }
}

Interesting terms when looking for rain: showers, rainfall, thundery, flooding, etc. These terms could now be returned to the user as possible candidates for improving their search results.

Concluding

That is it for now. I haven’t even scratched all the new cool stuff in the other components like X-Pack, Logstash and Kibana. More to come.