Part 2 of Creating a search DSL

Posted on 2017-07-21 by

In my previous blog post I wrote about creating your own search DSL using Antlr. In that post I discussed the Antlr language for constructs like AND/OR, multiple words and combining words. In this blog post I am showing how to use the visitor mechanism to write actual elasticsearch queries.

If you did not read the first post yet, please do so. It will make it easier to follow along. If you want the code, please visit the github page.

https://amsterdam.luminis.eu/2017/06/28/creating-search-dsl/
Github repository

What queries to use

In the previous blog post we ended up with some of the queries we want to support

  • apple
  • apple OR juice
  • apple raspberry OR juice
  • apple AND raspberry AND juice OR Cola
  • “apple juice” OR applejuice

Based on these queries we have some choices to make. The first query seems obvious, searching for one word would become a match query. However, in which field do you want to search? In Elasticsearch there is a special field called the _all field. In the example we are using the _all field, however it would be easy to create a query against a number of specific fields using a multi_match.

In the second example we have two words with OR in between. The most basic implementation would again be a match query, since the match query by default uses OR if you supply multiple words. However, the DSL uses OR to combine terms as well as and queries. A term in itself can be a quoted term as well. Therefore, to translate the apple OR juice we need to create a boolean query. Now look at the last example, here we use quotes. One would expect quotes to keep words together. In elasticsearch we would use the Phrase query to accomplish this.

As the current DSL is fairly simple, creating the queries is not that hard. But a lot more extensions are possible that can make use of more advance query options. Using wildcards could result in fuzzy queries, using title:apple could look into one specific field and using single quotes could mean an exact match, so we would need to use the term query.

Now you should have an idea of the queries we would need, let us have a look at the code and see Antlr DSL in action.

Generate json queries

As mentioned in the introduction we are going to use the visitor to parse the tree. Of course we need to create the tree first. Below the code to create the tree.

static SearchdslParser.QueryContext createTreeFromString(String searchString) {
    CharStream charStream = CharStreams.fromString(searchString);
    SearchdslLexer lexer = new SearchdslLexer(charStream);
    CommonTokenStream commonTokenStream = new CommonTokenStream(lexer);
    SearchdslParser parser = new SearchdslParser(commonTokenStream);

    return parser.query();
}

AS mentioned in the previous posts, the parser and the visitor classes get generated by Antlr. Methods are generated for visiting the different nodes of the tree. Check the class

  • SearchdslBaseVisitor
  • for the methods you can override.

    To understand what happens, it is best to have a look at the tree itself. Below the image of the tree that we are going to visit.

    Antlr4 parse tree

    We visit the tree from the top. The first method or Node that we visit is the top level Query. Below the code of the visit method.

    @Override
    public String visitQuery(SearchdslParser.QueryContext ctx) {
        String query = visitChildren(ctx);
    
        return
                "{" +
                    "\"query\":" + query +
                "}";
    }
    

    Every visitor generates a string, with the query we just visit all the possible children and create a json string with a query in there. In the image we see only a child orQuery, but it could also be a Term or andQuery. By calling the visitChildren method we continue to walk the tree. Next step is the visitOrQuery.

    @Override
    public String visitOrQuery(SearchdslParser.OrQueryContext ctx) {
        List<String> shouldQueries = ctx.orExpr().stream().map(this::visit).collect(Collectors.toList());
        String query = String.join(",", shouldQueries);
    
        return
                "{\"bool\": {" +
                        "\"should\": [" +
                            query +
                        "]" +
                "}}";
    }
    

    When creating an OR query we use the bool query with the should clause. Next we have to obtain the queries to include in the should clause. We obtain the orExpr items from the orQuery and for each orExpr we again call the visit method. This time we will visit the orExpr Node, this node does not contain important information for us, therefore we let the template method just call the visitChildren method. orExpr nodes can contain a term or an andQuery. Let us have a look at visiting the andQuery first.

    @Override
    public String visitAndQuery(SearchdslParser.AndQueryContext ctx) {
        List<String> mustQueries = ctx.term().stream().map(this::visit).collect(Collectors.toList());
        String query = String.join(",", mustQueries);
        
        return
                "{" +
                        "\"bool\": {" +
                            "\"must\": [" +
                                query +
                            "]" +
                        "}" +
                "}";
    }
    

    Notice how closely this resembles the orQuery, big difference in the query is that we now use the bool query with a must part. We are almost there. The next step is the Term node. This node contains words to transform into a match query, or it contains a quotedTerm. The next code block shows the visit method of a Term.

    @Override
    public String visitTerm(SearchdslParser.TermContext ctx) {
        if (ctx.quotedTerm() != null) {
            return visit(ctx.quotedTerm());
        }
        List<TerminalNode> words = ctx.WORD();
        String termsAsText = obtainWords(words);
    
        return
                "{" +
                        "\"match\": {" +
                            "\"_all\":\"" + termsAsText + "\"" +
                        "}" +
                "}";
    }
    
    private String obtainWords(List<TerminalNode> words) {
        if (words == null || words.isEmpty()) {
            return "";
        }
        List<String> foundWords = words.stream().map(TerminalNode::getText).collect(Collectors.toList());
        
        return String.join(" ", foundWords);
    }
    

    Notice we first check if the term contain a quotedTerm. If it does not contain a quotedTerm we obtain the words and combine them into one string. The final step is to visit the quotedTerm node.

    @Override
    public String visitQuotedTerm(SearchdslParser.QuotedTermContext ctx) {
        List<TerminalNode> words = ctx.WORD();
        String termsAsText = obtainWords(words);
    
        return
                "{" +
                        "\"match_phrase\": {" +
                            "\"_all\":\"" + termsAsText + "\"" +
                        "}" +
                "}";
    }
    

    Notice we parse this part into a match_phrase query, other than that it is almost the same as the term visitor. Finally we can generate the complete query.

    Example

    “multi search” && find && doit OR succeed && nothing

    {
      "query": {
        "bool": {
          "should": [
            {
              "bool": {
                "must": [
                  {
                    "match_phrase": {
                      "_all": "multi search"
                    }
                  },
                  {
                    "match": {
                      "_all": "find"
                    }
                  },
                  {
                    "match": {
                      "_all": "doit"
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "must": [
                  {
                    "match": {
                      "_all": "succeed"
                    }
                  },
                  {
                    "match": {
                      "_all": "nothing"
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }
    

    In the codebase on github there is also a Jackson JsonNode based visitor, if you don’t like the string based approach.

    That is about it, I am planning on extending the example further. If I have added some interesting new concepts I’ll get back to you with a part 3

    About Jettro Coenradie

    I am a Software Developer / Architect with a lot of hands on experience in Java, AngularJS, Elasticsearch and lots of others tools. I like to use these technologies to help customers with the business challenges. On top of that I like to gather and share knowledge related to data analytics. I have experience with importing and transforming data as well as presenting and visualising the data. Currently I am working with tools like elasticsearch, logstash and Kibana but also D3 and C3 for graphics and other presentations.


    Leave a Reply

    Your email address will not be published. Required fields are marked *