Tag Archive: antlr



Part 2 of Creating a search DSL

In my previous blog post I wrote about creating your own search DSL using Antlr. In that post I discussed the Antlr language for constructs like AND/OR, multiple words and combining words. In this blog post I am showing how to use the visitor mechanism to write actual elasticsearch queries.

If you did not read the first post yet, please do so. It will make it easier to follow along. If you want the code, please visit the github page.

https://amsterdam.luminis.eu/2017/06/28/creating-search-dsl/
Github repository

What queries to use

In the previous blog post we ended up with some of the queries we want to support

  • apple
  • apple OR juice
  • apple raspberry OR juice
  • apple AND raspberry AND juice OR Cola
  • “apple juice” OR applejuice

Based on these queries we have some choices to make. The first query seems obvious, searching for one word would become a match query. However, in which field do you want to search? In Elasticsearch there is a special field called the _all field. In the example we are using the _all field, however it would be easy to create a query against a number of specific fields using a multi_match.

In the second example we have two words with OR in between. The most basic implementation would again be a match query, since the match query by default uses OR if you supply multiple words. However, the DSL uses OR to combine terms as well as and queries. A term in itself can be a quoted term as well. Therefore, to translate the apple OR juice we need to create a boolean query. Now look at the last example, here we use quotes. One would expect quotes to keep words together. In elasticsearch we would use the Phrase query to accomplish this.

As the current DSL is fairly simple, creating the queries is not that hard. But a lot more extensions are possible that can make use of more advance query options. Using wildcards could result in fuzzy queries, using title:apple could look into one specific field and using single quotes could mean an exact match, so we would need to use the term query.

Now you should have an idea of the queries we would need, let us have a look at the code and see Antlr DSL in action.

Generate json queries

As mentioned in the introduction we are going to use the visitor to parse the tree. Of course we need to create the tree first. Below the code to create the tree.

static SearchdslParser.QueryContext createTreeFromString(String searchString) {
    CharStream charStream = CharStreams.fromString(searchString);
    SearchdslLexer lexer = new SearchdslLexer(charStream);
    CommonTokenStream commonTokenStream = new CommonTokenStream(lexer);
    SearchdslParser parser = new SearchdslParser(commonTokenStream);

    return parser.query();
}

AS mentioned in the previous posts, the parser and the visitor classes get generated by Antlr. Methods are generated for visiting the different nodes of the tree. Check the class

  • SearchdslBaseVisitor
  • for the methods you can override.

    To understand what happens, it is best to have a look at the tree itself. Below the image of the tree that we are going to visit.

    Antlr4 parse tree

    We visit the tree from the top. The first method or Node that we visit is the top level Query. Below the code of the visit method.

    @Override
    public String visitQuery(SearchdslParser.QueryContext ctx) {
        String query = visitChildren(ctx);
    
        return
                "{" +
                    "\"query\":" + query +
                "}";
    }
    

    Every visitor generates a string, with the query we just visit all the possible children and create a json string with a query in there. In the image we see only a child orQuery, but it could also be a Term or andQuery. By calling the visitChildren method we continue to walk the tree. Next step is the visitOrQuery.

    @Override
    public String visitOrQuery(SearchdslParser.OrQueryContext ctx) {
        List<String> shouldQueries = ctx.orExpr().stream().map(this::visit).collect(Collectors.toList());
        String query = String.join(",", shouldQueries);
    
        return
                "{\"bool\": {" +
                        "\"should\": [" +
                            query +
                        "]" +
                "}}";
    }
    

    When creating an OR query we use the bool query with the should clause. Next we have to obtain the queries to include in the should clause. We obtain the orExpr items from the orQuery and for each orExpr we again call the visit method. This time we will visit the orExpr Node, this node does not contain important information for us, therefore we let the template method just call the visitChildren method. orExpr nodes can contain a term or an andQuery. Let us have a look at visiting the andQuery first.

    @Override
    public String visitAndQuery(SearchdslParser.AndQueryContext ctx) {
        List<String> mustQueries = ctx.term().stream().map(this::visit).collect(Collectors.toList());
        String query = String.join(",", mustQueries);
        
        return
                "{" +
                        "\"bool\": {" +
                            "\"must\": [" +
                                query +
                            "]" +
                        "}" +
                "}";
    }
    

    Notice how closely this resembles the orQuery, big difference in the query is that we now use the bool query with a must part. We are almost there. The next step is the Term node. This node contains words to transform into a match query, or it contains a quotedTerm. The next code block shows the visit method of a Term.

    @Override
    public String visitTerm(SearchdslParser.TermContext ctx) {
        if (ctx.quotedTerm() != null) {
            return visit(ctx.quotedTerm());
        }
        List<TerminalNode> words = ctx.WORD();
        String termsAsText = obtainWords(words);
    
        return
                "{" +
                        "\"match\": {" +
                            "\"_all\":\"" + termsAsText + "\"" +
                        "}" +
                "}";
    }
    
    private String obtainWords(List<TerminalNode> words) {
        if (words == null || words.isEmpty()) {
            return "";
        }
        List<String> foundWords = words.stream().map(TerminalNode::getText).collect(Collectors.toList());
        
        return String.join(" ", foundWords);
    }
    

    Notice we first check if the term contain a quotedTerm. If it does not contain a quotedTerm we obtain the words and combine them into one string. The final step is to visit the quotedTerm node.

    @Override
    public String visitQuotedTerm(SearchdslParser.QuotedTermContext ctx) {
        List<TerminalNode> words = ctx.WORD();
        String termsAsText = obtainWords(words);
    
        return
                "{" +
                        "\"match_phrase\": {" +
                            "\"_all\":\"" + termsAsText + "\"" +
                        "}" +
                "}";
    }
    

    Notice we parse this part into a match_phrase query, other than that it is almost the same as the term visitor. Finally we can generate the complete query.

    Example

    “multi search” && find && doit OR succeed && nothing

    {
      "query": {
        "bool": {
          "should": [
            {
              "bool": {
                "must": [
                  {
                    "match_phrase": {
                      "_all": "multi search"
                    }
                  },
                  {
                    "match": {
                      "_all": "find"
                    }
                  },
                  {
                    "match": {
                      "_all": "doit"
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "must": [
                  {
                    "match": {
                      "_all": "succeed"
                    }
                  },
                  {
                    "match": {
                      "_all": "nothing"
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }
    

    In the codebase on github there is also a Jackson JsonNode based visitor, if you don’t like the string based approach.

    That is about it, I am planning on extending the example further. If I have added some interesting new concepts I’ll get back to you with a part 3


    Creating a search DSL

    As an (elastic)search expert, I regularly visit customers. For these customers I often do a short analysis of their search solution and I give advice about improvements they can make. It is always interesting to look at solutions customers come up with. At one of my most recent customers I noticed a search solution based on a very extensive search DSL (Domain Specific Language) created with Antlr. I knew about Antlr, but never thought about creating my own search DSL.

    To better understand the options of Antlr and to practice with creating my own DSL I started experimenting with it. In this blog post I’ll take you on my learning journey. I am going to create my own very basic search DSL.

    Specifying the DSL

    First we need to define the queries we would like our users to enter. Below are some examples:

    • tree – This is an easy one, just one word to search for
    • tree apple – Two words to look up
    • tree apple AND sell – Find matching content for tree apple, but also containing sell.
    • tree AND apple OR juice – Find matching content containing the terms tree and apple or containing the term juice.
    • “apple tree” OR juice – Find content having the terms apple and tree next to each other in the right order (Phrase query) or having the term juice.

    These are the combinations we need to make. In the next sections we setup our environment and I explain the basics of Antlr that you need to understand to follow along.

    Setting up Antlr for your project

    There are lots of resources about setting up your local Antlr environment. I personally learned most from tomassetti. I prefer to use Maven to gather the required dependencies. I also use the Maven Antlr plugin to generate the Java classes based on the Lexar and Grammar rules.

    I also installed Antlr using Homebrew, but you do not really need this for this blog post.

    You can find the project on Github: https://github.com/jettro/search-dsl

    I generally just load the Maven project into IntelliJ and get everything running from there. If you don’t want to use an IDE, you can also do this with pure Maven.

    proj_home #> mvn clean install
    proj_home #> mvn dependency:copy-dependencies
    proj_home #> java -classpath "target/search-dsl-1.0-SNAPSHOT.jar:target/dependency/*"  nl.gridshore.searchdsl.RunStep1
    

    Of course you can change the RunStep1 into one of the other three classes.

    Antlr introduction

    This blog post does not have the intention to explain all ins and outs of Antlr. But there are a few things you need to know if you want to follow along with the code samples.

    • Lexer – A program that takes a phrase and obtains tokens from the phrase. Examples of lexers are: AND consisting of the characters ‘AND’ but also the specials characters ‘&&’. Another example is a WORD consisting of upper or lowercase characters and numbers. Tokens coming out of a Lexer contain the type of the token as well as the matched characters by that token.
    • Grammar – Rules that make use of the Lexer to create the syntax of your DSL. The result is a parser that creates a ParseTree out of your phrase. For example, we have a grammar rule query that parses a phrase like tree AND apple into the following ParseTree. The Grammar rule is: query : term (AND term)+ ;.
    • ParseTree – Tree by Antlr using the grammar and lexer from the provided phrase. Antlr also comes with a tool to create a visual tree. See an example below. In this blog post we create our own parser of the tree, there are however two better alternatives. The first is using the classic Listener pattern. The other is the Visitor pattern.
      Antlr4 parse tree 1
    • Listener – Antlr generates some parent classes to create your own listener. The idea behind a a listener is that you receive events when a new element is started and when the element is finished. This resembles how for instance the SAX parser works.
    • Visitor – Antlr generates some parent classes to create your own Visitors. With a visitor you start visiting your top level element, then you visit the children, that way you recursively go down the tree. In a next blog post we’ll discuss the visitor pattern in depth.

    Search DSL Basics

    In this section we are going to create the DSL in four small steps. For each step we have a StepXLexerRules.g4 and a StepXSearchDsl.g4 file containing the Antlr lexer and grammar rules. Each step also contains a Java file with the name RunStepX.

    Step 1

    In this step we want to write rules like:

    • apple
    • apple juice
    • apple1 juice
    lexer
    WORD        : ([A-z]|[0-9])+ ;
    WS          : [ \t\r\n]+ -&amp;gt; skip ;
    
    grammar
    query       : WORD+ ;
    

    In all the Java examples we’ll start the same. I’ll mention the rules here but will not go into depth in the other steps.

    Lexer lexer = new Step1LexerRules(CharStreams.fromString("apple juice"));
    CommonTokenStream commonTokenStream = new CommonTokenStream(lexer);
    
    Step1SearchDslParser parser = new Step1SearchDslParser(commonTokenStream);
    Step1SearchDslParser.QueryContext queryContext = parser.query();
    
    handleWordTokens(queryContext.WORD());
    

    First we create the Lexer, the Lexer is generated by Antlr. The input is a stream of characters created using the class CharStreams. From the Lexer we obtain a stream of Tokens, which is the input for the parser. The parser is also generated by Antlr. Using the parser we can obtain the queryContext. Notice the method query. This is the same name as the first grammar rule.

    In this basic example a query consists of at least one WORD and a WORD consists of upper and lower case characters and numbers. The output for the first step is:

    Source: apple
    WORDS (1): apple,
    Source: apple juice
    WORDS (2): apple,juice,
    Source: apple1 juice
    WORDS (2): apple1,juice,
    

    In the next step we are extending the DSL with an option to keep words together.

    Step 2

    In the previous step you got the option to search for one or multiple words. In this step we are adding the option to keep some words together by surrounding them with quotes. We add the following lines to the lexer and grammar.

    lexer
    QUOTE   : ["];
    
    grammar
    query               : term ;
    
    term                : WORD+|quotedTerm;
    quotedTerm          : QUOTE WORD+ QUOTE ;
    

    Now we can support queries like

    • apple
    • “apple juice”

    The addition to the lexer is QUOTE, the grammar becomes slightly more complex. The query now is a term, a term can be multiple WORDs or a quoted term consisting of multiple WORDs surrounded by QUOTEs. In Java we have to check from the termContext that is obtained from the queryContext if the term contains WORDs or a quotedTerm. That is what is shown in the next code block.

    Step2SearchDslParser.TermContext termContext = queryContext.term();
    handleTermOrQuotedTerm(termContext);
    
    private void handleTermOrQuotedTerm(Step2SearchDslParser.TermContext termContext) {
        if (null != termContext.quotedTerm()) {
            handleQuotedTerm(termContext.quotedTerm());
        } else {
            handleWordTokens(termContext.WORD());
        }
    }
    
    private void handleQuotedTerm(Step2SearchDslParser.QuotedTermContext quotedTermContext) {
        System.out.print("QUOTED ");
        handleWordTokens(quotedTermContext.WORD());
    }
    

    Notice how we check if the termContext contains a quotedTerm, just by checking if it is null. The output then becomes

    Source: apple
    WORDS (1): apple,
    Source: "apple juice"
    QUOTED WORDS (2): apple,juice,
    

    Time to take the next step, this time we make it possible to specify to make it explicit to query for one term or the other.

    Step 3

    In this step we make it possible to make it optional for a term to match as long as another term matches. Example queries are:

    • apple
    • apple OR juice
    • “apple juice” OR applejuice

    The change to the Lexer is just one type OR. The grammar has to change, now the query needs to support a term or an orQuery. The orQuery consists of a term extended with OR and a term, at least once.

    lexer
    OR      : 'OR' | '||' ;
    
    grammar
    query   : term | orQuery ;
    orQuery : term (OR term)+ ;
    

    The handling in Java is straightforward now, again some null checks and handle methods.

    if (queryContext.orQuery() != null) {
        handleOrContext(queryContext.orQuery());
    } else {
        handleTermContext(queryContext.term());
    }
    

    The output of the program then becomes:

    Source: apple
    WORDS (1): apple,
    Source: apple OR juice
    Or query: 
    WORDS (1): apple,
    WORDS (1): juice,
    Source: "apple juice" OR applejuice
    Or query: 
    QUOTED WORDS (2): apple,juice,
    WORDS (1): applejuice,
    

    In the final step we want to make the OR complete by also adding an AND.

    Step 4

    In the final step for this blog we are going to introduce AND. With the combination of AND we can make more complicated combinations. What would you make from one AND two OR three OR four AND five. In my DSL I first do the AND, then the OR. So this would become (one AND two) OR three OR (four AND five). So a document would match if it contains one and two, or four and five, or three. The Lexer does change a bit, again we just add a type for AND. The grammar has to introduce some new terms. It is good to have an overview of the complete grammar.

    query               : term | orQuery | andQuery ;
    
    orQuery             : orExpr (OR orExpr)+ ;
    orExpr              : term|andQuery;
    
    andQuery            : term (AND term)+ ;
    term                : WORD+|quotedTerm;
    quotedTerm          : QUOTE WORD+ QUOTE ;
    

    As you can see, we introduced an orExpr, being a term or an andQuery. We changed an orQuery to become an orExpr followed by at least one combination of OR and another orExpr. The query now is a term, an orQuery or an andQuery. Some examples below.

    • apple
    • apple OR juice
    • apple raspberry OR juice
    • apple AND raspberry AND juice OR Cola
    • “apple juice” OR applejuice

    The java code becomes a bit boring by now, so let us move to the output of the program immediately.

    Source: apple
    WORDS (1): apple,
    Source: apple OR juice
    Or query: 
    WORDS (1): apple,
    WORDS (1): juice,
    Source: apple raspberry OR juice
    Or query: 
    WORDS (2): apple,raspberry,
    WORDS (1): juice,
    Source: apple AND raspberry AND juice OR Cola
    Or query: 
    And Query: 
    WORDS (1): apple,
    WORDS (1): raspberry,
    WORDS (1): juice,
    WORDS (1): Cola,
    Source: "apple juice" OR applejuice
    Or query: 
    QUOTED WORDS (2): apple,juice,
    WORDS (1): applejuice,
    

    Concluding

    That is it for now, of course this is not the most complicated search DSL. You can most likely come up with other interesting constructs. The goal for this blogpost was to get you underway. In the next blog post I intend to discuss and show how to create a visitor that makes a real elasticsearch query based on the DSL.