Tag Archive: antlr



Migrating from GSA to Elasticsearch with ANTLR

Starting January 1st of 2019, the Google Search Appliance (GSA) is set to be EOL. For one of my clients, we have chosen Elasticsearch as our alternative from 2019 onwards. However, there was one problem; the current GSA solution is in use by several API’s that can’t be changed for various reasons. Therefore, it was up to us to replicate the GSA’s behaviour with the new Elasticsearch implementation and come migration time, to swap out the GSA for Elasticsearch with as little functional changes as possible.

The GSA provides a range of functionality, some of which is easily implemented with other technologies. In our case, this included of course search functionality, but also other things such as website crawling. The part that I found most interesting however, was the GSA’s ability to enable users to form queries based on two small domain specific languages (DSL).

Queries in these two DSL’s reach the GSA as query parameters on the GSA URL. The first DSL, specified with query parameter q, has three “modes” of functionality:

  • Free text search, by simply putting some space separated terms
  • allintext search, a search query much like the free text search, but excluding fields such as metadata, anchors and URL’s from the search
  • inmeta search, which can potentially do a lot, but in our case was merely restricted to searches on metadata of the form key=value.

The second DSL, specified with query parameter partialFields also provides searching on metadata. In this case, searches are of the form (key:value) and may be combined with three boolean operators:

  • .: AND
  • |: OR
  • -: NOT

An example query could then be (key1:value1)|(key2.value2).

In this blog, I will explain how to implement these two DSL’s using ANTLR and I will show you how ANTLR enables us to separate the parsing of the DSL from our other application logic.

If this is your first time working with ANTLR, you may want to read two posts ([1], [2]) that have been posted on our blog earlier.

If you are looking for the complete implementation, then please refer to the Github repository.

Parsing the GSA DSL

Let us start with parsing the q DSL. I have split the ANTLR grammar into a separate parser and lexer file for readability.

The parser is as follows:

parser grammar GsaQueryParser;

options { tokenVocab=GsaQueryLexer; }

query   : pair (OR? pair)*  #pairQuery
        | TEXT+             #freeTextQuery;

pair    : IN_META TEXT+                 #inmetaPair
        | ALL_IN_TEXT TEXT (OR? TEXT)*  #allintextPair;

And the lexer is defined below:

lexer grammar GsaQueryLexer;

ALL_IN_TEXT : 'allintext:';
IN_META     : 'inmeta:';

OR          : 'OR';

TEXT        : ~(' '|'='|':'|'|'|'('|')')+;

WHITESPACE  : [ \t\r\n]+ -> skip;
IGNORED     : [=:|()]+ -> skip;

Note that the parser grammar reflects the two different ways that the q DSL can be used; by specifying pairs or by simply putting a free text query. The pairs can be separated by an OR operator. Furthermore, the allintext keyword may separate terms with OR as well.

The definition of the partialFields DSL is somewhat different because it allows for query nesting and more boolean operators. Both the parser and the lexer are shown below, again in two separate files.

Parser:

parser grammar GsaPartialFieldsParser;

options { tokenVocab=GsaPartialFieldsLexer; }

query       : pair
            | subQuery;

subQuery    : LEFTBRACKET subQuery RIGHTBRACKET
            | pair (AND pair)+
            | pair (OR pair)+
            | subQuery (AND subQuery)+
            | subQuery (OR subQuery)+
            | subQuery AND pair
            | subQuery OR pair
            | pair AND subQuery
            | pair OR subQuery;

pair        : LEFTBRACKET KEYWORD VALUE RIGHTBRACKET        #inclusionPair
            | LEFTBRACKET NOT KEYWORD VALUE RIGHTBRACKET    #exclusionPair
            | LEFTBRACKET pair RIGHTBRACKET                 #nestedPair;

Lexer:

lexer grammar GsaPartialFieldsLexer;

AND         : '.';
OR          : '|';
NOT         : '-';

KEYWORD     : [A-z0-9]([A-z0-9]|'-'|'.')*;
VALUE       : SEPARATOR~(')')+;

SEPARATOR   : [:];
LEFTBRACKET : [(];
RIGHTBRACKET: [)];
WHITESPACE  : [\t\r\n]+ -> skip;

Note the usage of labels in both grammars which, in the above case, allows me to easily distinguish different types of key-value pairs; nested, inclusive or exclusive. Furthermore, there is a gotcha in the matching of the VALUE token. To make a clear distinction between KEYWORD and VALUE tokens, I’ve included the : as part of a VALUE token.

Creating the Elasticsearch query

Now that we have our grammars ready, it’s time to use the parse tree generated by ANTLR to construct corresponding Elasticsearch queries. I will post some source code snippets, but make sure to refer to the complete implementation for all details.

For both DSL’s, I have chosen to walk the tree using the visitor pattern. We will start be reviewing the q DSL.

Creating queries from the q DSL

The visitor of the q DSL extends a BaseVisitor generated by ANTLR and will eventually return a QueryBuilder, as indicated by the generic type:

public class QueryVisitor extends GsaQueryParserBaseVisitor<QueryBuilder>

There are three cases that we can distinguish for this DSL: a free text query, an allintext query or an inmeta query. Implementing the free text and allintext query means extracting the TEXT token from the tree and then constructing a MultiMatchQueryBuilder, e.g.:

@Override
public QueryBuilder visitFreeTextQuery(GsaQueryParser.FreeTextQueryContext ctx) {
    String text = concatenateValues(ctx.TEXT());
    return new MultiMatchQueryBuilder(text, "album", "artist", "id", "information", "label", "year");
}

private String concatenateValues(List<TerminalNode> textNodes) {
    return textNodes.stream().map(ParseTree::getText).collect(joining(" "));
}

The fields that you use in this match query depend on the data that is in Elasticsearch – in my case some documents describing music albums.

An inmeta query requires us to extract both the field and the value, which we then use to construct a MatchQueryBuilder, e.g.:

@Override
public QueryBuilder visitInmetaPair(GsaQueryParser.InmetaPairContext ctx) {
    List<TerminalNode> textNodes = ctx.TEXT();

    String key = textNodes.get(0).getText().toLowerCase();
    textNodes.remove(0);
    String value = concatenateValues(textNodes);

    return new MatchQueryBuilder(key, value);
}

We can then combine multiple pairs by implementing the visitPairQuery method:

@Override
public QueryBuilder visitPairQuery(GsaQueryParser.PairQueryContext ctx) {
    BoolQueryBuilder result = new BoolQueryBuilder();
    ctx.pair().forEach(pair -> {
        QueryBuilder builder = visit(pair);
        if (hasOrClause(ctx, pair)) {
            result.should(builder);
            result.minimumShouldMatch(1);
        } else {
            result.must(builder);
        }
    });
    return result;
}

Based on the presence of OR clauses we either create a should or must boolean clause for our Elasticsearch query.

Creating queries from the partialFields DSL

The visitor of the partialFields DSL also extends a BaseVisitor generated by ANTLR and also returns a QueryBuilder:

public class PartialFieldsVisitor extends GsaPartialFieldsParserBaseVisitor<QueryBuilder>

There are three kinds of pairs that we can specify with this DSL (inclusion, exclusion or nested pair) and we can override a separate method for each option, because we labelled these alternatives in our grammar. A nested pair is simply unwrapped and then passed back to ANTLR for further processing:

@Override
public QueryBuilder visitNestedPair(GsaPartialFieldsParser.NestedPairContext ctx) {
    return visit(ctx.pair());
} 

The inclusion and exclusion query implementations are quite similar to each other:

@Override
public QueryBuilder visitInclusionPair(GsaPartialFieldsParser.InclusionPairContext ctx) {
    return createQuery(ctx.KEYWORD().getText(), ctx.VALUE().getText(), false);
}

@Override
public QueryBuilder visitExclusionPair(GsaPartialFieldsParser.ExclusionPairContext ctx) {
    return createQuery(ctx.KEYWORD().getText(), ctx.VALUE().getText(), true);
}

private QueryBuilder createQuery(String key, String value, boolean isExcluded) {       
    value = value.substring(1);

    if (isExcluded) {
        return new BoolQueryBuilder().mustNot(new MatchQueryBuilder(key, value));
    } else {
        return new MatchQueryBuilder(key, value).operator(Operator.AND);
    }
}

Remember that we included the : to help our token recognition? The code above is where we need to handle this by taking the substring of the value. What remains is to implement a way to handle the combinations of pairs and boolean operators. This is done by implementing the visitSubQuery method and you can view the implementation here. Based on the presence of an AND or OR operator, we apply must or should clauses, respectively.

Examples

In my repository, I’ve included a REST controller that can be used to execute queries using the two DSL’s. Execute the following steps to follow along with the examples below:

  • Start an Elasticsearch instance at http://localhost:9200 (the application assumes v6.4.2)
  • Clone the repository: git clone https://github.com/markkrijgsman/migrate-gsa-to-elasticsearch.git && cd migrate-gsa-to-elasticsearch
  • Compile the repository: mvn clean install
  • Run the application: cd target && java -jar search-api.jar
  • Fill the Elasticsearch instance with some documents: http://localhost:8080/load
  • Start searching: http://localhost:8080/search

You can also use the Swagger UI to execute some requests: http://localhost:8080/swagger-ui.html. For each example I will list the URL for the request and the resulting Elasticsearch query that is constructed by the application.

Get all albums mentioning Elton John
http://localhost:8080/search?q=Elton

GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "Elton",
            "fields": [
              "album",
              "artist",
              "id",
              "information",
              "label",
              "year"
            ],
            "type": "best_fields",
            "operator": "AND",
            "lenient": true,
          }
        }
      ]
    }
  }
}

Get all albums where Elton John or Frank Sinatra are mentioned
http://localhost:8080/search?q=allintext:Elton%20OR%20Sinatra

GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "must": [
              {
                "multi_match": {
                  "query": "Elton Sinatra",
                  "fields": [
                    "album",
                    "artist",
                    "id",
                    "information",
                    "label",
                    "year"
                  ],
                  "type": "best_fields",
                  "operator": "OR",
                  "lenient": true
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Note that the operator for the multi match query is now OR, where it was AND in the previous example.

Get all albums where the artist is Elton John
http://localhost:8080/search?partialFields=(artist:Elton)

GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "artist": {
              "query": "Elton",
              "operator": "AND"
            }
          }
        }
      ]
    }
  }
}

Get all albums where Elton John is mentioned, but is not the artist
http://localhost:8080/search?partialFields=(-artist:Elton)&q=Elton

GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "must_not": [
              {
                "match": {
                  "artist": {
                    "query": "Elton",
                    "operator": "OR"
                  }
                }
              }
            ]
          }
        },
        {
          "multi_match": {
            "query": "Elton",
            "fields": [
              "album",
              "artist",
              "id",
              "information",
              "label",
              "year"
            ],
            "type": "best_fields",
            "operator": "AND",
            "lenient": true
          }
        }
      ]
    }
  }
}

Get all albums created by Elton John between 1972 and 1974 for the label MCA
http://localhost:8080/search?partialFields=(artist:Elton).(label:MCA)&q=inmeta:year:1972..1974

GET rolling500/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "must": [
              {
                "match": {
                  "artist": {
                    "query": "Elton",
                    "operator": "AND"
                  }
                }
              },
              {
                "match": {
                  "label": {
                    "query": "MCA",
                    "operator": "AND"
                  }
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must": [
              {
                "range": {
                  "year": {
                    "from": "1972",
                    "to": "1974",
                    "include_lower": true,
                    "include_upper": true
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Please refer to the unittests for a lot more examples:

Conclusions

As you can see, the usage of ANTLR allows us to specify fairly complex DSL’s without compromising readability. We’ve cleanly separated the parsing of a user query from the actual construction of the resulting Elasticsearch query. All code is easily testable and makes little to no use of hard to understand regular expressions.

A good addition would be to add some integration tests to your implementation, which you can learn more about here. If you have any questions or comments, let me know!


Part 2 of Creating a search DSL

In my previous blog post I wrote about creating your own search DSL using Antlr. In that post I discussed the Antlr language for constructs like AND/OR, multiple words and combining words. In this blog post I am showing how to use the visitor mechanism to write actual elasticsearch queries.

If you did not read the first post yet, please do so. It will make it easier to follow along. If you want the code, please visit the github page.

https://amsterdam.luminis.eu/2017/06/28/creating-search-dsl/
Github repository

What queries to use

In the previous blog post we ended up with some of the queries we want to support

  • apple
  • apple OR juice
  • apple raspberry OR juice
  • apple AND raspberry AND juice OR Cola
  • “apple juice” OR applejuice

Based on these queries we have some choices to make. The first query seems obvious, searching for one word would become a match query. However, in which field do you want to search? In Elasticsearch there is a special field called the _all field. In the example we are using the _all field, however it would be easy to create a query against a number of specific fields using a multi_match.

In the second example we have two words with OR in between. The most basic implementation would again be a match query, since the match query by default uses OR if you supply multiple words. However, the DSL uses OR to combine terms as well as and queries. A term in itself can be a quoted term as well. Therefore, to translate the apple OR juice we need to create a boolean query. Now look at the last example, here we use quotes. One would expect quotes to keep words together. In elasticsearch we would use the Phrase query to accomplish this.

As the current DSL is fairly simple, creating the queries is not that hard. But a lot more extensions are possible that can make use of more advance query options. Using wildcards could result in fuzzy queries, using title:apple could look into one specific field and using single quotes could mean an exact match, so we would need to use the term query.

Now you should have an idea of the queries we would need, let us have a look at the code and see Antlr DSL in action.

Generate json queries

As mentioned in the introduction we are going to use the visitor to parse the tree. Of course we need to create the tree first. Below the code to create the tree.

static SearchdslParser.QueryContext createTreeFromString(String searchString) {
    CharStream charStream = CharStreams.fromString(searchString);
    SearchdslLexer lexer = new SearchdslLexer(charStream);
    CommonTokenStream commonTokenStream = new CommonTokenStream(lexer);
    SearchdslParser parser = new SearchdslParser(commonTokenStream);

    return parser.query();
}

AS mentioned in the previous posts, the parser and the visitor classes get generated by Antlr. Methods are generated for visiting the different nodes of the tree. Check the class

  • SearchdslBaseVisitor
  • for the methods you can override.

    To understand what happens, it is best to have a look at the tree itself. Below the image of the tree that we are going to visit.

    Antlr4 parse tree

    We visit the tree from the top. The first method or Node that we visit is the top level Query. Below the code of the visit method.

    @Override
    public String visitQuery(SearchdslParser.QueryContext ctx) {
        String query = visitChildren(ctx);
    
        return
                "{" +
                    "\"query\":" + query +
                "}";
    }
    

    Every visitor generates a string, with the query we just visit all the possible children and create a json string with a query in there. In the image we see only a child orQuery, but it could also be a Term or andQuery. By calling the visitChildren method we continue to walk the tree. Next step is the visitOrQuery.

    @Override
    public String visitOrQuery(SearchdslParser.OrQueryContext ctx) {
        List<String> shouldQueries = ctx.orExpr().stream().map(this::visit).collect(Collectors.toList());
        String query = String.join(",", shouldQueries);
    
        return
                "{\"bool\": {" +
                        "\"should\": [" +
                            query +
                        "]" +
                "}}";
    }
    

    When creating an OR query we use the bool query with the should clause. Next we have to obtain the queries to include in the should clause. We obtain the orExpr items from the orQuery and for each orExpr we again call the visit method. This time we will visit the orExpr Node, this node does not contain important information for us, therefore we let the template method just call the visitChildren method. orExpr nodes can contain a term or an andQuery. Let us have a look at visiting the andQuery first.

    @Override
    public String visitAndQuery(SearchdslParser.AndQueryContext ctx) {
        List<String> mustQueries = ctx.term().stream().map(this::visit).collect(Collectors.toList());
        String query = String.join(",", mustQueries);
        
        return
                "{" +
                        "\"bool\": {" +
                            "\"must\": [" +
                                query +
                            "]" +
                        "}" +
                "}";
    }
    

    Notice how closely this resembles the orQuery, big difference in the query is that we now use the bool query with a must part. We are almost there. The next step is the Term node. This node contains words to transform into a match query, or it contains a quotedTerm. The next code block shows the visit method of a Term.

    @Override
    public String visitTerm(SearchdslParser.TermContext ctx) {
        if (ctx.quotedTerm() != null) {
            return visit(ctx.quotedTerm());
        }
        List<TerminalNode> words = ctx.WORD();
        String termsAsText = obtainWords(words);
    
        return
                "{" +
                        "\"match\": {" +
                            "\"_all\":\"" + termsAsText + "\"" +
                        "}" +
                "}";
    }
    
    private String obtainWords(List<TerminalNode> words) {
        if (words == null || words.isEmpty()) {
            return "";
        }
        List<String> foundWords = words.stream().map(TerminalNode::getText).collect(Collectors.toList());
        
        return String.join(" ", foundWords);
    }
    

    Notice we first check if the term contain a quotedTerm. If it does not contain a quotedTerm we obtain the words and combine them into one string. The final step is to visit the quotedTerm node.

    @Override
    public String visitQuotedTerm(SearchdslParser.QuotedTermContext ctx) {
        List<TerminalNode> words = ctx.WORD();
        String termsAsText = obtainWords(words);
    
        return
                "{" +
                        "\"match_phrase\": {" +
                            "\"_all\":\"" + termsAsText + "\"" +
                        "}" +
                "}";
    }
    

    Notice we parse this part into a match_phrase query, other than that it is almost the same as the term visitor. Finally we can generate the complete query.

    Example

    “multi search” && find && doit OR succeed && nothing

    {
      "query": {
        "bool": {
          "should": [
            {
              "bool": {
                "must": [
                  {
                    "match_phrase": {
                      "_all": "multi search"
                    }
                  },
                  {
                    "match": {
                      "_all": "find"
                    }
                  },
                  {
                    "match": {
                      "_all": "doit"
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "must": [
                  {
                    "match": {
                      "_all": "succeed"
                    }
                  },
                  {
                    "match": {
                      "_all": "nothing"
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }
    

    In the codebase on github there is also a Jackson JsonNode based visitor, if you don’t like the string based approach.

    That is about it, I am planning on extending the example further. If I have added some interesting new concepts I’ll get back to you with a part 3


    Creating a search DSL

    As an (elastic)search expert, I regularly visit customers. For these customers I often do a short analysis of their search solution and I give advice about improvements they can make. It is always interesting to look at solutions customers come up with. At one of my most recent customers I noticed a search solution based on a very extensive search DSL (Domain Specific Language) created with Antlr. I knew about Antlr, but never thought about creating my own search DSL.

    To better understand the options of Antlr and to practice with creating my own DSL I started experimenting with it. In this blog post I’ll take you on my learning journey. I am going to create my own very basic search DSL.

    Specifying the DSL

    First we need to define the queries we would like our users to enter. Below are some examples:

    • tree – This is an easy one, just one word to search for
    • tree apple – Two words to look up
    • tree apple AND sell – Find matching content for tree apple, but also containing sell.
    • tree AND apple OR juice – Find matching content containing the terms tree and apple or containing the term juice.
    • “apple tree” OR juice – Find content having the terms apple and tree next to each other in the right order (Phrase query) or having the term juice.

    These are the combinations we need to make. In the next sections we setup our environment and I explain the basics of Antlr that you need to understand to follow along.

    Setting up Antlr for your project

    There are lots of resources about setting up your local Antlr environment. I personally learned most from tomassetti. I prefer to use Maven to gather the required dependencies. I also use the Maven Antlr plugin to generate the Java classes based on the Lexar and Grammar rules.

    I also installed Antlr using Homebrew, but you do not really need this for this blog post.

    You can find the project on Github: https://github.com/jettro/search-dsl

    I generally just load the Maven project into IntelliJ and get everything running from there. If you don’t want to use an IDE, you can also do this with pure Maven.

    proj_home #> mvn clean install
    proj_home #> mvn dependency:copy-dependencies
    proj_home #> java -classpath "target/search-dsl-1.0-SNAPSHOT.jar:target/dependency/*"  nl.gridshore.searchdsl.RunStep1
    

    Of course you can change the RunStep1 into one of the other three classes.

    Antlr introduction

    This blog post does not have the intention to explain all ins and outs of Antlr. But there are a few things you need to know if you want to follow along with the code samples.

    • Lexer – A program that takes a phrase and obtains tokens from the phrase. Examples of lexers are: AND consisting of the characters ‘AND’ but also the specials characters ‘&&’. Another example is a WORD consisting of upper or lowercase characters and numbers. Tokens coming out of a Lexer contain the type of the token as well as the matched characters by that token.
    • Grammar – Rules that make use of the Lexer to create the syntax of your DSL. The result is a parser that creates a ParseTree out of your phrase. For example, we have a grammar rule query that parses a phrase like tree AND apple into the following ParseTree. The Grammar rule is: query : term (AND term)+ ;.
    • ParseTree – Tree by Antlr using the grammar and lexer from the provided phrase. Antlr also comes with a tool to create a visual tree. See an example below. In this blog post we create our own parser of the tree, there are however two better alternatives. The first is using the classic Listener pattern. The other is the Visitor pattern.
      Antlr4 parse tree 1
    • Listener – Antlr generates some parent classes to create your own listener. The idea behind a a listener is that you receive events when a new element is started and when the element is finished. This resembles how for instance the SAX parser works.
    • Visitor – Antlr generates some parent classes to create your own Visitors. With a visitor you start visiting your top level element, then you visit the children, that way you recursively go down the tree. In a next blog post we’ll discuss the visitor pattern in depth.

    Search DSL Basics

    In this section we are going to create the DSL in four small steps. For each step we have a StepXLexerRules.g4 and a StepXSearchDsl.g4 file containing the Antlr lexer and grammar rules. Each step also contains a Java file with the name RunStepX.

    Step 1

    In this step we want to write rules like:

    • apple
    • apple juice
    • apple1 juice
    lexer
    WORD        : ([A-z]|[0-9])+ ;
    WS          : [ \t\r\n]+ -&amp;gt; skip ;
    
    grammar
    query       : WORD+ ;
    

    In all the Java examples we’ll start the same. I’ll mention the rules here but will not go into depth in the other steps.

    Lexer lexer = new Step1LexerRules(CharStreams.fromString("apple juice"));
    CommonTokenStream commonTokenStream = new CommonTokenStream(lexer);
    
    Step1SearchDslParser parser = new Step1SearchDslParser(commonTokenStream);
    Step1SearchDslParser.QueryContext queryContext = parser.query();
    
    handleWordTokens(queryContext.WORD());
    

    First we create the Lexer, the Lexer is generated by Antlr. The input is a stream of characters created using the class CharStreams. From the Lexer we obtain a stream of Tokens, which is the input for the parser. The parser is also generated by Antlr. Using the parser we can obtain the queryContext. Notice the method query. This is the same name as the first grammar rule.

    In this basic example a query consists of at least one WORD and a WORD consists of upper and lower case characters and numbers. The output for the first step is:

    Source: apple
    WORDS (1): apple,
    Source: apple juice
    WORDS (2): apple,juice,
    Source: apple1 juice
    WORDS (2): apple1,juice,
    

    In the next step we are extending the DSL with an option to keep words together.

    Step 2

    In the previous step you got the option to search for one or multiple words. In this step we are adding the option to keep some words together by surrounding them with quotes. We add the following lines to the lexer and grammar.

    lexer
    QUOTE   : ["];
    
    grammar
    query               : term ;
    
    term                : WORD+|quotedTerm;
    quotedTerm          : QUOTE WORD+ QUOTE ;
    

    Now we can support queries like

    • apple
    • “apple juice”

    The addition to the lexer is QUOTE, the grammar becomes slightly more complex. The query now is a term, a term can be multiple WORDs or a quoted term consisting of multiple WORDs surrounded by QUOTEs. In Java we have to check from the termContext that is obtained from the queryContext if the term contains WORDs or a quotedTerm. That is what is shown in the next code block.

    Step2SearchDslParser.TermContext termContext = queryContext.term();
    handleTermOrQuotedTerm(termContext);
    
    private void handleTermOrQuotedTerm(Step2SearchDslParser.TermContext termContext) {
        if (null != termContext.quotedTerm()) {
            handleQuotedTerm(termContext.quotedTerm());
        } else {
            handleWordTokens(termContext.WORD());
        }
    }
    
    private void handleQuotedTerm(Step2SearchDslParser.QuotedTermContext quotedTermContext) {
        System.out.print("QUOTED ");
        handleWordTokens(quotedTermContext.WORD());
    }
    

    Notice how we check if the termContext contains a quotedTerm, just by checking if it is null. The output then becomes

    Source: apple
    WORDS (1): apple,
    Source: "apple juice"
    QUOTED WORDS (2): apple,juice,
    

    Time to take the next step, this time we make it possible to specify to make it explicit to query for one term or the other.

    Step 3

    In this step we make it possible to make it optional for a term to match as long as another term matches. Example queries are:

    • apple
    • apple OR juice
    • “apple juice” OR applejuice

    The change to the Lexer is just one type OR. The grammar has to change, now the query needs to support a term or an orQuery. The orQuery consists of a term extended with OR and a term, at least once.

    lexer
    OR      : 'OR' | '||' ;
    
    grammar
    query   : term | orQuery ;
    orQuery : term (OR term)+ ;
    

    The handling in Java is straightforward now, again some null checks and handle methods.

    if (queryContext.orQuery() != null) {
        handleOrContext(queryContext.orQuery());
    } else {
        handleTermContext(queryContext.term());
    }
    

    The output of the program then becomes:

    Source: apple
    WORDS (1): apple,
    Source: apple OR juice
    Or query: 
    WORDS (1): apple,
    WORDS (1): juice,
    Source: "apple juice" OR applejuice
    Or query: 
    QUOTED WORDS (2): apple,juice,
    WORDS (1): applejuice,
    

    In the final step we want to make the OR complete by also adding an AND.

    Step 4

    In the final step for this blog we are going to introduce AND. With the combination of AND we can make more complicated combinations. What would you make from one AND two OR three OR four AND five. In my DSL I first do the AND, then the OR. So this would become (one AND two) OR three OR (four AND five). So a document would match if it contains one and two, or four and five, or three. The Lexer does change a bit, again we just add a type for AND. The grammar has to introduce some new terms. It is good to have an overview of the complete grammar.

    query               : term | orQuery | andQuery ;
    
    orQuery             : orExpr (OR orExpr)+ ;
    orExpr              : term|andQuery;
    
    andQuery            : term (AND term)+ ;
    term                : WORD+|quotedTerm;
    quotedTerm          : QUOTE WORD+ QUOTE ;
    

    As you can see, we introduced an orExpr, being a term or an andQuery. We changed an orQuery to become an orExpr followed by at least one combination of OR and another orExpr. The query now is a term, an orQuery or an andQuery. Some examples below.

    • apple
    • apple OR juice
    • apple raspberry OR juice
    • apple AND raspberry AND juice OR Cola
    • “apple juice” OR applejuice

    The java code becomes a bit boring by now, so let us move to the output of the program immediately.

    Source: apple
    WORDS (1): apple,
    Source: apple OR juice
    Or query: 
    WORDS (1): apple,
    WORDS (1): juice,
    Source: apple raspberry OR juice
    Or query: 
    WORDS (2): apple,raspberry,
    WORDS (1): juice,
    Source: apple AND raspberry AND juice OR Cola
    Or query: 
    And Query: 
    WORDS (1): apple,
    WORDS (1): raspberry,
    WORDS (1): juice,
    WORDS (1): Cola,
    Source: "apple juice" OR applejuice
    Or query: 
    QUOTED WORDS (2): apple,juice,
    WORDS (1): applejuice,
    

    Concluding

    That is it for now, of course this is not the most complicated search DSL. You can most likely come up with other interesting constructs. The goal for this blogpost was to get you underway. In the next blog post I intend to discuss and show how to create a visitor that makes a real elasticsearch query based on the DSL.