Faceted search with Elasticsearch

Posted on 2016-06-21 by

In this blog, I will be presenting two strategies for implementing faceted search with Elasticsearch. Few days back I had a discussion with my colleague Byron about implementing faceted search when Elasticsearch is being used to serve the search results. And this blog is a culmination of our discussion. In Elasticsearch, the aggregation framework provides support for facets and executing metrics and scripts over those facet results. Following is a simple example wherein each Elastic documents contains a field color and we execute a Term Aggregation on the document set

{
  "aggs" : {
    "colors" : {
      "terms" : { "field" : "color" }
      }
    }
}

And we get the following


   "buckets": [
       {
           "key": "red",
           "doc_count": 2
        },
        {
           "key": "green",
           "doc_count": 2
        },
        {
           "key": "blue",
           "doc_count": 1
        },
        {
           "key": "yellow",
           "doc_count": 1
        }

So we get each of the unique color as a bucket key and doc_count is the number of documents have corresponding color field value.

One of the options is to keep going on with sub-aggregations which is also called Path hierarchy sub-aggregation but that is something which is very expensive and also not feasible beyond a certain hierarchy level, as discussed here.

First approach 

The first approach for faceted search is when we have a unique field corresponding to a hierarchy level present in each document. For example, if we have documents pertaining to products for online webshop and 3 levels of hierarchy then a product document would look something like –

         {
		...

 	"categoryOneLevel": [
           "8299"
         ],
         "categoryTwoLevel": [
           "8299-3131"
         ],
         "categoryThreeLevel": [
           "8299-3131-2703",
           "8299-3131-2900"
         ]
	}

On the UI we can visualize it with something like – Computer (8299) (level 1) > laptop(3131) (level 2) > (Linux (2703) and Mac(2900)) (level 3) > further on. The reason for storing the other levels in each deeper located category is because a product can exist in multiple categories. If we take this approach then sample queries would be something like –

When user is on landing page (no facetting yet)

GET document/_search
  {
  "aggs": {
   "categories": {
     "terms": {
       "field": "categoryOneLevel",
      }
   }
 }
 }

Now when the user clicks, the next query fired would be

GET document/_search?size=0
{
 "query": {
   "bool": {
     "filter": {
       "term": {
         "categoryOneLevel": "8299"
       }
     }
   }
 },
 "aggs": {
   "categories": {
     "terms": {
       "field": "categoryTwoLevel",
       "include": "8299-.*"
     }
   }
 }

Here the filter query selects all the documents of categoryOne and then does the aggregations on all categoryTwo values which are directly linked with categoryOne as we choose the include : 8299-* filter Thus you can extend this approach further that if the user clicks on categoryTwo then results from categoryThree are returned. In this way we can have faceted search results from ElasticSearch.

Second Approach

Now, let’s look at the second approach which somewhat comes out of the box. Using the built-in Path tokenizer. Here a single field holds the hierarchy values and while indexing we specify the “path tokenizer” mapping setting for that field.  For example, a single field would hold the comma separated value and path tokenizer would produce the token values as –

Computer,Laptop,Mac

And produces tokens:

Computer
Computer,Laptop
Computer,Laptop,Mac

In this approach, I am taking the actual user-readable values of categories instead of Ids, it’s only for example sake, as taking Ids gives you the flexibility to update the category names like Computer, Laptop etc associated with that Id.

Let’s look at some elastic queries –


PUT blog_index/
{
  "settings": {
    "analysis": {
      "analyzer": {
      "path-analyzer": {
         "type": "custom",
         "tokenizer": "path-tokenizer"
        }
     },
     "tokenizer": {
      "path-tokenizer": {
         "type": "path_hierarchy",
         "delimiter": ","
         }
      }
   }
 },
 "mappings": {
     "my_type": {
       "dynamic": "strict",
       "properties": {
         "hierarchy_path": {
             "type": "string",
             "analyzer": "path-analyzer",
             "search_analyzer": "keyword"
             }
          }
       }
     }
   }

Our index is ready with field “hierarchy_path” having the path tokenzier set as the analyzer thus now the terms of this field will be tokenized based on the path_tokenizer.

Now lets, add a document to the index


POST blog_index/my_type/1
{
"hierarchy_path": ["Computer,Laptop,Mac","Home,Kitchen,Cookware"]
}

We have added a document with field hierarchy_path having example of two set of categories and each set having comma separated values.

If we have execute a terms aggregation on field “hierarchy_path” we get


GET blog_index/my_type/_search?search_type=count
{
   "aggs": {
     "category": {
        "terms": {
         "field": "hierarchy_path",
          "size": 0
          }
       }
     }
     }

We get the following buckets


"buckets": [
 {
 "key": "Computer",
 "doc_count": 1
 },
 {
 "key": "Computer,Laptop",
 "doc_count": 1
 },
 {
 "key": "Computer,Laptop,Mac",
 "doc_count": 1
 },
 {
 "key": "Home",
 "doc_count": 1
 },
 {
 "key": "Home,Kitchen",
 "doc_count": 1
 },
 {
 "key": "Home,Kitchen,Cookware",
 "doc_count": 1
 }
 ]

From the above results, we can see that the path_tokenizer has split the comma separated values of the field “hierarchy_path”.

So, now based on the user activity we can fire queries to select the documents pertaining to the category which user is looking for. The query for selecting the top level category would be


GET blog_index/my_type/_search?search_type=count
{
   "aggs": {
     "category": {
       "terms": {
         "field": "hierarchy_path",
         "size": 0,
         "exclude": ".*\\,.*"
        }
       }
       }
     }

and we get


"buckets": [
  {
    "key": "Computer",
    "doc_count": 1
   },
 {
   "key": "Home",
   "doc_count": 1
  }
]

We have used the regular expression exclude”: “.*\\,.* which excludes all the sub-levels thus we get only the top hierarchy.

If user wants only second level then the query fired would be


GET blog_index/my_type/_search?search_type=count 
{
 "query": {
 "bool" : {
 "filter": {
 "prefix" : { "hierarchy_path" : "Computer" }
 }
 }
 },
 "aggs": {
 "category": {
 "terms": {
 "field": "hierarchy_path",
 "size": 0,
 "include" : "Computer\\,.*",
 "exclude": ".*\\,.*\\,.*"
 }
 }
 }
}

Wherein we specify the regex for include which would mean all documents which are part of Computer hierarchy but we exclude the third level of hierarchy thus result only contains second level of hierarchy.

"buckets": [
 {
 "key": "Computer,Laptop",
 "doc_count": 1
 }
 ]

When the user activity requires third level of hierarchy then the query fired would be


GET blog_index/my_type/_search?search_type=count
{
 "query": {
 "bool" : {
 "filter": {
 "prefix" : { "hierarchy_path" : "Computer" }
 }
 }
 },
 "aggs": {
 "category": {
   "terms": {
     "field": "hierarchy_path",
      "size": 0,
     "include" : "Computer\\,.*\\,.*"
    }
   }
  }
 }

Based on the include regex “Computer\\,.*\\,.*”   we wil get only the documents have the third level of hierarchy as well


"buckets": [
{
"key": "Computer,Laptop,Mac",
"doc_count": 1
}
]

In this way based on the user activity of our application we can fetch corresponding results from Elastic and while indexing our documents we need to make sure the product documents have relevant value in the “hierarchy_path” field based on the hierarchy level which that product would be present in.


4 Comments

  • Thanks for the post! It really addresses a significant detail on the implementation of search engines for e-commerce applications. Though, I have two remarks on the content:

    1. In terms of overall consistency of the post, it might be better to stick to the category id’s (rather than the category names) in the second approach as well, like you did in the first one.

    2. While you propose two valid solutions for the problem, examination of their pros/cons are missing from many aspects such as performance and future maintenance. For instance: How do they compare in terms of read and update performance? What future challenges does each approach imply while updating the index? What happens when a product is moved to a different category? What happens when a category is moved/renamed/deleted? What extra measure would you recommend to speed up the regex searches?

  • Harold

    The second method will fail if you have facets with the same prefix. For example: ‘kind’ and ‘kinderen’.

    I tried to force exact matching by adding the ‘glue’ as suffix (‘kind,’ and ‘kinderen,’), but the path_hierarchy tokenizer will add an empty path element at the end as a result.

    I guess the first approach has the same issue, but i haven’t tested that one yet.

    • The solution was easy, and took me way too much time to figure out … as usual 😉

      The ‘path-tokenizer’ splits the path into these parts:
      * Computer,Laptop,Mac
      * Computer,Laptop
      * Computer

      So we don’t need to do a prefix query, but an (exact) term query.

      “term” : { “hierarchy_path” : “Computer” }

      This will include all ‘Computer,*’ items, but exclude all ‘Computers’ (mind the ‘s’) since that’s not an exact match.

Leave a Reply

Your email address will not be published. Required fields are marked *