## Anomaly detection over a normally distributed aggregation in Elasticsearch

In my first month at Luminis I got the chance to dive into a data set stored in Elasticsearch. My assignment was easily phrased: detect the **outliers** in this data set. Anomaly detection is an interesting technique we would like to apply to different data sets. It can be useful in several business cases, like an investigation if something is wrong with these anomalies. This blog post is about detecting the outlying bucket(s) in an aggregation. The assumption is that the data is normally distributed. This can be tested with normality tests. But there is no mathematical definition of what constitutes an outlier. The domain of this data set is that the data is about messages to and from different suppliers. The point of interest is if the number of messages linked to a supplier differs in a week from all the weeks. The same can be done with different time intervals or over other aggregations, as long as they are normally distributed. Also a number field can be used instead of the number of messages in an aggregation (bucket document count). Or if the success status of a message is stored, that can be used as a boolean or enumeration. The first step is to choose an interesting aggregation on the data that gives normally distributed sized buckets. In the example below I chose a Date Histogram Aggregation with a weekly interval. This means that the data is divided into buckets of a week. Over the weeks the messages to and from the suppliers were evenly sent per week. Actually this results in that the size of the buckets are normally distributed. Then the following mathematical theory can be applied to determine the anomalies. The second step is to determine the lower and upper bound of the interval in which the most of the *normally distributed* data lies. This is done by calculating the mean and the standard deviation of the number field. It this case these are calculated over the document count of the weekly buckets. In empirical sciences it is conventional to consider points that don’t lie within three standard deviations of the mean an outlier, because 99,73 % of the points lie within that interval (3-sigma rule). The lower bound of the interval is calculated by subtracting three times the standard deviation of the mean and the upper bound by adding it to the mean. In Elasticsearch this can be done with the Extended Stats Bucket Aggregation, which is a Pipeline Aggregation that can be connected to the previous aggregation. We set parameter sigma to 3.0. Note that Elasticsearch calculates the population standard deviation by taking the root of the population variance. This is mathematically correct here, because the entire population of messages is stored. (If there would be more data than we have and we want to determine the outliers in our sample, then merely applying Bessel’s correction wouldn’t be enough, an unbiased estimation of the standard deviation needs to be used.) Both steps can be done in the same GET request in Sense.

GET indexWithMessages/_search { "size": 0, "aggs": { "byWeeklyAggregation": { "date_histogram": { "field": "datumOntvangen", "interval": "week" } }, "extendedStatsOfDocCount": { "extended_stats_bucket": { "buckets_path": "byWeeklyAggregation>_count", "sigma": 3.0 } } } }

The third step is to do the same query with the same aggregation, but this time only select the buckets with a document count outside the calculated interval. This can be done by connecting a different Pipeline Aggregation, namely the Bucket Selector Aggregation. The *lower* and *upper bound* need to be taken from the *std_deviation_bounds* of the *extendedStatsOfDocCount* Aggregation of the response. In my case these are *11107.573992258556* and *70207.24418955963*, respectively. Unfortunately the buckets path syntax doesn’t allow to go up in the aggregation tree. Otherwise the requests could be combined. It is possible to get the lower bound and the upper bound out of the Extended Stats Aggregation in a buckets path, but the syntax is not intuitive. See my question on discuss and the issue raised from it.

GET indexWithMessages/_search { "size": 0, "aggs": { "byWeeklyAggregation": { "date_histogram": { "field": "datumOntvangen", "interval": "week" }, "aggs": { "outlierSelector": { "bucket_selector": { "buckets_path": { "doc_count": "_count" }, "script": "doc_count < 11107.573992258556 || 70207.24418955963 < doc_count" } } } } } }

Now we have the weeks (buckets) that are outliers. Further investigation with domain knowledge can be done with this information. In this case, it could for example be a vacation or a supplier could have done less or more in a certain week.