Simple methods for anomaly detection in e-commerce

Anomaly is a deviation from the expected value. The main challenges are: (a) how much the deviation should be to be classified as an anomaly, and (b) what time frame or subset of data should we examine.

The simplest way to answer those questions is to use your marketer’s intuition. As an e-commerce manager, you have an idea of how big of an impact constitutes an anomaly for your business. For example, if sales change by 5% in a daily year-on-year comparison, that would not typically be an anomaly in e-commerce, because the purchase patterns naturally deviate this much or even more. However, if your business has e.g. a much higher growth going on and you suddenly drop from 20% y-o-y growth to 5%, then you could consider such a shift as an anomaly.

So, the first step should be to define which metrics are most central for tracking. Then, you would define threshold values and the time period. In e-commerce, we could e.g. define the following metrics and values:

  • Bounce Rate – 50% Increase
  • Branded (Non-Paid) Search Visits – 25% Decrease
  • CPC Bounce – 50% Increase
  • CPC Visit – 33% Decrease
  • Direct Visits – 25% Decrease
  • Direct Visits – 25% Increase
  • Ecommerce Revenue – 25% Decrease
  • Ecommerce Transactions – 33% Decrease
  • Internal Search – 33% Decrease
  • Internal Search – 50% Increase
  • Non-Branded (Non-Paid) Search Visits – 25% Decrease
  • Non-Paid Bounces – 50% Increase
  • Non-Paid Visits – 25% Decrease
  • Pageviews – 25% Decrease
  • Referral Visits – 25% Decrease
  • Visits – 33% Decrease

As you can see, this is rule-based detection of anomalies: once the observed value exceeds the threshold value in a given time period (say, daily or weekly tracking), the system alerts to e-commerce manager.

The difficulty, of course, lies in defining the threshold values. Due to changing baseline values, they need to be constantly updated. Thus, there should be better ways to detect anomalies.

Another simple method is to use a simple sliding window algorithm. This algorithm can (a) update the baseline value automatically based on data, and (b) identify anomalies based on a statistical property rather than the marketer’s intuition. The parameters for such an algorithm are:

  • frequency: how often the algorithm runs, e.g. daily, weekly, or monthly. Even intra-day runs are possible, but in most e-commerce cases not necessary (exception could be technical metrics such as server response time).
  • window size: this is the period for updating. For example, if the window size is 7 days and the algorithm is run daily, it computes that data always from the past seven days, each day adding +1 to start and end date.
  • statistical threshold: this is the logic of detecting anomalies. A typical approach is to (a) compute the mean for each metric during window size, and (b) compare the new values to mean, so that a difference of more than 2 or 3 standard deviations from the mean indicates an anomaly.

Thus, the threshold values automatically adjust to the moving baseline because the mean value is re-calculated at each window size.

How to interpret anomalies?

Note that an anomaly is not necessarily a bad thing. Positive anomalies occur e.g. when a new campaign kicks off, or the company achieves some form of viral marketing. Anomalies can also arise when a season breaks in. To mitigate such effects from showing, one can configure the baseline to represent year-on-year data instead of historical data from the current year. Regardless of whether the direction of the change is positive or negative, it is useful for a marketer to know there is a change of momentum. This helps restructure campaigns, allocate resources properly, and become aware of the external effects on key performance indicators.