Anomaly Detection – A Novel Approach

Posted by | Enterprise Software, Optimization, Technology | No Comments

One of the harder things to do in monitoring system health or even brand health is to detect anomalies or “events” that are happening that may be out of the ordinary. It gets harder to detect such events when you’re data fluctuates frequently or when you’re trying to build a model that can be applied towards dramatically different datasets. This topic has been talked about many different times and contested with different theories, mathematics, and approaches such that we don’t create “alert fatigue”. Of course, I had to try and do it differently! Let’s talk about the approach I’ve been testing out.

TL;DR – I’m testing out a model that looks at the velocity vector moving average and derivative moving average. By looking at 3 time series data points of the derivatives in the past and extrapolate into the future, paired with the velocity vector, we get a good idea on when an anomaly may be happening.

I’ve explored many different ways including sophisticated machine learning methods. However, one afternoon I had a thought about looking at the problem a different way. This approach includes methods from day trading, physics, and calculus. The approach is simple enough: look at the change in slope agains the moving average. The reality is that there is a lot more to get it to work. And now, for the deconstruction…

Acceleration Moving Average

The first portion of this theory is to take a look at the acceleration moving average. This is found often in day trading as an indicator to a dramatic shift in direction that outpaces the prior accelerations. In this portion of the formula, we look at the acceleration formula as follows:

a = ∆v/∆t

For each time series, we store what the a is calculated out at. From there, we compare that against the moving average. Internally, we have tested out looking at a 14 day moving average on a 10 minute time series. So for each 10 minute increment, we look at the current and compare it against the moving average. However, as you can imagine, this can fluctuate quite dramatically and cause alerts to be sent that shouldn’t be. The risk of looking specifically at this is that you set a static threshold – ie. if current acceleration is greater than acceleration moving average by 20%, send alert. Where this really breaks down is when you get multiple spikes over the course of a day with each subsequent spike being less in volume (but still notable). Since the moving average will increase to account for the most recent spike, you lose out on the sub sequent spikes. Example below.


If you look at the large red line right at around 12/1/15, you’ll notice that if we were to use a moving average that our moving average line would be pulled up dramatically. This causes the subsequent events happening at around 12/15/15 and 12/18/15 to be missed. While the acceleration moving average is a novel approach, we’ve actually found that it isn’t necessarily as useful as we’d like. It has often been led astray with wild fluctuations of volume and has a high propensity to trigger alerts that are not actually needed – such as the above. This led to look at a different approach.

Velocity Vector

Vectors allow us to quantify an object’s direction and magnitude. When looking at an anomaly, we want to understand it’s direction of movement on an x,y axis then pair that with the magnitude of volume. We could arguably get rid of the acceleration moving average at this point as they effectively become the same thing once we look at the moving average. Now, the velocity vector gives us a bit of understanding in real time what is happening to our volume. See example below.


When analyzing twitter volume, volume can be sporadic. Even when reviewing the velocity vector moving average against the current, we still find that alarms are triggered more frequently than we’d like. This is primarily due to the data not being smoothed out. Meaning, we get snapshots of volume at different time frames as whole numbers, such as 10, 50, 34, etc. This makes it hard to discern the significance of a change in the vector portion of velocity vector. This brings us to the third portion of the formula.

Fourier Smoothing

Since Twitter volume data comes in as chunks of whole numbers, this causes our vectors to change dramatically which renders the prior useless. Vector velocities appear to really only be useful when the data is smoothed out between the actual time series counts. For example, if we have the two data points of 1 and 5, we’d actually want to fill in the difference with 1.1, 1.2, 1.3, 1.4, etc. In an interesting way, Twitter volume data can sometimes look like audio signal data, the sense that it can be incredibly choppy. In order to smooth it out, we can actually use Fourier Smoothing to create a nice looking data set as the Twitter volume count comes in. Below is an example of Fourier Smooth, where we look at discrete values of temperature by day and smooth out the data using this technique.


Now when we look the velocity vector moving average, the value becomes more stoic and doesn’t change nearly as much as it did when no smoothing was applied. If we look at the velocity vector on 10 minute increments as a 14 day moving average, we get some nice insight as to the different fluctuations happening. However, we’re still looking at the current state and still don’t have a good way of letting the machine tell us not only when to trigger something, but letting us know when something might happen. In order to solve the predictive portion of that problem, we looked to derivatives.


Since our Fourier Smoothing of the dataset provided nice hyperbolas, we can easily calculate the derivative of any data point at any given time. In our environment, we have tested out looking at the derivative at each 10 minute increment. Since calculating the derivative gives us a line that theoretically extends both into the past and future, we actually look up to 3 time series increments into the future and past. From there, we calculate the change in the y axis of the derivatives. See example below.


By doing this, we can predict what the change in the derivative is up to 30 minutes before we get to that point. This is key because we’re looking specifically at the slope of an extrapolated derivative. But how do we know when an anomaly may happen? We look at the moving average of the past 14 days of the change in derivative slope. If the current change in slope exceeds the moving average, we’re likely to have an anomaly on our hands. However, we have found this to also be a bit too sensitive by itself which led to creating a combination of both the velocity vector moving average and the derivative slope moving average.

By combining both, we force a decision to be made. If the velocity vector is within the moving average but the derivative slope isn’t, it is most likely not an anomaly. Conversely, the same also applies. What I did find out though was that if both the derivative slope and the velocity vector exceed the moving average, it’s a strong indication that an anomaly is or will happen. I’ve also tried pairing this with 1 standard deviation away from the moving average as a dynamic threshold. Adding this in creates a system that only pulls out the very extreme cases of anomalies. In further tests, I’ll probably be testing out using units of standard deviation as a way to create a “more/less” sensitive alerting system. Almost like a user-drive knob or refinement method.

Is this a finished approach? Absolutely not. There’s a lot of challenges in getting this to work 100% properly to the point that it meets some sort of statistical rigor. I’ve been encouraged by the early results looking at real examples of events happening with our customers in the Twitter sphere. So far, I’ve seen a decent amount of success with predicting when an anomaly may be happening. There are other methods that we could look at to help refine the model, such as adding an F-Score for precision and recall for better accuracy on the prediction front.

Changes in User Behavior: How 3D Graphs can Provide Deep User Insight

Posted by | Optimization, Technology | No Comments

During my time at Localytics there was a drastic movement towards getting deeper insight into the customer lifecycle (acquisition to engagement). It’s the holy grail for any marketer to be able to understand where the user was at in their journey, where they might go next, and when the may potentially fall off. This made me construct a theory around how we look at user data so that we can understand better what forces push a user in one direction or pull a user in another. This theory breaks down into 2 major components: Track user behavior in 3 dimensions instead of two, and utilizing deep learning networks to provide insight into where a users’ behavior is moving towards.


In many (if not all) of the current platforms on the market that offer some level of personalization or optimization, they are viewing a single marketing channel in a two dimensional space. What I mean by this is that, for example, we would look at something like conversion rate vs. time on a website. We could even get a little bit more technical and say conversion rate by a unique user over time to get more granular. There are a very select few amount of platforms providing a holistic (barf buzzword) user centric view of how individual users are interacting with each of the individual channels.

As enterprises shift dramatically towards what everyone is calling “digital transformation”, there’s a specific trend that is being surfaced: brands need to be where ever their users are at in all distribution mediums. What this means is that you’re not mobile first, web first, or social first, but rather taking the stance that your users will engage with your brand on one of many different channels, and on those channels in many different forms. For example, a user engaging with the “social” channel could mean Twitter, Pinterest, Facebook, or some random forum with their action being a like, a comment, or a share.

If we go back to viewing marketing channels as two dimensional spaces, we can start get some idea of what this looks like and the X or Y axis. For example:

  • X-Axis = Time
  • Y-Axis = Conversion Rate
  • Graph = Individual User Level


Today, you would do this for many different channels then try to discern what is actually happening. I’ve seen this manifest on Excel spreadsheets where the rows look something like:

  • Table = Campaign
  • Time Series = Week over Week on 6 month basis
  • Mobile App = 3.5% conversion rate on “X” event/trigger
  • Website = 3.2 minute avg. session length
  • Twitter = 2 #’s or @’s referencing “X” brand
  • Email = .03% open rate
  • etc.
Campaign ID Mobile App CR% Website Twitter Email Push
This Week 1995803 3.5% 3.2 min 2 Interactions .03% OR 10% CR
Last Week 1994068 3.2% 3 min 1 Interactions 1.5% OR 9% OR

Listings of metrics like this raise questions such as was this campaign successful? Did our messaging turn off our users or increase retention? Did any users churn week/week? But what if some users underneath a campaign show signs of churning but, in reality, they’re really just not engaging in your emails? Were there outside factors influencing these campaigns, such as weather or holidays? Were the results that I’m seeing statistically significant enough that I can trust them?

There’s a ton of open ended questions that marketers get stuck with and have no clue what any of it means. It’s complete data overload and what they end up doing is surfacing the deltas between the weeks as their KPIs, cherry picking the best, etc. It’s not a good situation overall.


The theory goes like this: When doing aggregate metrics or trying to understand multi-channel marketing efforts, viewing users on 3 dimensions instead of 2 surfaces much deeper insights. This manifests itself into 4 quadrants that generalize the overall user behavior that can occur on any channel:

  • Highly Engaged (top right)
  • Engaged but Not Responding (bottom right)
  • Responding but Not Engaged (top left)
  • Not Engaged and Not Responding (bottom left)

We want to keep them as generalized quadrants because there needs to be ample room for interpretation based on channel, vertical, or business interpretation. In my experiments, it’s easiest to contain the graph on a scale of 0-100. We’ll come back to this later.

Here’s the hypothesis that we’ll work off of, viewing the above theory from an app perspective :

If we think of audiences/users in 3 dimensions with more aggregate-type metrics, we can provide a more prescriptive insight into audiences/users based on many summary metrics. Additionally, we can surface audience/user movements automatically without the need for customer input.

We have our hypothesis, but now we need to add the general titles for the X and Y axis. This looks something like:


On our X-Axis we have App Engagement which could be an aggregated metric based on many different individual values. On the Y-Axis we have Marketing Engagement which could be attributed to all the different mediums a user might interact with when on their mobile device. We also have our 4 quadrants which help us positions users. For the above graph, I swapped out “Not Engaged and Not Responding” with “Risk of Churn” which can be viewed as synonymous.

In this world, new users would be placed onto the chart in the very center and, based on their actions, start to form a vector for their behavior. So once you start to have users on the chart, it may look something like:


With many analytics or marketing platforms, there are APIs that allow you to pull data on a scheduled basis. In our scenario here, we would want to pull data from many different data sources within our ecosystem into a nightly or weekly snapshot. This could be a super flattened JSON file that stores performance data for campaigns, segments, audiences or individual users. We don’t want to pull very specific data such as when a user engaged with a campaign, but rather aggregate or composite metrics (ie. 7 day retention).

This nightly or weekly snapshot of composite metrics allows us to identify anomalies, user behavior shifts over time, and assign a “vector” of movement. From there, the vectors of movement can be charted to look something like this:


Now we’re getting somewhere interesting. On each of the axises you can see different types of metrics that may contribute to the aggregate or composite metrics. Since we now have day over day or week over week deltas of user behavior, we can plot vectors. Different vector deltas may have different positive, neutral, or negative indicators associated with it. In the above example, we have 2 green users who are in the “Risk of Churn” section but they may have both elicited a significant change in user behavior based on the aggregated metrics. These changes may be signifying that the user is being recovered and coming back towards a more healthy state with regards to App Engagement and Marketing Engagement.

By doing this, it can help us identify anomalies, thresholds where we may want to intervene with an outbound marketing reach, predict where an audience/user may be moving based on their vector, and get a better overall view into where users may be at in their lifecycle journey. Up until now however, we’ve viewed this on 2 dimensions. If we add a 3rd dimension to our chart, it becomes much more interesting. Let’s assign an App Goal as the Z-Axis which could be something like “Increase App Engagement”, where the metric looks at any increase in engagement as a positive influence. This starts to look like the following:


Now that we’re viewing this in 3d, it gives us the opportunity to see where a specific user may be on each of the axises that we care about. For example, we may notice that the blue user in the top right has about a 7.5 on Marketing Engagement, 7 on App Engagement, and 7 on App Goal completion. This user would be considered a safe user and is stable in what we’d considered the “Highly Engaged” quadrant. However, on the bottom with our turquoise user, we’re seeing trouble. The user is low on App Goals, very low on Marketing Engagement, but has decent App Engagement. If we had a vector assigned to this, we could see the direction in which the user is heading to determine if this is a user who is moving into the “Potential to Churn” quadrant.

How A Deep Learning Neural Network Fits

Here’s where we will go down the rabbit hole. In the above image, our users were defined as “spheres” visually. What if we actually viewed them mathematically as spheres? Follow me on this one.

A user has a sphere around them that defines the behavior quadrants which, in any direction, have a global maximum value of 1. The user initially starts out with a neutral value of 0 (center of the sphere). When the user performs an action such as having 3 sessions on an app in 1 week, we see that as a positive behavior and assign that as “Highly Engaged”. On our graph above, the user started at the center of the graph (5, 5, 5) with the user behavior is at plotted at 0, 0, 0. Now, with the new positive session count towards “Highly Engaged”, we weight this direction in order to provide its vector movement.

Metaphorically, this is similar to having a sheet that has our above quadrants on it that is pulled tight. You drop a weighted ball in the “Highly Engaged” quadrant and the sheet is pulled in that direction. If you wrap that sheet into a sphere, you are doing the same thing except that each drop (or throw) of the ball is pushing the entire sphere in the direction of the quadrant (in a 3 dimensional space).


Credit to Michael Nielsen and his blog on Neural Networks and Deep Learning

We do this through the use of a deep learning neural network. This is supremely described in detail by Michael Nielsen on his blog which I highly recommend reading. In essence, the neural network utilizes many different data sources in order to product a mechanism called gradient decent within a specific “perceptron” or “neuron”. These neurons work on a sigmoid function with their output being a value between 0 and 1. As the output values come through to our “user behavior sphere”, the values weight the sphere in the direction of quadrants that the user behavior is attributing towards. This moves the “user behavior sphere” into that quadrant.

For example, positive App Engagement and Marketing Engagement pushes the user sphere towards the “Highly Engaged” quadrant while day over day or week over week reduction in interaction on either channel conversely weights the user sphere down towards the likely to churn.

With all of the different channels a user may be interacting with a brand on, a neural network can ingest this data, decipher it, and provide the proper vector for the user with respect to our 3d visualization.

Bring it back together

If we climb out of the rabbit hole and out of an app specific example, we can start to view that a 3 dimensional view of users can be beneficial for synthesizing and expressing user behavior across many channels as one unified view. It allows us to do interesting things with the concepts of explicit vs. implicit user behavior and be able to plot that behavior in a way that makes sense to marketers as well as machines. To be clear, I don’t think it’s beneficial to surface the above graph visualizations to marketers but rather view the data from this perspective. What I do think is that, behind the scenes, we view the user behavior in this format mathematically but surface up at a dashboard level only the interesting movements, such as “X number of users are potentially moving from one quadrant to another”.

This theory could be useful for complex ecosystems that have disparate metrics and data sources. However, it is important to note that I believe that this could only be useful when the metrics are in an aggregated format. I believe that if you did this on an individual metric basis that it will dilute value since we’re not taking in the full picture of the user or audience on all channels.

I still have many questions personally around how this can be applied to different business models given that metrics like engagement mean completely different things to different verticals (media vs. travel). Additionally, this may only be useful at an audience level instead of user level due to the graphs nature to simplify and abstract away the metrics.

We currently do similar types of thinking around users when it comes to radar graphs. We assign a metric and value ranges to each of the points on a radar graph and track user behavior as they progress towards “who they are”. This is especially useful for when we’re trying to understand what persona a user may fit into. However, the drawback is that this is still 2-dimensional thinking and has flaws such as not being able to effectively measure goal completion or aggregate metrics effectively (7 engagement, weekly conversion rate, etc.). There’s a world where we could add a “depth” factor to a radar graph but that is a theory for another day.


Radar Chart example from Chris Zhou

Additional/Upcoming Ideas

I have more thoughts around how this can be morphed into interesting things like open graphs for further automation among the entire marketing ecosystem. For example, if we have a user who is starting to become low on certain marketing engagement efforts, could the system automatically create a retargeted marketing campaign on Facebook as a method of exploitation test to further optimize the system?

A very open area that I need to consider more here is specifically around downward/negative weighting. In reality, users don’t necessary do events that show that they are likely to churn, except for some exception like opting out of push notifications. Users often elect to “churn” simply by stopping to use a channel. This is problematic for this model as we don’t have a proper way of giving the graph empirical data to weight it down. My hypothesis here is that we have a time decay model that looks at the deltas of values day over day or week over week to identify shifts in engagement specifically. This decay model would take into account metrics like % difference of engagement, session length decrease, # of days/weeks of downward interaction. The output would be a weighted value pulling the user sphere towards a non-favorable quadrant.

Another area of thought exploration is around doing further clustering within the quadrants so that marketers can target certain sections of the quadrants. For example, a marketer may see that one of their users in the “Highly Engaged” category is starting to move towards “Engaged but Not Responding Quadrant”. If we view the graph on a scale of 0-100, the marketer may want to intervene when the user has moved into the region of:

  • X-Axis = 60-65
  • Y-Axis = 55-65
  • Z-Axis = 50-60

In an ideal world, either the system or the user would set these quadrant sub-parameters and, when they log into their dashboard each day, get a notification with how many users have a “drift vector” showing that they may move quadrants. This would allow the marketer to get ahead quicker on the changes.

Yet another area of contention is the quadrant naming. I think there’s a good argument that would need to happen on whether or not the quadrant naming, nomenclature, and conventions make sense.

It’s a crazy theory but, based on the work that I’ve done in my past with different teams, there is some validity with regards to how marketers view users and how we view data. More thoughts to come

Love the idea? Think I’m crazy and full of shit? Leave a comment and tell me what you think!

Different Methods of Testing, Optimizing, Predicting and Personalizing

Posted by | Optimization | No Comments

I had a conversation with my dad today that sparked me to write this. He wrote me an email talking about how it seems like some of the products just “know who he is” and appear to know what he’s likely to do next. He’s not the only one who has asked me this in the past few weeks so I figured I would try to shed some light from my point of view.

There always seems to be a lot of talk about how important A/B testing is or new buzzwords like “optimization” or “personalization. Many of these trends often start from a key blog or industry expert calling out the ROI that it can provide to digital marketers. I’m all for testing and personalization, but it’s key to understand what exactly each is and what the benefits are. This is my attempt to bring it all back down to the layman.

Simple A/B Testing

There are many different types of A/B testing but what we call “simple A/B testing” is often the most common example. Simple A/B testing is the method of testing different variations against a control group. With the variations, they have equal weighting meaning that they are showed at random to users equally. Often times, tests are set up to have 95% of the users receiving the test (the variations) and 5% receiving the control (either nothing or what was there before). Simple A/B tests are often most effective when ran for a minimum of 7 days in order to gain sufficient data however the big key with run time is setting a determined end date for testing before hand.

Simple A/B testing is a basic method for testing and is often not necessarily the best option. The problem with this method is that the variations are equally weighted. This means that if a variation is not performing well, it will still get served up to users which may cause them to bounce. You effectively lose out on the potential to convert users. This is called the Opportunity Cost. Simple A/B should be reserved for basic things like button color, minor layout variants, etc. should you not have any other tool to use.

Bandit Testing Algorithm

This is where things get more interesting. An Bandit Testing Algorithm is a more sophisticated version of Simple A/B testing. With simple A/B testing, the weight of each of the variations are the same, meaning that if one variations isn’t performing well, you still serve that variation up. With Bandit Testing, as you explore the variations and their performance, you gain feedback as to how each variation does. What the bandit will do is start to weigh the winning variation more which means that it will serve up that variation more often (this is called exploiting). However, it will continue to test other less performing variations in order to explore the potential that they could still perform better (this is called explore). After the test has run for some time, it often becomes clear which variation is the winner. Most testing platforms implement this type of testing method.

The two most common versions of the Bandit algorithm are Epsilon Greedy and Bayesian Bandits. Many argue that Bayesian Bandits are the most sophisticated method for testing as they employ various statistic methods, such as probability distributions and probability densities, to find the best variation faster and more frequently.

The reason that Bandits are often considered better is that they dynamically update themselves based on prior testing knowledge. As the test runs and collects conversion data for the different variations, it starts to see which variation is performing better. It then enforces what we call “exploit & explore”. Exploiting is the system intelligently serving up the winning variation more frequently to find the winner faster. However, the system will continue to Explore losing variations every so often to ensure that it isn’t experiencing an anomaly and that the variation that is appearing to perform better actually is. You can think of it as a self-check.


With any product or service, retention of customers is always a key metric. It is much more expensive to acquire a new customer than it is to get an existing customer to return. It’s double as expensive when they churn! Since retention is a key metric for all companies, it’s important to really understand what it is. Retention is how often a user comes back to the product or service in a given time frame. In my opinion, retention is especially critical when paired with A/B testing as you get to measure the “stickiness” of your testing methods.

Monte Carlo Simulation

While not as widely used, Monte Carlo simulations help sophisticated testers predict the future with reasonable understanding. A Monte Carlo simulation draws out trends at random which is useful for running simple tests. However, for the industry, many will run daily Monte Carlos that update themselves based on historical data. For example, I may have a Monte Carlo that simulates out the potential trend for the next 30 days by aggregating the averages of the past 30 days. This gives me a decent look as to what the potential average trend might be and I may even be able to parse out an upper or lower limit from it as well. The testing industry uses Monte Carlo simulations to predict how a variation may perform in the future, how a particular segment group may change over time, and a few other unique prediction methods.

Regression Analysis

Understanding trends is a big value in the analytics and testing. Trends obviously help create insight into how an overall group may be transforming over time. A regression analysis can help users identify that trend. Regression analysis is a method for taking a group of data points or variables and estimating the relationships between them. The most basic method is called Linear Regression where you have a static line that basically creates an average based on the dataset. Regression analysis is useful for understanding the state of the dataset while being able to see how it may progress over time. It’s a very useful tool for analytics by providing insight based on history.

Segment Personalization

The ability to personalize based on a segment can be a very powerful feature when used properly. Segment personalization really means being able to deliver relevant content to a group of people who share similar attributes or interests. For example, let’s say that I’ve identified that on my travel website there are two types of visitors: Beach Going Travelers and Weekend Getaway Travelers. I have two segments that have very different interests, buying cycles, content consumption, and more. Through segment personalization, I’m able to deliver relevant content about different beach vacation packages to my Beach Going segment. These segments are often a mixture of both known and unknown users. Segment personalization is incredibly important since you’re delivering content that is actually relevant to a group which can reduce the likelihood of users in that segment to churn. The more aligned content is with a groups interest, the higher the potential for affinity to a brand, site, or app.

User Personalization

While segment personalization is at a higher level, user personalization is much more prescriptive. With user personalization, we’re personalizing content based on intimate known information about a user through different methods, such as dynamic messaging. For example, we might use dynamic messaging to send an email that says “Hey Ryan, you added this item to your cart. Check out these recommendations.” We’re using explicitly known data to interact with the user based on known attributes. This level of personalization is often referred to as a 1:1 conversation. We want to provide unique content and a different experience for a user  so that they don’t have to sift through weeds of information to get something relevant.


Taking personalization a step further is contextualization. Contextualization is delivering personal content within the context of the user. Much of the industry is turning to this however it is also one of the hardest to accomplish. For example, Starbucks knows that I’m a really heavy coffee drinker. They know that I usually start work at 9am based on how many times my device passes through a beacon (location identifier.). It’s 8:30am and I’m walking to work. As I get close to a Starbucks, I receive a push notification through my Starbucks app that says “Hey Ryan, it’s early and cold out. Come wake up and warm up with 10% off a mocha today!”. The message knows my name, my potential buying pattern, and the local weather. This is a very context driven message that starts to tug at different levels of consumer buying behavior. We can get this level of sophisticated personalization through 1st and 3rd party data sources that are collected from many different platforms. As we collect data from many platforms, we perform a method called “identity merge” which means that if you’re unknown in one platform, but known in another, when you identify yourself on the unknown platform we’re able to merge all of the known data about you into a unified user profile.

If you find this useful, feel free to comment!