Harry's Engineering



Democratizing Kaplan-Meier

By: Michael Kaminsky

TL;DR: We've used Looker on PostgreSQL to allow anyone in the company to perform Kaplan-Meier analysis on our retention data to drive insights and make decisions.

At Harry’s, we strive to have a robust understanding of our customers, and we’re particularly interested in identifying features of customers and their behavior that increase retention and reduce churn. One of the key difficulties in these time-to-event analyses is that measures of central tendency often give a very incomplete picture of customer behavior and so more sophisticated methods are often required.

For example, if you want to understand customer retention, you might ask ‘of those who have made one purchase from us, how many have made two’? Unfortunately, this has a lot of potential to be misleading because not all of your customers have the same ‘exposure’. That is, customers acquired 3 years ago have had 3 years to come back, but customers acquired last week probably won’t have already made another purchase. You can fine-tune this line of questioning quite a bit to drive at something that’s pretty reasonable, but any single number is going to be missing key information because time-to-event analysis is fundamentally a two-dimensional problem – you need to know both how many events happened, but also when they are happening.

Enter the retention curves, which often looks something like this:

Example Curve

This curve is showing the cumulative share of customers who have made a second purchase among those who made a first purchase in Q1 or Q2 respectively. The higher your curves, the more customers are returning to buy your products, and the better your business is doing.

The catch with the way this is generally done in Excel is that you can only show the curve for as long as you have complete information. That is, if you want to show a retention curve for 52 weeks, everyone in the cohort has to have made their first purchase at least 52 weeks ago. This is sort of a bummer, because you are throwing away good information you have about the people who started at the beginning of the quarter and have made it to 52 weeks. Additionally, these curves don’t generalize well to comparing cohorts that aren’t based on time (e.g., gender) because it’s difficult to know what the correct denominator is for any given week since your cohorts didn’t all make their first purchases at the same time.

It turns out that this problem can be solved with Kaplan Meier analysis. Kaplan Meier analysis is a well-known statistical tool for understanding differences in survival rates across cohorts. Kaplan Meier is a well-studied technique and is very commonly used in medical research to assess the differential impact of treatment interventions on patient outcomes. As it turns out, Kaplan Meier analysis can also be used to effectively understand retention and churn among customers.

The key feature of Kaplan Meier analysis is that it is well suited to handling a phenomenon known as data ‘censoring’. Censoring occurs when you don’t have the same amount of observation time for all of the individuals in your study. In clinical trials, for example, this happens when patients ‘drop out’ of the study early: they move, or they stop being interested, or simply get sick of participating.

Imagine a year long study of 100 people with terminal cancer who are given a new treatment. The ‘event’ of interest in this study is death, so the doctors are interested in whether or not there is a difference in the survival rates of those who get the treatment vs those who received a placebo. Now imagine one of the patients receiving the treatment drops out of the study after 6 months. On the one hand, the researchers can’t assumed that the patient survived a whole year because that might over-state the impact of the treatment. On the other hand, if the researchers throw out that patient’s data they might accidentally under-estimate the impact of the treatment because that patient did survive for at least 6 months.

Kaplan Meier then actually takes into account two types of events (events of study and censoring events) when generating survival curves. It turns out that we can generalize this technique to our analysis of the behavior of customers by treating the current date as the censoring date for all retrospective analysis. That is, we say that customers who made their first purchase 30 days ago but haven’t repurchased yet as being censored at 30 days.

Time Realignment

Circles are ‘censoring events’, and Xs are ‘events of study’

Once you have your data re-aligned to use analytic time, and you’ve identified censored data appropriately, applying the Kaplan Meier computation is actually pretty straightforward. The key insight of Kaplan Meier is that cumulative probability of an event having occurred by time T is the cumulative product of the marginal probabilities of the events occurring from time 0 to time T-1. That is, if you know the probability of customers returning between weeks 0 and 1, and between weeks 1 and 2, and between weeks 2 and 3, and … it’s trivial to calculate the cumulative probability of a customer having returned by week N, even though you may have a different sample of customers in each of those marginal probability calculations.

Python Lifelines

The thing that’s a bummer about Kaplan Meier analysis is that it’s typically been limited to use by data scientists and analysts via packages in R or Python. This goes against Harry’s philosophy of democratizing data analysis and causes a bottleneck in the agile analytics feedback loop. I, personally, don’t want to have to write a python script any time our business analyst wants to compare retention rates between two cohorts.

Enter Looker, our BI tool. Looker is a nifty interface for allowing data exploration for business users via parameterized SQL in the backend. As it turns out, the Kaplan Meier calculation can actually be written in SQL 1. We created a user-defined function in our PostgreSQL data warehouse that will calculate Kaplan Meier curves for any groupings of customers, and plugged this into the Looker backend. With a little bit 2 of hacking, we were able to get the Looker interface to generate Kaplan Meier curves for arbitrary cohort definitions which analysts and business users can the explore via Looker’s drag-and-drop style GUI.

Retention by Entry Cohort (The lines that are lower than all of the others are holiday gift-giving cohorts)

Retention by Shaving Frequency

(Unsurprisingly, people who shave more tend to retain at higher levels!)

This is just a small sample of all of the possible analyses this opens up for business owners and less-technical analysts. Most importantly, it frees up our data science and analytics team to focus on harder problems while the rest of the company is able to ask and answer their own questions quickly and easily.


1: Assuming your SQL flavor can perform window functions, and you recognize that a cumulative product can be calculated for a series x using exp(log(sum(x)). You can see our implementation in this gist. We based this off of the content of this talk by Gleb Arshinov and a number of other people who have written about this problem.

2: A lot. A lot of hacking