In an ideal world, we know exactly when the user needs to be reminded of our product. Moreover, in such a way that he, at a minimum, does not refuse our services, but makes a new payment. If we do carpet bombing by sending all of our customers reminders of ourselves, this can be both annoying for them and not a cheap option if you use SMS alerts. Churn-by-customer approaches are certainly great options, but they require time and resources to research and develop.
If you have the time and resources to research to calculate for each user the probability that he is going to leave your service, then I highly recommend starting with the excellent lifetimes library. It is quite simple, with good documentation, and there are also many excellent articles on the medium that show you step by step how to use it, for example, here.
But what if you currently have neither the time to develop complex approaches, nor even a rough understanding of how long your average client lives in your product? And marketing teams are asking you to help give them at least a rough idea of how long this time can be?
In this article, I will talk about a fairly simple approach using Bayesian modeling based on the PyMC3 library, with which you can get a rough picture of the lifetime of customers in your product, which can help answer basic business questions and prepare for the implementation of more accurate and high-quality models (if it is required😉).
Let’s assume for the example under consideration that we own a certain online service through which our users can generate music according to a text description (yes, neural networks could get to that too😁). Let’s assume that each new generated composition has a fixed cost.
Thus, all we know about our customers is historical data on how often they used our services over time.
First, let’s import all the necessary libraries:
On your online service, you can offer customers several options for subscriptions or purchases that differ in cost. As a general rule, customers who make more expensive purchases should be more loyal to you, which is why their average lifetime should also be higher than the average lifetime of your customer. However, in the real world, things can be a little different. For example, those who immediately bring in a large amount of funds to your product may have a much shorter lifetime, because they get the required result much faster.
In any case, it’s not a problem to research if you have enough historical data. I just want to advise you not to neglect the selection of any specific groups of your customers, if possible, because sometimes in such sections you can find a lot of interesting information for business. Users can be divided by gender, age, country of registration, subscription type, and so on. The main thing is that it makes sense.
Why do this? For example, the lifetime on your platform for ios users may be slightly different from the lifetime for android users, which is generally absolutely normal. But imagine that in the course of your analysis, the lifetime of ios and android users differs very much? Sometimes the cause of this behavior can be annoying bugs overlooked by the development and testing teams.
Let’s say we have historical data for 389 customers:
The most important information we need about users for research is the maximum number of days that the client has not used our online service (not including the period from the last active day to the current moment in time), as well as the number of days since the last use to the present moment. Thus, the entire dataframe that we need will have approximately this structure:
Let’s make a very simple but important assumption: if the number of inactive days for a particular client from the last active day to the current moment is less than the maximum number of inactive days (excluding the current period, it’s important), then we consider that such a user has not yet abandoned our online service. Otherwise, we consider the client to have left us.
An active day is a day when a client has completed some targeted action in your online service, i.e. made a purchase, logged in or or any other that you define.
The logic here is as follows: if a particular user has not logged into our online service for 30 days, but at the same time we know that he had periods of longer downtime, then there is a certain probability that such behavior is normal for a particular user. The more users you have, the less you will be exposed to random outliers in your data.
Why in the table above for users with id 111 and 112 we conclude that the first one most likely refused the services of using our product, and the second one did not, although at the moment they have exactly the same number of days of inactivity? The fact is that for client 112, we have already seen longer downtime in using our service. And we cannot say the same for client 111. Exactly what was said in the paragraph above.
So, we understand the data format, let’s generate a sample, which we will analyze. Since the two main metrics — the maximum number of days of inactivity and the current number of days of inactivity are integers, we will generate from a discrete distribution, in this case we will use a discrete uniform distribution:
And now, if possible, let’s make a few assumptions about the target users of our online service. Sometimes the business can help formulate such assumptions, but even if such a hint is not possible in your case, you can try to formulate some ideas logically.
For example, if your service involves small but frequent purchases, and one of the ways to attract customers is to pay for a targeted action (for example, for the first purchase, subscription or registration), then there may be quite a lot of new customers who may simply be interested in performing this targeted action. Just because they can be paid more by the traffic acquisition channels than they will spend on your product. The problem with such clients is that their lifetime does not exceed one or two days, which is why they introduce a strong bias in the data you are researching, so such clients should be excluded from the analysis.
Also, in your case, you might be interested in not considering clients whose total number of active days does not exceed a few days.
Moreover, you may not be interested in researching those customers whose maximum number of days of inactivity is less than a certain number. Let’s say if someone has not used your online service for 7 days, then maybe the client just went on vacation, and this is not a cause for concern.
Specifically, in our case, we will assume that we are not interested in all those who had less than 10 days of inactivity in their history:
And now let’s label the churn to our hypothetical customers:
Great, we have the data for the study ready. Let’s first visually assess how customers who are most likely to opt out of our online service should differ from those who did not, based on the current number of days of inactivity: