Measuring long-term outcomes using short-term data and surrogates

Image courtesy of Cai et al. (2023)

Solo post: Director's cut

When measuring the outcomes of an intervention, organizations usually observe and quantify immediate or short-term results. For example, marketing could drive additional traffic, a discounted shipping rate could increase conversion rates, a price promotion or a loyalty program could drive sales. In most cases, however, these interventions would have effects that materialize over a longer period of time. After being exposed to a promotion, customers may become more price sensitive and start buying cheaper products or strategically time their purchases to take advantage of the next promotion. In general, companies will not conduct multi-month (or even multi-year) experiments to compare alternatives and find the option that optimizes long-term return on investment (ROI). Decisions must be made in the absence of long-term results.

To address this shortcoming, in 2019, Susan Athey et al. published a paper on combining short-term proxies (aka surrogates or surrogate indices) to estimate long-term treatment effects.1 The Athey et al. paper is difficult to read for a non-technical audience, but offers interesting insights into this problem. Below I will explain what a surrogate index is and when it can be used. 

What is a surrogate and a surrogate index?

A surrogate is a short-term outcome that is used as a proxy to measure a long-term outcome. For example, in a marketing campaign, website traffic might be used as a surrogate for long-term sales.

A surrogate index is simply a combination of multiple surrogates aggregated into a single metric. Having an index of multiple surrogates is better than relying on a single surrogate. The index reduces the variability associated with a single surrogate, resulting in more reliable predictions. Multiple surrogates ensure that the surrogate index covers a broader range of causal pathways to the long-term outcome by capturing different aspects of the treatment effect. This reduces the risk of omitting important intermediate effects that a single surrogate might miss.

The causal effect of treatment is measured on short-term outcomes (surrogates). These surrogates are then used to estimate the average treatment effect on long-term outcomes. This is particularly useful in scenarios where it is not possible to wait for long-term outcomes.

When can you use a surrogate to measure long-term outcomes?

Imagine that we have two samples of data: an observational sample and an experimental sample. For the observational sample, we have access to some pretreatment covariates, short-term secondary outcomes, and a long-term outcome. For the experimental sample, we can observe whether a subject is assigned to a binary treatment, that subject's pretreatment covariates, and short-term secondary outcomes:

For example, if the treatment is a discounted shipping offer sent to a subset of customers, pretreatment covariates might include the customer's age, marital status, household size, geographic location, and income. The long-term outcome of interest might be purchase frequency, and short-term secondary outcomes (surrogates) might include bounce rate, number of pages per session, cart abandonment rate, and average order value.

Under three assumptions, the average effect of treatment on the longer-term outcome can be estimated from the average effect of treatment on surrogates:

Unconfoundedness: Treatment assignment should be independent of potential outcomes. In the example above, the discounted shipping offers should ideally be randomly assigned to customers independent of their demographics, existing Net Promoter Scores, bounce rates, pages per session, cart abandonment rates, and average order values. Unconfoundedness ensures that any observed effects are truly due to the treatment and not other variables.

Surrogacy: Surrogacy refers to the idea that the causal path from a treatment to a long-term outcome runs entirely through the surrogates. In simple terms, this means that the treatment affects the long-term outcome only by affecting the short-term outcomes (surrogates).

Comparability: Comparability refers to the requirement that the relationship between the surrogate and the long-term outcome be consistent across different samples. This assumption ensures that the surrogate index developed in one setting can be applied to another without bias.

Even if the surrogacy assumption partially fails, a surrogate index may still be useful for estimating a well-defined causal effect (if not the precise effect on the long-term outcome). However, if the comparability assumption fails, the results may not generalize well to other contexts.

Applications

The following applications demonstrate the value of surrogates in measuring long-term outcomes:

Instacart's economics team uses surrogate indices to estimate the long-run heterogeneous treatment effects of membership incentives. This approach helps them determine which users should receive incentives such as free trials or discounted memberships, and at what value, in order to maximize long-run value (LTV).

Instacart's unique challenge is the comparability assumption, as they wanted to calculate long-term treatment effects for periods where they didn't have comparable data. To overcome this challenge, Instacart developed a parametric projection approach that can be validated for periods with comparable data. This provides confidence in estimates for periods without such data. They also created an experiment library that stored data from previous incentive experiments in a unified way, and trained their surrogate index on experimental rather than purely observational data. See the details here.

Netflix compares decisions made with a surrogate index using 14 days of data to those made with 63 days of direct measurements, using data from 200 A/B tests. The results show that surrogate index models achieve ~95% consistency with long-term direct measurements, suggesting that shorter test cycles using surrogate indexes can potentially increase experimentation capacity without significantly compromising decision quality. You can find more details here.

Wayfair's A/B testing often requires waiting 30-60 days or more to measure lift (delayed rewards), creating a trade-off between speed and optimizing for long-term metrics. To solve this problem, they developed Demeter, an experimental analysis platform that uses surrogate index methodology to combine multiple short-term outcomes to predict long-term outcomes and provide unbiased estimates of treatment impact.

Demeter requires two datasets - an observational dataset of delayed reward measurements and an experimental dataset from an A/B test - to train a model that predicts delayed rewards based on leading indicators and historical context. Through validation testing, Wayfair demonstrated that Demeter could accurately predict 60-day sales increases with only 14 days of observational data, significantly reducing the time needed to make business decisions. You can find out more about it here.

Bottom line

The surrogate index method works by combining multiple short-term outcomes into a predictive value of the long-term outcome, allowing long-term treatment effects to be estimated. The feasibility and usefulness of this approach depends on the quality and selection of surrogates. If key surrogates are missing, the method performs poorly and can lead to significant bias in estimating long-term effects. This is also in line with our discussion of data centricity, where we underline the importance of having the right kind of data to make a model useful in practice.


[1] Athey, S., Chetty, R., Imbens, G. W., & Kang, H. (2019). The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely (No. w26463). National Bureau of Economic Research.


Podcast-style discussion of the article

The raw/unedited podcast discussion produced by NotebookLM (proceed with caution):

Other popular articles

How to (and not to) log transform zero

What if parallel trends are not so parallel?

Causal inference is not about methods