Does Lord approve of your experimental analysis?

Image courtesy of en.wikipedia.org

Director's cut

I will leave Lord's judgment (1967) to the next section and get to the data. After running a randomized controlled trial (or experiment), modeling is the next natural step. Here is a list of some commonly used methods among many others: 

  1. Paired comparison tests
  2. Repeated measures ANOVA
  3. Analysis of covariance (ANCOVA)
  4. Regression using a differences in differences (diff-in-diff) setup, and
  5. Regression using diff-in-diff setup and with two-way fixed-effects (TWFE) added

What would be the differences in estimated effect size across these methods given there is time- and subject-level variation? We asked this question and used the methods listed above for comparison. The results are as follows:

 

Paired T-Test

Repeated Measures ANOVA

ANCOVA

Differences in Differences

Two Way Fixed Effects Diff in Diff

Average Treatment Effect

 -0.9019***

 -0.3786***

-0.3184*** 

 -0.3717***

 -0.3849***

*** p < 0.001, ** p < 0.01, * p < 0.05.

Directionally, all models show a negative and statistically significant average treatment effect. That's good. However, the coefficients are not the same and this has implications for our business decisions. So, we need to know: what is the correct effect size? The answer depends on the question and data, so it is probably a good idea to clarify them at this point.

This is a price elasticity test. Our goal is to measure the unit sales impact of a price increase on a single product: what is the expected change in sales after increasing the price? The price increase is around 15%. The data comes from about 400-500 retail stores (half test, half control) for a single product, and covers 90 days before and after the treatment.  

Model 1: Paired T-Test

This paired t-test compares only before and after unit sales for the test stores. The unit sales are averaged over time during the pre and post periods separately for each store i (so, this test is not really helping achieve our goal).

Even if the test stores perform better or worse overall after the treatment, the difference could be due to other exogenous factors (such as weather, change in the marketing spend, or something else). Without a control group, it is not possible to attribute the change in sales to the price increase. 

Model 2: Repeated Measures ANOVA

Repeated measures ANOVA allows the sales in the test and control stores to be compared before and after the treatment. 

$Y_{i,t} = \alpha + \beta_1Post_{i,t}* Test_{i} + \varepsilon$

$Y_{i,t}$ is the average unit sales of store i, pre and post the treatment. $Post_{i,t}$ is a dummy parameter equal to 1 if t is post treatment and 0 otherwise, and $Test_{i}$ is a dummy parameter equal to 1 for test stores and 0 for control stores.

In measuring experimental effects, controlling for covariates is an important step to achieve balance between test and control stores. Paired t-test and repeated measures ANOVA do not control for whether the unit sales in test and control stores are balanced before treatment. ANCOVA allows us to add control variables to the model.

Model 3: ANCOVA

ANCOVA is an extension of ANOVA. The model allows the inclusion of covariates (in this case, sales in pretreatment periods) as controls for the differences across stores as follows: 

$Y_{i, post} = \alpha + \beta_1 Test_i + \beta_2Y_{i,pre} + \varepsilon$

$Y_{i,pre}$ and $Y_{i,post}$ are the average unit sales in store i before and after the treatment. $Test_i$ is a dummy parameter equal to 1 for test stores and 0 for control stores. 

Model 4: Differences in Differences 

The fourth model is a classical differences in differences model. This model will compare the change in sales for the test and control stores. Unlike the previous three models, sales enter the model directly: 

$Y_{i, t} = \alpha + \beta_1 Post_{i,t} + \beta_2Test_{i} +  \beta_3Post_{i,t}* Test_{i} + \varepsilon$  [Canonical diff-in-diff coding] 

OR

$Y_{i, t} = \alpha + \beta_1 Treated_{i,t} + \varepsilon$   [Generalized diff-in-diff coding]

$Y_{i, t}$ is the performance of store i on day t:

Canonical: $Post_{i,t}$ is a dummy parameter equal to 1 if day t is post treatment and 0 otherwise, and $Test_{i}$ is a dummy parameter equal to 1 for test stores and 0 for control stores. 

Generalized: $Treated_{i,t}$ is equal to unity for treated stores during days when treatment is in effect.

The models I have used so far still have limitations:

I did not control for time

"Pre" and "post" treatment observations are collected over time, and things change over time. For example:

  • On the first day of the experiment, a competitor launches a promotion,
  • On the second day, another competitor matches the same promotion,
  • On the third day, both competitors end their promotions,
  • On the fourth day, a winter storm hits the Northeast region, 
  • On the fifth day, the storm moves into the Midwest region.

Assuming all periods are the same and not controlling for time would bias the average treatment effect. What if customers stockpiled on the third day in an anticipation of the upcoming winter storm? What if some stores didn't receive any inventory during the storm?

I did not control for differences between stores

The following are some characteristics that may explain such differences: 

  • Not all stores are in the same location; some stores are surrounded by competition,
  • Some stores have public parking and/or are accessible by public transportation,
  • Some stores are closer to universities and surrounded by a younger population, 
  • Some stores are in the suburban locations and surrounded by families.

Some of these may be observed and can be explicitly controlled in the model. Some others may not be observed. If such differences exist and are not controlled for, the average treatment effect would again be biased.

The following model controls for time- and store-level differences. Instead of including them explicitly in the model, it uses a fixed-effects estimator. This is a magical way to control for such differences without estimating them. For more information on the fixed effects estimator, see the Wiki entry.

Model 5: Two way fixed effects 

Finally, the fifth model uses a two-way fixed effects differences in differences setting.

$Y_{i, t} = \alpha + \beta_1 Treated_{i,t} + Day_{t}  + Store_{i}+ \varepsilon$

$Treated_{i,t}$ is equal to 1 for treated stores during days when treatment is in effect. This model is called "two-way fixed effects" because it has two sets of fixed effects: one for time ($Day_{t}$) and the other one for store ($Store_{i}$). 

After modeling the data using five different methods and getting a different result from each model, which one is the correct average treatment effect in this experiment? Is the decrease in sales 0.3 units or 0.4 units (both rounded)?


Academic's take

This is an interesting exercise. Depending on the heterogeneity across stores and time, the estimated effect size could differ more (or less for that matter). Building on the work in the first section, I would like to focus on Lord's paradox (1967).

Needless to say, even though the same data are modeled for the same treatment, the assumptions behind the methods are different. Perhaps even more importantly, the question and conceptual model also vary across the approaches. In the models above, for example, the paired t-test is a clear example of where a different question is being answered.

The paired t-test is not answering the question asked: "What is the average unit sales impact of a 15% price increase on the product?" Instead, it answers a different question: "What is the average sales difference between the treated (test) stores before and after the price increase?" In addition, the paired t-test assumes independence between stores. I don't have any information on the nature of the data used in the analysis. However, this is likely problematic in most retail settings where, for example, some of the stores would be in close proximity to each other while others would be more distant. Other dependencies may also exist. I will leave these aside and focus on implicitly varying conceptual models.

Lord's paradox

When modeling the same question on the same data, and using the same controls, a (somewhat naïve) expectation is that the repeated measures ANOVA and ANCOVA would yield identical results. Why? Let's see.

In the first section, ANOVA is coded a bit different from the conventional norms (and they say academics are the ones who complicate things!). Let's assume the coding of ANOVA is as follows:

$(Y_{i,post} - Y_{i,pre})  = \alpha + \beta_1Test_{i} + \varepsilon$

$(Y_{i,post} - Y_{i,pre})$ is the difference in the average unit sales of store i before and after treatment. $Test_{i}$ is a dummy that is equal to 1 for test stores and 0 for control stores.

This is basically a simple linear regression model, where the difference in unit sales before and after the treatment is regressed on whether a store is treated.

ANCOVA, on the other hand, is still coded as follows:

$Y_{i,post} = \alpha + \beta_1Test_i + \beta_2Y_{i,pre} + \varepsilon$

$Y_{i,pre}$ and $Y_{i,post}$ are the average unit sales of store i pre and post the treatment. $Test_{i}$ is again a dummy that is equal to 1 for test stores and 0 for control stores.

This is basically a multiple linear regression model, where the unit sales (not the difference in unit sales) is regressed on whether a store is treated, controlling for unit sales in the pretreatment period.

In other words, in the first regression, the difference in unit sales (before and after) is the outcome variable. In the second regression, the unit sales following the treatment is the outcome variable and the pretreatment sales is included in the model as a control. These two models look quite the same, right? So why are the results different? The first model estimates an about 0.3 units decrease and the second one estimates an about 0.4 units decrease in sales.

Well, that's because the conceptual models underlying the two models are different.

In causal inference, data alone cannot tell the whole story. Let's make the assumptions explicit here:

The first solution with the repeated measures ANOVA (simple linear regression) assumes that the proportion of stores with similar sales levels is the same in both treated and control stores. In other words, the proportion of control stores with high/low sales is assumed to be roughly equal to the number of treated stores with high/low sales, and so on.

Another way to put this is to assume that the pretreatment unit sales in a store have no effect on whether the store is treated (e.g., whether the price of the product is increased in that store). That would be a completely randomized design. What if this is not the case?

Lord (1967) argues that the two solutions (ANOVA and ANCOVA) are both correct. That's probably true based on the data alone but without a conceptual model. If the conceptual model says, however, that the pretreatment sales have an effect on the treatment, then the plot thickens. This would mean that the price was increased, maybe, in stores with high unit sales (say, to minimize the impact on revenue).

Another reasonable expectation would be that the posttreatment sales is affected by pretreatment sales (say, stores that sell more before tend to sell more after). The following directed acyclic graph (DAG) summarizes the conceptual model:

If this is the conceptual model, the results of the ANOVA model would be incorrect. Why? Because the pretreatment sales is a confounder in this conceptual model and the confounder must be controlled for to isolate the effect of the treatment.

As you may have guessed, the estimated effect size in the ANCOVA model would be more correct based on the conceptual model above. More technically, adding the pretreatment sales to the model would be blocking the backdoor path from the posttreatment sales to the treatment so that the estimated effect is the causal treatment effect on posttreatment sales (Pearl, 2000).

If you like, you can read a summary of Lord's paradox on Wikipedia here (which is a fantastic wiki entry by the way) or delve into the details of the discussion around the paradox by starting with Lord (1967, 1969, 1975) and Holland and Rubin (1983), and continuing with Pearl (2016) and Pearl and Mackenzie (2018), among several others.

On a final note, the diff-in-diff models estimate an effect size that is closer to the effect from the ANOVA model by focusing on the differences (comparing the difference in unit sales between treated and control stores before and after the treatment). If the conceptual model I speculated in the DAG holds, a lagged regression model with fixed effects can be used here.* If the problem follows a different conceptual model without such a confounder, the results of the TWFE diff-in-diff model would be preferable over the ANCOVA model given the additional controls on time and store-level time-invariant heterogeneity (given the parallel pretreatment trends assumption is not violated).

* For a discussion on the differences between a diff-in-diff and lagged regression model, see Ding and Li (2019). Estimating a lagged regression model with fixed effects can be challenging, though. See Angrist and Pischke (2009) for a discussion.


Implications for data centricity

Data centricity is staying true to the data. In this example, the estimated effect size (change in sales due to the price increase) is different depending on the underlying assumptions about the data generation process and the resulting data set. Specifically, staying true to the data requires understanding whether the price increase was influenced in any way by prior sales. The data generated in such a scenario (where past sales dictate the price increase) would be different from the data generated in the scenario where the price increase is applied regardless of past sales. In this case, data centricity is established by correctly identifying how the data was generated, leading up to and including the treatment (price increase).

References

  • Angrist, J. D., & Pischke, J. S. (2009). Mostly harmless econometrics: An empiricist's companion. Princeton university press.
  • Ding, P., & Fan L. A bracketing relationship between difference-in-differences and lagged-dependent-variable adjustment. Political Analysis 27.4 (2019): 605-615.
  • Lord, E. M. (1967). A paradox in the interpretation of group comparisons. Psychological Bulletin, 68, 304–305.
  • Lord, F. M. (1969). Statistical adjustments when comparing preexisting groups. Psychological Bulletin, 72, 336–337.
  • Lord, E. M. (1975). Lord's paradox. In S. B. Anderson, S. Ball, R. T. Murphy, & Associates, Encyclopedia of Educational Evaluation (pp. 232–236). San Francisco, CA: Jossey-Bass.
  • Holland, P. W. & Rubin, D. B. (1983). "On Lord's paradox". Principals of modern psychological measurement. pp. 3–25.
  • Pearl, J. (2000). Causality: Models, Reasoning and Inference. Cambridge University Press.
  • Pearl, J. (2016). "Lord's Paradox Revisited – (Oh Lord! Kumbaya!)". Journal of Causal Inference. 4 (2).
  • Pearl, J. & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. New York, NY: Basic Books.

Other popular articles

How to (and not to) log transform zero

What if parallel trends are not so parallel?

Synthetic control method in the wild