How a supposedly data-centric decision cost Walgreens $200 million and how to avoid it

Image courtesy of Rob Klas - bloomberg.com

Introduction to the business case

The business case discussed in this post was reported by Bloomberg with the title "Walgreens Replaced Fridge Doors With Smart Screens. It’s Now a $200 Million Fiasco". You can find the article here. In summary, a startup promised Walgreens that its high-tech fridges would track shoppers and spark an in-store advertising revolution. Then the project failed miserably for a number of reasons. Most importantly, Walgreens faced a backlash when customers ended up seeing their reflections in dark mirrors instead of the drinks in the fridges through the glass doors. Store associates rushed to put signs on the coolers to explain which drinks were in which cooler. The project went so badly that it ended in a lawsuit between the startup and Walgreens, and not only did it fail to deliver business value, it resulted in losses in customer satisfaction, employee morale, and revenue. But why was this allowed to happen? The question we ask and answer for this case is:

What went wrong and how could the disaster may have been avoided?

Director's cut 

An experiment likely designed wrong

“‘We’re not tech guys,’ Avakian remembers the Walgreens team saying. ‘Prove it to us.’ He and Wasson say that based on their PowerPoint presentation, the company approved a six-store pilot program for 2018.”

Launching a pilot program (or experiment) to prove the financial return on an investment is a sound decision only if the experiment is designed properly. The pilot program initiated by Walgreens and Cooler Doors raises several concerns that may have affected the validity of the results. A well-designed retail experiment:

  1. Selects test (pilot) stores based on demographic, geographic and economic factors to ensure they are truly representative of the broader chain. 
  2. Pairs test stores with control stores that have similar sales patterns and customer demographics. 
  3. Maintains sufficient geographic separation between the test and control stores to prevent customer spillover between locations. 
  4. Has a test duration informed by a power analysis to help differentiate performance improvements from random noise. The limited scope of six stores raises questions about statistical significance.

In this case, it's unclear if the size and duration of the experiment was provided by the Cooler Doors (if there was an experiment!) but the story sounds familiar. New product or fixture introductions typically require an estimate of the return on investment. Vendors generally can't afford to run large-scale experiments, so they provide a number and list of stores where they can ship the product or install the fixture free of charge. These decisions are usually made and the experiment is run before the peak of the season, when the number of transactions are relatively low. It is not unusual that the data science teams are informed after the experiment is launched . They end up trying to fit a (relatively) sound methodological framework to get a statistically significant read (if they can).

Assuming the design takes all of the above into account, I would still be suspicious of the 5% lift. New fixture installations typically involve a complete restocking, which could temporarily inflate sales regardless of the effectiveness of the technology. With only six stores, even the attention paid and drinks grabbed by Cooler Doors employees may have increased sales!

Operational dependencies and missed opportunities

Cooler Doors' digital fridge doors, if it worked as intended, had benefits beyond digitizing the planogram and displaying advertisements. It would also track whether shelves were stocked and allow prices to be changed dynamically. However, the effectiveness of these capabilities did not seem to be realized. This is likely due to inadequate supporting infrastructure and processes. 

Based on the article, the digital doors paradoxically made basic stocking tasks more difficult. Shelves were either empty or stocked with the wrong product. That's likely because digital doors required store associates to physically open each door and inspect the shelves, whereas glass doors would allow for a quick visual assessment. That’s another red flag. 

Similarly, while the digital price tag function was technologically advanced, the article mentions that the dynamic pricing capability was not provided by Cooler Doors. Without an underlying dynamic pricing algorithm, the digital displays simply replicated traditional price tags. To change the prices on these fancy displays, Walgreens likely had to develop an alternative process, separate from the standard labeling process. For reasons unknown, the focus has been on ads and ad revenue, and the potential value of pricing has been overlooked, which seems like a missed opportunity.

A comprehensive digitization strategy would be essential for Walgreens to fully leverage the capabilities of Cooler Doors. This would include: 

  • Implementation of advanced pricing algorithms, and 
  • Development of real-time inventory monitoring capabilities. 

But none of this would work without streamlined processes for store associates.

Data science or AI solutions that are not integrated into business processes tend to end up as fancy and expensive gizmos with low adoption rates. A successful implementation addresses pain points rather than creating new ones that didn't exist before, and involves the end user in the solution design. 

The key lesson is that technology solutions require a robust supporting infrastructure and well-designed processes to deliver their intended benefits.


Academic's take

To me, it’s probably a terrible idea to block the view of soft drinks, which customers can literally touch anyway, with a digital screen. If the digital screens are offline for any reason, the contents are completely invisible (that’s why they had to put signs on the doors explaining what’s inside!).

But why was this idea even executed in the first place? Apparently, Walgreens signed a 10-year contract and initially had 10,000 smart doors installed. So why more than a limited experiment in the first place? My answer is bad data, bad analysis: a poor understanding of causal modeling and data centricity.

I will use four direct quotes from the article to explain what I think may have gone wrong here:

Why did the expected 5% sales growth not happen? 

“Pilot data showed the screens resulting in more than a 5% incremental sales jump, and Walgreens committed to installing them in an additional 50 stores the next year as part of a decade-long deal.”
“Walgreens says each smart door ended up bringing in just $215 that year, or a mere 59¢ a day, about half the contractual minimum and a pittance when measured against the thousands of dollars each door cost to build and install.”

A 5% increase in sales just because the fridges display the images of the drinks instead of the drinks does not sound like a plausible effect size. What is the underlying mechanism? Attention? Is it plausible to expect a 5% increase in sales just because the fridges attract more attention from customers who are already in the store? This part needs more convincing.

Also, what kind of model produced the estimated 5% increase here? Was it a predictive model? If so, shouldn't Walgreens have used a causal model, say, estimating an average treatment effect in the plot stores vs. the rest of the stores with "dumb" fridges? So many things can go wrong in estimating such an effect size. With only six stores in the initial test, how did the company ensure that the effect size was measured using comparable stores, which requires some serious matching here. Otherwise, the assumption of ignorability or exchangeability would be violated, casting further doubt on the estimated effect size. In addition, were the installations consistent across the pilot stores? This assumption is also difficult to meet because Walgreens has different store sizes and layouts, which leads to variation in the placement and digitization of the fridges. All in all, so many potential flaws here.

Why was $33 million in ad revenue was expected?

“Cooler Screens had outsourced sales of available advertising slots for its fridges to Yahoo, then a subsidiary of its investor Verizon. But Yahoo barely topped $3 million in sales for the fridges in 2021, 91% lower than projected, a Cooler Screens court filing said.”

This part is even more interesting because my guess is that they could only have used a predictive model here, whereas in this predictive model it's not clear where the training data came from. Apparently convenience stores in the US don't have digitized refrigerator doors, so we don't have historical data. I would be curious to know what kind of methodology the startup used and on what assumptions they came up with a figure of $33 million in ad revenue for Yahoo. Also, how did Yahoo not question that number?!

One more thing

The startup “claimed that its displays garnered almost 100 million monthly impressions and gave brands a healthy sales bounce, but these people doubted the math, which was tracked in spreadsheets.”

Did no one tell Walgreens and its contractor not to use a spreadsheet for important work?! So many things can go wrong (and clearly did go wrong) in a spreadsheet with large amounts of data and number and types of variables.

Bottom line

The failure of the project offers some important lessons:

  1. Predictions are not estimated effects of interventions. If the objective is to measure the potential effect of an intervention, such as changing the fridges or running promotions, causal models should replace predictive models. To do this, data must be generated through an experiment, since there is no observational data on the problem that can be used for causal inference in a quasi-experimental design.
  2. Sloppy work on the data and the modeling leads to sloppy results, which in turn leads to sloppy decisions.

Podcast-style discussion of the article

The raw/unedited podcast discussion produced by NotebookLM (proceed with caution):

Other popular articles

How to (and not to) log transform zero

What if parallel trends are not so parallel?

Explaining the unexplainable Part II: SHAP and SAGE