Deploy, but don't drift (away from the data)

 


Image courtesy of censius.ai

Director's cut*

What determines the success of a data science project? Is it the company's data, organizational structure, or culture? Joshi et al. (2021) identified the top five reasons as (i) misapplication of the analytical techniques, (ii) unrecognized sources of bias, (iii) misalignment between the business objectives and data science, (iv) lack of design thinking (designing the solution for the wrong user), and (v) diversion of responsibilities (such as data scientists are expected to champion the project).

My shorter list includes two issues that are related to model deployment and monitoring in one way or another: (i) ambitious questions with extremely low return on investment, and (ii) lack of infrastructure.

I saw models that consumed the time of the data science team, other data science resources, and computing power for such a negligible return. Sometimes we would solve problems on a smaller scale and stop. Lack of infrastructure also played a role. In most cases, model results were not integrated with existing business processes or tools. Silos prevailed. We had to report results to business stakeholders for manual implementation. The models provided value if and when decision makers took them into account. Unfortunately, these projects were doomed to fail because the model results were overridden by decision makers since the model deployment was not integrated into an existing operational flow.

To capture the value, models must be deployed as a part of business processes. As simple as it sounds, there are barriers to successfully deploying and integrating the models into an application, such as:

1) Lack of formal training

Most of the data science and analytics programs focus on methodology and inference. The learning experience typically mimics the academic research process. Data cleaning, model training, model evaluation, and prediction are done in stand-alone notebooks. There is typically a single data set used for model training and evaluation, and the data is not expected to change over time. Therefore, the concepts of model drift or model performance monitoring are typically not well understood.

This may sound like a trivial issue, but it is not. Although model drift can be mitigated, if left unchecked it can significantly degrade a model's performance and therefore the quality of insights. I have personally taken over and worked on predictive models that were abandoned due to low performance and accuracy. These were state of the art models, but they were trained years ago, at the beginning of the project. My teams were able to get these models back on track and improve accuracy with nothing more than retraining.

These problems usually stem from a lack of understanding of software engineering best practices. For example, model versioning is a standard software engineering practice. Similarly, versioning the data used to train the model helps identify the reasons why a model's performance may have degraded over time. If the goal is not to explain the reasons, but to ensure that the model remains effective, a simple solution is to build a model monitoring pipeline that tracks accuracy and triggers model training when performance falls below a threshold. However, data scientists rarely have software engineering training or hands-on experience to build these pipelines. The skills required to package, deploy, and maintain a model in production are either learned on the job (if the data science team is full stack) or not learned at all and end up requiring the full support of the software engineering and/or MLOps teams.

2) Lack of infrastructure and shifting roles in data teams

Data science projects also get stuck in the deployment phase because there is no accessible and intuitive infrastructure for building machine learning pipelines. After building a model either locally or in a development environment, data scientists typically work with data engineers to deploy, orchestrate, schedule, and serve their models. This process presents another significant challenge.

Instead of focusing on maintaining existing data pipelines, data engineers are being asked to start deploying and maintaining machine learning models. But a data pipeline is fundamentally different from a machine learning pipeline. Data pipelines are more static than machine learning pipelines because models need to be retrained as the underlying data changes. This requires a business process to monitor model performance and trigger retraining when model performance degrades, decays, or drifts. To build a model monitoring pipeline and trigger model retraining, the data engineer must have a good understanding of performance metrics and modeling concepts. This may be too much to expect of data engineering teams without modeling expertise.

3) Lack of business processes and product teams with expertise

Most traditional organizations also lack the teams to architect, develop, and deploy data science models from start to finish. Data engineering, data science, business analytics, business operations, and software engineering are siloed. There may or may not be full support for product management teams to drive the long-term product vision and bring all these stakeholders together for a common purpose. As a result, no one group really understands how the data, model, process, and software come together in the big picture. Without the full support of a dedicated product owner, the forest is sacrificed for the trees.

To address these issues, it is necessary to establish effective practices and processes for designing, building, and deploying machine learning models into production. This requires allocating more resources to the invisible parts of the model deployment and monitoring process. Raising awareness of MLOps best practices and taking a more holistic approach to deployment and monitoring can help achieve this goal.

* Views are my own. See full disclaimer on the About page.


Academic's take

Model deployment should take into account the likely possibility of data or concept drift. In most cases, changes in data are strongly correlated with time to deployment. Data and concept drift can then lead to poorer insights and business decisions ("insight drift"). Therefore, how data change scenarios will be handled should be incorporated into decisions made during deployment (and subsequent monitoring).

There are several reasons why data may change. The distribution of the data can change due to the increasing frequency of edge cases, growing the tails. The data population can change due to a change in the business model or due to a micro or macro economic shock.

Why does this matter? Because we make assumptions when we model data. If the data changes but the assumptions behind the model remain the same, then the insights from the model are no longer reliable and the subsequent decisions are likely to be poor.

Let's look at the possible types of drift. I will divide them into two conceptually distinct groups:

  1. Over time. This is more commonly studied and discussed in more detail below.
  2. Across units of analysis. This is a very practical but less explored problem. What happens when a model developed for data on a subset of units (products, stores) is scaled to all units?

Over time, the data may change in the following ways (Lu et al., 2020):

An example of a sudden drift is an exogenous shock. Let's say the demand distribution changes abruptly due to a supply chain disruption or a macroeconomic shock. In a gradual drift, the signs of the change are small at first and then increase over time. An example is a feedback loop. Customer demand begins to shift with a limited number of customers changing their preferences first, and then the rest of the customers follow suit. Incremental drift is when this happens simultaneously. Finally, recurring concepts can be seasonal or other periodic, such as cyclical changes in consumer demand/behavior.

Across units of analysis, the data can also vary significantly. Models are typically developed using data from a subset of units (products, stores) and then scaled to the remaining units. This is an interesting case and another way in which drift can occur and model assumptions can be violated. The result is a model with incorrect insights. There is little literature on this type of drift and its implications. We discuss it further from a practical perspective in the concluding section on data centricity below.

While the reason for drift may conceptually differ, in practice the data can change in the following ways:

  1. Covariate shift: When $P(X)$ changes (drifts) but $P(Y|X)$ remains the same.
    • This refers to the first decomposition of the joint distribution, i.e., the distribution of the inputs changes, but the conditional probability of an output given an input remains the same.
  2. Target shift: When $P(Y|X)$ changes (drifts) but $P(X)$ remains the same.
    • This refers to the second decomposition of the joint distribution. This is also known as prior shift, prior probability shift, or target shift.
  3. Concept shift: Both $P(Y|X)$ and $P(X)$ change (drift).
    • It is a combination of the first two. This is also known as a prior and posterior shift (prior and posterior probability shift or covariate and target shift).
Regardless of the cause and nature of the drift, detecting drift and preventing model drift is a similarity testing exercise: Compare new data with old data (using all available methods, including basic comparison tests), and intervene when the difference between old and new data is outside acceptable limits.1 In the previous section, Director discusses how this should be done and who should do it in a data science organization.

In closing... From data and concept drift to "insight drift"

Data and concept drift must be part of the model deployment and monitoring plan. More importantly, insight drift should be a primary concern. It's not just the data that should be monitored. The assumptions of the existing model and the fit of the assumptions to new data should also be monitored. Drift from the actual data requires retuning the hyperparameters for a predictive model. It also requires re-examination of model assumptions, especially for causal problems and parametric solutions. If model assumptions are violated in the new data, model insights and subsequent business decisions will be poor.

[1] Lu et al. (2020) offer some methodological ideas on how to detect drifts, but I won't go into the details here. See the article for details.


Implications for data centricity

Remember, data centricity is about being true to the actual data. So drift is the opposite of data centricity. Whether it is drift over time or across units, it needs to be mitigated. Drift over time is more widely discussed with solutions, but drift across units of analysis is less so. This is despite the fact that it is a major challenge for data scientists when the original model developed for a specific use case needs to be scaled to other product categories, business units, or geographies. For example, a model developed to predict demand on the East Coast may perform poorly when scaled nationally, or a model developed to predict demand for pet accessories may fail when scaled to pet toys. This may be because the assumptions no longer fit. To remain data-centric, while models should be retrained, their assumptions must also be reconsidered. Often, staying true to the data, or data centricity, comes at the cost of rebuilding models (e.g., updating features, assumptions) in addition to retraining the original models on new data.

References

  • Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2018). Learning under concept drift: A review. IEEE transactions on knowledge and data engineering, 31(12), 2346-2363.
  • Joshi, M. P., Su, N., Austin, R. D., & Sundaram, A. K. (2021). Why so many data science projects fail to deliver. MIT Sloan Management Review, 62(3).

Podcast-style discussion of the article

The raw/unedited podcast discussion produced by NotebookLM (proceed with caution):

Other popular articles

How to (and not to) log transform zero

What if parallel trends are not so parallel?

Synthetic control method in the wild