Deploy, but don't drift (away from the data)
Director's cut*
What determines the success of a data science project? Is it the company's data, organizational structure, or culture? Joshi et al. (2021) identified the top five reasons as (i) misapplication of the analytical techniques, (ii) unrecognized sources of bias, (iii) misalignment between the business objectives and data science, (iv) lack of design thinking (designing the solution for the wrong user), and (v) diversion of responsibilities (such as data scientists are expected to champion the project).
My shorter list includes two issues that are related to model deployment and monitoring in one way or another: (i) ambitious questions with extremely low return on investment, and (ii) lack of infrastructure.
I saw models that consumed the time of the data science team, other data science resources, and computing power for such a negligible return. Sometimes we would solve problems on a smaller scale and stop. Lack of infrastructure also played a role. In most cases, model results were not integrated with existing business processes or tools. Silos prevailed. We had to report results to business stakeholders for manual implementation. The models provided value if and when decision makers took them into account. Unfortunately, these projects were doomed to fail because the model results were overridden by decision makers since the model deployment was not integrated into an existing operational flow.
To capture the value, models must be deployed as a part of business processes. As simple as it sounds, there are barriers to successfully deploying and integrating the models into an application, such as:
1) Lack of formal training
Most of the data science and analytics programs focus on methodology and inference. The learning experience typically mimics the academic research process. Data cleaning, model training, model evaluation, and prediction are done in stand-alone notebooks. There is typically a single data set used for model training and evaluation, and the data is not expected to change over time. Therefore, the concepts of model drift or model performance monitoring are typically not well understood.
This may sound like a trivial issue, but it is not. Although model drift can be mitigated, if left unchecked it can significantly degrade a model's performance and therefore the quality of insights. I have personally taken over and worked on predictive models that were abandoned due to low performance and accuracy. These were state of the art models, but they were trained years ago, at the beginning of the project. My teams were able to get these models back on track and improve accuracy with nothing more than retraining.
These problems usually stem from a lack of understanding of software engineering best practices. For example, model versioning is a standard software engineering practice. Similarly, versioning the data used to train the model helps identify the reasons why a model's performance may have degraded over time. If the goal is not to explain the reasons, but to ensure that the model remains effective, a simple solution is to build a model monitoring pipeline that tracks accuracy and triggers model training when performance falls below a threshold. However, data scientists rarely have software engineering training or hands-on experience to build these pipelines. The skills required to package, deploy, and maintain a model in production are either learned on the job (if the data science team is full stack) or not learned at all and end up requiring the full support of the software engineering and/or MLOps teams.
2) Lack of infrastructure and shifting roles in data teams
Data science projects also get stuck in the deployment phase because there is no accessible and intuitive infrastructure for building machine learning pipelines. After building a model either locally or in a development environment, data scientists typically work with data engineers to deploy, orchestrate, schedule, and serve their models. This process presents another significant challenge.
Instead of focusing on maintaining existing data pipelines, data engineers are being asked to start deploying and maintaining machine learning models. But a data pipeline is fundamentally different from a machine learning pipeline. Data pipelines are more static than machine learning pipelines because models need to be retrained as the underlying data changes. This requires a business process to monitor model performance and trigger retraining when model performance degrades, decays, or drifts. To build a model monitoring pipeline and trigger model retraining, the data engineer must have a good understanding of performance metrics and modeling concepts. This may be too much to expect of data engineering teams without modeling expertise.
3) Lack of business processes and product teams with expertise
Most traditional organizations also lack the teams to architect, develop, and deploy data science models from start to finish. Data engineering, data science, business analytics, business operations, and software engineering are siloed. There may or may not be full support for product management teams to drive the long-term product vision and bring all these stakeholders together for a common purpose. As a result, no one group really understands how the data, model, process, and software come together in the big picture. Without the full support of a dedicated product owner, the forest is sacrificed for the trees.
To address these issues, it is necessary to establish effective practices and processes for designing, building, and deploying machine learning models into production. This requires allocating more resources to the invisible parts of the model deployment and monitoring process. Raising awareness of MLOps best practices and taking a more holistic approach to deployment and monitoring can help achieve this goal.
* Views are my own. See full disclaimer on the About page.↩
Academic's take
Model deployment should take into account the likely possibility of data or concept drift. In most cases, changes in data are strongly correlated with time to deployment. Data and concept drift can then lead to poorer insights and business decisions ("insight drift"). Therefore, how data change scenarios will be handled should be incorporated into decisions made during deployment (and subsequent monitoring).
There are several reasons why data may change. The distribution of the data can change due to the increasing frequency of edge cases, growing the tails. The data population can change due to a change in the business model or due to a micro or macro economic shock.
Why does this matter? Because we make assumptions when we model data. If the data changes but the assumptions behind the model remain the same, then the insights from the model are no longer reliable and the subsequent decisions are likely to be poor.
Let's look at the possible types of drift. I will divide them into two conceptually distinct groups:
- Over time. This is more commonly studied and discussed in more detail below.
- Across units of analysis. This is a very practical but less explored problem. What happens when a model developed for data on a subset of units (products, stores) is scaled to all units?
Over time, the data may change in the following ways (Lu et al., 2020):
An example of a sudden drift is an exogenous shock. Let's say the demand distribution changes abruptly due to a supply chain disruption or a macroeconomic shock. In a gradual drift, the signs of the change are small at first and then increase over time. An example is a feedback loop. Customer demand begins to shift with a limited number of customers changing their preferences first, and then the rest of the customers follow suit. Incremental drift is when this happens simultaneously. Finally, recurring concepts can be seasonal or other periodic, such as cyclical changes in consumer demand/behavior.
Across units of analysis, the data can also vary significantly. Models are typically developed using data from a subset of units (products, stores) and then scaled to the remaining units. This is an interesting case and another way in which drift can occur and model assumptions can be violated. The result is a model with incorrect insights. There is little literature on this type of drift and its implications. We discuss it further from a practical perspective in the concluding section on data centricity below.
While the reason for drift may conceptually differ, in practice the data can change in the following ways:
- Covariate shift: When $P(X)$ changes (drifts) but $P(Y|X)$ remains the same.
- This refers to the first decomposition of the joint distribution, i.e., the distribution of the inputs changes, but the conditional probability of an output given an input remains the same.
- Target shift: When $P(Y|X)$ changes (drifts) but $P(X)$ remains the same.
- This refers to the second decomposition of the joint distribution. This is also known as prior shift, prior probability shift, or target shift.
- Concept shift: Both $P(Y|X)$ and $P(X)$ change (drift).
- It is a combination of the first two. This is also known as a prior and posterior shift (prior and posterior probability shift or covariate and target shift).
[1] Lu et al. (2020) offer some methodological ideas on how to detect drifts, but I won't go into the details here. See the article for details.↩
Implications for data centricity
References
- Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2018). Learning under concept drift: A review. IEEE transactions on knowledge and data engineering, 31(12), 2346-2363.
- Joshi, M. P., Su, N., Austin, R. D., & Sundaram, A. K. (2021). Why so many data science projects fail to deliver. MIT Sloan Management Review, 62(3).