Data Centricity

This article builds on the data centricity concept outlined at datacentricity.org and extends it to decision making.

Assumptions simplify the complex world and make it easier to understand. The art in scientific thinking -whether in physics, biology, or economics- is deciding which assumptions to make. ― N. G. Mankiw

What is data centricity?

Data centricity is staying true to the data. Staying true to the data is not limited to prioritizing data over its applications or improving data quality. It is also about strengthening the path from the data to the models used to derive insights from the data, and then to decisions. This path is defined by assumptions. Assumptions must be made about the data (and the underlying data generation processes) to connect data to models, and to decisions.

Assumptions link data to models and then to decisions in business, statistics, engineering, and computer science. In a 1987 book, the British statistician George Box (and Norman Draper) famously wrote: “Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.” This is because all models simplify reality to uncover associations or causal relationships, or to make predictions about the future.

We can divide the assumptions used to link data to models into two main categories: Method-based and Model-based. Method-based assumptions refer to the statistical or machine learning methods used. Model-based assumptions refer to the assumptions that are specific to the problem and solution at hand.

  1. Method based: Different types of methods require different types and levels of assumptions. This first category has a well-established typology. Briefly, we can talk about three groups of assumptions here:
    1. Fully parametric: The distributions describing the data generation process are assumed to follow a family of probability distributions with a finite number of parameters. A common assumption is a normal distribution with unknown mean and variance derived from simple random sampling.

      Example: As a parametric method, a linear regression model assumes, among other things, that the error term follows a normal distribution. That is, if the outcome is sales, we expect the difference between estimated sales and actual sales to be around the mean of the difference most of the time, with variations below and above the mean at about the same level and frequency.

    2. Non-parametric: The assumptions made about the process generating the data are much fewer than in parametric methods, but there are still assumptions, including random sampling. Depending on the specific method, observations must be independent and follow the same distribution.

      Example: As a nonparametric method, a decision tree-based XGBoost does not assume any particular family of probability distributions, but random sampling remains a critical assumption.

    3. Semi-parametric: Mixed assumptions ― While the mean of the outcome may be assumed to have a linear relationship with some explanatory variables (a parametric assumption), the variance around that mean need not be assumed to follow any particular distribution. Semiparametric models can often be divided into parametric and nonparametric parts (e.g., structural and random variation).

      Example: Gaussian Mixture Models (GMMs) are semi-parametric. A GMM can be used to cluster sales data (e.g., from different stores). In such a model, sales may be assumed to follow a mixture of several normal distributions (parametric component), while cluster membership may be assigned probabilistically by iteratively updating the model fit until convergence (nonparametric component).

  2. Model based: Different objectives and identification strategies require different types and levels of assumptions. Unlike the differences between the assumptions of the three broad types of methods, the differences due to modeling objectives are less structured. We broadly divide models into two groups:
    1. Causal modeling: Both experimental and observational data can be used in causal inference. Depending on the specific modeling approach, the assumptions required for causal modeling may differ, but we can list three fundamental assumptions to determine the average causal effect: positivity, consistency, and exchangeability.
      1. Positivity: Each subject (store, customer, etc.) has a positive probability of receiving all levels of the treatment variable (difficult, if not impossible to meet when the treatment is continuous).

        Example: If there is a promotion, each customer should have a nonzero probability of receiving the promotional offer and a nonzero probability of not receiving the promotional offer.

      2. Consistency: The treatment is assumed to be well defined, and lead to a potential outcome that is equal to the observed outcome. In other words, hypothetically assigning the treatment to customers should have the same effect as if the customers actually received the offer.

        Example: This assumption would be violated if some customers received physical coupons while others received coupons in emails. The potential sales for a customer who actually received the promotional offer should also equal the observed sales to the customer.

      3. Exchangeability: The control and treatment groups can be switched without changing the observed outcome. This assumption can also be referred to as the independence, conditional ignorability, or unconfoundedness.

        Example: Conditional on the observed covariates (e.g., customer type, past sales, seasonality), the potential sales for customers who receive the promotion are comparable to those who do not receive the promotion. That is, the allocation of the promotion is independent of potential sales.

      In addition, (1) if compliance with the treatment is an issue (if compliance varies between subjects), the estimated effect may be reduced to the local average treatment effect.

      Example: The compliance assumption would require that when customers are assigned to receive a coupon, they actually receive and redeem the coupon.

      (2) There must be no interference between subjects (the treatment of one customer should not affect the outcome of another customer). Together with consistency, this last assumption is called the Stable Unit Treatment Values Assumption (SUTVA = Consistency + No interference).

      Example: When a customer receives a coupon, their purchasing behavior should not be influenced by whether their friends or family members also received the coupon. This ensures no interference. The coupon must also be consistent in its terms and conditions. This ensures consistency.

    2. Predictive modeling: The basic assumption of a predictive model is that the training set is a good representation of the test set. In other words, past data is a useful predictor of future data. The data in the training and test samples must also be representative of the population. These assumptions must be valid in addition to any methodological assumptions made by the underlying methods used.

      Example: The way the promotion performed in the past should be the same in the future. In addition, the historical and future samples of customer data should be representative of the population data.

From models to decisions

Transforming data into useful models requires careful consideration of the underlying assumptions. Is that enough? While models are essential intermediaries, the ultimate goal is decision making. Even models built on sound assumptions do not automatically generate effective decisions. A critical link between data models and effective decisions is usually optimization, especially when dealing with scenarios involving multiple constraints.

Optimization consists of three components: decision variables that represent possible choices, constraints that define feasible solutions, and an objective function that measures the quality of the solution. The role of assumptions extends beyond the modeling of data into the optimization step. The assumptions that bind data to models, both method-based and model-based, are carried forward into optimization. For example, if the objective function or constraints are not convex, as may have been assumed in the model earlier, optimization would yield multiple local optima, which would then require further assumptions about how to search the solution space. Equally important are the model assumptions. For example, in a budget constraint, incorrect assumptions about cost estimates can distort the feasible region, leading to practically infeasible or economically suboptimal decisions.

Bottom line

​Being data-centric means making good assumptions and deriving the right insights from the data. This is not a trivial task. Assumptions multiply as we move from raw data to the modeling of the data to optimizing the model output: any errors that originate in the data methods and modeling assumptions amplify through the optimization step, potentially leading to severely suboptimal decisions. ​This cascading effect​ makes it even more important to focus on making the right assumptions to stay true to the data throughout the modeling and optimization journey. But in the fast-paced world of data science, assumptions often go unchecked. Can we then trust the decisions? Certainly not.

The truth is infinitely complex and a model is merely an approximation to the truth. If the approximation is poor or misleading, then the model is useless. ― T. Tarpey

Other popular articles

How to (and not to) log transform zero

What if parallel trends are not so parallel?

Explaining the unexplainable Part II: SHAP and SAGE