Posts

Deploy, but don't drift (away from the data)

Image
  Image courtesy of censius.ai Director's cut * What determines the success of a data science project? Is it the company's data, organizational structure, or culture? Joshi et al. (2021) identified the top five reasons as (i) misapplication of the analytical techniques, (ii) unrecognized sources of bias, (iii) misalignment between the business objectives and data science, (iv) lack of design thinking (designing the solution for the wrong user), and (v) diversion of responsibilities (such as data scientists are expected to champion the project). My shorter list includes two issues that are related to model deployment and monitoring  in one way or another: (i) ambitious questions with extremely low return on investment, and (ii) lack of infrastructure. I saw models that consumed the time of the data science team, other data science resources, and computing power for such a negligible return. Sometimes we would solve problems on a smaller scale and stop. Lack of infrastruct

Explaining the unexplainable Part I: LIME

Image
  Image courtesy of finalyse.com Academic's take Every model is a simplified version of the reality, as it must be. That is fine as long as we know and understand how the reality is simplified and reduced to the parameters of a model. In predictive analytics, where nonparametric models are heavily used with a kitchen sink approach of adding any and all features to improve predictive performance, we don't know even know how a model simplifies the reality. So, what if we use another model to simplify and explain the nonparametric predictive model? This other model  is called a surrogate model and it is designed to be interpretable. In short, surrogate models explore the boundary conditions of decisions made by a predictive model. What is a surrogate model? Surrogate models can help us understand (i) the average prediction of a model (global surrogate) or (ii) a single prediction (local surrogate). The quest then becomes finding surrogate models that can explain the predictions (

How to (and not to) log transform zero

Image
Image courtesy of the authors: Survey results of the papers with log zero in the American Economic Review Academic's take Log transformation is widely used in linear models for several reasons: Making data "behave" or conform to parametric assumptions, calculating elasticity, etc. The figure above shows that nearly 40% of the empirical papers in a selected journal used a log specification and 36% had the problem of the log of zero. When an outcome variable naturally has zeros, however, log transformation is tricky. In most cases, the solution is to instinctively add a positive constant to each value in the outcome variable. One popular idea is to add 1 so that raw zeros remain as log-transformed zeros. Another is to add a very small constant, especially if the scale of the variable is small. Well, the bad news is that these are all arbitrary choices that bias the resulting estimates. If a model is correlational, a small bias due to the transformation may not be a big conc

Are fixed effects really fixed?

Image
Image courtesy of the authors: Bias vs. the standard deviation of temporal unobserved heterogeneity where the heterogeneity follows a random walk Academic's take An interesting recent paper titled "Fixed Effects and Causal Inference" by Millimet and Bellemare (2023) discusses the feasibility of assuming fixed effects are fixed over long periods in causal models. The paper highlights the rather obvious but usually overlooked fact that fixed effects may fail to control for unobserved heterogeneity over long time periods. This makes perfect sense, since any effects that are assumed to be fixed (firm characteristics, store attributes, consumer demographics, artistic talent) are more likely to be constant over shorter periods, but may as well vary over longer periods. The paper refers to a critical point made by Mundlak (1978): "It would be unrealistic to assume that the individuals do not change in a differential way as the model assumes [...] It is more realistic to ass

What if parallel trends are not so parallel?

Image
Image courtesy of github.com/asheshrambachan/honestdid Academic's take In difference-in-differences models, parallel trends are often treated as a make-or-break assumption, and the assumption is not even testable. Why not? One misconception is that comparing the treated and control units before treatment is a test of the assumption. In reality, the assumption is not limited to the pretreatment period but for the entire counterfactual. Comparing trends in the pretreatment period is only a plausibility test. That is, "parallel pretreatment trends" and "parallel counterfactual trends" are not the same. We can observe the former, but we need the latter. So, what we have is not what we want, as is usually the case. Even testing for parallel pretreatment trends alone is tricky because of potential power issues (evidence of absence is not the absence of evidence!) and other reasons, including a potential sensitivity to the choice of functional form / transformations (

Does Lord approve of your experimental analysis?

Image
Image courtesy of en.wikipedia.org Director's cut I will leave Lord's judgment (1967) to the next section and get to the data. After running a randomized controlled trial (or experiment), modeling is the next natural step. Here is a list of some commonly used methods among many others:  Paired comparison tests Repeated measures ANOVA Analysis of covariance (ANCOVA) Regression using a differences in differences (diff-in-diff) setup, and Regression using diff-in-diff setup and with two-way fixed-effects (TWFE) added What would be the differences in estimated effect size across these methods given there is time- and subject-level variation? We asked this question and used the methods listed above for comparison. The results are as follows:   Paired T-Test Repeated Measures ANOVA ANCOVA Differences in Differences Two Way Fixed Effects Diff in Diff Average Treatment Effect  -0.9019***  -0.378

Synthetic control method in the wild

Image
Image courtesy of dataingovernment.blog.gov.uk Academic's take Synthetic data is increasingly popular across the board. From deep learning to econometrics, artificially generated data is used for a number of purposes. In deep learning, one such use case is to train neural network models using artificially generated data. In econometrics, synthetic data has recently found another use case: creating control groups in observational studies for causal inference . Synthetic data + control groups: Synthetic controls. This is too generic of a name for a specific method. In this post, I will focus on the synthetic control approach developed by Abadie et al. (2010, 2014) and previously Abadie & Gardeazabal (2003). Athey and Imbens (2017) describe Abadie et al.'s work as "arguably the most important innovation in the policy evaluation literature in the last 15 years." Why is it needed? Measuring the causal effect of a treatment requires a counterfactual (what would've h