Causal inference is the process by which we use data to make claims about causal relationships, thus it is among the core tasks of data scientists. Behind this process are two distinct concepts: identification and estimation, and only by mastering both can we get better at establishing causality from associations observed in the data.

However, as new estimation methods continue to emerge data scientists tend to equate method complexity with strength in causal inference. Unfortunately, without a well-defined identification, no amount of sophisticated modeling or estimation can help us in establishing causality from data. As such, in this article we will discuss in detail why identification has precedence over estimation and why causal inference will fail without it.

If you are new to causal inference and/or are more familiar with machine learning then you can think identification in causal inference as being the counterpart to some of the fundamental concepts in machine learning such as regularization and cross validation. Mastering these concepts is essential for succeeding in prediction tasks because any algorithm is effective insofar as these concepts are applied correctly during training. A simple regularized regression applied properly without data leakage can perform better on unseen cases then the state-of-the-art algorithm that suffers from overfitting and data leakage.

Another, albeit less obvious, reason for mastering these principles for machine learning is to gain trust in our work. When we want to deploy our prediction model, one of the first things we do is to convince our stakeholders that our model is not merely memorizing what it saw during training but can actually learn from data and therefore generalize. We describe in detail how we applied cross-validation, how we handled overfitting through regularization and why the test error is a reliable estimate of our model’s performance on unseen data. This builds trust in our model and we gain the support of our stakeholders.

Similarly, in causal inference, identification allows us to not only set the stage correctly for estimation but also build trust in our work. As such, to be able to conduct an identification analysis and to communicate it clearly is an underappreciated yet powerful skill that we must master if we want to improve our causal IQ as well as build trust in our causal inference.

### Potential outcomes

To understand identification, it is useful to start with the potential outcomes framework. Let’s say we are interested in answering whether becoming a Prime member causes customers to spend more on Amazon’s online store. Because this is a simple case with two treatment conditions we can think about treatment, Prime member or not, described by a binary random variable, $$T_i=[0,1]$$. The outcome we are interested in is the value of purchases made within 12 months of joining Prime, denoted by $$Y_i$$.

To answer this question, we assume we can imagine what might have happened to someone who joined Prime if that person had not done so, and vice versa. Therefore, for each customer there are two potential outcomes, one if the customer is a member, and the other if not. A causal effect is the difference between the two potential outcomes, but only one of them is observed. Let $$Y_{i1}$$ denote the potential outcome for customer $$i$$ if they are a member and $$Y_{i0}$$ denote the potential outcome for customer $$i$$ if they are not. The causal effect of Prime membership for customer $$i$$ is defined by:

$\tau_i = Y_{i1} - Y_{i0}$ Because we never observe $$Y_{i1}$$ and $$Y_{i0}$$ both at the same time, we are faced with the fundamental problem of causal inference, which states that causal inference at the individual level is impossible.

The observed outcome for customer $$i$$, $$Y_i$$, in our data can be connected to the potential outcomes as follows: $Y_i = Y_i(1)T_i + Y_i(0)(1-T_i)$ In general we focus on the average treatment effect (ATE), which is the difference between the expected values:

$\tau = E[Y_{i1}] - E[Y_{i0}]$

Herein lies the challenge: we need the unconditional expectations $$E[Y_{i1}]$$ and $$E[Y_{i0}]$$ to obtain the ATE. That is the difference in expected outcomes if everyone in the population became a Prime member vs not. However, we only observe the conditional expectations $$E[Y_{i1}|T_i=1]$$ and $$E[Y_{i0}|T_i=0]$$, which are the expected outcomes among the members and the nonmembers, respectively. So, unless we have a reason to believe that $$E[Y_{i1}|T_i=1]=E[Y_{i1}]$$ and $$E[Y_{i0}|T_i=0]=E[Y_{i0}]$$ we can not obtain the ATE.

Another way to look at this challenge is by decomposing the ATE:

\begin{align*}\tau &= E[Y_{i1} - Y_{i0}] \\ &= E[E[Y_{i1} - Y_{i0}|T_i]] \\ &= P(T_i=1)(E[Y_{i1}|T_i=1] -E[Y_{i0}|T_i=1]) +(1-P(T_i=1))(E[Y_{i1}|T_i=0] -E[Y_{i0}|T_i=0]) \end{align*}

Here, the ATE is a function of five quantities, and all we can estimate from the observed data are the following three: $$P(T_i=1)$$ using the proportion assigned to the treatment condition, $$E[Y_{i1}|T_i=1]$$ using $$E[Y_i|T_i=1]$$, and $$E[Y_{i0}|T_i=0]$$ using $$E[Y_i|T_i=0]$$. The other two quantities are $$E[Y_{i0}|T_i=1]$$, the average outcome under control for those in the treatment condition, and $$E[Y_{i1}|T_i=0]$$, the average outcome under treatment for those in the control condition. Notice these are unobserbed counterfactuals, and we have no way to estimate these two quantities from the data without making assumptions.

### Identification

So, how do we proceed? How do we show the conditional expectations are equivalent to the unconditional ones and what we obtain by taking their difference is indeed the ATE? Or what do we do with the unobserved counterfactuals in the alternate expression? The answer is we make untestable assumptions and advocate for them.

This is exactly where identification comes into the picture. Essentially, identification means laying out the assumptions needed for a statistical estimate we obtain from the data to be given a causal interpretation. However, it does not stop there. It also means making a case for why the assumptions are plausible and therefore the association we find in the data identifies the causal estimand, i.e. the ATE, that we are after and can be trusted as a causal relationship. As such, identification forces us to not only make the assumptions needed for causality explicit but also defend them in our analysis.

Now, I hope you won’t be disappointed if I told you that every approach for causal inference, including randomized experiments, require untestable assumptions to establish causality. Yup, that’s correct. Even the gold standard of causal inference can’t give us causality without making assumptions. The thing is that not all assumptions are equal. Some are more plausible than others and when we and our audience is clear about the assumptions guiding our causal inference we can look for ways to evaluate them.

The fact that untestable assumptions is required for causal inference does not mean that it is impossible. It does mean, however, that it is accompanied with a high degree of uncertainty and, having a clear identification strategy goes a long way in reducing that uncertainty.

This also makes it clear why identification has precedence over estimation. Simply put, if identification fails, in other words if the assumptions behind our causal inference are not plausible, then no modeling or estimation approach can take us beyond association. On the other hand, if identification is valid, then we can seek to improve estimation by leveraging a variety of tools ranging from non-parametric to fully parametric methods.

### Bias

Here is what I mean. Let’s say to find the effect of Prime membership on customer purchases at Amazon, I told you that I collected data for historic purchases of members and nonmembers and that I will be using it to estimate the ATE. Clearly, this means I’m assuming that I can obtain $$E[Y_{i1}]$$ using $$E[Y_{i1}|T_{i}=1]$$ and $$E[Y_{i0}]$$ using $$E[Y_{i0}|T_{i}=0]$$, which is my identifying assumption. Now, before looking at the analysis or data, we should ask whether this is a plausible assumption.

To make a judgment, let’s look at the following decomposition obtained from the conditional expectations:

\begin{align*}E[Y_{i}|T_i=1] - E[Y_{i}|T_i=0] &= E[Y_{i1}|T_i=1] - E[Y_{i0}|T_i=0]\\ &= E[Y_{i1}|T_i=1] - E[Y_{i0}|T_i=0] + E[Y_{i0}|T_i=1] - E[Y_{i0}|T_i=1] \\ &= \underbrace{E[Y_{i1}|T_i=1] - E[Y_{i0}|T_i=1]}_{ATT} + \underbrace{E[Y_{i0}|T_i=1] - E[Y_{i0}|T_i=0]}_{Bias} \end{align*}

Well, what we end up with this approach is not the ATE, but an association that is the combination of two things: the effect of Prime membership on the members, i.e. the average treatment effect on the treated (ATT), and a bias term. Simply put, the bias term tells us what the differences in purchases between members and nonmembers would have been if members had not joined Prime.

As in many business contexts, we expect those who voluntarily subscribe for a service or product to have different purchase behavior than those who do not. In our example, we can think those who join Prime do so because they already use Amazon quite often and because they expect to continue to do so, joining Prime is a good deal for them. Essentially, members would have higher purchases than nonmembers even if they had not joined Prime, suggesting a positive bias, $$E[Y_{i0}|T_i=1]>E[Y_{i0}|T_i=0]$$. This means the assumption that I made for causal inference is not valid and we can not identify the ATE.

Now, it does not matter how many observations we have in the data or whether we use a simple difference in means estimator or run a regression. Because our identification is not valid, in the end we will have an association and not a causal effect.

### Identification Strategies

Great causal works in social science and business are those that have a dedicated section where the identification strategy is explicitly described before discussing any modeling or estimation. By doing this the minds of these works not only communicate confidence in their findings but also convince their audience that their findings can be interpreted as causal under the maintained assumptions. Most credible causal inferences also don’t require heavy modeling and estimation methods. In fact, when a causal inference is credible, it is because most of the challenges have been addressed during identification and before the statistical analysis.

So, what are the main identification strategies then for causal inference? Based on their identifying assumptions they can be classified as follows:

1. Randomized Experiments
2. Natural Experiments
3. Instrumental Variables
4. Regression Discontinuity Designs
5. Selection on Observables
6. Selection on Observables with Temporal Data

Basically, every estimation method that we encounter ranging from a simple difference in means to the latest causal machine learning relies on one of these identification strategies [2]. For example, the selection on observables strategy encompasses everything from regression adjustment to double machine learning and everything in between, including every propensity score as well as matching algorithm. Hence, for practitioners who want to get better at causal inference, it makes more sense to first understand this strategy and its assumptions in detail before moving into studying the various estimation methods.

In the next series of articles, we will be doing exactly that starting from randomized experiments. We will look into each identification strategy in detail by discussing the identifying assumptions behind them.

Let’s conclude this piece by reiterating that causal inference starts with identification and the most credible causal inferences are those that have a clear and convincing identification strategy, not those that have the most sophisticated estimation methods. Data science practitioners who want their causal inferences to be taken seriously, need to master identification.

### References

[1] P. Holland, Statistics and Causal Inference. (1986), Journal of the American Statistical Association.

[2] L. Keele, The Statistics of Causal Inference: A View from Political Methodology. (2015), Political Analysis.

[3] A. Lewbel, The Identification Zoo — Meanings of Identification in Econometrics. (2019), Journal of Economic Literature.

Posted on:
February 22, 2023
Length: