Common DAG Structures–Confounding, Collider Bias, and Mediation

Picture of Justin Belair

Justin Belair

Biostatistician in Science & Tech | Consultant | Causal Inference Specialist

Table of Contents

common_DAG_structures_blog.knit

Introduction

Directed Acyclic Graphs (DAGs) are powerful tools for visualizing and understanding causal relationships. In this blog post, we’ll explore common DAG structures that frequently appear in causal inference problems, simulate data according to these structures, and demonstrate how different analytical approaches can lead to correct or incorrect causal estimates. If you want to begin your journey of learning causal inference and don’t know where to start, visit our Causal Inference Guide: Books, Courses, and More.

If you’re interested in obtaining the R code for this blog post, consider purchasing my upcoming book, Causal Inference in Statistics, with Exercises, Practice Projects, and R Code Notebooks. Each Chapter contains a complete case-study with an extensive code notebook that you can use to grasp the principles using code. There are also exercises and practice projects to help you solidify your understanding of the material.

Let’s jump in!

Confounding

One of the most basic causal structures is confounding, where a third variable affects both the treatment and the outcome. Here, \(W\) is the treament, \(Y\) is the outcome, and \(Z\) is a confounder that affects both \(W\) and \(Y\).

Let’s simulate 200 data points that follow this structure and see what happens when we analyze it. The true treatment effect will be set at a value of 5. Since we simulate data that we know has a true treatment effect of 5, we will be able to assess the bias in our methods, i.e. the difference between our estimates and the ground-truth value of 5.

Here is a snapshot of what the dataset looks like.

##   W        Z         Y
## 1 1 4.782948 102.18330
## 2 1 6.872517 145.53130
## 3 0 5.750960 113.37947
## 4 1 6.264373 132.30732
## 5 0 3.044272  64.73679
## 6 1 6.677193 143.98164

Now, let’s fit two different models to this data and compare the results.

  • Model 1 : We fit a simple linear regression model of outcome on treament, without adjusting for the confounder: \[Y \sim W\]
  • Model 2 : We fit a linear regression model of outcome on treatment, adjusting for the confounder by adding it as a covariate: \[Y \sim W + Z\]
##           Intercept         W        Z
## Y ~ W     90.023053 29.332428       NA
## Y ~ W + Z  2.943876  4.945199 19.74877

We see that when we correctly specify the model, the \(W\) coefficient is close to the true treatment effect of 5. It is not exactly 5 due to sampling variability. However, when we fail to adjust for the confounder, we get a biased estimate.

It is not possible to determine beforehand the size and magnitude of bias based solely on the DAG. However, the DAG structure can help us identify the presence of bias and guide us in the right direction. Further structural knowledge about the relationship between the confounder and the treament/outcome variables can help us assess the magnitude and direction of the bias if we were to omit adjusting for a confounder, e.g. if we did not measure it.

Collider Bias

Another important structure is the collider, where a variable is influenced by both the treatment and the outcome. Formally, a collider has a definition that can be bit tricky1. Informally, a collider is a variable that has two arrows pointing into it (see illustration below, where \(Z\) is now a collider).

Different selection bias mechanisms, such as differential loss-to-followup, convenience sampling, etc. can all be represented as bias induced by conditioning on a collider (or one of its descendants) in a DAG2. One such example that is very common and not always easy to identify arises when a sample is selected based on some its characteristics. For example, when assessing the correlation of athletic ability and intellectual ability, selecting a sample of students from highly selective universities could induce a spurious correlation, leading to the false belief that intellectual ability leads student to achieve higher athletic ability, or vice-versa. See my previous blog post on selection bias for a detailed illustration of this example.

Let’s simulate data and see what happens when we condition on a collider. The data looks like this.

##   W         Z         Y
## 1 0  81.94203  3.847860
## 2 0 145.72867  7.258959
## 3 1  68.89566  2.530837
## 4 1 128.56243  5.526076
## 5 1 176.21011  7.905835
## 6 0 -43.43318 -2.080502

We then fit 2 models:

  • Model 1: \(Y \sim W\), correctly ignoring the collider
  • Model 2: \(Y \sim W + Z\), erreneously adjusting for the collider
##            Intercept          W          Z
## Y ~ W     2.00674769  4.5691843         NA
## Y ~ W + Z 0.01273163 -0.9853293 0.04979243

Looking at these results, we see that the effect estimate for the model that does not adjust for \(Z\) is close to 5, as expected, whereas the model that adjusts for \(Z\) gives a biased estimate. This is because conditioning on a collider can introduce bias in our treatment effect estimate. This can be counterintuitive–controlling for more variables doesn’t always improve your analysis!

Mediators

A mediator is a variable that lies on the causal pathway between exposure and outcome, such as \(M\) in the DAG below.

When working with mediators, we can decompose the total effect into direct and indirect effects. When the model is linear (as we have assumed in this example) these effects work additively along distincty paths. That is, \[\text{Total effect} = \text{Direct Effect} + \text{Indirect Effect}.\]

In this example, the treatment effect of \(W\) on \(Y\) is 5, and the effect of \(W\) on \(M\) is 2. The effect of \(M\) on \(Y\) is 3. The indirect effect is works multiplicatively along the path \(W \rightarrow M \rightarrow Y\)3. Thus, the total effect is given by \[\begin{align*} \text{Total Effect} &= \text{Direct Effect} + \text{Indirect Effect} \\ &= 5 + 2 \times 3 \\ &= 11. \end{align*}\]

The data looks like this.

##   W           M         Y
## 1 1  0.08795477  4.523313
## 2 1  2.24083179 13.661702
## 3 1  5.33561097 20.571514
## 4 0  1.56165450  4.252607
## 5 0  0.27217363  3.801606
## 6 1 -1.28309216  8.492330

We then fit 3 models:

  • Model 1: \(Y \sim W\), ignoring the mediation component
  • Model 2: \(Y \sim W + M\), incorporating an adjustment for the mediator
  • Model 3: \(M \sim W\), the mediation model, where we model the relationship between the mediator and the treatment indicator
##             Intercept         W        M
## Y ~ W      2.14594107 10.493410       NA
## Y ~ W + M  2.18439123  5.342696 2.830629
## M ~ W     -0.01358361  1.819635       NA

We see that when we regress \(Y\) on \(W\), we get an estimate close to 11, as expected. The direct effect of \(W\) on \(Y\) can be obtained via the regression adjusted for \(M\), which blocks the effect that passes through the mediator. We obtain an estimate close to 5, as expected. The effect of \(W\) on \(M\) is close to 2, as given by the coefficient of \(W\) in the \(M \sim W\) regression. The effect of \(M\) on \(Y\) is close to 3, as given by the \(M\) coefficient in the \(Y \sim W +M\) regression. Multiplying the latter two effects we obtain the indirect effect close to 6, as expected.

Conclusion

Understanding these common DAG structures is crucial for accurate causal inference:

  • Confounding: Requires adjustment for common causes of treatment and outcome
  • Collider bias: Avoid adjusting for variables affected by both treatment and outcome
  • Mediation: Be clear about whether you’re estimating direct, indirect, or total effects. In cases with linear models, path analysis rules can be used to quickly decompose the total effect into direct and indirect effects

DAGs provide a powerful visual language for communicating causal assumptions and guiding proper statistical analysis. By understanding these common structures, researchers can better design studies, analyze data, and interpret results.

If you want to receive monthly insights about Causal Inference in Statistics, please consider subscribing to my newsletter. You will receive updates about my upcoming book, blog posts, and other exclusive resources to help you learn more about causal inference in statistics.

Join the stats nerds🤓!


  1. See Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press for a formalization of a collider. The previous link is an affiliate link and we may earn a small commission on a purchase. I also discuss this idea in detail with examples, exercises, data, and code in my upcoming book Causal Inference in Statistics, with Exercises, Practice Projects, and R Code Notebooks↩︎

  2. See Hernán, Hernández-Díaz, Robins (2004). A structural approach to selection bias. Epidemiology, 15(5), 615-625.↩︎

  3. This technique is known as Path Analysis. I discuss it in detail in Part II of my upcoming book Causal Inference in Statistics, with Exercises, Practice Projects, and R Code Notebooks.↩︎

Scroll to Top