How the variables are handled in causal

In experimental research, the independent variable is manipulated; in causal-comparative research, no manipulation takes place. Causal-comparative studies are likely to provide much weaker evidence for causation than do experimental studies.

A compilation of the issues that you kept skipping over, and how to do it right

Photo by Dawid Liberadzki on Unsplash

Why do we need Causal Inference?

Turning measurement into action is the heart of any intelligent system. Data Science is centrally about using numerous quantitative measurements to take action. In the private tech sector, these measurements are the activity of users, and the actions we can take are business actions. Our business actions range from internal decisions of general strategic approach, to decisions that directly affect customers, like new or modified products, features, recommendations, or pricing.

The art and science of Data Science is in how to properly use our measurements to infer the right actions. If you’ve ever done this, you’ve seen the leaps of faith that we must take. It often looks something like, “Well, we know that this set of customers had this outcome, so let’s put more customers in similar circumstances, in order to achieve similar outcomes.” When does this leap of faith work out? When do the actions that Data Scientists recommend actually lead to the intended outcomes?

To be confident in the leap of faith, we have to accept that our old ways of turning data analysis into business recommendations was a bit naive. We had heard that correlation is not causation, but we effectively ignored that distinction. At best, we admitted that we were uncertain, dug a little more, or compared our results to intuition to build confidence in our conclusions. But this unstructured wild west of drawing conclusions is subject to biases, especially if we don’t exactly know what we are doing.

What is Causal Inference?

Fortunately, people have been working on this fundamental question: What is causation, and how do we know when A causes B?

One of the key players in this field, Judea Pearl, often reiterates that understanding causality is key to good decision-making. He has developed a 3 Level system to understand this relationship between simple observations, and truly understanding causality:

The first level, association, is about all the associations that statistics can measure. This first level includes conditional probabilities, correlations, and even all of Machine Learning, including neural networks, that allow us to “predict” some data from other data.

Historically, people in every field have used these models to make decisions. Customers who finish onboarding in the first week are said to be “more likely” to avoid churning for the first year, when all that was measured was a correlation or conditional probability, not a causal effect. Most people are still doing this.

Judea Pearl’s second level, intervention, holds the questions that we want to answer. IF we take X business action, then what will happen? If we give customers a discount, how much longer will they stay? If we add this new feature, how much will the customer LTV change?

If we knew all the level 2 answers, we could make all the optimal decisions (within the choices that we think to consider!)

Finally, the level 3 questions are a full understanding of a system’s causality. Even if I knew how much the LTV would change, would I understand why? While level 2 was about the effect of any cause, level 3 is about the cause of any effect. If we can look at anything that happened and fully understand every reason why it happened, then we can easily answer any level 2 question, rather than only having the answers that we sought to estimate.

And indeed, answering level 2 questions is typically the direct goal of our causal inference in the technology sector. So how should we do it?

The General Approach of Causal Inference

Every project secretly approximates a causal inference problem. Everything a Data Scientist works on is some form of determining what change in the system should be made to achieve the best outcomes with users/clients.

In the level 1 approach, we look at the association between the action and the outcome metric through some quantification, without taking other factors into account. Some call it a rough estimate, or directionally right. Of course, as the spooky Simpson’s Paradox shows us, it might be exactly wrong.

The key fact of causal inference is this: For any covariates in the system that can affect the measured outcome (directly or indirectly), you have to be sure that your treatment + control groups have the same amount of these covariates.

For example, if you want to know the effect of a drug on a disease, you’ll of course need one group of people who take the drug while another group who doesn’t. Whether this is through a randomized trial or just observing the population’s current behavior, when you compare these two groups of people, you need them to be as identical as possible in every way except for the fact of whether they took the drug.

In a randomized trial, you’d give the other group a placebo so that both groups equally think they took the drug, since their beliefs can affect their outcomes (whether directly or through their actions). Whether or not its a randomized trial, you’d have to ensure that both groups have the same proportion of each gender (if gender somehow effects the measurable outcome), genetic predispositions, and what have you.

Now, to do this right, as Judea Pearl would remind us, we really would like a model of our entire system. We would like to know all of the factors that influence who gets the treatment, what the outcome is, and everything related. The structure for that model is a DAG (Directed Acyclic Graph), which can show us the relevant variables, and the direction of the edges that relate them. (e.g. the coffee maker’s pressure effects the barometer gauge, but the barometer gauge doesn’t effect the coffee maker’s pressure).

We at least want to note which of our variables are confounders (cause the treatment and the outcome either directly or through other relationships), mediators (are a path by which the treatment causes the outcome), and those that are caused by the treatment or outcome.

We need this information because we do want to control for direct and indirect confounders, since not controlling it naturally biases our results, as in Simpson’s paradox. However we don’t want to control for mediators or effects, because then controllingit would bias our results.

These rules of correctly controlling variables in DAGs, called d-separation, cause issues when the real causal model of the data-generating process contains variables that affect each other (a two-way arrow), variables that are causes themselves but also act as mediators of other causes, and other complex structures.

For example, the number of contacts that an account has might directly affect the retention, since an account with no contacts offers less value on a social platform (e.g. the news feed has fewer interesting posts, there is less information to gain about your audience). In addition, the number of contacts may also impact the number of actions that the user takes on the platform, since each contact gives them opportunities to take actions, and that number of actions on the platform may also directly affect retention (since a user that’s actively using the platform is less likely to churn).

Therefore, the number of contacts may affect retention both directly, and through the mediators of number of each type of action, since those actions directly affect retention. So, should we handle both of those variables as confounders? If we do, what will we be measuring? Often, we must make choices based on our guess of which causal paths are more important, and we must keep in mind that our measurement is of a certain set of causal paths, which could be more paths than we wanted to measure, or fewer paths than we wanted to measure (if we want the effect of adding more members, we need to allow it to affect all paths, so we shouldn’t control for the number of actions).

Another tricky aspects of determining your DAG is that any subsets of individuals that have different causal relationships count as a causal variable. For example, if gender causes some bodily feature which changes the probability of different treatments/outcomes (e.g. breast cancer), then it counts as a cause of that treatment/outcome. Likewise, if gender causes some social-psychological circumstance that effects its frequency in a treatment/outcome group (e.g. there are more women than men signed up for a women’s magazine), then it’s a cause of that treatment/outcome.

In addition, we can use proxy variables for any of the variables in our causal system. Arguable, we are always using proxy variables. We don’t know our user’s age or gender, but we ask them and use their answer as a proxy for the truth. Similarly, we don’t know their feelings and motivations about the product, but we can use their survey answers as proxies for the truth. We can use their engagement as a proxy for their relationship with the product. The list goes on. In a causal DAG, it’s always okay to use a direct effect of a variable as a noisy measurement of it. You can imagine an arrow coming from every variable to its measurable proxy, but there’s no need to complicate our picture — we’re always using proxy measurements.

A Framework for Inferring Causality

Once we’ve established a tentative model of the nonzero causal relationships between the measurable variables of our system, and so determined the tentative set of variables that act as our confounders, we can do the process of Causal Inference!

I’ve come up with this framework for the causal inference process.

It involves 4 steps:

  1. Randomized Experiment
  2. Observation Pruning
  3. Covariate Modeling
  4. Outcome Measuring

The randomized experiment is the A/B test (or Randomized Controlled Trial), which we know to be very useful in estimating the causal effect of a treatment variable. Pruning is the process if removing data that we don’t want to include in our measurement. Modeling is about representing for the effect of every covariate on the outcome. Finally, the measurement is about looking at the difference between the treatment and control groups, and determining its range, reliability and significance.

Each of the first 3 steps can potentially be skipped. We can do causal inference without a randomized assignment, without removing the worst data, and without modeling the effect of the covariates. However, we should almost certainly do at least 1 of these 3 steps. Ideally, utilizing all of the steps will lead to the best results.

Causal inference often refers to quasi-experiments, which is the art of inferring causality without the randomized assignment of step 1, since the study of A/B testing encompasses projects that do utilize Step 1. But I’ll highlight here that this framework applies to all causal inference projects with or without an A/B test.

(Yes, we should consider pruning and modeling even if we do an A/B test! We’ll get into why and how.)

How the variables are handled in causal

Photo by engin akyurt on Unsplash

Step 1: Randomized Experiments (A/B Testing)

A/B Tests aren’t Perfect

Getting back to the topic of Randomized Controlled Trials (RCTs), or A/B testing as they’re known in the tech sector, we can start to understand why they’re almost magical in the field of causality. With randomized trials, we can ensure that the treatment and control group are not assigned in a biased way, relative to any covariates that can effect the outcome.

The treatments and controls will, on average, have the same number of each gender, each predisposition.. every single confounder. AND we didn’t wrongly control for any effects, since the treatment and control assignment happened before the treatment happened, so we aren’t involving its effects in the treatment assignment.

But, RCTs aren’t perfect. What does it mean that treatments and control groups have equal amounts of every confounder, on average? This is really the dirty secret of RCTs. What it means is that if you redid this experiment an infinite number of times, the average number of confounders in each group would be equal. But it does not mean that the confounders are equal in your experiment. Being equal, on average, in this way makes it an unbiased measure.

But being an unbiased measure doesn’t make you an optimal measure.

Blocking is Better

If gender affects our outcome, and your randomly assigned treatment group has a greater proportion of females than your control group, is that going to affect the outcome that you measure? You bet it is! The difference in outcomes that you measure for your treatment and control groups will be partially due to those two groups having different proportions of females! Therefore the causal effect of your treatment in your experiment will be skewed by your experiment’s imbalanced covariates (e.g. gender). If you redid this experiment many times, the average causal effect that you measure will indeed be correct (unlike if you did no RCT, where the average would not be correct), but your experiment’s measure is skewed.

So although an RCT gives us an unbiased measure in the sense of many hypothetical experiments averaging to the truth, we have a fixable source of error in our experiment, our covariate imbalance. One way to fix it as a Stratified/Block Randomized Trial. Here, we ensure that the treatment and controls have the same numbers of our covariates. Instead of taking all people and assigning them evenly to each of our treatment groups, we take all women and assign them evenly, then take all men and assign them evenly, etc.

Then we do have equal covariates between the groups! With many confounding variables to account for, we need to make strata from the combinations of variables. If we are concerned with gender and whether their age is over 40, then we have men over 40 and men under 40, women over 40 and women under 40, etc. Each of those strata must be equally split between each treatment group.

In terms of statistics, what will this produce? The measure of the casual effect of the treatment variable was already unbiased, but now it will be more accurate: it will have a smaller variance. That is, if we did this experiment many times, the measures of our treatment effect will still be unbiased, and those measures will have a lower variance than in the simple RCT.

So do a Blocked RCT if you can!

Bin Continuous Variables to Enable Blocking

This is the first of a few times that we will mention the type of each column of data. Think of the nodes in your DAG that you’ve chosen to include in your causal inference, the variables that you know affect the outcome. Each one could be continuous, ordinal or categorical. Ordinal and categorical variables can be treated the same with respect to stratifying, but what about continuous variables?

If we want to block randomize our trial, we will need to bin the continuous variable into ranges, so that we can turn each range into blocks that are evenly split between treatment groups.

Binning a continuous range can be done in many ways. Here we are choosing a representation of a confounding variable, so we want to capture a categorical representation where the value of the continuous variable doesn’t range too much within each category. Certainly, the larger the number of bins, the more the range of continuous values within the bin won’t have too much of an affect on the outcome. But too many bins would mean too many strata.

Scikit-learn’s KBinsDiscretizer has a few methods, like sorting the data evenly into k bins (splitting on quantile), setting the k boundaries evenly apart, or using a 1-dimensional k-means clustering on the data. In reality, a combination of looking at a histogram of a sample of the continuous data and thinking about domain knowledge is probably your best bet to choosing how it should be binned for a block randomized trial:

If the histogram has clear humps, those might make good bins. If it’s a very skewed distribution, you may want to split in a geometric or logarithmic way, like binning each order of magnitude (factor of 10).

Later on, we can still model the continuous variable so we don’t absolutely need to bin it, but it can be a great idea to block randomize our trial on a binned version of the continuous variable.

Obligatory bit on Power and Sample Size Calculations

I would be remiss if I didn’t say a bit more about running a proper Randomized Experiment.

Firstly, your results will only end up being statistically significant if you have a large enough sample. Well, it is probabilistic, so the real fact is that a larger sample size will make it more likely for you to detect an effect, if it is there. For a particular effect size, you can calculate what sample size you would need in order to be pretty sure that you’d detect an effect, if there is an effect. We call these elements Power, Effect Size, and Sample Size.

Let’s discuss this phenomenon for a two independent sample t-test, since it may need some modifications for more complex models and measurements. From a frequentist standpoint, we think about a null hypothesis where the difference between the treatment groups is zero. If this null hypothesis were true and we measured the difference between groups repeatedly in an unbiased way (did many unbiased experiments), then the measures we would get would average to zero, and form a normal distribution with some variance (i.e. Central Limit Theorem). The variance of that measure will scale inversely with the sample size. We will draw our alpha threshold based on that distribution (e.g. for a two-tailed alpha with 5% False positive rate, we can look for where the normal CDF is 0.975, which is at 1.96 standard deviations).

We then also think about the world in which there is a difference between the treatment groups, such that the measures of the differences over many experiments would be the same normal distribution but with a nonzero mean. That nonzero mean, the difference between the means of these two Gaussians, is the Effect Size. In this world, we will only reject the null hypothesis if our measure exceeds 1.96 standard deviations on the first distribution, so we need that point’s location on the second distribution to be some distance from the center of the second distribution, such that, given the second distribution is true, we will have P chance of correctly rejecting the null. P here is known as the Power, and it’s the area under the second distribution that is to the right of that 1.96 standard deviations on the first distribution (assuming a two-tailed 5% alpha). A typical Power number is 80%, which on a normal CDF happens 0.84 standard deviations from the center. Thus the two normal distributions’ means must be 1.96+0.84 standard deviations apart. Using the two-sample t test equation relating z-score to effect size, variance and the (harmonic mean of the) sample sizes, we find that the needed sample size for this 5% alpha, 80% power standard scenario is about:

N = 16 * variance / effect_size²

TL;DR For a two sample t-test with 80% power and two-tailed 5% alpha, the above formula gives the sample size as a function of the variance of the individual observations within groups, and the effect size between groups. We tend to call this the minimum effect size — we will have 80% Power to detect the minimum effect, and even more power to detect any larger effects.

Some rules of thumb for how this formula plays out, based on the Ratio of of Standard Deviation to Minimum Detectable Effect:

  • 10x ratio -> Need 1600 samples per group
  • 100x ratio -> Need 160k samples per group
  • 1000x ratio -> Need 16M samples per group.

As a reasonable example, if we had a measure with standard deviation of 50%, such as an average purchase value of $80 with a standard deviation of $40, and we care about detecting an effect of 0.5% (40 cents) between the treatment groups, then our ratio is 100x, so we need 160k samples per group.

Sample Size Modifications

We said that when we run a block randomized trial, we get a smaller variance in our measured effects. We will ultimately be combining the measures from each stratum, and indeed get a smaller variance at the end, which can allow us to detect smaller effects with the same number of samples, or rather we can detect the real effect with a smaller number of samples. In order to get a good measure on each stratum, we need enough samples for that strata.

At the very minimum, we need at least one measure in each stratum for each treatment group. This principle is called positivity — we can’t measure the effect if we have no examples. Realistically, we need a number of measurements so that we can estimate the variance of the population of observations reasonably well, and to get the uncertainty of the aggregate experiment outcome small enough to make our effect significantly significant. That means we need enough samples for each strata. “Enough” for each strata is based on the population variance within that strata.

So we need many samples for each strata, with each strata needing a different number of samples. If we do a simple randomized controlled trial, then each strata will only appear as often as it does in the full sample group. If the people in the trial are only 1% people over 60, then we may have to scale up the entire sample to massive proportions just to get enough people over 60 to get a good measurement for that segment. However, in a block experiment, we will explicitly take whatever number of people over 60 that we want, and we will separately take however many people in each other group that we want. Therefore we can scale up one stratum without scaling up another, keeping our total sample size down. This can be helpful for some experiments (though perhaps not for others where we will simply run the trial for a number of weeks and wait to get enough people in each stratum).

In general, we can multiply our estimated necessary sample size by various factors, often called design effects, to account for any considerations. If we want to go from a one-tailed to a two-tailed test, get 20% more people. If we have some fraction of missing data, get that many more people. We will also need larger sample sizes to detect interaction effects, essentially treatment effects for one stratum.

And we also need a larger sample size for our last kind of randomized trial, the cluster randomized trial.

Network Effects

We said this was a comprehensive guide, didn’t we? So let’s talk network effects.

Essentially, a randomized trial is supposed to satisfy the Stable Unit Treatment Value Assumption (SUTVA), which says that the treatment isn’t affecting the control individuals. However, in our interconnected world, this assumption can be violated.

People in the control group can know people in the treatment group, which can give them access to the treatment group’s knowledge, experience, beliefs, behaviors, resources, etc. This decreases the difference between the control and treatment outcomes together in the experiment, so that the measured effect underestimates the real effect of the treatment relative to a truly separate control.

The other SUTVA violation is the opposite — resources are finite, so if the treatment group uses a different amount of resources, then this impacts the amount of resources available to the control group. This will increase the measured difference between the treatment and control in the experiment, so that the experiment overestimates the treatment effect.

We resolve this by trying to separate the treatment and control groups. Depending on the case, we can separate them geographically (treatment group in one city, control in another), in time (treatment group at one date/time and control in another), or by friend groups (cuts to a social network graph that leave groups that are minimally connected to each other).

Depending on how we analyze the resulting experiment, we may need a few more, or a lot more, sample size to compensate for the experiment structure. Traditional methods have us treat each cluster as one unit, so the clusters will be what is counted in the sample size N — -N clusters rather than N people is a MUCH bigger sample size!

More recent methods will keep the individual observations and include their cluster in the modeling stage, to separate the effect of the cluster from the effect of the treatment. Since observations are correlated by their cluster, they are not independent samples in that sense, and we will need a larger sample size to compensate. The Intra-Cluster Correlation (ICC) is a measure of how similar samples are within clusters vs between clusters, and will dictate how much extra sample size we need. The design effect factor that we multiple our sample size by is 1+ ICC*(N-1), where the ICC ranges from 0 to 1.

Let’s move on to the next stages.

Photo by Georgie Cobbs on Unsplash

Step 2: Pruning

The problem with pruning

Pruning data is when we remove data points before doing our analysis. It’s a contentious topic, and for good reason. The flexibility in pruning criteria means that a researcher can try many pruning heuristics until they get the result that they want (this way to cheat in science is sometimes called p-hacking). To handle that, let’s first make 2 points:

  1. Your analysis should be planned out before you do it. Researchers have recently been publishing their analysis plans on arxiv, so that they are publicly accountable for carrying out an analysis that has not been tweaked to get the desired results.
  2. If you do consider several different strategies, then you have to account for it in the variance of your results. p-value thresholds need to be corrected to keep the true false positive rate to your claimed threshold, since you’re effectively running more than one experiment. We’ll dive into p-value corrections later.

Next, let’s understand pruning more deeply, through a discussion about picking the right data.

When we prune, what are we doing?

Pruning outliers is common practice, but why exactly do we do it? When we see an outlier, we know that it will mess up our measurements, while its a mean or a model. But why isn’t it right to keep the outlier?

Fundamentally, we are saying that we don’t want to model the phenomenon that produced this outlier. If the outlier is do to bad data collection, then we are saying that we don’t want our model to include bad data collection. Or, if the outlier is a real but large number, like a giant house in a house price data set, then we are saying that we don’t want our model to include giant houses.

If our goal is causal inference, then we must be clear on what we are inferring. If data is produced by causes that we do not want to model, then we should prune it. And data doesn’t need to be an outlier to fit this definition. Any data that’s from a different source that we don’t want to model, or about a subset of people that we don’t want to model, should be excluded!

This concept is perfectly aligned with our notion of confounding variables. Indeed, the concept of “generalization” in statistical models that “generalize” to the test population is usually about modeling the same causes as our test population. If our data set / training data is different in some key way from our test data / general population, like being a different time, different place, or different demographic distributions, those are confounding causes that we have not properly captured in our model. Models failing to generalize is largely an issue of confounding variables. We must use the right data to model the right causes, in order to answer the questions that we want answered.

Let’s prune!

First things first, you’re welcome to engage in your regular pruning practice. Bad data should certainly be fixed or removed, and you’re welcome to say, “I don’t want to model these extreme cases.” If you trim off the top 10% of your data, then just explicitly say that you’re modeling the bottom 90% of the population. If you trim off houses above $5M, then just say your model doesn’t apply to those. Sometimes, having a different model for each regime is a great solution. (e.g. Segmented Regression).

Now let’s extend your pruning practice with these causal inference concepts. In randomized trials, we wanted the treatment and control groups to have the same number of every covariate that affects the outcome. In a simple Randomized Trial, we got close to that, but not exactly. It gets much worse if you don’t have a randomized trial at all. If we’re just taking data from the population, a quasi-experiment, then the covariates may be extremely unbalanced between each treatment group.

So, we can simulate a more controlled experiment by ignoring some measurements. We could just keep subsets of measurements, such that our selected data does have equal covariates between the subsets of people that get each value of the treatment variable.

For example, if we wanted to look at the direct impact of parent’s income on children’s income, we might need to control for variables like the state they live in. We can’t do any kind of randomized trial here since we can’t set the parents income.

To see why this would be an issue, consider that the general population disproportionately has different incomes in different states, so if state resided independently affects children’s income, then in the general population, the families with high parent outcome will usually live in a high earning state which will bias the kid’s incomes (combining the effects of parent’s incomes and state into our measurement, when we only wanted to measure the direct effect of parents incomes)

So instead we can select data such that the parent’s income groups are equally split in each state. The ratio of high earning to low earning parents should be the same for each state in our pruned data. That way, we aren’t conflating the effect of parent’s income and the effect of state resided.

Let’s get into the details of how to do this

Matching

Whether we have one or multiple important confounders, there are a couple of good ways to prune the data. Many pruning techniques rely on some form of matching — we match each treatment observation to a control observation with very similar confounders.

With just one confounder, we can match treatment observations to the control observation with the most similar value for the confounder. If we have multiple confounders, then we must consider multiple dimensions of confounders simultaneously when picking the best match.

Since our confounders can be continuous or categorical, we must have solutions for both cases.

If our confounders are continuous, we can just match treatment observations to a control whose confounders have the smallest “distance” apart. This is called Mahalanobis Distance Matching. You should probably rescale the confounders so that the confounders are equally important per unit distance, so that we are matching to the closest. In realty we should probably scale based on how much each confounder affects the outcome measure.

If we have categorical confounders, or bin our continuous confounders into categories, we can again make strata (each possible combination of the confounders). We can prune data until we have the same number of data points from each treatment group in each strata. We can also just remove all data points in strata where there any treatment group is not represented (has 0 data points in that strata), and allow there to be imbalanced treatment vs. control within each strata — we can then weight our data points to effectively balance within strata. This is called Coarsened Exact Matching.

I’ll briefly note that another common method in causal inference is propensity matching. We will discuss propensity scores later, as here I will only say that propensity-based pruning is not recommended. Propensity essentially involves projecting multi-dimensional covariates onto a single dimension, and as a result is based on less information is therefore strictly worse than pruning based on strata. It is sometimes recommended as a solution to the curse of dimensionality and the need for positivity (that it’s hard to have data from each treatment group in each strata if there are many confounders. E.g. if there are e.g. n confounders with k levels each, there are k^n strata). See Gary King’s discussions of why it isn’t worth it: https://www.youtube.com/watch?v=rBv39pK1iEs

Categorical Data Representations for Pruning

Let’s revisit the issue of data type. We can only do coarsened exact matching with categorical variables, so this may be the time for us to bin some of our continuous variables. Using more bins here for a given variable will mean a finer resolution, but can also force us to delete a lot of data, since we need at least one example from each treatment group within every combination of every bin of this variable and every bin of every other variable. Generally, binning a variable around 5 times is the norm, but it depends on your sample size. Just take a look at how much data you’re forced to delete as a function of the number of bins.

In addition, if your confounders are more complicated, binning into categories will eventually be a necessary step. For example, one of your confounders could be text, audio or image data! We might be looking at a social platform and want to know the treatment effect of some aspect of the social communication, but we need to control for the content of the communication itself, i.e. control for the language of the post/message, or its image or video! In that case, we are going to need a meaningful, simple representation of that high dimensional data in order to do causal inference.

To include it in Coarsened Exact Matching, we are going to need a relatively low dimensional representation, to keep the number of blocks reasonable so that we don’t delete too much data. Natural language data could be binned into a small number of categories based on sentiment/emotion and semantic topic.

Next, let’s roll into Step 3, where we explicitly model the effects of every variable.

Photo by STIL on Unsplash

Step 3: Modeling the Confounders

Riffing on our discussion about pruning, we don’t actually need our dataset / train data to have exactly the same balance of covariates as our test data. We just need to know all of the confounders that affect the outcome, and “control” for them in some way.

Even if our train data is 10% children while our test data is 90% children, we can potentially make perfect predictions if we have fully modeled the effect of being a child.

Modeling can be done in a variety of ways. We might think that we should be concerned with the interpretability of our models. After all, discussions of model interpretability are essentially about causality — we want to know Why our model predicted outcome y. Which features play the largest roles in the prediction?

Generally, I’d say these discussions of interpretability hinge on not knowing which variable we should treat as our treatment variable. If we want to consider the causal effect of every variable while simultaneously using all the others as controls, then yes, interpretability is excellent. We can use a simple linear regression and interpret at all of the coefficients as the causal effect of each variable (we’ll get into this more rigorously). However we know that complexities arise with multicollinearity, and we should be concerned whether “treating all the other variables as confounders” is appropriate given our DAG model of our causal system.

But if we do know which variable we want to analyze as the treatment (and of course this is already set if we did a Randomized experiment), then it’s time to do proper causal inference modeling.

Regarding randomized experiments, we may well still want to include modeling. In a simple randomized experiment, our covariates are not exactly balanced. If we don’t handle this with pruning and weighting, then we can handle it in modeling.

For a block randomized trial, we could potentially measure each stratum without any modeling, since covariates are perfectly matched. However, we may have only blocked some of the covariates to keep the dimensionality down, so here we can model the rest of the covariates, in a single model or a model for each stratum. Or if our blocking is based on a binning of a truly continuous variable, then we may want to model that continuous variable with a model for each stratum.

Finally, for cluster randomized trials, we need to use modeling if we want to use the individual observations, since the clusters are an additional confounder.

Meta-Learners

The Meta-learner framework allows us to use ANY arbitrarily complex ML model in the same simple way to measure the treatment effect. S, T and X learners allow us to translate any Machine Learning model into a Causal Inference machine.

The simplest framework is the S-learner. First, we train any model on our data, using the treatment and the covariates to predict the outcome. It could be a regression, a random forest, kNN, boosting machine, Neural network, you name it.

Then with our model, we take our data, replace all of the treatment variable measures with the value of one treatment group, and use the model to predict the outcomes. Those act as our observations for that treatment group. Then we repeat for each treatment group. Viola, it’s like an experiment, with artificially generated data that has identical covariates in each treatment group — literally every observation is repeated in each treatment group, with only the treatment changed. The difference between the average of each group is a measure of our ATE (Average Treatment Effect).

We can also get the CATE (Conditional Average Treatment Effect) by simply only looking at the observations with a given condition. If we want the treatment effect for men, we can just look at the observations that are men when we do the S-learner process.

T and X learners riff on this with some more powerful ways of controlling — like training a model for each treatment group, and training additional models on the errors between the measures the model predictions. These meta-learners are all based on an understanding that we are attempting to measure Counterfactuals, which is what would have happened if a given observation had only their treatment group changed.

With this knowledge, we can jump back to interpreting a linear regression as causal. Indeed if we used a linear regression in the S-learner framework, then the average difference between the treatment groups is just the coefficient in front of the treatment variable, since everything else subtracts to 0.

So if indeed everything else in the regression is a proper confounder, and we’ve measured all confounders, and the linear model captures the full relationship between all of the variables, then this S-learner is a good measure of the causal effect.

But reality isn’t so pretty.

Heterogenous Treatment Effects and Interactions

Good modeling means estimating our system as accurately as we reasonably can. A simple linear model is only a good model if each variable linearly relates to the outcome, and each variable’s effect on the outcome is completely independent of the other variables. Is that the case for your system?

In terms of causal inference, reality often has heterogenous treatment effects — the treatment is a function of the other variables. That is, who is being treated dictates the effects of the treatment. For example, the treatment may effect people differently depending on their demographic, health, geography, level of knowledge, or goals.

Reality also has interactions between confounders. The effect of a confounder like gender on the outcome may be a function of age — perhaps young women are biased toward better outcomes more than any other age and gender, but older women are actually biased toward the worst outcomes of any age and gender. In that case, linearly adding the contributions of age and gender won’t capture this phenomenon

Generalizing, the effect of each variable on the outcome can be arbitrarily nonlinear and arbitrarily a function of any other variable.

It’s a good thing that we can use complex models for causal inference.

I’ll note here that we can also use many models. There’s no need for a singular trained model to represent everything. It’s often a good idea to train a different model for different ranges of input values. If we have some strata, a different model for each strata may be very helpful.

There are many upsides of using different models for each stratum: We can allow the general shape, or even the type of model to be completely different for each stratum. In addition, each model is essentially capture the heterogenous and interaction terms between the categories that define that stratum and the variables being modeled.

For example, if we have a stratum for women under 40, and within that stratum we have a model containing age (as a continuous variable), the treatment, and whether they are a new user, then we are capturing the effect of the treatment solely on women under 40, a heterogenous effect, and we are also capturing the effect of continuous age and new users on women under 40, an interaction effect.

In choosing to use many models, there are also downsides. For one, we will have less data for each model. If we had a single model with all of these interaction terms, it would perhaps be even worse, since the data needed to train a model scales superlinearly with the number of terms. But if we had a single model with a more limited set of interaction terms, we may have more data per term than with the model-per stratum approach.

Further, if we had a single model its typically recommended to include the main effect (non-interaction terms), and each stratum would contribute to averaged estimates for this term. But in the model-per-stratum approach, this aspect is missing.

Feature Extraction as Usual

The rules of good Machine Learning still apply. If we keep some of our data continuous, then we need to try to model its relationship with the outcome variable, as with any ML. While a decision tree based model (forests, GBMs) will handle the continuous data for us, and a neural network can learn arbitrary functions, some simpler models requires our help. We can extract new features from our data by making polynomials of continuous variables, taking their ratios, or doing other feature extraction / dimensionality reduction techniques to create new features (PCA, ICA, RBF kernels, etc.).

As usual in ML, we should consider scaling features if necessary. Any approach that uses a distance metric (kNNs, RBF kernels) should have rescaled features, so that distances value each feature equally (or feel free to rescale them as you see fit). We’ll also probably rescale if we use regularization, which we will discuss later on.

Binning continuous variables into categorical, or rather ordinal variables can be helpful to correctly modeling the domain.

In a linear model on continuous variables, we need to use polynomials and ratios to capture any nonlinear relationship. However if we bin a continuous variable, our One Hot Encoding will force us to learn at least a constant relationship between each bin and the outcome variable. If we map that back to our original continuous variable, that’s like learning a piecewise constant function. For a large enough number of pieces, this can allow us to find much more complex relationships than an explicitly modeled continuous variable.

For example, if we bin a user’s number of social contacts into several bins, then we might be able to measure that there are multiple peaks and troughs along the continuum — perhaps having 0 contacts leads to the highest retention because people forget about their account, while having 1–5 contacts is bad. Then perhaps 6–25 contacts is also good because they use the platform to communicate with class friends, but 25–200 contacts is bad because all of their weak ties create noise in the feed, while having 200+ contacts is good because those users are using the platform for their business.

Binning continuous variables can similarly help us model heterogenous and interaction relationships. If we bin variable X into 5 bins [X1…X5], then we can create terms with the product of X1 and the treatment X1*T, X2 and the treatment, etc. or between the X’s and another confounder. This again enables a piecewise model of these interactions, which allows nice flexibility. To do this with continuous variables, we would need to consider the product of ratios and polynomials on X with the other terms, which could be more complicated while specifying the wrong structure. But it depends on your domain!

How do we choose the right model?

We said that with the Meta-learner framework we can use any type of ML model, and we are now also considering a number of interaction/heterogenous terms and the choice to train different models for different regimes.

So how do we figure out the right modeling approach?

Well, consider that we just want a model that accurately captures our system. It doesn’t seem fundamentally different than the typical Bias-Variance tradeoff. That’s no surprise, since we said that model generalization has a lot to do with correctly modeling the causality of our system.

If we can find a model with high generalization accuracy, then we’ve probably modeled our causal system fairly well — if any changes in any measurable is leading to predictable outcomes, then we understand the forces in our system, at least to the extend that the test data explores.

So, we’re interested in the usual — an ML model with a high test score. And as with usual best practice, we should prefer simpler models if the test scores are indistinguishable. We should also prefer models where repeated fits lead to the same model — similar coefficients in a linear model, importance scores in a forest, and in using these models for causal inference, the same estimated treatment effect each time we fit and measure again.

As with all of our modeling approaches, we should try many ways that make sense with our system and compare the results. Where results are the same, we can feel confident that our model/data is robust to small changes in assumptions / framework. Where results differ, we learn something important about our data — perhaps that a particular relationship is important.

Let’s consider our options.

Regularization

We’ve discussed a simple linear regression. The next step of complexity, or rather, a more complex loss function that reduces model complexity, is to Regularize our linear model. If regularization improves our test score while reducing model complexity, it’s likely that we’re better capturing the causal relationships.

Indeed, papers have explored the relationship between the real treatment effect and the estimated effects when fitting regressions with various degrees of regularization. Findings generally indicate that strong regularization tends to lead to more accurate treatment effects. You can grid search, halving search, randomized search, or use Bayesian Optimization to find the best regularization parameters for an Elastic Net, and see what works.

One paper from Amazon in 2019 analytically determines the correct amount of regularization in order to best estimate the treatment effect, under some assumptions. Generally it aligns with more severe regularization.

Inverse Probability Weighting

As we’ve discussed, the heart of causal inference is in balancing confounders. If your treatment groups have the same values of the confounders, then the difference in the outcomes will be focused on the treatment effects.

We’re already discussing modeling, so we’re doing a good job of controlling for any remaining differences in confounders. But we can do better by accounting for how much data we have.

Consider that our models are minimizing a loss function that sums the errors from each point in our dataset. If our dataset has many women in treatment 1 but not many women in treatment 2, then the model will care more about optimizing the prediction for women in treatment 1 than for women in treatment 2. We want the model to care equally about all strata, and all heterogenous/treatment effects, but our data doesn’t represent that.

So we weight the data.

We will weight the data so that there are effectively an equal number of data points from each treatment group within any combination of confounders. This approach has been done in a number of ways, and consistently shows improvements in estimating the treatment effect.

In Coarsened Exact Matching, we mentioned that we can weight the data if a stratum was imbalanced between treatment groups. Weighting the data before you model is even better than simply weighting the measurements at the end.

Mathematically for categorical variables, we can just weight the data by the inverse of the frequency of its treatment group. For example, if we have 3 examples of treatment 1 and 5 examples of treatment 2, then we can weight the data points by 1/3 and 1/5, respectively. Then each treatment group has the same total weight of 1. We can also weight by the inverse of the probability of a point being in each treatment group. The probability is 3/8 and 5/8 respectively, so the inverse probabilities are 8/3 and 8/5. Weighting the points with these values will also lead to the same total weight for each treatment group. We call this Inverse Probability Weighting.

But how can we weight the continuous variables in the model?

Propensity

Well, this is as good a time as many to discuss Propensity scores. We rejected the notion of propensity before, because pruning based on the full information of the confounders was better than projecting to one dimension. But here, we need a single continuous weight for frequency/probability of each treatment group, so projecting to one dimension is exactly what we want.

A propensity model can be any ML model. Rather than our “main” ML model which uses the treatment + confounders to predict the outcome, a propensity model instead uses the confounders to predict the treatment. It can be a linear model, or something as nonlinear as a forest, kNN or neural network. We just need to know, for a given set of confounders, what is the likelihood that the observation is in each treatment group.

What we are really trying to do is create a proxy for how many observations are in each treatment group at every value of the confounders.

For example, if our dataset has many observations of men who are roughly 39, but very few women who are roughly 39, then the observations are biased in that range. We can correct for this bias in our main model by using a propensity model to estimate the degree to which we have more male observations than female around this age range, and using those weights on our observation data to better train our main model.

Adding propensity scores as model weights makes our model more robust. Since we are concerned about so many moving parts as we try to estimate the treatment effect, this robustness is welcome.

Furthermore, since we are using the propensity model as a probability, we want it to be well calibrated, and avoid values too close to the edges. Let’s discuss that in some depth:

Consider the case where we have a binary treatment, like 1 treatment vs 1 control, indicated by 0 or 1. The propensity score is the the probability that an observation received the treatment. When we train a binary classifier that outputs a “probability” from 0 to 1 (like with a predict_proba() method), then we would very much like for this probability to be as accurate as possible. With classification, we typically don’t care about the exact probabilities — we just care about the separation of the two classes from any decision boundary (at some probability value) — we typically only care up to a monotonic scaling. But here we want the exact probabilities for propensity scoring to work properly.

Thus, we should subject our propensity score classifier to Calibration. Calibration is just a way of using some validation data to get those probabilites exactly right. We can calibrate our predictions through a simple sigmoid curve, or through a nonparametric, arbitary “isotonic” function if we have enough data to learn it without overfitting. Scikit-learn has calibration functionality.

Let’s dig a bit deeper into choosing the right models.

Error Structure of the Output and Generalized Linear Models

While we can blindly use an AutoML kind of approach to Machine Learning, the domain and the math can suggest some good solutions.

It’s always important that we examine the error structure of our outcome variable. Asymmetrical residuals let us know if our model is biased, the variances at different values of the independent variable should be examined too.

The least squares cost function is really the maximum likelihood estimator for the case when the outcome variable is equal to the output of the model plus a simple normally distributed error, epsilon:

y = linear model + epsilon,

with epsilon ~ N(0, variance)

This implies a homoscedastic reality, i.e. the noise outside of the linear model is not a function of the independent variables (the variance of y is the same all along the x axes)

Of course, this may not be the case.

Certainly it is not the case with a categorical outcome variable, where the distribution is some kind of binomial/multinomial, taking on n values with some probabilities for each value of the independent variables x.

Zooming out, the Generalized Linear Model approach allows us to specify other error structures. We can choose any distribution from the Exponential Dispersion Family, which includes most distributions that you’ve heard of.

If we don’t like our error structure, or don’t want to predict a very skewed outcome variable (since the squared errors will be enormous in the right tail), we can consider transforming our output variable. We just have to specify the right error distribution for the transformed output variable.

While this aside is useful in itself, I thought it would be a good introduction to considering the distribution of our learned parameters.

Mixed/Hierarchical Models and Generalized Estimating Equations (GEEs)

Standard practice for categorical variables is to transform it into One Hot variables. That is, for a category with N possible values, we map the value for each observation onto a vector of length N-1 (to avoid perfect multicollinearity), by representing N-1 of the categories with indicator variables in each of the N-1 vector elements, and letting the 0 state in all elements refer to the dropped, “reference” category.

When we do this One Hot Encoding, we allow our model to learn independent parameters for each category. Let’s think about a linear model. The value of one category’s parameters hardly effects that of the other categories (except that they can all be shifted by some amount if that amount is factored into other terms in the model).

But for some problems, it would be better to specify the distribution of these parameters that represent the effect of a categorical confounder. It turns out that we need this in order to model the individual observations in Cluster Randomized Trial. At least, it’s the recommended best practice.

A mixed/hierarchical model does just that. Applied to modeling a clustered trial, we specify the distribution of coefficients for the effect of the clusters. That way, the cluster that an observation belongs to acts an additional source of error, with each cluster having its own term, and the set of these terms forming a normal distribution with mean 0 and some variance.

Any “correlated” data can be modeled in this way. Any time we have multiple observations from each individual, this approach can be appropriate, with the individuals acting as the grouped “random effect” with normally distributed parameters. I’ve also used this approach when we look at user’s responses to influencers or campaigns, since we want to include the individual responses in the model, but they are correlated by the campaign or influencer to which they are responding.

Another approach to the same type of problem is with Generalized Estimating Equations (GEEs). It’s recommended that the variance structure be specified as exchangeable, but you might as well try all of the structures.

Complicated models like the mixed model, generalized linear model and generalized estimating equations can be done in Python with the Statsmodels library.

Advanced Causal Inference Models

Finally, we can touch on a few other models specifically designed for causal inference.

The Doubly Robust model is a slight extension to our discussion of using Propensity scores alongside our model. The Doubly Robust model is much like the Meta-learners, in that we use our main model to make predictions and take a look at the outcomes. The difference is that we also use the propensity, and combine their predictions for each observation. That final equation for estimating ATE (Average Treatment Effect) uses the causal model and propensity model in such a way that if either model is correctly specified, the outcome will be correct! It follows that small errors in the models and large errors in one of them will not perturb our estimate of treatment effect too much.

Double Machine Learning (DML) and Causal Forests are two other popular methods. The DML can use a Causal Forest or other arbitrarily complex ML to handle nonlinearities, interactions and heterogenous effects. DML takes a robust approach that involves alternating model fitting, and interestingly models the outcome as a function of the confounders alone, along with a propensity-like model.

Let’s go ahead to Step 4.

Photo by krakenimages on Unsplash

Step 4: Measurement

Stratifying again

If we don’t fit any models, the measurements are the usual statistical tests. That is to say we can take the average of the outcomes for each treatment group, and determine whether these means are different.

It is better, however, to stratify when we do this. Taking the treatment effect for each stratum and then pooling these measures back together (based on their prevalence in the population we seek to model) gets us the average treatment effect while strongly controlling for those confounders.

So if we have a randomized trial with slightly unbalanced confounders, we can use this technique.

And certainly if we simply have observational data we should use this technique, especially if we couldn’t prune very well.

The effect of stratifying here can be very similar to modeling, especially modeling interaction terms of One Hot variables. A linear model could contain k-dimensional interaction terms of k different One Hot confounder variables, and these terms would essentially measure the same thing as just stratifying the data into those k-dimensional strata and taking the average of the data. Combining these techniques strategically is healthy.

So which statistical tests are we using?

Statistical Tests

For continuous variables we can utilize a two sample t-test, or a multi sample extension. With categorical outcomes, we can use chi squared.

For continuous variables, we should consider whether having different means is a sufficient answer to our question. t-tests are often really intended to ask whether the entire distribution of the control and treatment groups are the same — whether the two groups are truly identical. If so, then you may want to ensure that the variances are the same as well, with a statistical test for the same variance.

You may also want to verify that the distributions are normal, or trim/Winsorize them until they are — having the same mean doesn’t make two distributions equal if they have entirely different shapes! Use a QQ plot and a Shapiro-Wilk test to ensure that your distributions are normal, or trim x% of the data and state that you’re testing whether the middle (100-x)% of each distribution is the same, if that’s what it takes to make them normal.

Otherwise, you can use nonparametric statistical tests. Generally these a ranking of all data points to determine whether the distributions are probably the same.

If we matched data, then we can consider paired testing. Typically, paired testing is used when a single individual is tested more than once in a longitudinal study, but it can apply here too. Paired testing has far more power, since the variance of the differences within individuals is much smaller than the variance of the entire set of individuals. Whether to use a paired test or a more conservative independent test is mathematically tied to the correlation between the potentially paired observations — the real variance factors in that covariance, so it ranges from an independent test when there’s no covariance, to perfect covariance where the variance of the differences would be zero and you already know you’re paired test rejects the null.

If we do many statistical tests, then we also need to control the family wise error rates. If we want the total false positive rate for our whole analysis to be alpha, then we can’t do several tests with the same alpha. A Bonferonni correction is a very conservative way to handle this issue, as it assumes that the tests are all independent and forces them all to exceed the strictest possible criteria (dividing alpha by the number of tests). The Holm-Bonferonni correction still assumes independent tests, but at least allows several hypotheses to be rejected as long as one passes the strictest criteria. In this method and other methods for correlated statistical tests, p values are ranked, much like how observation values are ranked in nonparametric tests.

In addition, Tukey’s Honest test correctly handles the family wise error rate issues around many two-sample t tests. Essentially the same t score is compared to a similar Student’s t distribution involving the range between scores.

If we do fit models, then we can measure as we’ve discussed. The Metalearner framework or Doubly Robust framework provide ways to measure our average treatment effect. However, we still need the error in these treatment effects. For that, a general solution is bootstrapping.

Bootstrapping to get Error Measurements

Bootstrapping is another magical technique. I think the surprising and useful nature of it puts it up there with the Central Limit Theorem. In Bootstrapping, we seek the error in our measure from some arbitrarily complex process on sample data — we want to know how much our measure on our sample data differs from what the measure would really be on the whole population. With a similar logic to estimating the population error in a t-test from the variance of the sample data, we will use only our sample data to estimate the error in a measure.

Bootstrapping can be used to estimate the error in any measure. We’re used to getting the error in the mean, but what if your measure is the median? What if it’s the coefficient of a linear regression? Perhaps there are formulas for the errors in those based on the variance of the sample data, but what if our measure is something about a random forest or a neural network — that looks tricky!

The technique for bootstrapping is simple. If we have N data points that we use to obtain our measure (through fitting some arbitrarily complex model to those N data points), we just repeatedly sample N data points from our N data points with replacement. So if our data points are the numbers [1, 2, 3, 4, 5], one bootstrap sampling is [2, 2, 4, 5, 5] — it’s still 5 data points, but it’s a different set of data. We will use this new “bootstrapped data” to get our measure again. We do this repeatedly, perhaps 50 or 100 times, to get the distribution of the outcome.

And therein lies the magic: The variance of our measures from bootstrapped data sets is a good estimate of the error between our original measure and the true population measure.

Incredible.

There are a number of other ways to bootstrap as well. One interesting technique is taking the residuals of a model, permuting them between data points and applying them to the outcomes before fitting a new model. It’s best to dig into the pros and cons of different bootstrapping techniques if you’ll be depending on this error measurement!

We are also interested in measuring the heterogenous causal effects and their errors.

Size and Significance of Heterogenous Effects

Again, heterogenous effects are when the effect of the treatment depends on other covariates, like the subset of people being treated. As we mentioned, our sample size may not be able to power the measurement of all of these heterogenous effects, if the sample size was chosen just to measure a main effect of a certain size, but we can see whether there is something statistically significant.

We can look at statistical tests within each stratum to see the effect size and significance of the treatment on that single stratum. We can also take a CATE approach to our models and see if the treatment effect for a subset of individuals (defined by a range of some covariates) is significant, and get its effect size. Use bootstrapping where needed to get error measures.

As always, our results may not be stat sig even if there is a real effect. If the bootstrapped variance is larger than the effect size that we care about, then we need more data.

Photo by Cherrydeck on Unsplash

Next Things to Learn

As a last section, let’s quickly mention a few other concepts that can come up in causal inference.

For one, we didn’t discuss ordinal data too much. In fact, all of our continuous variables binned into categories can keep their ordinal structure. Ordinal modeling approaches can make a single decision boundary with different threshold values, or train n-1 different models on the decision boundaries between n ordinal values and combine these models. How that interfaces with causal inference is a topic for another day.

More recently, Judea Pearl has been working with Elias Bareinboim on Data Fusion, which is how we can use multiple very different data sources to make a single causal inference. For example, we may have a randomized controlled trial AND a separate observational data set, each measuring different subsets of variables. How can these be combined?

Another line of work is on personalization. If we can only apply the same treatment to all individuals, then perhaps the average treatment effect is enough. But if we are able to give different treatments to different individuals, then we need to figure out which one is best for each person. Individual level causation requires some additional theory.

We can also analyze different regimes of individuals in terms of their outcomes. Outside of causal inference, quantile regression can allow us create a model for different percentiles of outcomes. We might be very interested in modeling the best outcomes if power users provide most our revenue, or modeling the worst outcomes, if avoiding adverse effects is the highest priority. Depending on the domain, we should always consider from the top down what is valuable to predict, and factor that into what causal measure we are seeking. However we should always be careful about analyzing based on outcome — simply splitting users by a post-intervention variable leads to biased measures of treatment effects.

In addition, we didn’t pay specific attention to time series types of data. Difference in difference DiD methods are the baseline approach to examining a treatment effect, but they assume that the trend is the same for each treatment group (not a function of the pre-treatment values for each group). Essentially, a DiD approach is just a paired test. We can also apply Segmented Regression along the time axis as we might with other covariates, to allow the level and slope to change in each time period. However, more complex models temporal phenomena like autocorrelation and seasonality are missed in these approaches. Models that consider these elements are necessary to properly model the system and get accurate estimates of standard errors. Google’s CausalImpact use a state-space approach to modeling time series, namely a Bayesian Structural Time Series (BSTS), which is a very flexible approach that can capture most temporal phenomena.

Regression Discontinuity Designs can be used when a treatment is applied to all individuals where one covariate exceeds some threshold. Since the control and treatment are completely (or fuzzy) separated in this covariate, analyses typically focus on the local average treatment effect by using individuals near that boundary.

Finally, there is recent work integrating causality with Deep Learning. I look forward to Judea Pearl’s upcoming seminar on Feb 25, 2022 on that topic:
https://www.cs.uci.edu/events/seminar-series/?seminar_id=1094

Adieu

I’ve tried my best here to make this Guide comprehensive — it’s my best shot at organizing my general causal inference knowledge over one weekend. I’ll put out some other articles in the future on more specific topics, or feel free to contact me with any questions. I hope it helps you!

For a short, related discussion, check out my article: Variance Reduction in Causal Inference:

How do you find the causal effect between variables?

The use of a controlled study is the most effective way of establishing causality between variables. In a controlled study, the sample or population is split in two, with both groups being comparable in almost every way. The two groups then receive different treatments, and the outcomes of each group are assessed.

What is causal variable?

In research, when we say two variables have a causal relationship (or a cause-and-effect relationship), we mean that a change in one variable (known as the independent variable) causes a change in the other (the dependent variable).

How do you conduct causal research?

Causal evidence has three important components:.
Temporal sequence. The cause must occur before the effect. ... .
Concomitant variation. The variation must be systematic between the two variables. ... .
Nonspurious association. Any covarioaton between a cause and an effect must be true and not simply due to other variable..

What is the causal effect of an independent variable?

More specifically, a causal effect is said to occur if variation in the independent variable is followed by variation in the dependent variable, when all other things are equal (ceteris paribus).