A Causality Summary Part II – Igor L.R. Azevedo

Main Reference: https://arxiv.org/abs/2405.08793

Probabilistic Graphical Models

Probabilistic Graphical Models (PGMs) are powerful tools that represent the joint probability distribution of a set of random variables in terms of their conditional dependencies, which are typically defined by a graph structure. A fundamental task in PGMs is sampling from the joint distribution, and this can be achieved using a method known as ancestral sampling.

Key Idea

The essence of ancestral sampling is to generate samples from the joint distribution by leveraging the graph’s structure. Specifically, we exploit the chain rule of probabilities, which decomposes the joint probability into a product of conditional probabilities based on the graph’s directed edges. For a given variable, its value depends only on the values of its parent nodes, as defined by the graph.

Sampling Procedure

1. Graph Traversal Order:

Begin with the variables with no parents (root nodes) in the graph. These variables are sampled from their marginal distributions since they are not conditioned on other variables.
Then, proceed to sample the remaining variables in a breadth-first order (or topological order), ensuring that all parent nodes of a variable are sampled before sampling the variable itself.

2. Sampling Rule for Each Variable:

For each variable $v$ , given its parents $pa(v)$ , sample its value from its conditional probability distribution:

$\tilde{v} \sim p_v(v \mid \tilde{pa}(v))$

where:

$\tilde{v}$ is the sampled value of $v$ ,
$\tilde{pa}(v)$ is the set of sampled values for the parent nodes of $v$ ,
$p_v(v \mid \tilde{pa}(v))$ is the conditional distribution of $v$ given its parent nodes.

3. Sampling Parent Nodes:

For any node $v' \in pa(v)$ (the parents of $v$ ), its value $\tilde{v}'$ is sampled using its own conditional probability distribution:

$\tilde{v}' \sim p_{v'}(v' \mid \tilde{pa}(v'))$

This recursive process ensures that all necessary parent values are available before sampling any variable.

Why It Works

Ancestral sampling works because it directly follows the chain rule of probabilities:

$p_V(V) = \prod_{v \in V} p_v(v \mid pa(v))$

where each term $p_v(v \mid pa(v))$ represents the conditional probability of a variable given its parents. By sampling in topological order and respecting the conditional dependencies, we ensure that the sampled values accurately represent the joint distribution.

Advantages

Exact Sampling:
- Ancestral sampling produces exact samples from the joint distribution without any approximation errors or bias.
Simple Implementation:
- The method is straightforward and only requires knowledge of each node’s graph structure and conditional probability distributions.

Example

Consider a simple Bayesian Network with three variables $A$ , $B$ , and $C$ , where $A \to B \to C$ . The joint distribution is given by:

$p(A, B, C) = p(A) \cdot p(B \mid A) \cdot p(C \mid B)$

Using ancestral sampling:

Sample $A$ from its marginal distribution $p(A)$ .
Given the sampled value of $A$ , sample $B$ from $p(B \mid A)$ .
Given the sampled value of $B$ , sample $C$ from $p(C \mid B)$ .

This process ensures that the sampled values $\tilde{A}, \tilde{B}, \tilde{C}$ are consistent with the joint distribution $p(A, B, C)$ .

A Bayesian network (BN) is a probabilistic graphical model for representing knowledge about an uncertain domain where each node corresponds to a random variable and each edge represents the conditional probability for the corresponding random variables

Representing Probability Distribution

A probability distribution can be represented as a table, which consists of two columns:

The values that the variable can take.
The probabilities (or densities) associated with those values.

For a discrete variable $v$ , the probability table satisfies the normalization condition:

$\sum_{v \in V} p_v(v) = 1$

For continuous variables, this is expressed as an integral:

$\int_{v \in V} p_v(v) dv = 1$

Learning from Samples

If we have a set of samples $S = \{\tilde{v}_1, \tilde{v}_2, \dots, \tilde{v}_N\}$ drawn from an unknown distribution over a discrete variable $v$ , we can approximate the original distribution using a frequency table. Each row in the table is computed as:

$\left(v, \frac{1}{N} \sum_{n=1}^{N} \mathbf{1}(\tilde{v}_n = v) \right)$

where $\mathbf{1}(\tilde{v}_n = v)$ is an indicator function that equals 1 if $\tilde{v}_n = v$ and 0 otherwise.

Maximum Likelihood Learning

As explained above, the frequency table corresponds to maximum likelihood estimation (MLE).

In MLE, the probability of each value $v$ is estimated as the fraction of samples that take that value:

$q_v(v) = \frac{1}{N} \sum_{n=1}^{N} \mathbf{1}(\tilde{v}_n = v)$

While effective in large-sample scenarios, this approach has limitations:

If $q_v(v) = 0$ for some $v$ , this implies zero probability for values that were not observed, even if they might occur with a small but non-zero probability.
This limitation can be mitigated using regularization.

Conditional Probability and Regularization

The conditional probability $p_{v'}(v' \mid v)$ is defined as:

$p_{v'}(v' \mid v) = \frac{p_{v',v}(v', v)}{p_v(v)}$

If $p_v(v) = 0$ , the conditional probability is not well-defined. Regularization techniques can help address this by smoothing the probabilities and ensuring robustness when samples are sparse or certain values are missing.

Structural Causal Models

The directed edges typically encode conditional dependencies between variables in probabilistic graphical models. However, the direction of an edge in a graph does not necessarily reflect the causal relationship between variables.

From a probabilistic standpoint, the joint distribution remains unchanged if the direction of the edges is flipped. This is formalized using Bayes’ rule:

$p_v(v \mid v') = \frac{p_{v'}(v' \mid v)p_v(v)}{p_{v'}(v')}$

This equation implies that we can “flip” the arrow of an edge between variables $v$ and $v'$ without altering the joint or conditional probabilities. While this property simplifies probabilistic inference, it raises challenges when using graphical models to infer causal relationships.

Addressing the Confusion in Causal Inference

We adopt a different representation of the generative process underlying the probabilistic graphical model to clarify causal relationships.

Instead of focusing purely on the probabilistic distribution, we describe the process of generating a value for each variable $v$ as a deterministic function of:

Its parent nodes $pa(v)$ ,
An additional external (exogenous) noise variable $\epsilon_v$ .

This is expressed as:

$v \leftarrow f_v(pa(v), \epsilon_v).$

Here:

$f_v$ is a deterministic function specific to $v$ ,
$pa(v)$ are the parent variables of $v$ ,
$\epsilon_v$ is an independent noise variable that captures randomness.

This formulation explicitly separates deterministic relationships (via $f_v$ ) and randomness (via $\epsilon_v$ ), providing a clearer framework for causal reasoning.

Why Structural Causal Models Are More Informative

The deterministic function $f_v$ makes the underlying generative process explicit:

Perturbation Insight:
- Changing $pa(v)$ or $\epsilon_v$ affects $v$ , enabling us to reason how $v$ would change under different conditions.
Causal Interventions:
- Intervening on $pa(v)$ allows us to study cause-and-effect relationships systematically.

Example: Pushing a Book

Imagine pushing a book on your desk with a force $v'$ .

The resulting position of the book, $v$ , depends deterministically on:

The applied force $v'$ , representing $pa(v)$ .
Unknown noise factors $\epsilon_v$ , such as desk friction.

By modifying $v'$ (the force applied), we can predict the new position $v$ while accounting for the stochastic nature of $\epsilon_v$ . This ability to isolate and analyze specific causal influences is central to Structural Causal Models (SCMs).

Structural Causal Models as Triplets

A Structural Causal Model (SCM) is defined by the triplet:

$(V, F, U)$

where:

$V$ is the set of variables,
$F$ is the set of deterministic functions $\{f_v\}$ ,
$U$ is the set of noise variables $\{\epsilon_v\}$ .

This contrasts with probabilistic graphical models, which emphasize joint distributions. SCMs focus on the mechanisms generating those distributions.

Converting SCMs to Probabilistic Graphical Models

SCMs can be converted into probabilistic graphical models using a change of variables approach:

Assume a known prior distribution over $\epsilon_v$ ,
Transform the noise variables into values for $v$ using the functions $f_v$ :

$v \sim f_v(pa(v), \epsilon_v)$

As discussed earlier, this allows sampling from the joint distribution or performing ancestral sampling.

Once we have an SCM, we can use it for various tasks, including inference and reasoning. For example, we can answer questions about the underlying causal structure by sampling from the joint distribution.

Counterfactual Reasoning

One of the key strengths of SCMs is their ability to support counterfactual reasoning:

Counterfactuals allow us to ask:
- “What would have happened if some variables were set differently?”
For example, if a target variable $v$ had a different value, what noise factors ( $\epsilon_v$ ) would explain that change?

Posterior Over Noise Variables

To answer counterfactual questions, we compute the posterior distribution over the noise variables $\epsilon_v$ , conditioned on observed configurations of $v$ and its parents. The posterior distribution is denoted as:

$q(U)$

where $q(U)$ corresponds to the set of all $\epsilon_v$ values consistent with the observed configuration of $V$ .

Practical Implication

By using SCMs, we can:

Fix certain variables in $V$ to specific values (interventions),
Allow the external noise variables $U$ to vary according to $q(U)$ ,
Study how these interventions propagate through the system, enabling us to answer “what if” questions about alternative scenarios.

Summary

Probabilistic graphical models encode dependencies but are agnostic to causal relationships.
Structural causal models introduce deterministic functions $f_v$ and noise variables $\epsilon_v$ , enabling explicit modeling of causal processes.
SCMs are particularly powerful for counterfactual reasoning, systematically answering “what would have happened if” scenarios.