Structure Learning

Structure learning in PGMs is a fundamental task that involves determining the optimal graphical structure \(\mathbb{G}\) that is well matched the observed data. This structure \(\mathbb{G}\), whether directed as in BNs or undirected as in MRFs, captures the dependencies among the variables and is crucial for both understanding the underlying probabilistic relationships. Meanwhile, structure learning also facilitates subsequent tasks such as inference, prediction, and decision-making. The learned structure serves as a foundation for building more complex models, particularly in domains such as genetics, epidemiology, and natural language processing, where understanding the dependencies between variables is critical.

The challenge in structure learning lies in the combinatorial nature of the problem. Given a set of \(n\) variables, the number of possible graph structures grows super-exponentially with \(n\), making exhaustive search infeasible for all but the smallest sets of variables.

There are two common approaches for structure learning from data: score-based approaches and constraint-based approaches.

Score-based Methods

The score-based approaches use a scoring function to measure the fitness of the structures (i.e., graphs) to the data and find the highest score out of all the possible graphs. Therefore, these methods make the number of possible graphs super-exponential to the number of dimensions (i.e., \(n\)) of the learning problems.

To find the best or the top-k best structure(s) of the PGM, we need some metrics that enable us to compare the qualities of different structures. The metrics are known as scoring functions, whose inputs are a possible structure and the training dataset, and then measure the ability of the PGM to represent the distribution of the data. Some commonly used scoring functions include Bayesian Dirichlet (BD), K2, BDe, BDeu, log-likelihood, Minimum Description Length (MDL), Bayesian Information Criterion (BIC), Akaike Information Criterion (AIC), and so on.

Constraint-based Methods

The constraint-based approaches perform a number of conditional independence (CI) tests to identify the independence relations among the random variables and use these relations as constraints to construct PGMs. This category of methods often runs in a polynomial time, and is commonly used in high-dimensional problems.

Here we introduce the important concept of CI tests (based on BNs). Consider some random variables \(V_i\), \(V_j\) and \(V_k\), a CI test assertion of the form \(I(V_i, V_j | \{V_k\})\) means \(V_i\) and \(V_j\) are independent given \(V_k\). Let \(\mathcal{D} = \{c_0, c_1, ..., c_{m-1}\}\) denote a data set of \(m\) complete samples, a CI test \(I(V_i, V_j | \{V_k\})\) determines whether the corresponding hypothesis \(I(V_i, V_j | \{V_k\})\) holds or not, based on statistics of \(\mathcal{D}\).

For discrete variables, the most common statistic for testing \(I(V_i, V_j | \{V_k\})\) is the \(G^2\) test statistic, which is defined as

\[G^2 = 2 \sum_{x, y, z} N_{xyz} log \frac{N_{xyz}}{E_{xyz}},\]

where \(N_{xyz}\) is the number of samples in \(\mathcal{D}\) that satisfy \(V_i = x\), \(V_j = y\) and \(V_k = z\). The value of \(N_{xyz}\) can be obtained from the contingency table that shows the frequencies for all configurations of values. \(G^2\) follows an asymptotic \(\chi^2\) distribution with \((|V_i|-1)(|V_j|-1)\), where \(|\cdot|\) denotes the number of possible values of the variable. The p value of \(\chi^2\) distribution can be calculated according to the \(G^2\) statistic and the final decision is made by comparing p value with the significance level \(\alpha\). If p value is greater than \(\alpha\), the independent hypothesis \(I(V_i, V_j | \{V_k\})\) is accepted; otherwise, the hypothesis is rejected. \(E_{xyz}\) is the expected frequency which is defined as

\[E_{xyz} = \frac{N_{x+z} N_{+yz}}{N_{++z}},\]

where \(N_{x+z} = \sum_{y} N_{xyz}$, $N_{+yz} = \sum_{x} N_{xyz}\), and \(N_{++z} = \sum_{xy} N_{xyz}\), which represent the marginal frequencies.

PC-stable

A fundamental constraint-based algorithm is the PC (named after its authors Peter and Clark) algorithm, and PC-stable solves the order-dependent issue in the original PC algorithm and produces less error. PC-stable is important since most constraint-based methods are improved versions of the PC-stable algorithm or proceed along similar lines of the PC-stable algorithm.