Road to ML Engineer #31 - Graph Learning

So far, we have only been discussing machine learning and deep learning techniques on tabular data, language, and images. However, much of the data in the real world, such as chemical structures, transportation networks, and social networks, is inherently represented as graphs consisting of nodes or vertices and edges connecting them. (In fact, languages and images can be represented as linear graphs and 2D grid graphs, which are special types of graphs with constraints.) Hence, there is a tremendous incentive to research machine learning algorithms to perform various tasks on graphs. In this article, we will dive into the challenges and requirements of graph learning.

Note: The content of this article is largely inspired by the video Theoretical Foundations of Graph Neural Networks by Velickovic, P. (2021). I highly recommend checking it out.

Permutation Invariance & Equivariance

Though graphs are an interesting subject to study, there are difficulties in working with them. One of the difficulties comes from the lack of ordering of vertices. To understand this, let's imagine a graph with no edges and vertices labeled 1, 2, 3, and 4. Here, we can notice that we can express them in orders such as 1→2→3→4, 2→4→1→3, or 4→3→1→2 in a matrix $X$ , and there are $n!$ permutations of expressing the same set of $n$ vertices. We can conveniently express this using a permutation matrix $P$ that only transforms the order of the vertices. Each row and column of the permutation matrix contains only one 1 and 0 elsewhere.

P_{(2,3,1,4)}X = \begin{bmatrix} 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 1\\ \end{bmatrix} \begin{bmatrix} - x_1 -\\ - x_2 -\\ - x_3 -\\ - x_4 -\\ \end{bmatrix} = \begin{bmatrix} - x_2 -\\ - x_3 -\\ - x_1 -\\ - x_4 -\\ \end{bmatrix}

For all $P$ , the model needs to produce the same output. One way of achieving that is to use $F$ that does not depend on the order, or $F$ that is permutation invariant, which can be expressed as follows:

F(PX) = F(X)

The examples of permutation-invariant functions are aggregator functions like sums, averages, max, and min, which produce the same aggregated output regardless of the order in which the vertices are presented. However, there are cases where we want to produce the same output while not aggregating the graphs and preserving the order. Such functions are called permutation equivariant, which can be expressed as follows:

F(PX) = PF(X)

Permutation equivariant functions are any functions operated in isolation on every vertex. For processing graphs with no assumed order, the functions need to be either permutation invariant, permutation equivariant, or combinations of permutation invariant and equivariant operations. When combining the two, permutation equivariant operations need to be performed first, as permutation invariant operations are aggregators that destroy the permutations, resulting in the following expression:

F(X) = \phi ( \bigoplus_{i \in V}\psi(x_i) )

The $\psi$ function is an arbitrary function applied to individual vertices, $\bigoplus$ is the aggregator, and $\phi$ is another arbitrary function applied to the aggregated output. The above can produce the same output regardless of the choice of the parameter functions and the ordering of the vertices in $X$ .

Locality on Graphs

The graphs with no edges were just sets. Now that we understand permutation invariance and equivariance in sets, let’s move on to discussing graphs by introducing edges. The edges can be represented by the adjacency matrix $A$ , whose rows and columns represent the vertices. (If you are not sure about adjacency matrices, I recommend checking out Road to Road to C++ Programmer #16 - Graphs.) This means that we need to account for the permutation of both rows and columns of $A$ , or $PAP^T$ , and update the invariant and equivariant functions of graphs as follows.

F(PX, PAP^T) = F(X, A) \\ F(PX, PAP^T) = PF(X, A)

The permutation invariant functions are still aggregators like sums. We can still set the permutation equivariant functions to be the same as for sets, which are any functions applied to every vertex in isolation. However, we can achieve more complex permutation equivariant operations using the added edges or 1-hop neighbors since the 1-hop neighbors remain the same regardless of the order of vertices used to represent $A$ . To be more precise, we can define neighbors as the vertices with connecting edges between them as follows:

N_i = \{j : (i, j) \in \epsilon \lor (j, i) \in \epsilon\} \\ X_{N_i} = \{x_j : j \in N_i \}

If there is a local function $g$ that takes all the neighbors $X_{N_i}$ for $x_i$ and applies it to every row, we can define a permutation equivariant function $F$ .

F(X, A) = \begin{bmatrix} - g(x_1, X_{N_1}) -\\ - g(x_2, X_{N_2}) -\\ \vdots \\ - g(x_v, X_{N_v}) -\\ \end{bmatrix}

Since the neighbors can be organized in different orders within $A$ , the function $g$ needs to be permutation invariant for $F$ to be permutation equivariant. Hence, all graphical models learn $F$ , which uses 1-hop neighbors and applies a permutation invariant function to every vertex to arrive at the latent representations of a graph that preserve permutations. These representations can then be used in any downstream task.

Downstream Tasks

There are mainly three tasks involving graphs: node classification, graph classification, and edge prediction. Depending on the task, latent representations can be modified.

Node classification extracts the latent representation of the node to classify and passes it to an arbitrary function. This is only possible due to the permutation equivariance of the previous steps, which preserves the ordering of the vertices. Edge prediction similarly extracts the latent representations of the relevant vertices and possibly the edge representation to perform predictions. Graph classification uses permutation invariant aggregators and performs operations to arrive at the same classification result regardless of the ordering.

Conclusion

This article covered the important concepts of permutation invariance and equivariance, which are necessary to process graphs that do not assume order, and the informed function $F$ learned by any graphical model. This function uses permutation invariant operations applied to every vertex and its neighbors to produce latent representations of graphs that preserve ordering for any downstream task. In the next article, we will cover some models and the parameters of $g$ they learn.

Resources

Velickovic, P. 2021. Theoretical Foundations of Graph Neural Networks. YouTube.