Let’s define the setup more formally:

- KGs are directed, multi-relational graphs with arbitrary sets of nodes and relation types
- Graphs arrive
**without features**, that is, we don’t assume the existence of textual descriptions (nor pre-computed feature vectors) of entities and relations. - Given a query (head, relation, ?), we want to rank all nodes in the underlying graph (inference graph) and maximize the probability of returning a true tail.
*Transductive*setup: the set of nodes and entities is the same at training and inference time.*Inductive*(entity) setup: the set of relations has to be fixed at training time, but nodes can be different at training and inference*Inductive*(entity and relation) setup: both new unseen entities and relations are allowed at inference

What do neural networks learn to be able to generalize to new data? The primary reference— the book on Geometric Deep Learning by Bronstein, Bruna, Cohen, and Veličković—posits that it is a question of *symmetries and invariances*.

What are the learnable invariances in foundation models? LLMs are trained on a fixed vocabulary of tokens (sub-word units, bytes, or even randomly initialized vectors as in Lexinvariant LLMs), vision models learn functions to project image patches, audio models learn to project audio patches.

What are the learnable invariances for multi-relational graphs?

First, we will introduce the invariances (equivariances) in standard **homogeneous** graphs.

*Standard (single) permutation equivariant graph models:* A great leap in graph ML came when early GNN work (Scarselli et al. 2008, Xu et al. 2018, Morris et al. 2018) has shown that inductive tasks on graphs benefited enormously from assuming that vertex IDs are arbitrary, such that the predictions of a graph model should not change if we reassigned vertex ID. This is known as *permutation equivariance* of the neural network on node IDs. This realization has created great excitement and a profusion of novel graph representation methods since, as long as the neural network is equivariant to node ID permutations, we can call it a graph model.

The permutation equivariance on node IDs allows GNNs to inductively (zero-shot) transfer the patterns learned from a training graph to another (different) test graph. This is a consequence of the equivariance, since the neural network cannot use node IDs to produce embeddings, it must use the graph structure. This creates what we know as *structural representations* in graphs (see Srinivasan & Ribeiro (ICLR 2020)).

Now edges in the graphs might have different relation types — is there any GNN theory for such graphs?

1️⃣ In our previous work, Weisfeiler and Leman Go Relational (with Pablo Barceló, Christopher Morris, and Miguel Romero Orth, LoG 2022), we derived Relational WL — a WL expressiveness hierarchy for multi-relational graphs focusing more on node-level tasks. The great follow-up work by Huang et al (NeurIPS 2023) extended the theory to link prediction, formalized *conditional message passing,* and logical expressiveness using Relational WL. ✍️ Let’s remember **conditional message passing** — we’ll need it later — it provably improves link prediction performance.

The proposed addition of a global readout vector induced by incoming/outgoing edge direction resembles the recent work of Emanuele Rossi et al on studying directionality in homogeneous MPNNs (read the blog post on Medium for more details). Still, those works do not envision the case when even relations at test time are unseen.

*2️⃣ Double permutation equivariant (multi-relational) graph models:* Recently, Gao et al. 2023 proposed the concept of **double equivariance** for multi-relational graphs. Double equivariance forces the neural network to be equivariant to the joint permutations of both node IDs and relation IDs. This ensures the neural network learns structural patterns between nodes and relations, which allows it to inductively (zero-shot) transfer the learned patterns to another graph with new nodes and new relations.

➡️ In our work, we find* the invariance of relation interactions*, that is, even if relation identities are different, their fundamental interactions remain the same, and those fundamental interactions can be captured by a **graph of relations. **In the graph of relations, each node is a relation type from the original graph. Two nodes in this graph will be connected if edges with those relation types in the original graph are incident (that is, they share a head or tail node). Depending on the incidence, we distinguish** 4 edge types** in the graph of relations:

*Head-to-head (h2h)*— two relations can start from the same head entity;*Tail-to-head (t2h)*— tail entity of one relation can be a head of another relation;*Head-to-tail (h2t)*— head entity of one relation can be a tail of another relation;*Tail-to-tail (t2t)*— two relations can have the same tail entity.

A few nice properties of the relation graph:

- It can be built from absolutely any multi-relational graph (with simple sparse matrix multiplications)
- The 4 fundamental interactions never change because they just encode the basic topology — in directed graphs there always will be head and tail nodes, and we relations would have those incidence patterns

Essentially, learning representations over the relations graph can transfer to any multi-relational graph! This is the

learnable invariance.

In fact, it can be shown (we are already working on the formal proofs which will be available in an upcoming work 😉) that representing relations via their interactions in a graph of relations is a double equivariant model! This means that learned relational representations are independent of identities but rather rely on the joint interactions between relations, nodes, and nodes & relations.

With all the theoretical foundations backing us up, we are now ready to introduce ULTRA.

ULTRA is a method for unified, learnable, and transferable graph representations. ULTRA leverages the invariances (and equivariances) of the **graph of relations** with its fundamental interactions and applies **conditional message passing** to get relative relational representations. Perhaps the coolest fact is that

a single pre-trained ULTRA model can run 0-shot inference on any possible multi-relational graph and be fine-tuned on any graph.

In other words, ULTRA is pretty much a foundation model that can run inference on any graph input (with already good performance) and be fine-tuned on any target graph of interest.

The crucial component of ULTRA is in *relative* relation representations constructed from the graph of relations. Given a query `(Michael Jackson, genre, ?)`

, we first initialize the `genre`

node in the graph of relations with the all-ones vector (all other nodes are initialized with zeros). Running a GNN, the resulting node embeddings of the relation graph are conditioned on the `genre`

node — it means that each starting initialized relation will have its own matrix of relational features, and that’s very helpful from many theoretical and practical aspects!

Practically, given an input KG and a (h, r, ?) query, ULTRA executes the following actions:

- Construction of the graph of relations;
- Get relation features from the conditional message passing GNN on the graph of relations (conditioned on the initialized query relation r);
- Use the obtained relational representations for the inductive link predictor GNN conditioned on the initialized head node h;

Steps 2 and 3 are implemented via slightly different modifications of the Neural Bellman-Ford net (NBFNet). ULTRA only learns embeddings of the 4 fundamental interactions (h2t, t2t, t2h, h2h) and GNN weights — pretty small overall. The main model we experimented with has only 177k parameters.

We pre-trained ULTRA on 3 standard KGs based on Freebase, Wikidata, and Wordnet, and ran 0-shot link prediction on 50+ other KGs of various sizes from 1k — 120k nodes and 2k edges — 1.1M edges.

Averaged across the datasets with known SOTA, a single pre-trained ULTRA model is **better in the 0-shot inference mode** than existing SOTA models trained specifically on each graph 🚀Fine-tuning improves the performance even 10% further. It’s particularly amazing that a single trained ULTRA model can scale to graphs of such different sizes (100x difference in node size and 500x in the edge sizes) whereas GNNs are known to suffer from size generalization issues (see the prominent works by Yehudai et al, ICML 2021 and Zhou et al, NeurIPS 2022).

🙃 In fact, with 57 tested graphs, we ran a bit out of KGs to test ULTRA on. So if you have a fresh new benchmark hidden somewhere — let us know!

We can bump the zero-shot performance even more by adding more graphs to the pre-training mixture although we do observe certain performance saturation after training on 4+ graphs.

The church of Scaling Laws predicts even better performance with bigger models trained on more qualitative data, so it’s definitely on our agenda.

So foundation models for KG reasoning are finally here, we are past that 2018 threshold! A single pre-trained ULTRA model can perform link prediction on any KG (multi-relational graph) from any domain. You really just need a graph with more than 1 edge type to get going.

📈 Practically, ULTRA demonstrates very promising performance on a variety of KG benchmarks already in the 0-shot mode, but you can bump the performance even further with a short fine-tuning.

We make all the code, training data, and pre-trained model checkpoints available on GitHub so you can start running ULTRA on your data right away!

📜 preprint: arxiv

🛠️ Code, data: Githtub repo

🍪 Checkpoints: 2 checkpoints (2 MB each) in the Github repo

🌎 Project website: here

As a closing remark, KG reasoning just represents a fraction of the many interesting problems in the reasoning domain, and the majority still don’t have a generic solution. We believe the success of KG reasoning will bring more breakthroughs in other reasoning domains (for example, we recently found that LLMs can actually learn and employ textual rules). Let’s stay optimistic about the future of reasoning!