Whatโ€™s new for KGs in NLP?

Knowledge Graphs in NLP @ EMNLP 2020

Your guide to the KG-related research in NLP, November edition.

Michael Galkin
22 min readNov 19, 2020


Knowledge Graphs continue to drive NLP forward ๐ŸŽ! You canโ€™t miss the event as major as EMNLP 2020, so letโ€™s dive in and see whatโ€™s new in our ocean ๐ŸŒŠ.

I had the luck to attend EMNLP in its online form which was well organized: Zoom Q&A sessions, poster sessions and socials in gather.town, paper-specific and general channels in Rocket.Chat. This year, the EMNLP program consists of 754 EMNLP papers, plus 520 Findings of EMNLP papers are accessible in the proceedings. For this review, I did not distinguish between the main and Findings papers and tried to select ~30 remarkable works that might establish new trends for the next 2โ€“3 years ๐Ÿ“ˆ.

Today in our agenda:

  1. KG-Augmented Language Models: Empower your Transformer
    1.1 Autoencoders
    1.2 Autoregressive
  2. Natural Language Generation: New Folks in Datasetlandia
  3. Entity Linking: Massive and Multilingual
  4. Relation Extraction: OpenIE 6 and Neural Extractors
  5. KG Representation Learning: Temporal KGC and Successor to FB15K-237
  6. ConvAI + KGs: On the Shoulders of OpenDialKG
  7. Wrapping Up

KG-Augmented LMs: Empower your Transformer

We first noted a boom ๐Ÿš€ in LMs augmented with structured knowledge last year in EMNLP 2019. Dozens of models enriched with entities from Wikipedia or Wikidata appeared in 2019 and 2020 (even here at EMNLPโ€™20) but the conceptual problem is still there:

How do we measure โ€œthe knowledgeโ€ encoded in LM parameters?

First attempts like the LAMA benchmark framed the problem as matching single-token cloze-style blanks of facts extracted from Wikidata, e.g., โ€œiPhone is designed by _โ€ (Apple, of course). LMs showed some notion of factual knowledge but, frankly, not very much nor deep. Still, LAMA is 1) single-token; 2) English-only. Can we cover more complex tasks and diverse environments? Yes! ๐Ÿคฉ

In line with recent successes in multilingual benchmarks like XTREME, Jiang et al study whether multilingual models exhibit some factual knowledge and propose X-FACTR, a multilingual benchmark of cloze-style questions in 23 languages with multi-token blanks (up to 5โ€“10 tokens, actually), to measure it. The authors probed M-BERT, XLM, and XLM-R against X-FACTR. The key findings leave A LOT of room for designing and training knowledgeable language models:
๐Ÿค” Multi-lingual models barely reach 15% accuracy in high-resource languages and about 5% in low-resource ones ๐Ÿ“‰
๐Ÿค” M-BERT seems to contain more factual knowledge than much bigger XLM and XLM-R. Surprise!
๐Ÿค” Multi-token prediction is much harder than single-token, and you need non-trivial decoding strategies for such entities
๐Ÿค” There is almost no agreement on fact validity in multiple languages, i.e., โ€œSwitzerland was named after _โ€ (EN) and โ€œะะฐะธะผะตะฝะพะฒะฐะฝะธะต ะจะฒะตะนั†ะฐั€ะธะธ ะฒะพัั…ะพะดะธั‚ ะบ _โ€ (RU) produce totally different answers.

Would be quite thrilling to see the probing results of recent mT5 (multilingual T5) and M2M-100 on X-FACTR ๐Ÿ‘€. Besides, put in the comments section below your prediction how soon the benchmark will get saturated ๐Ÿ˜‰

Overall, multilingual LMs can recover barely 15% of facts in EN. The rest is below 10%. Source: Jiang et al

Entity Representations in LMs

This time we have four ๐Ÿ€ new methods! I put in bold their specific pre-training objectives.

Yamada et al propose LUKE (Language Understanding with Knowledge-based Embeddings), a transformer model with pre-training tasks: MLM + predicting masked entities within a document (see the illustration ๐Ÿ‘‡). Maintaining an entity embedding matrix (500K distinct entities), the authors add entity-aware self-attention which is, essentially, three more query matrices depending on the computed token type (word-entity, entity-entity, entity-word). ๐Ÿ” A simple augmentation enables new downstream tasks and slightly improves over RoBERTa and recent KG-augmented baselines ๐Ÿ‘

Source: Yamada et al

Next up, Fรฉvry et al introduce Entities as Experts (EaE), a 12-layer Transformer where the first 4 layers work as usual, then token embeddings of annotated mentions query top-100 entities from the entity memory, and then the summed up embedding is passed through 8 transformer layers. ๐Ÿ”‘ Key differences: EaE has three pre-training tasks: mention detection + entity linking + vanilla MLM. Yes, EaE only needs annotated entity mentions doing linking internally โš™๏ธ. Using the BERT-base 110M initial setup, adding 1M x 256d entities yields overall about 367M params (comparable to BERT-Large). In the LAMA probe, EaE boosts performance in T-Rex tasks while being competitive to the mighty T5โ€“11B in TriviaQA and WebQuestions ๐Ÿ’ช

On the other โœ‹, Shen et al employ a background KG a bit differently: in their GLM (Graph-guided Masked Language Model) the graph supplies a vocabulary of named entities with their connectivity patterns (reachable entities in k-hops). This info is leveraged in the two pre-training tasks: masked entity prediction + entity ranking in the presence of distractors, i.e, negative samples. KG helps to mask informative entities ๐Ÿ and select hard negative samples ๐ŸŽ ๐ŸŽ for robust training. GLM was predominantly designed for commonsense KGs like ConceptNet and ATOMIC and commonsense-related tasks, although ontological KGs can be attached as well.

Finally, Poerner et al make use of Wikipedia2Vec in their E-BERT. Here is the idea ๐Ÿ‘‰ : vanilla BERT trains only wordpiece embeddings, while Wikipedia2Vec trains both words and entity embeddings (2.7M entities). So, we first learn W, a linear transformation between BERT wordpieces and Wikipedia2Vec words, and then use the fitted parameter W to project Wikipedia2Vec entities. Finally, an entity is concatenated with wordpieces, e.g.: โ€œThe native language of Jean Mara ##is is [MASK]โ€ becomes โ€œThe native language of Jean_Marais / Jean Mara ##is is [MASK]โ€. Pre-training tasks: none, as further training is not required. Interestingly, E-BERT-small shows better results on the LAMA probe than E-BERT-large ๐Ÿค”.

Autoregressive KG-augmented LMs

In this subsection, the generation process of LMs is conditioned by or enriched with structured knowledge like small subgraphs!

Chen et al make a major contribution with KGPT (knowledge-grounded pre-training), a generative model for data-to-text tasks, and a huge novel dataset KGText! 1๏ธโƒฃ The authors propose a generalized format of encoding various data-to-text tasks like WebNLG, E2E NLG, and WikiBio to be a uniform input for a language model. 2๏ธโƒฃ KGPT is probed with two encoders: Graph Attention Net-based (looks a bit overcomplicated to me, just take a multi-relational CompGCN and youโ€™re good to go ๐Ÿ˜Ž) and the BERT-style one with additional positional embedding-esque inputs (check the illustration ๐Ÿ‘‡). Essentially, you linearize ๐Ÿ“ a graph into a sequence with pointers where are the entities, relations, and full triples. The decoder is a standard GPT-2-like with a copy mechanism. 3๏ธโƒฃ KGText is a new pre-training corpus where EN sentences from Wikipedia are aligned with subgraphs from Wikidata, overall about 1.8M (subgraph, text) pairs ๐Ÿ’ช. The authors made sure that each subgraph and its paired sentence describe pretty much the same facts. This is a substantial contribution indeed, as previous graph-to-text datasets are rather small and subsume a supervised setting.

Here, KGPT shows quite impressive ๐Ÿ‘€ results in few-shot and zero-shot scenarios after pre-training on KGText leaving GPT-2 far behind. That is, just 5% of training data on WebNLG (RDF to text task) can already yield 40+ BLEU points in a few-shot setup and 20+ in completely zero-shot. My two cents:
๐Ÿค” KGPT still lacks explicit entities (each entity embedding is an average of its subword units), and there is no differentiation between entities and literals when encoding a given subgraph.
๐Ÿค” 8 days on 8 Titan RTX GPUs to pre-train. Well, a bit better than 30 days on 2048 TPUs ;)

KGPT encoder. Source: Chen et al

Ji et al take the opposite way and rather extend a decoder with the graph reasoning module keeping a GPT-2 encoder intact (see the ๐Ÿ–ผ below) in their GRF (Generation with Multi-Hop Reasoning Flow). Working with commonsense-related tasks and KGs like ATOMIC and ConceptNet, the authors first extract a k-hop subgraph induced by 1-grams from the input text. The text is encoded via the GPT encoder while the KG subgraph is encoded via CompGCN (smart choice ๐Ÿ˜‰). The reasoning module (essentially looks like message passing) propagates information through the subgraph and creates a softmax distribution over entities to select relevant ones. Finally, a copy gate decided whether to put that entity or select a word from a vocabulary.

๐Ÿ‘ฉโ€๐Ÿ”ฌ Experiments on Story Ending Generation, Abductive NLG, and Explanation Generation demonstrate gains over various GPT-2 baselines in automatic metrics as well as human evaluation of generated texts.

GRF intuition. Source: Ji et al

Our weightlifting ๐Ÿ‹ champion for today is MEGATRON-CTRL (8.3B parameters) created by Xu et al from NVIDIA. By controlled generation, we understand conditioning the LM generator not just by the input context but also with some keywords that can drive a story in a certain direction.

Source: Xu et al

Here, the authors employ ConceptNet and its 600K triples as a commonsense KG and external knowledge source.

First, the keywords are matched with triples, and the matched ones are passed through the Universal Sentence Encoder (USE). On the other hand, the input context is also passed through the USE. Finally, top-K max inner product vectors are selected โœ๏ธ. The retriever is trained with negative sampling.

The decoder is a fantabulously large transformer (8.3B parameters), the keyword generator is just 2.5B params. Training takes only 160 Tesla V100s ๐Ÿ˜Š. In the experiments, it is shown that such large models do indeed benefit from the background knowledge and tend to be preferred by humans in AMT experiments.

NLG (Data to text): New Folks in Datasetlandia

EMNLPโ€™20 chairs explicitly stated that

datasets are not second-class citizens in the NLP research

and this year we see a lot of new, large, well-designed, and complex tasks / datasets that will be fueling NLG next years at least โ›ฝ๏ธ

ENT-DESC task. Source: Cheng et al

Cheng et al introduce ENT-DESC, a triple-to-text dataset based on Wikidata (yes! ๐Ÿคฉ) where, given a 2-hop subgraph around the main entity, the task is to generate its textual description. The dataset stands out from WebNLG in several ways: 1๏ธโƒฃ ENT-DESC is much larger: 110K graph-text pairs, over 11M triples, about 700K distinct entities, and 1K distinct relations; 2๏ธโƒฃ The triples per entity ratio is higher, but not all triples are to contribute to the generated text, i.e., some of them are distractors and models should be robust enough to dismiss them; 3๏ธโƒฃ Expected descriptions are longer than those of WebNLG.

The proposed baseline model MGCN looks somewhat over-engineered to me: an input multi-relational graph is split into 6 single-relational graphs with their embeddings aggregated. Why not taking a multi-relational GNN encoder like R-GCN or CompGCN? ๐Ÿค” Probed on WebNLG, MGCN yields about 46.5 BLEU points (yes, yes, I see you frowning upon BLEU ๐Ÿคจ, I also do) while KGPT from the previous section yields 65+ points. Still, MGCN presents a strong baseline on ENT-DESC, so Iโ€™d encourage everybody to flex their graph-to-sequence muscles on a new dataset! ๐Ÿ’ช

Logic2Text. Source: Chen et al

Next up, Chen et al propose a new dataset, Logic2Text, that challenges NLG systems with generating text from logical forms. Itโ€™s important to notice that it is not just a table-to-text task, but a more complex one with 7 logic types ๐ŸŒˆ including count, comparative, superlative, aggregation, majority, unique, and ordinal.

The dataset consists of about 5K tables and 10.7K pairs (logical form, text). The forms are complex enough, e.g., 9 nodes and 3 functions in each form on average. The authors comprehensively described the construction and annotation processes ๐Ÿ‘ˆ consider this when introducing your dataset.

Several generative baselines were tested, and fine-tuned GPT-2 performs best (what a surprise! ๐Ÿ˜‰). Interestingly, the quality drops >30% when table captions are discarded. Moreover, a few-shot setup is possible too, so Iโ€™d hypothesize even larger transformers could perform a zero-shot transfer. Finally, you can always flip the task and use the dataset for training a semantic parser ๐Ÿ˜Ž.

In the table-to-text world, Parikh et al introduce ToTTo, a large dataset of 120K examples. The task is to generate a plausible text given a table and several highlighted ๐Ÿ–‹ nodes, i.e., itโ€™s not as easy as row-to-text or column-to-text. Turns out that complex table structures (like the one illustrated below) with merged rows/columns, and non-trivial cell highlighting do make the task more difficult and make models to hallucinate a lot. The dataset construction process is very well described ๐Ÿ‘, and the authors employ PARENT and BLEURT ๐Ÿ‘ metrics in addition to plain BLEU. Round of applause, everybody :)

Source: Parikh et al

As we started to talk about metrics, letโ€™s throw one more stone into the BLEU garden! โ˜„ ๏ธGekhman et al join our endeavour and propose KoBE (Knowledge-based Evaluation).

Still using BLEU?

The idea is pretty simple: 1) letโ€™s link โ›“ entity mentions to some multilingual KG; 2) measure the recall of found entities in candidates vs the source. Fin!
Multilinguality is a common attribute of large KGs, why not leveraging it as a metric? ๐Ÿ˜

The authors employ Google Knowledge Graph Search API for entity linking. Probing KoBE on WMT19 tasks they find that the metric correlates to human judgments better than BLEU! What else argument do you need to finally retire BLEU?

Finishing with the datasets, it is worth mentioning Schmitt et alโ€™s work who frame Visual Genome (VG) as a scene graph-to-text, and propose VGball, a subset of VG (still 200x larger than WebNLG, though). Yes, it is also possible to flip the task to the opposite side and train a model to extract triples and build KGs directly from images!

Entity Linking: Massive and Multilingual

In the Entity Linking world, Google and Facebook join the party and spin up ๐Ÿฅ their TPUs and HPC clusters to solve massive multilingual entity linking!

Botha, Shan, and Gillick present a study of Entity Linking in 100 Languages. First, hats off ๐ŸŽ“ for moving away from EN-only scenarios: this is a wonderful effort of the NLP community ๐Ÿ‘!
Large KGs like Wikidata are by design language-agnostic, so why donโ€™t we leverage all non-EN data? (in fact, for some entities EN labels and descriptions might not even exist) The authors first mine a HUGE dataset of 684M mentions about 20M Wikidata entities in 104 languages, and design Mewsli-9, a lightweight test-case only dataset of 300K mentions of 82K entities in 9 languages to evaluate the entity linking performance.

Model-wise, the authors resort to a dual encoder ๐Ÿฅ‚, where one transformer (typically, BERT) encodes mentions, and the second transformer encodes entity descriptions, computing a cosine similarity as the final operation. Initialized with mBERT checkpoints, the models are trained on TPU v3 for several days (TPUs go brrr). ๐Ÿงช Turns out the strategy is quite efficient: on Mewsli-9, the best model (powered with smart training enhancements) reaches micro-avg 90% Recall@1 and 98% Recall@10. Additionally, check out the illustration below ๐Ÿ‘‡ for language-specific numbers on a heldout set.

Entity Linking in 100 Languages. Source: Botha, Shan, and Gillick

A bit differently, Wu et al consider EN Wikipedia as a background KB and entity vocabulary in their new BLINK entity linker tailored for zero-shot ๐Ÿ‘Œsetups. BLINK employs a bi-encoder ๐Ÿฅ‚ paradigm as well, but this time all entity descriptions are pre-computed and stored in the FAISS index (those are [CLS] embeddings). An entity mention in a context (needs to be annotated beforehand) is passed through another transformer, and the resulting mention embedding retrieves ๐Ÿ”Ž top-K nearest neighbors in the index via FAISS. Finally, top-K options are ranked via the cross-encoder transformer (the ๐Ÿ–ผ is quite informative).

Experimental evidence: 1) FAISS-based retriever is fast (~2ms/query) and accurate (Recall@10 > 90%), much better than TF-IDF and BM-25; 2) In the zero-shot scenario, BLINK leaves all baselines far behind! ๐Ÿƒโ€โ™€๏ธ 3) Inference is fast even on 1 CPU, so you can plug the model into your applications, too! ๐ŸŽ‰

BLINK intuition. Source: Wu et al
ELQ. Source: Li et al

However, BLINK does require annotated entity mentions. This issue is resolved by a sibling paper by Li et al in their ELQ (Entity Linking for Questions). In fact, ELQ resides in the same repo as BLINK as they perfectly complement each other ๐Ÿค—.

Although the architecture is similar to BLINK (bi-encoder ๐Ÿฅ‚+ FAISS), ELQ jointly learns mention detection and disambiguation. That is, no input annotations required!
Furthermore, ELQ excels in practical applications ๐Ÿ…: ELQ outperforms TAGME and BLINK on QA datasets such as WebQSP and GraphQuestions, as well as increases accuracy on big QA datasets like Natural Questions and TriviaQA ๐Ÿ‘.

COMETA example. Source: Basaldella, Liu et al

The last (but not least) meal ๐Ÿฅ— on our Entity Linking plate is COMETA, a corpus for medical entity linking ๐Ÿš‘ by Basaldella, Liu et al. The target medical KG is SNOMED-CT which was here long long before RDF, OWL, Description Logics, and even before some ontology engineers were born (yes, that old ๐Ÿง™โ€โ™‚๏ธ).

COMETA consists of carefully annotated 20K entity mentions (extracted from Reddit) about ~8K unique SNOMED CT general and specific concepts. The authors probed 20 EL baselines โ€” rule-based and BERT-based, to conclude that the medical EL task is still hard, especially in the zero-shot setup. A shout-out to transformers aficionados ๐Ÿ˜Ž : letโ€™s help the folks! (there is an unsaturated benchmark, jump on the bandwagon ๐Ÿš„)

Relation Extraction: OpenIE 6 and Neural Extractors

OpenIE is a cornerstone framework of modern NLP applications that extract triples from text with an open schema (no background ontology). A sheer variety of *CL papers employ OpenIE one way or another.

At EMNLP 2020, Kolluru, Adlakha et al introduce OpenIE 6, the next major version of their IE approach ๐Ÿคฉ. Whatโ€™s new? First, OpenIE 6 frames triple extraction as a 2-D (num_words x num_extraction) grid labeling task, so that each word at each extraction can belong to subject/predicate/object/none labels. Still, the real ๐Ÿ‘น is in the detail. The authors propose an Iterative Grid Labeling (IGL) system based on BERT which helps in the 2-D grid labeling task. Namely, it helps to resolve coordinated conjunctions (IGL-CA on the picture ๐Ÿ‘‡), as well as applies soft constraints during the triple extraction process (CIGL-OIE). The soft constraints add up to the final loss function more signal from POS tags, head verb coverage & exclusivity, and extraction counts attached to head verbs.

๐Ÿ‘ฉโ€๐Ÿ”ฌ Experiments show that OpenIE 6 is 10x faster than OpenIE 5 with a significant and consistent performance boost (around 4 F1 points) on several benchmarks. You can also trade those 4 points for even more speed and get OpenIE 5-level of performance but ~50x faster ๐Ÿš€. I hope you already clicked the Github repo link? ๐Ÿ˜‰

Further on the topic, Hohenecker, Mtumbuka et al conduct a systematic study of neural architectures for OpenIE.

Source: Hohenecker, Mtumbuka et al

The authors break down a typical neural OpenIE architecture into 3 essential blocks: embedding, encoding, and prediction. ๐Ÿงช Probing various combinations, the authors find that LM contextual embeddings + Transformer encoder + LSTM predictor yields massive (200%) improvements on the OpenIE16 benchmark. Even more, the authors show that a vanilla NLL loss might favor shallow predictions ๐Ÿ“‰ and should be properly adjusted depending on the object position in a sentence. Takeaway messages: although transformers improve the numbers, you need to design an appropriate training regime taking into account the essence of the IE task โ˜๏ธ.

It took me a while to position the paper by Dognin, Melnyk, Padhi et al as it touches upon NLG, link prediction on KGs, and triple extraction simultaneously ๐Ÿ‘€. The authors propose DualTKB which aims at learning graphs from texts and texts from graphs in a cyclic manner ๐ŸŒ€. To achieve that, the model can generate both text (e.g., translations) and triples (one-hop paths) from a unified encoder. Specifically, the encoder takes as input either some text (option A) ๐Ÿ or a linearized triple (option B) ๐ŸŠ, and then the two decoders produce either text (A) ๐ŸŽ or another triple (B) ๐Ÿ‹. That is, you can have several routes like A-B ๐Ÿ-๐ŸŽ (extracting triple from text), or B-B ๐ŸŠ-๐Ÿ‹(link prediction), and others. Repeating the procedure, you could iteratively extract more triples from the text or condition the model on back-translation (that is actually what the authors do for training). DualTKB shows promising results on commonsense datasets for both KG completion and text generation (although somehow GRU works better than BERT ๐Ÿคจ) and can be easily tried on WebNLG or other relation extraction datasets with parallel annotations. Besides, that GIF visualization is awesome ๐Ÿ˜

Animated DualTKB intuition! Source: Dognin, Melnyk, Padhi et al , GitHub repo

Knowledge Graph Representation Learning: Temporal KGC and Successor to FB15K-237

This year at EMNLP 2020 we have about 20 (!) papers dedicated solely to KG representation learning ๐Ÿ‘€. Among them is our paper โ€œMessage Passing for Hyper-Relational Knowledge Graphsโ€ which I will not discuss here as we published a standalone post here on Medium covering all the details, so I invite you to check it out as well ๐Ÿ˜Š.

A considerable amount of work is put into Temporal KGs, i.e., those who have timestamps that a certain fact was valid within a certain timeframe. For instance, (Obama, president of, USA, 2009, 2017). And we need to predict either a subject or an object given the rest of the quad-/quintuple. Several notable works:

Source: Wu et al

In this setting, Wu et al propose TeMP (Temporal Message Passing framework), where a structural GNN encoder (R-GCN is used although any multi-relational one will do like CompGCN) is paired with a temporal encoder โฐ. The authors experiment with the temporal encoders: GRU and self-attention. That is, each of ๐›• timesteps is encoded with a GNN, and their inputs are fed into the temporal encoder. An additional gating mechanism takes into account frequencies of occurring entities within a certain timeframe (e.g., there are few mentions of Obama in 1900โ€“1950, but much more in 2000โ€“2020). The final entity embeddings are computed after the gating and are fed into a decoder โ€” here it is ComplEx, although Iโ€™d presume any scoring function from the KG embeddings family would work. A similar approach R-GCN + RNN is used in RE-NET by Jin et al (but tackling a temporal component with a decoder differently). Our conclusion: multi-relational GNNs can have a sense of time!

** MATH ALERT ๐Ÿคฏ ** We know that hyperbolic embeddings enjoy smaller embedding dimensions (eg, 32d or 64d) and yield competitive results. So far, such models have been explored in the classical static KG completion setup. Hyperbolic + time = ? ๐Ÿค” Are you into some differentiable geometry? ๐Ÿ˜‰

DyERNIE intuition. Source: Han et al

Han et al employ some advanced math to model the temporal aspect of KGs in DyERNIE. The temporal interactions of an entity are modeled as movements on a manifold with a certain velocity. DyERNIE leverages a product of Riemannian manifolds for different curvatures and defines a new scoring function applied to a quadruple (s, p, o, t). ๐Ÿงช Experiments show that 20d/40d/100d-dimensional models indeed outperform baselines, and learned velocities indeed capture temporal aspects โŒ›๏ธ. However, you might find in the appendix that training a 100d model on a standard dataset might take up to 350 hours ๐Ÿ˜ฒ. ** END OF MATH ALERT **

Finally, Jain, Rathi et al come up with a valuable methodological ๐Ÿ“š contribution: most of the Temporal KG completion tasks measure queries (s, r, ?, t) or (?, r, o, t) while predicting an actual time interval (s, r, o, ?) is still underexplored. Moreover, existing metrics ๐Ÿ“ for this task either under- or overestimate the system performance. The authors propose a new metric for time interval prediction: affinity enhanced Intersection over Union (aeIOU) inspired by gIOU often applied in Computer Vision.

Source: Jain, Rathi et al

That fancy union โ‹“ symbol is the smallest hull (contiguous interval) containing both gold and predicted intervals. The authors demonstrate that aeIOU better captures the complexity of the task, and show its benefits with a new model, TimePlex, that adds time-specific inductive biases (e.g., that personBornYear should precede personDiedYear). Overall, the paper is well-structured and easy to follow, great work! ๐Ÿ‘

Returning back to the classical link prediction, Safavi and Koutra thoroughly study drawbacks of FB15K-237 and other KGE benchmarks concluding that their biases and design choices taken 7+ years ago are not that suitable for the field in 2021.

Time to come up with more diverse and complex benchmarks. Meme image adjusted by me.

Given ~50 KG new embedding papers a year ๐Ÿ‘€, models do tend to overfit to the dataset, so that itโ€™s hard for models to demonstrate their expressive capabilities โ€” simply because the benchmarking datasets do not benefit from such an expressiveness. Not stonks ๐Ÿ“‰. Instead, the authors propose CODEX, KG completion datasets extracted from Wikidata (๐Ÿ˜) and Wikipedia. Whatโ€™s inside: 1๏ธโƒฃ Small/Medium/Large subgraphs; 2๏ธโƒฃ two tasks: link prediction and triple classification; 3๏ธโƒฃ entity and type descriptions in 6 languages, none of which covers all entities entirely; 4๏ธโƒฃ crowdsourced hard negatives; 5๏ธโƒฃ removed test leakage sources and most of FB15K-237 biases. Iโ€™d be glad to see CODEX getting more traction in the community!

Continuing with biases, Fisher et al study how to mitigate biases in KGs learned with KG embedding models. For instance, in Wikidata, most people typed as bankers are male, but we do not want the gender to affect profession predictions for all people in Wikidata. A fast โ€˜dark side of the Force ๐Ÿ‘นโ€™ solution might be to drop all the โ€˜badโ€™ triples, but then weโ€™d identify that there are no female US presidents, so the quality of the model will be impaired.
๐Ÿ‘‰Instead, the authors propose to another procedure (๐Ÿ–ผ is very informative): essentially, create a mask of possibly biased relations and assign a KL loss to model predictions to push the probabilities to equilibrium. The experiments show that itโ€™s indeed possible to reduce bias for some predicates and not sacrifice a ton of modelโ€™s predictive power โš–๏ธ.

Source: Fisher et al
The red node is an out-of-sample entity. Source: Albooyeh, Goel, and Kazemi

One more interesting study by Albooyeh, Goel, and Kazemi concentrates on the out-of-sample setup, i.e., when in the test time a new unseen ๐Ÿคทโ€โ™€๏ธ node arrives as a subject or an object. Some might call this setup inductive, but itโ€™s not clear why the authors decided to go for out-of-sample ๐Ÿค”. So far in the literature there are 2 types of tasks people call inductive: (1) a triple with an unseen entity is attached to the seen, trained graph (this paper); (2) the test set contains a whole new graph and we need to predict links in this unseen graph (this is a recent ICMLโ€™20 paper by Teru et al.). Still, in standard inductive tasks for GNNs, nodes often have features, but in this setup, the authors specifically outline that features are not available (and simple node degree heuristic is not very helpful). How do we infer an embedding of the arrived unseen entity then? The authors propose to aggregate embeddings of the seen entities & relations and propose two strategies for that: 1๏ธโƒฃ simple averaging in the 1-hop neighbourhood, and 2๏ธโƒฃ solving the least squares problem (with our beloved inverse matrices ๐Ÿค— in O(nยณ) time). The authors also design subsets of WN18RR and FB15K-237 for this task and find that both aggregation strategies are able to cope with the task. The only missing thing to me is to see the training times for the least squares option ๐Ÿ˜ƒ.

ConvAI + KGs: On the Shoulders of OpenDialKG

OpenDialKG was one of the spotlights of ACL 2019: a large-scale conversational dataset with a rich underlying KG and quite complex tasks ๐Ÿ”ฅ. The baseline model left a lot of space for improvements, and, finally, here at EMNLPโ€™20, we spot a considerable progress in KG-based ConvAI systems influenced by or using OpenDialKG.

Source: Jung et al

One of my conference favourites, a work by Jung et al applies the idea of attention flow for multi-hop traversal. Their approach, AttnIO, models incoming โžก๏ธ and outgoing โฌ…๏ธ flows.
The โžก๏ธ incoming flow is essentially a GNN-based neighbourhood aggregation (GAT with relation types) operating over a sampled subgraph. The dialog context (and entity names) is encoded via ALBERT.
The โฌ…๏ธ outgoing flow is conditioned by attention scores of outgoing edges. The decoder iterates for T steps (getting T-long paths, respectively).

๐Ÿงช Quantitatively, the experiments show a great performance boost over the original OpenDialKG baseline especially in terms if top-1 and top-3 predictions ๐Ÿ’ช. Qualitatively, case studies demonstrate that AttnIO generates explainable reasoning paths understandable by human evaluators. Scaling the work to large KGs like Wikidata with 100M nodes and 1.1B edges might be an exciting endeavour, drop me a line if you plan to ๐Ÿ˜‰.

Source: Madotto et al

Madotto et al take a different way to incorporate KBs and KGs: as we discussed in the first section of this article ๐Ÿ‘†, huge transformer LMs tend to exhibit some factual knowledge. So why donโ€™t we put all the knowledge into LM params? The proposed model, KE (Knowledge Embedder), builds upon this very idea. Our goal is to generate all plausible combinations ๐ŸŽ๐ŸŠ๐Ÿฅ of KG facts in a dialogue and condition any LM on this corpora. Here is the proposed strategy: (1) The contents of relational DBs or KGs are queried with SQL or Cypher. The queries are then transformed into dialogue templates (check the ๐Ÿ–ผ). (2) The templates are populated with the result set of queries. (3) We feed those templated dialogues into the LM hypothesizing it would memorize the KB facts in its parameters.
The authors attached KE to GPT2 and probed the model on a variety of ConvAI datasets (including OpenDialKG). ๐Ÿ“Š Indeed, GPT2 benefits greatly from the KE module (yields + 20 F1 points on certain datasets) and is on par with explicit retrieval-based models. Some drawbacks ๐Ÿ“‰ : the original OpenDialKG graph is too big to generate all dialogue templates with the current strategy, so the numbers are far far away from AttnIO (for example) but leave a lot of space for future improvements.

๐Ÿ‘ I would also like to mention several papers that demonstrate the benefits of using KGs in your dialogue system: Yang et al in their GraphDialog focus on SMD and MultiWOZ datasets. Transforming originally tabular data into a KG and properly encoding the graph, they managed to greatly improve the entity retrieval F1 score! In the medical domain, Khosla et al develop MedFilter, a system for doctor-patient conversations. They plug in UMLS, a huge medical ontology, as a part of the utterance encoding (together with the discourse information). MedFilter better extracts and classifies symptoms, complaints, and medications. Itโ€™s great to see more practical applications of dialogue systems with knowledge graphs ๐Ÿ‘!

Wrapping Up

This year at EMNLPโ€™20 we welcome more complex benchmarks, thoroughly designed tasks, and probing methodologies. As models grow in size (and hopefully in expressiveness), and GPUs get more RAM, itโ€™s important to invest the computation power wisely ๐Ÿค”

1900 NeurIPS 2020 papers awaiting

KG-augmented language models are probably the future of LMs: once we run out of new text on the whole Internet, itโ€™s time to inject more structured inductive biases.

Thanks for reading and stay tuned! Iโ€™ll go get myself a double โ˜•๏ธ before looking into 1900 NeurIPS 2020 papers ๐Ÿ˜จ



Michael Galkin

AI Research Scientist @ Intel Labs. Working on Graph ML, Geometric DL, and Knowledge Graphs