What’s new for KGs in NLP?
Knowledge Graphs in NLP @ EMNLP 2020
Your guide to the KG-related research in NLP, November edition.
Knowledge Graphs continue to drive NLP forward 🏎! You can’t miss the event as major as EMNLP 2020, so let’s dive in and see what’s new in our ocean 🌊.
I had the luck to attend EMNLP in its online form which was well organized: Zoom Q&A sessions, poster sessions and socials in gather.town, paper-specific and general channels in Rocket.Chat. This year, the EMNLP program consists of 754 EMNLP papers, plus 520 Findings of EMNLP papers are accessible in the proceedings. For this review, I did not distinguish between the main and Findings papers and tried to select ~30 remarkable works that might establish new trends for the next 2–3 years 📈.
Today in our agenda:
- KG-Augmented Language Models: Empower your Transformer
- Natural Language Generation: New Folks in Datasetlandia
- Entity Linking: Massive and Multilingual
- Relation Extraction: OpenIE 6 and Neural Extractors
- KG Representation Learning: Temporal KGC and Successor to FB15K-237
- ConvAI + KGs: On the Shoulders of OpenDialKG
- Wrapping Up
KG-Augmented LMs: Empower your Transformer
We first noted a boom 🚀 in LMs augmented with structured knowledge last year in EMNLP 2019. Dozens of models enriched with entities from Wikipedia or Wikidata appeared in 2019 and 2020 (even here at EMNLP’20) but the conceptual problem is still there:
How do we measure “the knowledge” encoded in LM parameters?
First attempts like the LAMA benchmark framed the problem as matching single-token cloze-style blanks of facts extracted from Wikidata, e.g., “iPhone is designed by _” (Apple, of course). LMs showed some notion of factual knowledge but, frankly, not very much nor deep. Still, LAMA is 1) single-token; 2) English-only. Can we cover more complex tasks and diverse environments? Yes! 🤩
In line with recent successes in multilingual benchmarks like XTREME, Jiang et al study whether multilingual models exhibit some factual knowledge and propose X-FACTR, a multilingual benchmark of cloze-style questions in 23 languages with multi-token blanks (up to 5–10 tokens, actually), to measure it. The authors probed M-BERT, XLM, and XLM-R against X-FACTR. The key findings leave A LOT of room for designing and training knowledgeable language models:
🤔 Multi-lingual models barely reach 15% accuracy in high-resource languages and about 5% in low-resource ones 📉
🤔 M-BERT seems to contain more factual knowledge than much bigger XLM and XLM-R. Surprise!
🤔 Multi-token prediction is much harder than single-token, and you need non-trivial decoding strategies for such entities
🤔 There is almost no agreement on fact validity in multiple languages, i.e., “Switzerland was named after _” (EN) and “Наименование Швейцарии восходит к _” (RU) produce totally different answers.
Would be quite thrilling to see the probing results of recent mT5 (multilingual T5) and M2M-100 on X-FACTR 👀. Besides, put in the comments section below your prediction how soon the benchmark will get saturated 😉
Entity Representations in LMs
This time we have four 🍀 new methods! I put in bold their specific pre-training objectives.
Yamada et al propose LUKE (Language Understanding with Knowledge-based Embeddings), a transformer model with pre-training tasks: MLM + predicting masked entities within a document (see the illustration 👇). Maintaining an entity embedding matrix (500K distinct entities), the authors add entity-aware self-attention which is, essentially, three more query matrices depending on the computed token type (word-entity, entity-entity, entity-word). 🔐 A simple augmentation enables new downstream tasks and slightly improves over RoBERTa and recent KG-augmented baselines 👏
Next up, Févry et al introduce Entities as Experts (EaE), a 12-layer Transformer where the first 4 layers work as usual, then token embeddings of annotated mentions query top-100 entities from the entity memory, and then the summed up embedding is passed through 8 transformer layers. 🔑 Key differences: EaE has three pre-training tasks: mention detection + entity linking + vanilla MLM. Yes, EaE only needs annotated entity mentions doing linking internally ⚙️. Using the BERT-base 110M initial setup, adding 1M x 256d entities yields overall about 367M params (comparable to BERT-Large). In the LAMA probe, EaE boosts performance in T-Rex tasks while being competitive to the mighty T5–11B in TriviaQA and WebQuestions 💪
On the other ✋, Shen et al employ a background KG a bit differently: in their GLM (Graph-guided Masked Language Model) the graph supplies a vocabulary of named entities with their connectivity patterns (reachable entities in k-hops). This info is leveraged in the two pre-training tasks: masked entity prediction + entity ranking in the presence of distractors, i.e, negative samples. KG helps to mask informative entities 🍏 and select hard negative samples 🍎 🍎 for robust training. GLM was predominantly designed for commonsense KGs like ConceptNet and ATOMIC and commonsense-related tasks, although ontological KGs can be attached as well.
Finally, Poerner et al make use of Wikipedia2Vec in their E-BERT. Here is the idea 👉 : vanilla BERT trains only wordpiece embeddings, while Wikipedia2Vec trains both words and entity embeddings (2.7M entities). So, we first learn W, a linear transformation between BERT wordpieces and Wikipedia2Vec words, and then use the fitted parameter W to project Wikipedia2Vec entities. Finally, an entity is concatenated with wordpieces, e.g.: “The native language of Jean Mara ##is is [MASK]” becomes “The native language of Jean_Marais / Jean Mara ##is is [MASK]”. Pre-training tasks: none, as further training is not required. Interestingly, E-BERT-small shows better results on the LAMA probe than E-BERT-large 🤔.
Autoregressive KG-augmented LMs
In this subsection, the generation process of LMs is conditioned by or enriched with structured knowledge like small subgraphs!
Chen et al make a major contribution with KGPT (knowledge-grounded pre-training), a generative model for data-to-text tasks, and a huge novel dataset KGText! 1️⃣ The authors propose a generalized format of encoding various data-to-text tasks like WebNLG, E2E NLG, and WikiBio to be a uniform input for a language model. 2️⃣ KGPT is probed with two encoders: Graph Attention Net-based (looks a bit overcomplicated to me, just take a multi-relational CompGCN and you’re good to go 😎) and the BERT-style one with additional positional embedding-esque inputs (check the illustration 👇). Essentially, you linearize 📏 a graph into a sequence with pointers where are the entities, relations, and full triples. The decoder is a standard GPT-2-like with a copy mechanism. 3️⃣ KGText is a new pre-training corpus where EN sentences from Wikipedia are aligned with subgraphs from Wikidata, overall about 1.8M (subgraph, text) pairs 💪. The authors made sure that each subgraph and its paired sentence describe pretty much the same facts. This is a substantial contribution indeed, as previous graph-to-text datasets are rather small and subsume a supervised setting.
Here, KGPT shows quite impressive 👀 results in few-shot and zero-shot scenarios after pre-training on KGText leaving GPT-2 far behind. That is, just 5% of training data on WebNLG (RDF to text task) can already yield 40+ BLEU points in a few-shot setup and 20+ in completely zero-shot. My two cents:
🤔 KGPT still lacks explicit entities (each entity embedding is an average of its subword units), and there is no differentiation between entities and literals when encoding a given subgraph.
🤔 8 days on 8 Titan RTX GPUs to pre-train. Well, a bit better than 30 days on 2048 TPUs ;)
Ji et al take the opposite way and rather extend a decoder with the graph reasoning module keeping a GPT-2 encoder intact (see the 🖼 below) in their GRF (Generation with Multi-Hop Reasoning Flow). Working with commonsense-related tasks and KGs like ATOMIC and ConceptNet, the authors first extract a k-hop subgraph induced by 1-grams from the input text. The text is encoded via the GPT encoder while the KG subgraph is encoded via CompGCN (smart choice 😉). The reasoning module (essentially looks like message passing) propagates information through the subgraph and creates a softmax distribution over entities to select relevant ones. Finally, a copy gate decided whether to put that entity or select a word from a vocabulary.
👩🔬 Experiments on Story Ending Generation, Abductive NLG, and Explanation Generation demonstrate gains over various GPT-2 baselines in automatic metrics as well as human evaluation of generated texts.
Our weightlifting 🏋 champion for today is MEGATRON-CTRL (8.3B parameters) created by Xu et al from NVIDIA. By controlled generation, we understand conditioning the LM generator not just by the input context but also with some keywords that can drive a story in a certain direction.
Here, the authors employ ConceptNet and its 600K triples as a commonsense KG and external knowledge source.
First, the keywords are matched with triples, and the matched ones are passed through the Universal Sentence Encoder (USE). On the other hand, the input context is also passed through the USE. Finally, top-K max inner product vectors are selected ✏️. The retriever is trained with negative sampling.
The decoder is a fantabulously large transformer (8.3B parameters), the keyword generator is just 2.5B params. Training takes only 160 Tesla V100s 😊. In the experiments, it is shown that such large models do indeed benefit from the background knowledge and tend to be preferred by humans in AMT experiments.
NLG (Data to text): New Folks in Datasetlandia
EMNLP’20 chairs explicitly stated that
datasets are not second-class citizens in the NLP research
and this year we see a lot of new, large, well-designed, and complex tasks / datasets that will be fueling NLG next years at least ⛽️
Cheng et al introduce ENT-DESC, a triple-to-text dataset based on Wikidata (yes! 🤩) where, given a 2-hop subgraph around the main entity, the task is to generate its textual description. The dataset stands out from WebNLG in several ways: 1️⃣ ENT-DESC is much larger: 110K graph-text pairs, over 11M triples, about 700K distinct entities, and 1K distinct relations; 2️⃣ The triples per entity ratio is higher, but not all triples are to contribute to the generated text, i.e., some of them are distractors and models should be robust enough to dismiss them; 3️⃣ Expected descriptions are longer than those of WebNLG.
The proposed baseline model MGCN looks somewhat over-engineered to me: an input multi-relational graph is split into 6 single-relational graphs with their embeddings aggregated. Why not taking a multi-relational GNN encoder like R-GCN or CompGCN? 🤔 Probed on WebNLG, MGCN yields about 46.5 BLEU points (yes, yes, I see you frowning upon BLEU 🤨, I also do) while KGPT from the previous section yields 65+ points. Still, MGCN presents a strong baseline on ENT-DESC, so I’d encourage everybody to flex their graph-to-sequence muscles on a new dataset! 💪
Next up, Chen et al propose a new dataset, Logic2Text, that challenges NLG systems with generating text from logical forms. It’s important to notice that it is not just a table-to-text task, but a more complex one with 7 logic types 🌈 including count, comparative, superlative, aggregation, majority, unique, and ordinal.
The dataset consists of about 5K tables and 10.7K pairs (logical form, text). The forms are complex enough, e.g., 9 nodes and 3 functions in each form on average. The authors comprehensively described the construction and annotation processes 👈 consider this when introducing your dataset.
Several generative baselines were tested, and fine-tuned GPT-2 performs best (what a surprise! 😉). Interestingly, the quality drops >30% when table captions are discarded. Moreover, a few-shot setup is possible too, so I’d hypothesize even larger transformers could perform a zero-shot transfer. Finally, you can always flip the task and use the dataset for training a semantic parser 😎.
In the table-to-text world, Parikh et al introduce ToTTo, a large dataset of 120K examples. The task is to generate a plausible text given a table and several highlighted 🖋 nodes, i.e., it’s not as easy as row-to-text or column-to-text. Turns out that complex table structures (like the one illustrated below) with merged rows/columns, and non-trivial cell highlighting do make the task more difficult and make models to hallucinate a lot. The dataset construction process is very well described 👏, and the authors employ PARENT and BLEURT 👏 metrics in addition to plain BLEU. Round of applause, everybody :)
As we started to talk about metrics, let’s throw one more stone into the BLEU garden! ☄ ️Gekhman et al join our endeavour and propose KoBE (Knowledge-based Evaluation).
The idea is pretty simple: 1) let’s link ⛓ entity mentions to some multilingual KG; 2) measure the recall of found entities in candidates vs the source. Fin!
Multilinguality is a common attribute of large KGs, why not leveraging it as a metric? 😏
The authors employ Google Knowledge Graph Search API for entity linking. Probing KoBE on WMT19 tasks they find that the metric correlates to human judgments better than BLEU! What else argument do you need to finally retire BLEU?
Finishing with the datasets, it is worth mentioning Schmitt et al’s work who frame Visual Genome (VG) as a scene graph-to-text, and propose VGball, a subset of VG (still 200x larger than WebNLG, though). Yes, it is also possible to flip the task to the opposite side and train a model to extract triples and build KGs directly from images!
Entity Linking: Massive and Multilingual
In the Entity Linking world, Google and Facebook join the party and spin up 🥏 their TPUs and HPC clusters to solve massive multilingual entity linking!
Botha, Shan, and Gillick present a study of Entity Linking in 100 Languages. First, hats off 🎓 for moving away from EN-only scenarios: this is a wonderful effort of the NLP community 👏!
Large KGs like Wikidata are by design language-agnostic, so why don’t we leverage all non-EN data? (in fact, for some entities EN labels and descriptions might not even exist) The authors first mine a HUGE dataset of 684M mentions about 20M Wikidata entities in 104 languages, and design Mewsli-9, a lightweight test-case only dataset of 300K mentions of 82K entities in 9 languages to evaluate the entity linking performance.
Model-wise, the authors resort to a dual encoder 🥂, where one transformer (typically, BERT) encodes mentions, and the second transformer encodes entity descriptions, computing a cosine similarity as the final operation. Initialized with mBERT checkpoints, the models are trained on TPU v3 for several days (TPUs go brrr). 🧪 Turns out the strategy is quite efficient: on Mewsli-9, the best model (powered with smart training enhancements) reaches micro-avg 90% Recall@1 and 98% Recall@10. Additionally, check out the illustration below 👇 for language-specific numbers on a heldout set.
A bit differently, Wu et al consider EN Wikipedia as a background KB and entity vocabulary in their new BLINK entity linker tailored for zero-shot 👌setups. BLINK employs a bi-encoder 🥂 paradigm as well, but this time all entity descriptions are pre-computed and stored in the FAISS index (those are [CLS] embeddings). An entity mention in a context (needs to be annotated beforehand) is passed through another transformer, and the resulting mention embedding retrieves 🔎 top-K nearest neighbors in the index via FAISS. Finally, top-K options are ranked via the cross-encoder transformer (the 🖼 is quite informative).
Experimental evidence: 1) FAISS-based retriever is fast (~2ms/query) and accurate (Recall@10 > 90%), much better than TF-IDF and BM-25; 2) In the zero-shot scenario, BLINK leaves all baselines far behind! 🏃♀️ 3) Inference is fast even on 1 CPU, so you can plug the model into your applications, too! 🎉
However, BLINK does require annotated entity mentions. This issue is resolved by a sibling paper by Li et al in their ELQ (Entity Linking for Questions). In fact, ELQ resides in the same repo as BLINK as they perfectly complement each other 🤗.
Although the architecture is similar to BLINK (bi-encoder 🥂+ FAISS), ELQ jointly learns mention detection and disambiguation. That is, no input annotations required!
Furthermore, ELQ excels in practical applications 🏅: ELQ outperforms TAGME and BLINK on QA datasets such as WebQSP and GraphQuestions, as well as increases accuracy on big QA datasets like Natural Questions and TriviaQA 👏.
The last (but not least) meal 🥗 on our Entity Linking plate is COMETA, a corpus for medical entity linking 🚑 by Basaldella, Liu et al. The target medical KG is SNOMED-CT which was here long long before RDF, OWL, Description Logics, and even before some ontology engineers were born (yes, that old 🧙♂️).
COMETA consists of carefully annotated 20K entity mentions (extracted from Reddit) about ~8K unique SNOMED CT general and specific concepts. The authors probed 20 EL baselines — rule-based and BERT-based, to conclude that the medical EL task is still hard, especially in the zero-shot setup. A shout-out to transformers aficionados 😎 : let’s help the folks! (there is an unsaturated benchmark, jump on the bandwagon 🚄)
Relation Extraction: OpenIE 6 and Neural Extractors
OpenIE is a cornerstone framework of modern NLP applications that extract triples from text with an open schema (no background ontology). A sheer variety of *CL papers employ OpenIE one way or another.
At EMNLP 2020, Kolluru, Adlakha et al introduce OpenIE 6, the next major version of their IE approach 🤩. What’s new? First, OpenIE 6 frames triple extraction as a 2-D (num_words x num_extraction) grid labeling task, so that each word at each extraction can belong to subject/predicate/object/none labels. Still, the real 👹 is in the detail. The authors propose an Iterative Grid Labeling (IGL) system based on BERT which helps in the 2-D grid labeling task. Namely, it helps to resolve coordinated conjunctions (IGL-CA on the picture 👇), as well as applies soft constraints during the triple extraction process (CIGL-OIE). The soft constraints add up to the final loss function more signal from POS tags, head verb coverage & exclusivity, and extraction counts attached to head verbs.
👩🔬 Experiments show that OpenIE 6 is 10x faster than OpenIE 5 with a significant and consistent performance boost (around 4 F1 points) on several benchmarks. You can also trade those 4 points for even more speed and get OpenIE 5-level of performance but ~50x faster 🚀. I hope you already clicked the Github repo link? 😉
Further on the topic, Hohenecker, Mtumbuka et al conduct a systematic study of neural architectures for OpenIE.
The authors break down a typical neural OpenIE architecture into 3 essential blocks: embedding, encoding, and prediction. 🧪 Probing various combinations, the authors find that LM contextual embeddings + Transformer encoder + LSTM predictor yields massive (200%) improvements on the OpenIE16 benchmark. Even more, the authors show that a vanilla NLL loss might favor shallow predictions 📉 and should be properly adjusted depending on the object position in a sentence. Takeaway messages: although transformers improve the numbers, you need to design an appropriate training regime taking into account the essence of the IE task ☝️.
It took me a while to position the paper by Dognin, Melnyk, Padhi et al as it touches upon NLG, link prediction on KGs, and triple extraction simultaneously 👀. The authors propose DualTKB which aims at learning graphs from texts and texts from graphs in a cyclic manner 🌀. To achieve that, the model can generate both text (e.g., translations) and triples (one-hop paths) from a unified encoder. Specifically, the encoder takes as input either some text (option A) 🍏 or a linearized triple (option B) 🍊, and then the two decoders produce either text (A) 🍎 or another triple (B) 🍋. That is, you can have several routes like A-B 🍏-🍎 (extracting triple from text), or B-B 🍊-🍋(link prediction), and others. Repeating the procedure, you could iteratively extract more triples from the text or condition the model on back-translation (that is actually what the authors do for training). DualTKB shows promising results on commonsense datasets for both KG completion and text generation (although somehow GRU works better than BERT 🤨) and can be easily tried on WebNLG or other relation extraction datasets with parallel annotations. Besides, that GIF visualization is awesome 😍
Knowledge Graph Representation Learning: Temporal KGC and Successor to FB15K-237
This year at EMNLP 2020 we have about 20 (!) papers dedicated solely to KG representation learning 👀. Among them is our paper “Message Passing for Hyper-Relational Knowledge Graphs” which I will not discuss here as we published a standalone post here on Medium covering all the details, so I invite you to check it out as well 😊.
A considerable amount of work is put into Temporal KGs, i.e., those who have timestamps that a certain fact was valid within a certain timeframe. For instance, (Obama, president of, USA, 2009, 2017). And we need to predict either a subject or an object given the rest of the quad-/quintuple. Several notable works:
In this setting, Wu et al propose TeMP (Temporal Message Passing framework), where a structural GNN encoder (R-GCN is used although any multi-relational one will do like CompGCN) is paired with a temporal encoder ⏰. The authors experiment with the temporal encoders: GRU and self-attention. That is, each of 𝛕 timesteps is encoded with a GNN, and their inputs are fed into the temporal encoder. An additional gating mechanism takes into account frequencies of occurring entities within a certain timeframe (e.g., there are few mentions of Obama in 1900–1950, but much more in 2000–2020). The final entity embeddings are computed after the gating and are fed into a decoder — here it is ComplEx, although I’d presume any scoring function from the KG embeddings family would work. A similar approach R-GCN + RNN is used in RE-NET by Jin et al (but tackling a temporal component with a decoder differently). Our conclusion: multi-relational GNNs can have a sense of time!
** MATH ALERT 🤯 ** We know that hyperbolic embeddings enjoy smaller embedding dimensions (eg, 32d or 64d) and yield competitive results. So far, such models have been explored in the classical static KG completion setup. Hyperbolic + time = ? 🤔 Are you into some differentiable geometry? 😉
Han et al employ some advanced math to model the temporal aspect of KGs in DyERNIE. The temporal interactions of an entity are modeled as movements on a manifold with a certain velocity. DyERNIE leverages a product of Riemannian manifolds for different curvatures and defines a new scoring function applied to a quadruple (s, p, o, t). 🧪 Experiments show that 20d/40d/100d-dimensional models indeed outperform baselines, and learned velocities indeed capture temporal aspects ⌛️. However, you might find in the appendix that training a 100d model on a standard dataset might take up to 350 hours 😲. ** END OF MATH ALERT **
Finally, Jain, Rathi et al come up with a valuable methodological 📚 contribution: most of the Temporal KG completion tasks measure queries (s, r, ?, t) or (?, r, o, t) while predicting an actual time interval (s, r, o, ?) is still underexplored. Moreover, existing metrics 📏 for this task either under- or overestimate the system performance. The authors propose a new metric for time interval prediction: affinity enhanced Intersection over Union (aeIOU) inspired by gIOU often applied in Computer Vision.
That fancy union ⋓ symbol is the smallest hull (contiguous interval) containing both gold and predicted intervals. The authors demonstrate that aeIOU better captures the complexity of the task, and show its benefits with a new model, TimePlex, that adds time-specific inductive biases (e.g., that personBornYear should precede personDiedYear). Overall, the paper is well-structured and easy to follow, great work! 👏
Returning back to the classical link prediction, Safavi and Koutra thoroughly study drawbacks of FB15K-237 and other KGE benchmarks concluding that their biases and design choices taken 7+ years ago are not that suitable for the field in 2021.
Given ~50 KG new embedding papers a year 👀, models do tend to overfit to the dataset, so that it’s hard for models to demonstrate their expressive capabilities — simply because the benchmarking datasets do not benefit from such an expressiveness. Not stonks 📉. Instead, the authors propose CODEX, KG completion datasets extracted from Wikidata (😍) and Wikipedia. What’s inside: 1️⃣ Small/Medium/Large subgraphs; 2️⃣ two tasks: link prediction and triple classification; 3️⃣ entity and type descriptions in 6 languages, none of which covers all entities entirely; 4️⃣ crowdsourced hard negatives; 5️⃣ removed test leakage sources and most of FB15K-237 biases. I’d be glad to see CODEX getting more traction in the community!
Continuing with biases, Fisher et al study how to mitigate biases in KGs learned with KG embedding models. For instance, in Wikidata, most people typed as bankers are male, but we do not want the gender to affect profession predictions for all people in Wikidata. A fast ‘dark side of the Force 👹’ solution might be to drop all the ‘bad’ triples, but then we’d identify that there are no female US presidents, so the quality of the model will be impaired.
👉Instead, the authors propose to another procedure (🖼 is very informative): essentially, create a mask of possibly biased relations and assign a KL loss to model predictions to push the probabilities to equilibrium. The experiments show that it’s indeed possible to reduce bias for some predicates and not sacrifice a ton of model’s predictive power ⚖️.
One more interesting study by Albooyeh, Goel, and Kazemi concentrates on the out-of-sample setup, i.e., when in the test time a new unseen 🤷♀️ node arrives as a subject or an object. Some might call this setup inductive, but it’s not clear why the authors decided to go for out-of-sample 🤔. So far in the literature there are 2 types of tasks people call inductive: (1) a triple with an unseen entity is attached to the seen, trained graph (this paper); (2) the test set contains a whole new graph and we need to predict links in this unseen graph (this is a recent ICML’20 paper by Teru et al.). Still, in standard inductive tasks for GNNs, nodes often have features, but in this setup, the authors specifically outline that features are not available (and simple node degree heuristic is not very helpful). How do we infer an embedding of the arrived unseen entity then? The authors propose to aggregate embeddings of the seen entities & relations and propose two strategies for that: 1️⃣ simple averaging in the 1-hop neighbourhood, and 2️⃣ solving the least squares problem (with our beloved inverse matrices 🤗 in O(n³) time). The authors also design subsets of WN18RR and FB15K-237 for this task and find that both aggregation strategies are able to cope with the task. The only missing thing to me is to see the training times for the least squares option 😃.
ConvAI + KGs: On the Shoulders of OpenDialKG
OpenDialKG was one of the spotlights of ACL 2019: a large-scale conversational dataset with a rich underlying KG and quite complex tasks 🔥. The baseline model left a lot of space for improvements, and, finally, here at EMNLP’20, we spot a considerable progress in KG-based ConvAI systems influenced by or using OpenDialKG.
One of my conference favourites, a work by Jung et al applies the idea of attention flow for multi-hop traversal. Their approach, AttnIO, models incoming ➡️ and outgoing ⬅️ flows.
The ➡️ incoming flow is essentially a GNN-based neighbourhood aggregation (GAT with relation types) operating over a sampled subgraph. The dialog context (and entity names) is encoded via ALBERT.
The ⬅️ outgoing flow is conditioned by attention scores of outgoing edges. The decoder iterates for T steps (getting T-long paths, respectively).
🧪 Quantitatively, the experiments show a great performance boost over the original OpenDialKG baseline especially in terms if top-1 and top-3 predictions 💪. Qualitatively, case studies demonstrate that AttnIO generates explainable reasoning paths understandable by human evaluators. Scaling the work to large KGs like Wikidata with 100M nodes and 1.1B edges might be an exciting endeavour, drop me a line if you plan to 😉.
Madotto et al take a different way to incorporate KBs and KGs: as we discussed in the first section of this article 👆, huge transformer LMs tend to exhibit some factual knowledge. So why don’t we put all the knowledge into LM params? The proposed model, KE (Knowledge Embedder), builds upon this very idea. Our goal is to generate all plausible combinations 🍎🍊🥝 of KG facts in a dialogue and condition any LM on this corpora. Here is the proposed strategy: (1) The contents of relational DBs or KGs are queried with SQL or Cypher. The queries are then transformed into dialogue templates (check the 🖼). (2) The templates are populated with the result set of queries. (3) We feed those templated dialogues into the LM hypothesizing it would memorize the KB facts in its parameters.
The authors attached KE to GPT2 and probed the model on a variety of ConvAI datasets (including OpenDialKG). 📊 Indeed, GPT2 benefits greatly from the KE module (yields + 20 F1 points on certain datasets) and is on par with explicit retrieval-based models. Some drawbacks 📉 : the original OpenDialKG graph is too big to generate all dialogue templates with the current strategy, so the numbers are far far away from AttnIO (for example) but leave a lot of space for future improvements.
👏 I would also like to mention several papers that demonstrate the benefits of using KGs in your dialogue system: Yang et al in their GraphDialog focus on SMD and MultiWOZ datasets. Transforming originally tabular data into a KG and properly encoding the graph, they managed to greatly improve the entity retrieval F1 score! In the medical domain, Khosla et al develop MedFilter, a system for doctor-patient conversations. They plug in UMLS, a huge medical ontology, as a part of the utterance encoding (together with the discourse information). MedFilter better extracts and classifies symptoms, complaints, and medications. It’s great to see more practical applications of dialogue systems with knowledge graphs 👏!
This year at EMNLP’20 we welcome more complex benchmarks, thoroughly designed tasks, and probing methodologies. As models grow in size (and hopefully in expressiveness), and GPUs get more RAM, it’s important to invest the computation power wisely 🤔
KG-augmented language models are probably the future of LMs: once we run out of new text on the whole Internet, it’s time to inject more structured inductive biases.
Thanks for reading and stay tuned! I’ll go get myself a double ☕️ before looking into 1900 NeurIPS 2020 papers 😨