Knowledge Graphs @ EMNLP 2021
Your regular digest of KG research, November edition
I didn’t make it to Punta Cana this year 😢 but I’m happy (remotely) for the folks who managed to get there in spite of all traveling restrictions! Premium content inside 🏖
The autumn got very busy and I’d like to try a shorter format: each big topic has one “spotlight” 🛋 work in the citation block which I find particularly interesting, and several relevant works which have a bit shorter description.
The plan for today:
KG-augmented Language Models: Categorization
Relational World Knowledge Representation in Contextual Language Models: A Review by Tara Safavi and Danai Koutra
If you are an experienced reader of such digests (or previous posts) then you know pretty well the abundance of KG-augmented LMs published at every conference and uploaded to arxiv weekly. If you feel lost 😨 — I can assure you’re not the only one.
This year, we finally have a sound framework and taxonomy of various KG+LM approaches! The authors define 3 big families: 1️⃣ no KG supervision, probing knowledge encoded in LM params with cloze-style prompts; 2️⃣ KG supervision with entities and IDs; 3️⃣ KG supervision with relation templates and surface forms.
Each family has a few branches 🌳 For instance, let’s have a look at 4 entity-aware models illustrated below. Varying from “less symbolic” to “more symbolic”, some LMs perform mention-span masking, or contrastive learning, or fusion of entity embeddings from a known vocabulary. The authors did a great job classifying dozens of existing architectures according to the framework and it looks so much better organized now. Much needed work! 👏
A few short papers focus on enriching LMs with biomedical KGs, a long-lasting effort to teach LMs a domain-specific biomedical slang.
🔖 Meng et al propose Mixture-of-Partitions (MoP), an LM based on the AdapterFusion technique which alleviates the need to pre-train LMs from scratch. MoP was trained with common biomedical vocabularies and ontologies UMLS and SNOMED CT.
🔖 Sung et al ask “Can Language Models be Biomedical Knowledge Bases?” referring to the famous EMNLP’19 paper by Petroni et al. The answer is largely NO. The authors design BioLAMA, a benchmark for probing biomedical knowledge built from UMLS, CTD, and Wikidata. They find that modern LMs get <10% accuracy on those probes, so the community definitely needs something more reliable 🤔.
Conversational AI: Stop Hallucinating, Bro
Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding by Nouha Dziri, Andrea Madotto, Osmar Zaiane, Avishek Joey Bose
Generating responses with a ConvAI system with a KG background is tricky. In pipeline systems with many components, you rigorously use surface forms (entity names) and you mostly resort to templates, and templates are boring and hardly maintainable 🙄. On the other hand, e2e generative models like GPT-2 and GPT-3 produce far more unique replies but often hallucinate 🥴, that is, insert wrong entity names when you don’t expect it.
The authors of this work embarked on a hunt 🏹 to reduce hallucinations with KG supervision proposing Neural Path Hunter. First, they study several kinds of hallucations , where they come from (mostly from top-k sampling), and how to quantify it.
The NPH itself consists of two modules: 1️⃣ a critic (non-autoregressive LM) that performs binary classification over tokens; 2️⃣ entity retriever for fixing entity errors: this is essentially an entity memory where entity embeddings come from GPT and are updated with CompGCN using the graph structure. The most plausible candidates come from applying DistMult scoring function. Voila! 🪄
NPH can be paired with any pre-trained LM, experiments on the OpenDialKG benchmark with GPT2-KG, GPT2-KE, and AdapterBot demonstrate significant reduction 📉 of hallucinations and increase 📈 in faithfulness. A user study reports that human-measured hallucination is reduced ~2x in NPH models 👏
🔖 Another relevant work in this context: Honovich et al study the same problem in dialogue systems but w/o background KG and propose a new benchmark Q² to measure factual consistency of Question generation and Question answering (where both Q’s come from, if you ask).
🔖 If you’re into ConvAI and commonsense KGs — be sure to check the CLUE (Conversational Multi-Hop Reasoner) by Arabshahi, Lee, et al that incorporates the notion of if-(state), then-(action), because-(goal) patterns logical rules and symbolic reasoning.
Entity Linking: In the Shadow of the Colossus
Robustness Evaluation of Entity Disambiguation Using Prior Probes: the Case of Entity Overshadowing by Vera Provatorova, Svitlana Vakulenko, Samarth Bhargav, Evangelos Kanoulas
When you plug in real-world KGs for language tasks you’ll inevitably encounter different entities that have exactly the same name 👨🦰👨🦰. Unfortunately, humanity does not use unique hashes for all entities in the world, so entity disambiguation remains an important step of Entity Linking.
For instance, Wikidata has at least 18 entities named “Michael Jordan”. Often, EL systems rely on basic stats and popularity scores, such that the most popular “Michael Jordan the basketball player” would overshadow less prominent (in pop culture, at least) folks.
💾 The authors tackle this problem and introduce a new dataset, ShadowLink, to measure the degree of confusion of modern EL systems. Turns out the highest F1 score barely reaches 0.35 (recent generative GENRE yields 0.26) on the hardest part. All systems saturate their scores on long-tail rare entities and cope with more common entities as well. The main challenge is formulated as “what makes the task challenging is the combination of ambiguity and uncommonness”. I’d recommend the authors upload the dataset to HuggingFace Datasets to increase the visibility of their cool project 😉.
🔖 Arora et al approach the entity linking problem from another direction. The main idea is that true named entities in a document (processed jointly, not one-by-one) span a low-rank subspace 🔮 in the space of all entities including candidates (check a visual example below). The Eingenthemes approach is unsupervised if you have pre-trained entity embeddings — the authors use DeepWalk over the English subset of Wikidata (alternatively, they try word embeddings, but it doesn’t work that well).
🔖 A conceptually similar problem of entity-based conflicts is studied by Longpre et al, namely, knowledge substitution — if you flip a true entity in a paragraph to a random (or contradicting one), would the model change the answer? In other words, would QA models rely on reading the context or memorized knowledge? 🤔 Turns out, when training QA models with such substitutions, you can increase the OOD generalization by a good margin!
🔖 Finally, have a look at the survey of Tedeschi et al on “NER for Entity Linking: What Works and What’s Next”. The authors identify key challenges of EL and try to address NER-relevant ones in NER4EL aiming at reducing the performance gap between large pre-trained LMs and smaller models which is especially relevant in low-resource scenarios 👏.
I didn’t manage to come up with a catchy line here :/ If you are into OpenIE and KG Construction, the following papers might be relevant.
🔖 Dognin et al propose ReGen, an approach for fine-tuning LMs to perform both Text2Graph and Graph2Text tasks (or fine-tune specialized models). The key ingredient 🥦 is adding an RL loss (Self-Critical Sequence Training) in addition to standard cross-entropy (CE). It can be easily added to any pre-trained LM — the authors try it with T5-Large (770M params) and T5-base (220M params). 🧪Experimentally, ReGen significantly improves over Text2Graph WebNLG baselines (3–10 abs. points depending on the metric) and works on the much larger TekGen dataset (6M training pairs).
🔖 Dash et al study the canonicalization problem in OpenIE — when entities with different surface forms like (NYC, New York City) refer to the same prototype. In an unsupervised manner, we want IE systems to automatically cluster those mentions together. The method, CUVA, resorts to Variational Autoencoders (VAEs) to identify the clusters (entities and relations are parameterized by Gaussians). In addition to the standard for VAEs reconstruction loss, CUVA employs additional link prediction loss based on the HolE scoring function. 🍯 Moreover, the authors introduce a novel CanonicNELL dataset!
KG Question Answering: Add some ✨ SPARQL ✨
SPARQLing Database Queries from Intermediate Question Decompositions by Irina Saparina and Anton Osokin
There are not so many applications of SPARQL in the *CL domain, unfortunately. I think it deserves much wider adoption in NLP. When it’s supported by a cool application — I’m in 👀.
The majority of structured QA datasets or those employing semantic parsing target SQL as the main output format. Is there life beyond SQL pipelines? 🤨
Saparina and Osokin propose a new look on that problem by 1️⃣ first using a Question Decomposition Meaning Representation (QDMR) framework that translates a question into a syntax-independent logical form; 2️⃣ this form can be translated to any structured format, and here the authors resort to SPARQL showing it is a lot easier to query databases in the graph format. It does require transforming an input table to RDF, but for datasets of the Spider scale it can be done very easily.
The trainable modules include RAT transformer encoder with LSTM decoder which produces QDMR tokens. QDMR -> SPARQL is a straight transpilation based on few rules.
✅ On-par-to-SOTA results;
✅ code is available ;
✅ SPARQL works better than SQL ;
what else do you need for a good paper? 😉
✨ Another exciting work “Case-Based Reasoning for Natural Language Queries over Knowledge Bases” by Das et al combines SPARQL with case-based reasoning (CBR). CBR has deep roots in expert systems back in 80’s but was recently revived with the power of representation learning. TLDR explanation of CBR in 2021: it is conceptually close to compositional generalization, i.e., having seen some basic examples you can construct a more complex query about previously unseen entities.
Have a look at the example below. We have an input query “Who is Gimli’s father’s sibling in the Hobbit?”. In the training data we might not have anything about Gimli or Hobbit, but we might have “relatively similar” cases over the relations we could find useful for our query, e.g., “Who is Charlie Sheen’s dad?” with Freebase relation
people.person_parents and “Who are Rihanna’s sibling?” with relation
people.person.sibling_s . Composing them for our question, we construct a SPARQL query to the database.
The proposed CBR-KBQA approach combines 1️⃣ a trainable neural retriever in the DPR-style (supervision is based on overlapping relations), 2️⃣ a linear transformer (they use BigBird) as concatenated relevant questions and queries are quite long, 3️⃣ several re-ranking mechanisms to clean up the predictions. They use off-the-shelf NER and Entity Linking modules, and also employ pre-trained TransE relation embeddings for re-ranking. CBR-KBQA demonstrates impressive performance on several KBQA datasets including CFQ. A small note: I’m a bit suspicious that the best available SOTA model (67.3 MCD-Mean) is outperformed by such a margin to 78.1 and not submitted to the benchmark, the code is not yet available, too 🤨.
🔖 Shi et al study multi-hop QA and propose to integrate both entity/relation IDs (label form) and their natural language descriptions (text form) into their message propagation framework TransferNet. Evaluation is done on standard MetaQA, WebQuestionsSP, and Complex Web Questions datasets.
🔖 In the same task (same datasets as in the previous work), Oliya et al noticed that most SOTA QA models require textual spans already linked to KG entities and try to circumvent this requirement with dynamic entity re-ranking using features of node neighborhood of KG entities and features of text spans.
That’s All Folks
Let me know if you like this shorter “premum” 🏖 format better than looong walls of text as in previous reviews! Thanks for investing your time here, hope you took home something useful 🙂