Knowledge Graphs & NLP @ EMNLP 2019 Part I

Michael Galkin
12 min readNov 6, 2019

--

Hey there! 👋 The review post of the papers from ACL 2019 on knowledge graphs (KGs) in NLP was well-received so I thought maybe it would be beneficial for the community to look through the proceedings of EMNLP 2019 for the latest state of the art in applying knowledge graphs in NLP. Let’s start!

Source: https://github.com/roomylee/EMNLP-2019-Papers

EMNLP 2019 enjoys ever-growing number of submissions and accepted papers. This year we’ve seen about 3000 submissions and 450+ accepted full papers and 200+ short papers.

Keywords stats report that the community gets more and more interested in integrating knowledge in its various forms (such as graphs, yeah!).

And it turns out KGs can be applied to a wide variety of tasks including top mentions in the distribution including NLG, question answering, dialogue systems and so on.

I’d like to structure this post into several smaller parts so you could navigate straight to your favorite part 😉. Due to the overall size the post will have two parts.

Part I (👈 you are here)

  1. Augmented Language Models
  2. Dialogue Systems and Conversational AI
  3. Building Knowledge Graphs from Text (Open KGs)
  4. Knowledge Graph Embeddings
  5. Conclusions

Part II (already live)

  1. Question Answering over Knowledge Graphs
  2. Natural Language Generation from KGs
  3. Commonsense Reasoning with KGs
  4. Named Entity Recognition and Relation Linking

Augmented Language Models

Language models (LMs) are the hottest topic in the NLP research right now. The most prominent examples are BERT and GPT-2 but new LMs are published every month trained on humongous volumes of text. So the question here is:

Are LMs capable of encoding knowledge in a way similar to knowledge graphs?

Source: Petroni et al

Petroni et al study this problem comparing language models with knowledge graphs on Question Answering and NLG tasks where factual knowledge is required, e.g., a question is posed by inserting a MASK token instead of an answer. Turns out LMs demonstrate similar to KGs performance on very simple questions such as “Adolphe Adam died in [Paris]” 🎆. Even though more complex questions are not studied here, I’d expect pure LMs would struggle with more complex questions being far behind systems based on KGs especially when you’d need some multi-hop reasoning without having a reference text paragraph.

Peters et al propose KnowBERT, namely, BERT infused with structured knowledge. The authors propose the Knowledge Attention and Recontextualization (KAR) component that optimizes selection of a correct entity in a given span based on a candidates list thus producing knowledge-enriched embeddings. As seed KG embeddings authors employ TuckER algorithm actually published around the same time and also accepted at EMNLP 2019, that was fast! 🚀 KnowBERT is evaluated against multiple tasks: relation extraction, entity linking and knowledge extraction in a form of replacing answers with [MASK] tokens. Predicting facts from Wikidata, KnowBERT (that uses BERT-based as a background LM) achieves 0.31 MRR whereas BERT-Large hits only 0.11 MRR 👀. In other traditional tasks such as relation extraction and entity typing the model performs on par with SOTA. The results are pretty cool actually, and clearly demonstrate a potential to incorporate large-scale KGs into LMs for a win-win situation 😊

The KAR component of KnowBERT. Source: Peters et al

🏥 Sharma et al apply KG embeddings (DistMult) on a subset of UMLS (medical KG) and combine those with ELMo to tackle domain-specific medical NLI tasks. The fact that cutting edge NLP technologies are applied for biomedical research is hands down commendable 👏 and state of the art is moved forward! My concern here is about reliability and costs of errors. The authors admit that the model still makes mistakes, e.g., statements “She was speaking normally at that time” and “The patient has no known normal time where she was speaking normally” are classified to be entailment. In such sensitive domains as healthcare ML models have to be not just very accurate but traceable and explainable as well since we’d like to give doctors all the arguments how we came up with a certain decision (that can affect lives).

Dialogue Systems and Conversational AI

Transformers and Seq2seq models have been demonstrated as effective methods for supporting a coherent chit-chat conversation 👉 (try the agent built by 🤗 for ConvAI 2) 👈. However, when you need to perform a certain action or achieve a specific goal (like booking a table in a restaurant) you still need to operate with slots and values. Last years the community started to appraise more and more organizing the world model into knowledge graphs (can’t agree more 😍). That is, most of knowledge-grounded dialogue systems now are based either on some relational databases or graphs, and EMNLP 2019 adds a bunch of new datasets and frameworks!

Joey: am I in a KG now?

Tuan et al propose a new task, dynamic knowledge-grounded conversation generation, and two new datasets (DyKgChat) based on Chinese TV drama Hou Gong Zhen Huan Zhuan (HGZHZ) and Friends (yes, exactly THAT TV show). The Friends dataset consists of more than 3K dialogues with avg 18 turns per conversation, the KG has 281 distinct entities and 7 relations. Facts in the background KG might change over time, for instance, Ross was dating different girlfriends (Emily, Charlie, Carol, Janice, Elizabeth, Bonnie, Julie, did I miss somebody? name all of them in the comments section below 😉) in different seasons so that you need to generate a new response to a question “Who’s Ross dating now?” depending on the context and changed KG triples (Ross, lover, Emily) -> (Ross, lover, Rachel). The authors also propose new metrics and a baseline model QAdapt. Great new resource 🔥

Source: Chen et al

Chen et al leverage KGs in a movie recommender system KBRD (Knowledge-based Recommender Dialog). The dataset (ReDial) is based on DBpedia and consists of more than 50K movies with their directors and genres. The dialog corpus comprises 10K conversations with 18 utterances per dialog on avg. The proposed model incorporates KGs via Relational Graph Convolutional Network (R-GCN) embeddings. Transformer encoders/decoders are conditioned by attention over entity candidates as well as previously mentioned entities. KBRD improves SOTA on both recommender and utterance generation tasks. Well, my only concern is a shallow entity linking that looks for a direct string matching, hence, one would need to maintain an index of all existing strings referring to entities in the graph.

Yu et al propose CoSQL, a large-scale cross-domain text-to-SQL conversational dataset. CoSQL covers more than 200 databases in 3K+ dialogues with whopping ~4K slots and >1M values 👀 (compare to MultiWOZ that has only 25 slots and ~4K values).

The dataset implies that a user would need to tackle several challenges: intent/action classification (like clarification or fetching data from RDBs), dialogue state tracking to work with dialogue history and context, constructing a query (text-to-SQL) from natural language, and natural language generation (NLG) from results of an SQL query. As you see in the illustrating figure, queries can be quite complex.

It would be very cool to have a similar large-scale high-quality dataset based on graph databases with corresponding queries (say, SPARQL, Cypher or GraphQL) 😌

Actually, the whole stack of SQL-based datasets by Yale and Salesforce includes Spider (text to SQL queries), SParC (a sequence of text2SQL queries in a context), and CoSQL (a sequence of text2SQL in a context + response generation). So SQL and Relational DB aficionados have a very nice playground now 😉. EMNLP 2019 offers a variety of papers that improve SOTA numbers on those tasks, so I’d encourage you to check out respective websites and new approaches in the leaderboards 🔍.

Building Knowledge Graphs from Text

Extracting facts from some raw text has always been a complicated task. Schema-rigid approaches try to align extracted entities and relations to a background schema (ontology), for instance, “Berlin is the capital of Germany” in Wikidata terms can look like (wd:Q64, wd:P1376, wd:Q183) whereas open extraction frameworks like OpenIE try to organize noun and verb phrases into triples (Berlin, capital of, Germany). Even though building an Open KG might sound a bit easier, applying those graphs might be more difficult due to challenges with reasoning, deduplication and canonicalization (to say that “US” is the same as “United States”).

Source: Fan et al

🔥🔥 Fan et al explain how they build so-called local knowledge graphs, that is, small graphs specific to a current set of documents/sentences, and scale them across numerous documents. Here’s a TL;DR workflow: 1️⃣ First, they resort to OpenIE and Coreference Resolver to extract triples and build a small weighted graph where weights correspond to mentions. 2️⃣ Then, the graph is linearized in BFS manner as in the figure. 3️⃣ The linearized graph, word and positional embeddings are fed together into the Transformer encoder with Memory Compressed Attention (MCA). 4️⃣ Processing the Transformer-encoded question, the model outputs sentences-answers.

The model was evaluated on Explain Like I’m Five (ELI5) dataset (e.g., generating an explanation paragraph for a question “Can you explain the theory of relativity”) and WikiSum (generate summaries of Wikipedia articles) that eventually aim at text summarization. Incorporating KGs shows great benefits! 😍 I find the approach presented in the paper very promising and extendable to other domains even though there is a room for improvement: local graphs are relatively small, and applying any graph embeddings technique (GNNs) looks like a low-hanging fruit 🍊

When a new dataset is based on Wikidata instead of Freebase

🔥 Mesquita et al and folks from Diffbot came up with KnowledgeNet: a new dataset for extracting facts from natural language text, and here are two cool things. 1) They use Wikidata (😍) as a background graph 2) They even partly consider qualifiers (😍x2) when you need to say that, e.g., Barack Obama was the president of the US from 2009 to 2017 ! The dataset contains more than 9K annotated sentences taken from 5K documents to comprise about 13K facts expressed in 15 Wikidata relations. The final goal of the project / dataset is to achieve 100K facts over 100 properties. In addition, the authors define five baseline approaches for you guys to flex you IR muscles 💪

Source: Gupta et al

Gupta et al develop a Canonicalization-infused Representations (CaRe) model for learning embeddings of open KGs. In a nutshell, canonicalization is intended to denote similar entities or relations with one symbol. For instance, in the illustrative figure “Barack” and “Barack Obama” are essentially the same entity and can be canonicalized to one symbol. Similarly, “took birth in” and “was born in” correspond to the same relation and can be denoted as one distinct relation.

The authors propose an entity canonicalization mechanism that can be used with any KG embedding technique like TransE or ConvE 👍, and the experiments show the infused models indeed perform better link prediction over open KGs compared to vanilla 🍦models. Would be interesting to check whether the model can encode numbers which is a common issue for KG embedding algorithms.

Having seen methods to build and embed Open KGs, what about reasoning?

Source: Fu et al

Fu et al study the open knowledge graph reasoning problem and propose a reinforcement learning-based model for graph-based reasoning CPL (Collaborative Policy Learning) that consist of two agents: graph reasoner and facts extractor. Moreover, the authors publish two new datasets: FB60K-NYT10 where free-text news articles from NY Times are annotated with Freebase entities (well, I’d go with Wikidata as Freebase is dead since 2014 and all its content is in Wikidata already), and UMLS-PubMed where text articles from PubMed are annotated with UMLS facts. Experiments are interesting: in a more complex setup with FB60K CPL is inferior to SOTA KG embedding models like RotatE, but better than other joint text+graph embedding models or RL agents like MINERVA. That is, on one hand you have to deal with graph sparsity and weak signal if you want an efficient RL agent, on the other hand you have an explainable result. Choose wisely 🤔

Knowledge Graph Embeddings

As we already mentioned KG embeddings a couple of times, let’s see what EMNLP 2019 brings us in this domain.

Tucker decomposition strikes back! Traditionally, most of KG embedding algorithms learn two matrices: one for entites E and one for relations R . In TuckER, proposed by Balažević et al, we have three matrices to learn:

TuckER KG factorization. Source: Balažević et al

Where W corresponds to the core tensor that is envisioned to encode some prototype semantics of both entities and relations in a graph. The authors show that linear models like DistMult, ComplEx, and SimplE are special cases of the Tucker decomposition with certain specific constraints. Yes, TuckER performs very well compared to other models! 🔝 I’d be very interested in a further analysis to know more what’s happening inside that core tensor 😉

There is a general observation though: benchmark results depend on small but important details like graph preprocessing or training routines. For instance, TuckER and some other models require manual materialisation of reciprocal triples, e.g., given (Berlin, capitalOf, Germany) one more will be added: (Germany, capitalOf^(-1), Berlin). Other peculiarities might describe strategies of negative sampling like in RotatE (another SOTA algorithm). Balkir et al propose JoBi, a set of enhancements for training bilinear KG embeddings considering pairwise co-occurrences of entities and relations. First, they add a new loss component for (e,r) pairs to traditional (e,r,o) triples. Second, the pairwise distribution is used for biased negative sampling to generate more contrastive negatives which is shown to yield better performance in smaller batch sizes and negative/positive ratios. Sometimes it is written only briefly in one line (baad, baaad 👺, don’t do that) how many negatives per one positive triple models had to generate when training, e.g., it can reach hundreds of negative triples per one positive! Turns out if you generate negatives wisely, this number can be reduced without losing in the performance. Shortly: use that augmentation if you want to squeeze one-two more percents in the benchmarks 😉

🤔 Admittedly, it is hard nowadays to sell a KG embedding model only showing “my MRR / H@10 value is higher than previous SOTA”. You’d need to formally demonstrate that your approach is able to encode implicit semantics or maybe infer some logical rules. A prominent direction is to encode relation paths in knowledge graphs and show whether incorporating some sort of multi-hop reasoning improves numbers in other tasks.

Source: Zhu et al

For instance, Hayashi et al propose BlockHolE: a non-bilinear model based on block circulant matrices able to encode paths in a graph and answer path queries. Then, Zhu et al propose OPTransE, an extension of the well known TransE that takes into account relation paths between the subject and the object. Interestingly, in the evaluation the authors report that paths longer than 2 steps do not increase the link prediction performance but incur high computational costs. 2-hop neighbourhood is enough? 🤔

Chen at al, Lv et al, and Wang et al study meta-learning in KG embeddings that proves efficient in few-shot scenarios, say, when you have too few examples of a particular relation or entity. Moreover, Wang et al develop a technique to enrich existing graphs with new triples extracted from text descriptions/abstracts available in Wikidata and DBpedia. Note that the approaches can be paired with traditional KG embedding algorithms, hence, one more argument towards selling something bigger than just a new embedding model.

Conclusions

You made it to the end! Congratulations! Part II is coming soon 👍

Let me know in the comments whether it’d be better to split such a huge post into several smaller parts, or I forgot to mention something important. Of course, there were numerous generally cool papers not about KGs but I gladly give some space to other folks willing to share their conference experience 😉

This year’s EMNLP was significantly bigger than previous EMNLPs, in fact, 2019 itself has more papers than 2016+2017 venues. It gets harder for the community to catch up with the relevant papers if you didn’t attend the conference. So I’m very grateful to the folks compiling their thoughts and notes from the conferences in their domains bringing them to a wider audience 👏. Note that images were taken from the respective sources.

--

--

Michael Galkin
Michael Galkin

Written by Michael Galkin

AI Research Scientist on Graph and Geometric learning.

No responses yet