Knowledge Graphs @ AAAI 2020

10 min readFeb 11, 2020

The first major AI event of 2020 is already here! Hope you had a nice holiday break 🎄, or happy New Year if your scientific calendar starts with a conference (which means NY comes from NYC). AAAI 2020 brought us a new line-up of Knowledge Graph-related papers, in other words, AAA-class papers from AAAI 😉 Okay, enough feeble jokes, let’s get started!

This year AAAI got 1591 accepted papers among which about 140 are graph-related 👀. Additionally, there was a strong Workshop and Tutorial presence:

and on the intersection with NLP there were

The slides and accepted papers of those events are already online so I encourage you to check them out, too. In this post, we’ll examine several 🔥 topics that leverage graphs and KGs in particular. Here is the outline:

KG-Augmented Language Models: in different flavours
Entity Matching in Heterogeneous KGs: finally no manual mappings
KG Completion and Link Prediction: neuro-symbolic and temporal KGs
KG-based Conversational AI and Question Answering: going big
Conclusion

KG-Augmented Language Models In Different Flavours

We first noted the trend of infusing some structured knowledge into LMs in the review of EMNLP 2019, and 2020 is officially (declared by me 😊) a year of KG-Augmented LMs: more large-scale training corpora appears together with pre-trained models (and multilingual, too! 🌍).

💎 Hayashi et al define Latent Relation Language Models (LRLMs) for natural language generation (NLG) tasks conditioned on a knowledge graph. KGs contribute with relations, surface forms (synonyms) of entities and relations, and are infused into probabilistic distributions when generating tokens. That is, at each time step the model takes either a word from a vocabulary or resorts to a known relation. The eventual task is to generate a coherent and correct text given a topic entity. LRLMs leverage KG embeddings over the underlying graph to obtain entity and relation representations, as well as Fasttext for embedding surface forms. Finally, for parameterizing the process you’d need a sequence model. The authors try LSTM and Transformer-XL in order to evaluate LRLM on WikiFacts linked with Freebase and WikiText that employs Wikidata annotations. 🎆Experiments report that even large-scale Transformers benefit from conditioning on KGs, perplexity values do decrease, and the generated text can be much more coherent compared to other approaches. Great work that can be used for generating texts from a set of triples! 👍

Liu et al propose K-BERT which expects each sentence (if possible) is annotated with named entities and relevant (predicate, object) pairs from some KG. The enriched sentence tree 🌴 (as in the figure) is then linearized into a new positional-like embedding and masked with a visibility matrix that controls which parts of the input can be seen during training and attended to. In fact, the authors explicitly mention that knowledge infusion happens only during the fine-tuning stage whereas pre-training happens exactly as was in the vanilla BERT. The authors infuse open-domain and medical KGs and observe a consistent 1–2% boost across all evaluation tasks.

In a similar fashion, Sun et al introduce ERNIE 2.0, an extended approach for incorporating external knowledge and capture even more lexical, syntactic and semantic information compared to original ERNIE 1.0. Scarlini et al employ BERT together with semantic networks BabelNet and NASARI in their model SensEmBERT for word sense disambiguation and representing senses in multiple languages. The authors explicitly outline that SensEmBERT better supports rare words and outperforms tailored supervised approaches in the WSD tasks. The model is openly available. Bouraoui et al further evaluate relational knowledge of BERT, namely, whether it predicts a correct relation given a pair of entities like (Paris, France) . The authors identify that BERT is generally good in factual and commonsense tasks, not bad in lexical tasks, and pretty meh in morphological tasks. This is actually a well-presented motivation to augment LMs with KGs! 👏

Entity Matching in Heterogeneous KGs

Different KGs have their own schema for modelling their entities, i.e., different set of properties that might only partially overlap, or totally different URIs. For instance, the city of Berlin in Wikidata has URI https://www.wikidata.org/entity/Q64 while in DBpedia it is http://dbpedia.org/resource/Berlin . If you have a KG comprised of such heterogeneous URIs (that in fact describe one real-world Berlin) you either will consider those entities as independent or need to write/find custom mappings that will explicitly pair those URIs as same (e.g., with owl:sameAs predicate often used in open domain KGs). Maintaining mappings for large-scale evolving graphs is quite a cumbersome task 🤯 . Previously, ontology-based alignment tools relied only on such mappings to identify similar entities. Today, we have GNNs to learn such mappings automatically with just a small training set!

🔥 Sun et al propose AliNet, an end-to-end GNN-based architecture able to aggregate distant multi-hop neighbourhoods for entity alignment. Due to the schema heterogeneity, the task gets more complex as neighbourhoods of similar entities in different KGs are not isomorphic. To compensate that, the authors suggest paying attention to n-hop surroundings of a node as well as TransE-style relation modeling with a specific loss function. Eventually, the gate function controls how much information a node gets from 1-,2-,3-hop neighbours. AliNet is evaluated on multi-lingual versions of DBpedia (ZH-EN, JA-EN, FR-EN), on DBpedia -Wikidata, and DBpedia -YAGO datasets. As you know, DBpedia, Wikidata, and YAGO have totally different schemas, so it’s even more surprising to see 90+% Hits@10 prediction accuracy 👀 No more manually created mappings!

Xu et al study the alignment problem in multi-lingual KGs (DBpedia in this case) where GNN-based approaches can fall into the “many-to-one” case and generate several candidate source entities for a given target entity. The authors examine how to make the GNN encoding outputs be more sure in their predictions (hence increase Hits@1 score 📈) and offer two strategies: Easy-to-Hard decoding — essentially, a two-pass procedure where the first step invokes an alignment model; at the second step at least K candidates with a probability higher than a threshold 𝛕 are added to the ground truth and the alignment model is executed again. The second strategy employs the Hungarian algorithm to find the best assignment of candidate pairs. Since the algorithm has O(N⁴) time complexity, the authors apply a threshold 𝛕 over probabilities which dramatically reduces candidate sub-spaces and allows the algorithm to run in some reasonable time. Results: depending on the underlying GNN model and task, you could get 3–5% higher H@1.

Knowledge Graph Completion and Link Prediction

AAAI’20 marks and outlines two growing trends: neuro-symbolic computation is back and shiny; temporal KGs are getting more traction.

As to neuro-symbolic paradigm, Minervini et al extend Neural Theorem Provers (NTPs) to Greedy NTPs (GNTPs). NTPs are end-to-end differentiable systems that learn rules and try to prove facts given the rest of the KG. Although NTPs bring explainability, their complexity grows extremely fast on the KG size, and NTPs could not be evaluated on reasonably large datasets (not talking about whole KGs like Wikidata). To alleviate the problem, the authors derive a greedy kNN strategy to select facts that can maximize proof scores thus making GNTPs far more scalable. 🎉 Moreover, in addition to (s,p,o) triples, the authors also allow rules to be expressed in natural language as mentions, e.g., “London is located in the UK”. Experimenting on the link prediction tasks the authors find that text mentions even encoded with a simple bag of embeddings model yield significantly better results.

⏳KGs often contain facts that are valid only within a certain period of time and then get updated with new values, e.g., (Albert Einstein, spouse, Mileva Marić, from: 1903, to: 1919) and then (Albert Einstein, spouse, Elsa Einstein, from: 1919, to: 1936). That is, depending on time and year some links in a KG are correct and some are not - the time dimension is especially crucial in Enterprise KGs. Working with Temporal KGs requires models to weight links differently given a certain time window unlike traditional KG embedding approaches that consider only static graphs. Goet et al propose DE-SimplE, an extension of SimplE model, that supports temporal dimension in KGs via diachronic entity embeddings where 𝛾d dimensions out of entity dimension D capture temporal features and (1-𝛾)d capture static KG features. Surely, we’ll see more models for dynamic KGs changing over time ⏱

👩‍⚖ A fresh look on facts classification is shed by Hildebrandt et al: they propose to adopt debate dynamics in R2D2, where two agents compete trying to prove or refute a given triple, and a judge (disguised as a binary classifier) decides whether a triple is true or false. The system is trained with Reinforcement Learning and can also be tweaked to perform link prediction tasks in addition to triple classification. The authors also conduct a survey 📝 to find how human evaluators assess arguments generated by prosecutors and advocates. As a general problem of rule mining systems, it would be interesting to see the scalability study of R2D2 🤖.

☄Commonsense KGs like ConceptNet and ATOMIC are now used in many NLP tasks but so far have not yet receive a thorough study on their link prediction and completion characteristics.

Malaviya et al show that traditional KG embedding algorithms like DistMult or ConvE yield pretty low results due to KG sparsity (that is, density of FB15K-237 is about 1.2e-3 with average in-degree of 17 whereas ATOMIC’s density is 9e-6 with 2.25 average in-degree. The authors argue that we need to consider structural and semantic context as well — so that in the proposed model R-GCN is used to aggregate neighbourhood information and BERT for encoding phrases and text. Additionally, the authors experiment with synthetically induced edges (which probability scores exceed a certain threshold) and subgraph sampling for R-GCN. The paper is a blast 🎆 to read: well-structured, well-explained concepts, thorough experiments and analysis -> strong work 💪

Visualization of embeddings from WN18RR. Source: **Zhang et al**

Shorter paper descriptions of the algorithms you might find useful: Vashishth et al propose InteractE, an improved version of ConvE with various reshaping strategies and circular convolutions, consistently outperforms ConvE in the benchmarks. Zhang et al extend RotatE in their model HAKE in order to better model hierarchical relationships so that narrower term embeddings are likely to be found within embeddings of broader terms ◎

KG-based Conversational AI and Question Answering

AAAI’20 hosted the Dialogue State Tracking Workshop (DSTC8). The event brought together the experts in Conversational AI including folks from Google Assistant, Amazon Alexa, and DeepPavlov 🎅.

Within DSTC the community solves even more complex tasks while getting bigger and richer datasets. I’d outline the Schema-Guided Dialogue (SGD like Stochastic Gradient Boost 😉) dataset presented by Rastogi et al. To date, it’s the largest available multi-domain dataset having more than 16K dialogues in 16 different domains. SGD does not have one unified schema, rather each of 16 domains has its own schema description obtained from Freebase. In fact, it is that large, so that one can not even enumerate all possible slot values or guarantee that the train set would cover all possible services/slots/values combinations. Hence, you definitely want to have some zero-shot state tracking component. Here is the report on the results achieved by 25 competing teams, they were able to drastically push the baselines along lower scores of 20–60% to 86–99% 🎉.

Even more exciting KGs applications:

A model for Visual Storytelling. Source: **Hsu et al**

Wang et al propose TransDG for response generation in chit-chat dialogue systems where the main trick is to transfer factual knowledge from KGQA systems trained on SimpleQuestions / Freebase and ConceptNet. Hsu et al define KG-Story, a framework for visual storytelling which generates a coherent textual description based on a sequence of photos. Moreover, the framework is enriched with KGs which help in image scene understanding as well as text generation.

In the KGQA domain, Sun et al propose SPARQ: a skeleton-based semantic parsing for answering complex questions over KGs. By the skeleton the authors understand a span of minimum semantic units (e.g., VP, NP, PP) and some attachment relations that build a prototype of a query tree which is then instantiated and sent to a KG query engine. The approach is evaluated generally on Freebase: GraphQuestions and ComplexWebQuestions, though there are Wikidata versions of those 😉. For CommonsenseQA, Lv et al apply both ConceptNet and IR from Wikipedia to mine some evidence knowledge and pass it through GCN. On the other hand, the inference module based on XLNet encodes questions, choices and found evidence to obtain a joint representation. The approach yields 75.3% F1 score which is current top-2!

Conclusion

We had a brief look on KGs applied in rather NLP-related tasks. Surely, there are applications in other domains like structuring scene graphs in Computer vision or Bioinformatics where graphs, for instance, help to study molecules. I hope now your backlog increased just a little bit 😉.

The growing Graph ML community already provides enough materials to dive into the area, and I’d encourage you to have a look there from time to time, e.g., Graph-based Deep Learning Literature repo on GitHub that aggregates most of graph-related papers from top conferences, or Graph ML channel in Telegram. Sergei Ivanov has already compiled a great post on Graph ML trends from ICLR 2020 👏

Luckily, ICLR’20 is big enough to study even more fresh KG-related papers. Thanks for reading and stay tuned! 🚀