Building Transformer-Based Entity Linking System

History and step by step guide.

izuna385

Follow

Published in

Nerd For Tech

6 min readMar 26, 2021

[Updated at 18th April, 2021] You can run experiments at your own Colab-Pro environment. For further details, see here.

History

Entity Linking (EL) is the task of mapping a range of text in a document, called a Mention, to an entity in a knowledge graph, and is considered one of the most important tasks in natural language understanding and applications.

For example, when a user says, “I saw Independence Day today, and it was very exciting,” in a chatbot dialogue, no one would think this as anniversary.

This is just one example, but entity linking is a very important task for subsequent problems related to NLP.

In this article, we will create two simple entity linking systems based on Bi-encoder. The former is based on surface-based candidate generation (CG), and the latter on Approximate Nearest Neighbor Search (ANNSearch).

Entity Linking System with bi-encoder, based on surface-based CG

System with bi-encoder, based on ANN-search CG

Until 2019, entity linking models were trained by feeding a neural net with string similarity between mentions and entities, co-occurrence statistics in Wikipedia (features called priors in the early papers), context in which mention exist, and so on.

As it requires high computational costs to compare all entities in a knowledge base with each mention, previous work used such statistics, or alias tables for candidate generation; i.e., to filter entities to consider for each mention. (Yamada et al., 2016; Ganea and Hofmann, 2017; Le and Titov, 2018; Cao et al., 2018).

For an alias table to be effective, it must incorporate a large number of mentions occurring in target documents. This is not an issue for entity linking from generic documents, owing to Wikipedia and its rich annotation of anchor texts available to create an alias table. By contrast, building an effective alias table is difficult and costly in specialized domains, as Wikipedia’s coverage is often inadequate for these domains, and a large domain-specific corpus of anchor texts can be hardly found.

If the target mention/document is in a general field, it is possible to construct a large Alias Table statistically from a large number of member-entity pairs from Wikipedia and Web Crawl.

When an alias table is not available or has limited coverage, candidates can be generated by the surface similarity between mentions and entities (Murty et al., 2018; Zhu et al., 2019). However, capturing all possible entities for a mention by its surface similarity is extremely difficult if not impossible, especially in biomedical literature wherein entities often exhibit diverse surface forms (Tsuruoka et al., 2008). For example, the gold entity for the mention rs2306235, which occurs in the St21pv dataset (Mohan and Li, 2019), is PLEKHO1 gene. However, this entity is likely to be overlooked in surface-based candidate generation, as it does not share a single character with rs2306235.

To mitigate the dependence on alias tables, Gillick et al. (2019) proposed using the approximate nearest-neighbor (ANN) search. Although their method successfully abstained from using alias tables, it instead requires a large corpus of contextualized mention-entity pairs for training, which is rare in specialized domains. Wu et al. (2019) proved that Bi-encoder is still useful for entity candidate generation in similar setting, such as Zero-shot one which Logeswaran et al. (2019) proposed, but the accuracy is still far below that of the general domain, and further research is expected in this setting.

In this article, we will implement entity linking using Bi-encoder. We will implement two types of candidate generation, one based on surface forms and the other using approximate neighborhood search.

Source Codes

Source codes are here.
https://github.com/izuna385/Entity-Linking-Tutorial

Data Preprocessing

Written in the above source codes. In this article, we use BC5CDR (Li et al. (2016)) dataset for training and evaluating model.

BC5CDR is a dataset created for the BioCreative V Chemical and Disease Mention Recognition task. It comprises 1,500 articles, containing 15,935 chemical and 12,852 disease mentions. The reference knowledge base is MeSH, and almost all mentions have a gold entity in the reference knowledge base.

In this article, we preprocessed Pubtator format documents by using scispacy in parallel processing. Here is the tool used for this preprocessing.
https://github.com/izuna385/PubTator-Multiprocess-Parser
https://github.com/izuna385/ScispaCy-Candidate-Generator

Model and Scoring

Transformers are used for encoding mentions and entities. Cited from Humeau et al., ‘20.

The encoder for the model and candidate entities is a very simple structure. We will use embeddings that are encoded from the link target range and surrounding word pieces. We will also use an embedding that is generated from the entity name and description.

Candidate Generation

For surface-based CG, we use ScispaCy for two reasons. The first is that the dataset we are working on, BC5CDR, belongs to the biomedical field, and the second is that ScispaCy itself has a superficial form-based candidate generation function. For ANN-based CG, we will use faiss.

Training

To save negative examples when training with BERT, we use in-batch negative sampling here.

In this learning method, it is possible to train the encoder more rigorously by increasing the batch size, but this also increases the GPU resources required. In this experiment, we have confirmed that training is possible even with a batch size of 16 🙂.

Evaluation and Results

We use two methods for candidate generation: surface form and approximate neighborhood search, but the final prediction uses the inner product of the mention and entity embeddings for both.

Surface-based CG

ANN-based CG

Discussions and Further Directions

The final correct answer rate was about 50% for the surface form-based candidate generation. In this case, we used inner product for the final prediction, but in fact, surface similarity is another important feature in prediction. In addition to scoring between embeddings, it is conceivable to prepare a scoring function that takes character similarity into account.

In addition, the data set used in this study is only about 10,000, which is only 1/10000th of the number of data used in Gillick et al. From this, we can infer that our model successfully encodes the context and entity information into the embedding.