Entity Linking (EL) is the task of detecting mentioned entities in a text and linking them to the corresponding entries of a Knowledge Base. EL is traditionally composed of three major parts:
- Candidate generation, and
- Candidate disambiguation.
The performance of an EL system is highly dependent on the accuracy of each individual part.
Regarding the fact that in ExPoSe project we need to annotate parliamentary conversations by linking entities in them to external resources like Wikipedia, we have some ideas on this task and participated in the SIGIR ERD Challenge 2014 and the performance analysis of this system on the challenge's datasets shows that the proposed approaches successfully improve the accuracy of the baseline system.
In general, we focus on these three main building blocks of EL systems and try to improve on the results of one of the open source EL systems, namely DBpedia Spotlight.
We focus on DBpedia Spotlight as a baseline and tries to resolve some of its problems and improve its accuracy. One the main reasons of using DBpedia Spotlight is that it is a very configurable system, which makes it a good choice for using it as baseline EL system. In DBpedia Spotlight, spotting can be done by employing NLP techniques such as Named Entity Recognition, Detecting Multi-Word Entities, and finding sequences of capitalized words which are main approaches used in most of EL systems.
Alternatively, a language-independent method, based on a surface form dictionary, can be used for spotting texts. This step is one of the important parts of EL systems and most of EL systems try to produce all possible surface forms for a given text. However, all detected spots do not necessarily refer to an entity in the target KB. Therefore, one of the important post-processing approaches to improve the accuracy of DBpedia Spotlight could be determining whether a spotted text really refers to an entity in KB or not. This decision is called “NIL detection”. After spotting the given text, all possible candidate entities for the spotted texts should be generated. Since the formatting of the spotted string may be different from the formatting of saved entities in the target KB, some pre-processings should be done in order to match the spotted strings with entities in KB. Therefore, another important step in improving the performance of an EL system could be the pre-processing of the input text and matching its entities with the entities of the KB. Another dominant factor of the performance of the EL system is to determine, given the context, which of the candidates of a surface form is most likely to be mentioned in this instance. Also, in some cases it is possible that for a surface form there are more than one correct entity. In this case, the EL system should rank the candidates according to their correctness rate. Therefore, after candidate generation, a post-processing on the candidates and disambiguating them could improve the accuracy of the EL system.
We participated in the SIGIR ERD Challenge 2014 and our proposed system of this paper has been submitted, with good results. We focus on three major techniques, which could improve the performance of DBpedia Spotlight:
- NIL detection
- candidate disambiguation
As pre-processing we first tune parameters of DBpedia Spotlight and find its best configuration. Also, we normalize the character encoding and transform all documents formatted in different formats to a unique format. Additionally, since DBPedia Spotlight is case sensitive, in order to find all possible surface forms of texts, in addition to the main text, we capitalize the input text and submit them to the DBpedia Spotlight. For the NIL detection part of our system, we use two different approaches: filtering candidates which are not included in the target KB and classifying entities as “NIL” or “Not NIL” instances. Since the target KB of DBpedia Spotlight is different from our target KB (the ERD 2014 target KB), it is possible that the detected entities by DBpedia Spotlight do not exist in the target KB. Therefore, we filter out the surface forms that their all candidates do not exist in the KB and consider them as NIL. In the classification approach, we use some texts which their entities are annotated to learn a classifier that classifies candidates as “NIL” or “Not NIL”. We extract different types of features from these annotated texts and their entities and train a classifier. Then, we use this classifier such that if the score that the classifier assigns to a candidate is lower than a predefined threshold, we classify it as “NIL”. Finally, in order to disambiguate the candidates, we use the generated scores by the classifier to find the most probable candidate for each surface form. The defined features exploit the context within the text to estimate the correctness rate of an entity. The main idea behind these features is that the entities that are mentioned in a given text are related to each other. Therefore, we could use other mentioned entities in a given text to disambiguate an entity. We use this intuition and define several features. For example, we traverse the categorical structure of Wikipedia to estimate a relatedness score for each entity based on the closeness of its categories to the categories of other entities of the text.
For more details on our proposed approach and its results, please refer to this paper:
- A. Olieman, H. Azarbonyad, Mostafa Dehghani, J. Kamps, and M. Marx, "Entity linking by focusing DBpedia candidate entities", In proceedings of the first international workshop on Entity recognition & disambiguation (ERD'14), 2014, pp. 13-24.