[STANBOL-1037] Entity Disambiguation for Stanbol - ASF JIRA

Details

Type: Story
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Enhancer, Entityhub
Labels:
- gsoc2013
- mentoring

Description

Entity Disambiguation in Stanbol would mainly refers to the process of modifying the fise:confidence values of EntityAnnotations obtained as a result of any Linking Engine within Stanbol (EntityLinkingEngine or NamedEntityLinking). Such modifications to confidence values should be done in order to obtain a ranking of possible candidates (entities) to link with for each EntityAnnotation after a disambiguation process. Each candidate would be an Entity within EntityHub or any other Knowledge Base configured in Stanbol.

Disambiguation
============

Entity Linking is not a trivial task due to the name ambiguity problem, i.e., the same name may refer to different entities in different contexts and also the same entity usually can be mentioned using a set of different names. For instance, the name Michael Jordan can refer to more than 20 entities in Wikipedia, some of them are
shown below:

Michael Jordan(NBA Player)
Michael I. Jordan(Berkeley Professor)
Michael B. Jordan(American Actor)

This situation happens not only with these well known semantic knowledge bases like DBpedia or Freebase, but are also important for any enterprise semantic dataset or custom vocabularies. An instant example is to resolve the ambiguity within a database of employees.

Formally, Entity Disambiguation for Stanbol should work as follows: after an enhancement process of a ContentItem using an enhancement chain that includes a Linking Engine, we would get a set of TextAnnotations TA =

{T1, T2,......Tn}

. Each TextAnnotation in TA should contain a name mention which is characterized by its name, its local surrounding context (fise:selection-context) and the ContentItem containing it. For each TextAnnotation in TA and as a result of the Linking Engine, we would get a set of EntityAnnotations EAi =

{E1i, E2i,....., ENi}

where i corresponds to TextAnnotation i in TA. We should rely on the linking engines to provide all possible entity annotations (candidates within all sites in the EntityHub) for each TextAnnotation. Each EntityAnnotation is characterized by its Knowledge Base (entityhub:site) and its entry in that knowledge base (fise:entity-reference). The objective of the disambiguation process is to rank each EntityAnnotation set EAi through the modification of its EntityAnnotations' confidence values so that the entity with the higher confidence value were the referent entity for the TextAnnotation associated to EAi.

Algorithms
========

- Local Approaches

(From [1]) Conventional entity linking approaches have focused on making independent Entity Linking decisions using the local mention-to-entity compatibility for each isolated mention. The essential idea was to extract the discriminative features from the description of a specific entity and then link each name mention in a document by comparing the contextual similarity with each of its candidate referent entities. Such approach is followed by Disambiguation-MLT engine in STANBOL-723.

- Global Approaches (Collective Entity Linking)

The main drawback of the local-based approaches stems from the fact that they do not take into consideration the interdependence between different Entity Linking decisions. Specifically, the entities in a topical coherent document usually are semantically related to each other. In such cases, figuring out the referent entity of one name mention may in turn give us useful information to link the other name mentions in the same document. That suggests that disambiguation performance could be improved by resolving all mentions at the same time.

This approach only makes sense in an scenario with highly connected knowledge bases where the entities are semantically related in some way.

- Graph Based Approaches

In these approaches, both Knowledge Base and interdependence between possible Entity Linking decisions are modeled as graphs and inference algorithms are used to resolve all the mentions within a document.

Knowledge Bases
==============

As described in ~~STANBOL-223~~, for Disambiguation, it is necessary to use some data as disambiguation features. Disambiguation data nature will depend on the knowledge base particularities. In general, it will be necessary to generate a Semantic context for each candidate and process it in the disambiguation algorithm. The Disambiguation Context could be a fixed data structure for each kind of disambiguation engine in Stanbol and developers should be in charge to develop mechanism to create those contexts for their custom vocabularies or knowledge bases.

For instance, with Local Approaches, developers should be able to configure Disambiguation-MLT or any other local based disambiguation engine in order to obtain a disambiguation context from EntityHub for computing its similarity with mentions' contexts within the Content Item.

This can be as easy as select Entity's disambiguation fields or as complex as making calls to methods for building disambiguation contexts on the fly. Normally, the first option will involve the generation of disambiguation fields at EntityHub's index creation time. For instance, as described in ~~STANBOL-223~~, for DBPedia, it is possible to extract sentences with occurrences of entities'e mentions from Wikipedia using https://github.com/ogrisel/pignlproc. These sentences can be included in DBPedia EntityHub index as disambiguation fields. Entities' abstract can also be used for disambiguation. All these fields should be configurable (boost) for disambiguation purposes.

General Architecture and Workflow
==========================

A typical Disambiguation system architecture would include three steps:

- Candidates Generation: from a surface form (name mention) in the Content Item, generate a set of possible entities within EntityHub to link with. A typical source of entities' names are entities' labels, but others fields can be used. In this step, is it necessary to resolve how to search on that names' sources: Exact Matching, Overlapping, Fuzzy Search, Full-Text Search, Case-sensitive, Coreference Resolution....

- Candidate Ranking: rank the probabilities to be the reference entity of all candidates. Basically, this step involves the execution of the specific disambiguation engine as an enhancement post processing phase.

- Detect and Cluster Missing Entities: those mentions that actually shouldn't be linked to any Entity should be extracted and grouped in clusters (one cluster for each unknown entity). These entities can be suggested to the user in order to include them in the knowledge base (Automatic Knowledge Base Population).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

stanbol-enhancement-workflow.001.png
16/Apr/13 07:54
108 kB
Rupert Westenthaler

Issue Links

incorporates

STANBOL-1183 Stanbol Disambiguation API

Open

is related to

STANBOL-1156 Freebase Entity Disambiguation

Closed

relates to

STANBOL-1053 Add disambiguation context fields to the default Solr schema of the Entityhub SolrYard

Resolved

STANBOL-723 Enhancement Engine for Disambiguation based on Solr MLT

Reopened

supercedes

STANBOL-223 Entity Disambiguation

Resolved

Entity Disambiguation for Stanbol

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates