[STANBOL-1262] Change/Improve processing of Chunks by EntityLinking - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.12.0
Fix Version/s: 0.12.0
Component/s: None
Labels:
None

Description

The first step of EntityLinking (applies to all EntityLinkingEngines incl. the Lucene FST Linking Engine) is that it classifies Tokens as "linkable", "matchable" and "others". In addition it determines "processible" chunks Tokens are contained in.

This issue is about changing the way how "processible" chunks are determined if the AnalyzedText contains multiple overlapping chunks.

A typical case where this can happen is if both a Noun Phrase Detection and a Named Entity Recognition is contained in the Chain. The chunks selected by Named Entities will typically be smaller as the corresponding Noun Phrase. There are even situations where the Named Entity does not even include all Nouns contained in a Noun Phrase.

Here an Example taken from [1]:

After a disappointing start against an Everton side who led through Kevin Mirallas's first-half goal ...

While "Everton" is detected as Organization by NER, the Noun Phrase "an Everton side" also include 'side' as an 2nd noun. Therefore 'Everton' is not considered for linking as it only matches a 1/2 matchable tokens within a 'processible phrase'

This is because EntityLinking currently merges overlapping processible phrase together. A semantic that is - no longer - an optimal for EntityLinking.

To avoid recall problems like described the last Chunk emitted by the AnalyzedText should be used instead. For the above example this would result in

an [other]: an Everton side
Everton [linkable]: Everton
side [matchable]: an Everton side

So 'Everton' would get correctly linked to an Entity with the label Everton but 'side' would not get linked to an Entity with the label Side, as it is in a Phrase with an other linkable/matchable token.

An other example would be ' ... the University of Munich is ... ' where one could expect Noun Phrases for 'the Univerity' and 'Munich' (if single token noun phrases are emitted by the chunker component). In addition as a result of the NER engine one can expect a chunk for 'Univerity of Munich'.

the [other]: the University
University [matchable]: University of Munich
of [other]: University of Munich
Munich [linkable]: Munich

This would result in the linking rules that 'University' is only linked to Entities that also match Munich in their Label while Munich would be also linked to Entities that just include Munich. A small differentiation to the current implementation where Munich alone would not get linked as all the chunks would get merged to a big one covering 'the University of Munich'.

[1] http://www.theguardian.com/football/2014/jan/20/west-bromwich-albion-everton-premier-league-match-report

Attachments

Issue Links

is depended upon by

STANBOL-1266 EntityLinking engines should consider Chunks with NER annotations

Resolved

Activity

People

Assignee:: Rupert Westenthaler

Reporter:: Rupert Westenthaler

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 21/Jan/14 09:23

Updated:: 22/Jan/14 09:05

Resolved:: 22/Jan/14 09:05