Uploaded image for project: 'Stanbol (Retired)'
  1. Stanbol (Retired)
  2. STANBOL-1262

Change/Improve processing of Chunks by EntityLinking



    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.12.0
    • 0.12.0
    • None
    • None


      The first step of EntityLinking (applies to all EntityLinkingEngines incl. the Lucene FST Linking Engine) is that it classifies Tokens as "linkable", "matchable" and "others". In addition it determines "processible" chunks Tokens are contained in.

      This issue is about changing the way how "processible" chunks are determined if the AnalyzedText contains multiple overlapping chunks.

      A typical case where this can happen is if both a Noun Phrase Detection and a Named Entity Recognition is contained in the Chain. The chunks selected by Named Entities will typically be smaller as the corresponding Noun Phrase. There are even situations where the Named Entity does not even include all Nouns contained in a Noun Phrase.

      Here an Example taken from [1]:

      After a disappointing start against an Everton side who led through Kevin Mirallas's first-half goal ...

      While "Everton" is detected as Organization by NER, the Noun Phrase "an Everton side" also include 'side' as an 2nd noun. Therefore 'Everton' is not considered for linking as it only matches a 1/2 matchable tokens within a 'processible phrase'

      This is because EntityLinking currently merges overlapping processible phrase together. A semantic that is - no longer - an optimal for EntityLinking.

      To avoid recall problems like described the last Chunk emitted by the AnalyzedText should be used instead. For the above example this would result in

      • an [other]: an Everton side
      • Everton [linkable]: Everton
      • side [matchable]: an Everton side

      So 'Everton' would get correctly linked to an Entity with the label Everton but 'side' would not get linked to an Entity with the label Side, as it is in a Phrase with an other linkable/matchable token.

      An other example would be ' ... the University of Munich is ... ' where one could expect Noun Phrases for 'the Univerity' and 'Munich' (if single token noun phrases are emitted by the chunker component). In addition as a result of the NER engine one can expect a chunk for 'Univerity of Munich'.

      • the [other]: the University
      • University [matchable]: University of Munich
      • of [other]: University of Munich
      • Munich [linkable]: Munich

      This would result in the linking rules that 'University' is only linked to Entities that also match Munich in their Label while Munich would be also linked to Entities that just include Munich. A small differentiation to the current implementation where Munich alone would not get linked as all the chunks would get merged to a big one covering 'the University of Munich'.

      [1] http://www.theguardian.com/football/2014/jan/20/west-bromwich-albion-everton-premier-league-match-report


        Issue Links



              rwesten Rupert Westenthaler
              rwesten Rupert Westenthaler
              0 Vote for this issue
              1 Start watching this issue