Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
Description
The KeywordLinkingEngine can make use of POS tags to decide of a Token (word) needs to be processed or can be skipped. If no POS tags are available or the POS tag probability is to low (currently the default is 0.8) than the minimum token length (default is 3) is used as fall-back.
Analyzing POS tag results have shown that often tags with non noun tags where below the 0.8 limit. For those the fall-back was used and in most cases this resulted in the KeywordLinkingEngine in processing those tokens.
However it can also be observed that while some of those POS tags where not correct usually non correct tags where only between tags where both where non-noun tags. Because of that it can improve results and processing time to decrease the minimum probability for accepting an non noun POS tag.
Because of that the algorithm will be adjusted like follows:
Introduce two Tag Probabilities:
1. "minPosTypeProb" for Accepting POS tags that represent Nouns and
2. "minPosTypeProb/2" for rejecting POS tags that are not nouns
Assuming that the <code>minPosTypePropb=0.667</code> a<ul>
- noun with the prop 0.8 would result in returning <code>true</code>
- noun with prop 0.5 would return <code>null</code>
- verb with prop 0.4 would return <code>false</code>
- verb with prop 0.3 would return <code>null</code>
NOTES: <code>null</code> indicates that no POS tag is available or the POS tag has a low propability
This changes will be need to be applied to the "OpenNlpAnalysedContentFactory#processPOS(..)" and the "EntityLinker#isProcessableToken(..)" methods