Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
8.2
-
None
-
New
Description
Equivalent synonyms are showing up with different token types and ordering depending on whether the Solr format or the Wordnet format is used. A synonym set like
"woods, wood, forest" in Solr format leads to the following token stream (term and type) when analyzing the term "forest":
"forest"/word, "woods"/SYNONYM, "wood" /SYNONYM
The following set in Wordnet format should give the same output (all terms are in the same synset), however all tokens are of type SYNONYM here and the original input token "forest" isn't the first one:
synonyms.txt:
s(100000001,1,'woods',n,1,0) s(100000001,2,'wood',n,1,0) s(100000001,3,'forest',n,1,0)
Token stream (term/type) when an
woods"/SYNONYM, "wood" /SYNONYM, "forest"/SYNONYM
I don't think this is intentional and is confusing (especially because the "original" input token type gets lost). I saw that the way the synsets are added to the SynonymMap in the respective parsers differes and have a PR that changes this.
Attachments
Issue Links
- links to