Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9030

Solr- and WordnetSynonymParser behaviour differs

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 8.2
    • Fix Version/s: master (9.0), 8.4
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Equivalent synonyms are showing up with different token types and ordering depending on whether the Solr format or the Wordnet format is used. A synonym set like

      "woods, wood, forest" in Solr format leads to the following token stream (term and type) when analyzing the term "forest": 

      "forest"/word, "woods"/SYNONYM, "wood" /SYNONYM

       

      The following set in Wordnet format should give the same output (all terms are in the same synset), however all tokens are of type SYNONYM here and the original input token "forest" isn't the first one:

      synonyms.txt:

      s(100000001,1,'woods',n,1,0)
      s(100000001,2,'wood',n,1,0)
      s(100000001,3,'forest',n,1,0)

      Token stream (term/type) when an

      woods"/SYNONYM, "wood" /SYNONYM, "forest"/SYNONYM

      I don't think this is intentional and is confusing (especially because the "original" input token type gets lost). I saw that the way the synsets are added to the SynonymMap in the respective parsers differes and have a PR that changes this.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                romseygeek Alan Woodward
                Reporter:
                cbuescher Christoph Büscher
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m