Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9030

Solr- and WordnetSynonymParser behaviour differs

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 8.2
    • 9.0, 8.4
    • modules/analysis
    • None
    • New

    Description

      Equivalent synonyms are showing up with different token types and ordering depending on whether the Solr format or the Wordnet format is used. A synonym set like

      "woods, wood, forest" in Solr format leads to the following token stream (term and type) when analyzing the term "forest": 

      "forest"/word, "woods"/SYNONYM, "wood" /SYNONYM

       

      The following set in Wordnet format should give the same output (all terms are in the same synset), however all tokens are of type SYNONYM here and the original input token "forest" isn't the first one:

      synonyms.txt:

      s(100000001,1,'woods',n,1,0)
      s(100000001,2,'wood',n,1,0)
      s(100000001,3,'forest',n,1,0)

      Token stream (term/type) when an

      woods"/SYNONYM, "wood" /SYNONYM, "forest"/SYNONYM

      I don't think this is intentional and is confusing (especially because the "original" input token type gets lost). I saw that the way the synsets are added to the SynonymMap in the respective parsers differes and have a PR that changes this.

      Attachments

        Issue Links

          Activity

            People

              romseygeek Alan Woodward
              cbuescher Christoph Büscher
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m