Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6400

SynonymParser should encode 'expand' correctly.

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 5.2, 6.0
    • None
    • None
    • New

    Description

      Today SolrSynonymParser encodes something like A, B, C with 'expand=true' like this:
      A -> A, B, C (includeOrig=false)
      B -> B, A, C (includeOrig=false)
      C -> C, A, B (includeOrig=false)

      This gives kinda buggy output (synfilter sees it all as replacements, and makes all the terms with type synonym, positionLength isnt supported, etc) and it wastes space in the FST (includeOrig is just one bit).

      Example with "spiderman, spider man" and analysis on 'spider man'

      Trunk:
      term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,type=SYNONYM
      term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,positionLength=1,type=SYNONYM
      term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,type=SYNONYM

      You can see this is confusing, all the words have type SYNONYM, because spider and man got deleted, and totally replaced by new terms (Which happen to have the same text).

      Patch:
      term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,type=word
      term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,positionLength=2,type=SYNONYM
      term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,type=word

      Attachments

        1. LUCENE-6400.patch
          11 kB
          Michael McCandless
        2. LUCENE-6400.patch
          10 kB
          Michael McCandless
        3. LUCENE-6400.patch
          10 kB
          Robert Muir
        4. unittests-expand-and-parse.patch
          8 kB
          Ian Ribas
        5. PositionLenghtAndType-unittests.patch
          7 kB
          Ian Ribas
        6. LUCENE-6400.patch
          2 kB
          Robert Muir

        Activity

          People

            Unassigned Unassigned
            rcmuir Robert Muir
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment