Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8137

GraphTokenStreamFiniteStrings does not handle position inc > 1 in multi-word synoyms

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 7.2.1, 8.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The automaton built for graph queries that contain multiple multi-word synonyms does not handle gaps if they appear in the middle of a multi-word synonym. In such case the token next to the gap is considered as part of the multi-word synonym. 

      Stop words that appear before or after multi-word synonyms are handled correctly in the current version but the synonym rule "part of speech, pos" for instance does not create the expected query if "of" is removed by a filter that is set after the synonym_graph.  One solution would be to reuse TokenStreamToAutomaton (with minor changes to add the ability to create token transitions rather than chars) which preserves gaps (as a transition) in the produced automaton.

        Attachments

        1. SGF_SF_interaction.patch
          4 kB
          Jan Høydahl

          Issue Links

            Activity

              People

              • Assignee:
                jim.ferenczi Jim Ferenczi
                Reporter:
                jim.ferenczi Jim Ferenczi
              • Votes:
                2 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated: