Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8137

GraphTokenStreamFiniteStrings does not handle position inc > 1 in multi-word synoyms

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 7.2.1, 8.0
    • None
    • None
    • None
    • New

    Description

      The automaton built for graph queries that contain multiple multi-word synonyms does not handle gaps if they appear in the middle of a multi-word synonym. In such case the token next to the gap is considered as part of the multi-word synonym. 

      Stop words that appear before or after multi-word synonyms are handled correctly in the current version but the synonym rule "part of speech, pos" for instance does not create the expected query if "of" is removed by a filter that is set after the synonym_graph.  One solution would be to reuse TokenStreamToAutomaton (with minor changes to add the ability to create token transitions rather than chars) which preserves gaps (as a transition) in the produced automaton.

      Attachments

        1. SGF_SF_interaction.patch
          4 kB
          Jan Høydahl

        Issue Links

          Activity

            People

              jim.ferenczi Jim Ferenczi
              jim.ferenczi Jim Ferenczi
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: