Lucene - Core
  1. Lucene - Core
  2. LUCENE-4065

FilteringTokenFilter should never corrupt the tokenstream graph

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently removers like stopfilter have an option (true/false) to enable position increments.

      If its true: it both inserts gaps where necessary AND propagates gaps down the stream.
      If its false: it does neither, which can totally mess up the tokenstream graph (e.g. move synonyms to another word).

      There are totally valid natural usecases for false, where you don't want gaps because you want phrasequeries to act as if the word was never actually there.

      But 'not inserting gaps' is separate from proper propagation of existing gaps.

      So I think we should provide an option (either fix 'false' or make it an enum), where you still get a legit tokenstream and dont totally screw it up, but you simply omit gaps.

      See LUCENE-3848 for more information (Where we at least fixed this case to not begin the tokenstream with posinc=0)

        Issue Links

          Activity

          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Robert Muir
          http://svn.apache.org/viewvc?view=revision&revision=1430944

          LUCENE-4065: shitlist these broken ctors so they dont cause false fails

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1430944 LUCENE-4065 : shitlist these broken ctors so they dont cause false fails
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Robert Muir
          http://svn.apache.org/viewvc?view=revision&revision=1430939

          LUCENE-4065: shitlist these broken ctors so they dont cause false fails

          Show
          Commit Tag Bot added a comment - [trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1430939 LUCENE-4065 : shitlist these broken ctors so they dont cause false fails
          Robert Muir made changes -
          Link This issue is related to LUCENE-4641 [ LUCENE-4641 ]
          Hide
          Robert Muir added a comment -

          Another way to see it:
          imagine i have 'my test case'
          and i have a synonyms set with a single mapping: test=example

          So synonymfilter makes: 'my test/example case'. Example has posinc=0

          if we have a stopfilter with posinc=false that has a single stopword (test),
          then we end out with 'my/example case'.

          But in my opinion this should be 'my example case': e.g. we should propagate
          the posinc=1 of 'test' to example. We arent introducing a gap though, just preventing
          insane graph corruption and restacking of synonyms.

          Show
          Robert Muir added a comment - Another way to see it: imagine i have 'my test case' and i have a synonyms set with a single mapping: test=example So synonymfilter makes: 'my test/example case'. Example has posinc=0 if we have a stopfilter with posinc=false that has a single stopword (test), then we end out with 'my/example case'. But in my opinion this should be 'my example case': e.g. we should propagate the posinc=1 of 'test' to example. We arent introducing a gap though, just preventing insane graph corruption and restacking of synonyms.
          Robert Muir made changes -
          Field Original Value New Value
          Attachment LUCENE-4065_test.patch [ 12527833 ]
          Hide
          Robert Muir added a comment -

          test case (boiled down from testrandomchains)

          A much simpler one could be made.

          Show
          Robert Muir added a comment - test case (boiled down from testrandomchains) A much simpler one could be made.
          Robert Muir created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development