Lucene - Core
  1. Lucene - Core
  2. LUCENE-1903

Incorrect ShingleFilter behavior when outputUnigrams == false

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      ShingleFilter isn't working as expected when outputUnigrams == false. In particular, it is outputting unigrams at least some of the time when outputUnigrams==false.

      I'll attach a patch to ShingleFilterTest.java that adds some test cases that demonstrate the problem.

      I haven't checked this, but I hypothesize that the behavior for outputUnigrams == false got changed when the class was upgraded to the new TokenStream API?

      1. LUCENE-1903.patch
        8 kB
        Chris Harris
      2. LUCENE-1903.patch
        7 kB
        Uwe Schindler
      3. LUCENE-1903_testcases.patch
        5 kB
        Chris Harris
      4. LUCENE-1903_testcases_lucene2_4_1_version.patch
        5 kB
        Chris Harris
      5. TEST-org.apache.lucene.analysis.shingle.ShingleFilterTest.xml
        15 kB
        Chris Harris

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment - - edited

          Committed revision: 812779 & 812782
          I also added Chris' advanced tests. More tests, more better. The tests verify that my changes also work with empty or very short token streams.

          Thanks Chris Harris!

          Show
          Uwe Schindler added a comment - - edited Committed revision: 812779 & 812782 I also added Chris' advanced tests. More tests, more better. The tests verify that my changes also work with empty or very short token streams. Thanks Chris Harris!
          Hide
          Chris Harris added a comment -

          Chris: Could you test this patch

          Looks ok so far, but unfortunately I doubt I'll be able to test it in a real-world setting before you guys want to freeze the code and/or release. (I usually use not just ShingleFilter but ShingleFilter combined with the LUCENE-1370 patch. It was in looking into to upgrading LUCENE-1370 to work with the trunk that I discovered this bug.)

          Show
          Chris Harris added a comment - Chris: Could you test this patch Looks ok so far, but unfortunately I doubt I'll be able to test it in a real-world setting before you guys want to freeze the code and/or release. (I usually use not just ShingleFilter but ShingleFilter combined with the LUCENE-1370 patch. It was in looking into to upgrading LUCENE-1370 to work with the trunk that I discovered this bug.)
          Hide
          Chris Harris added a comment -

          This is the same as Uwe's patch, but with three additional tests:

          testBiGramFilterWithHolesWithoutUnigrams
          testBiGramFilterWithEmptyTokenStream
          testBiGramFilterWithEmptyTokenStreamWithoutUnigrams

          Not sure if it's worth adding them, but they do pass.

          Show
          Chris Harris added a comment - This is the same as Uwe's patch, but with three additional tests: testBiGramFilterWithHolesWithoutUnigrams testBiGramFilterWithEmptyTokenStream testBiGramFilterWithEmptyTokenStreamWithoutUnigrams Not sure if it's worth adding them, but they do pass.
          Hide
          Uwe Schindler added a comment -

          You are right about the external contributor. I think, we should also add a changes entry for LUCENE-1901.

          Show
          Uwe Schindler added a comment - You are right about the external contributor. I think, we should also add a changes entry for LUCENE-1901 .
          Hide
          Robert Muir added a comment - - edited

          here is what i think:
          this is what Michael Busch said in LUCENE-1775

          ShingleFilter and ShingleFilterTest are converted to the new API.

          ShingleFilter is much more efficient now, it clones much less often and computes the tokens mostly on the fly now.

          the fact it went to the new API appears to have made it to CHANGES, but not the fact it is more efficient.
          so maybe it could be mentioned in CHANGES not only that it went to the new API,
          but that it is more efficient and that Chris & Uwe added additional tests and fixed bugs/ensured correctness?

          by the way, you can take my name off existing CHANGE if you want, I did nothing

          Show
          Robert Muir added a comment - - edited here is what i think: this is what Michael Busch said in LUCENE-1775 ShingleFilter and ShingleFilterTest are converted to the new API. ShingleFilter is much more efficient now, it clones much less often and computes the tokens mostly on the fly now. the fact it went to the new API appears to have made it to CHANGES, but not the fact it is more efficient. so maybe it could be mentioned in CHANGES not only that it went to the new API, but that it is more efficient and that Chris & Uwe added additional tests and fixed bugs/ensured correctness? by the way, you can take my name off existing CHANGE if you want, I did nothing
          Hide
          Mark Miller added a comment -

          Okay - I'm not sure if I'll short for tonight or the morning yet (not so late here).

          A CHANGES.txt entry is not needed in my opinion, as this is not a new feature or a bug from 2.4.1.

          This could prob be debated from a lot of angles - in the end, it appears pretty much up to each committer what they do -

          My rule of thumb has been - when it comes to me, be modest, skip where it makes sense. But when it comes to someone else having reported
          the issue - I always add a credit somehow. Outside contributions (in whatever form) deserves credit for reporting things and suppling things (especially tests), and I think it encourages that behavior to a small
          degree. Proper credit to outside contributers is very important I think - thats why we put someones name right next to ours, even when the user
          may have just reported the issue and we did all the work - or even if two people did 98% of the work and some guy drops a patch that changes
          2%. Just my two cents on the subject though - take it for what its worth.

          Show
          Mark Miller added a comment - Okay - I'm not sure if I'll short for tonight or the morning yet (not so late here). A CHANGES.txt entry is not needed in my opinion, as this is not a new feature or a bug from 2.4.1. This could prob be debated from a lot of angles - in the end, it appears pretty much up to each committer what they do - My rule of thumb has been - when it comes to me, be modest, skip where it makes sense. But when it comes to someone else having reported the issue - I always add a credit somehow. Outside contributions (in whatever form) deserves credit for reporting things and suppling things (especially tests), and I think it encourages that behavior to a small degree. Proper credit to outside contributers is very important I think - thats why we put someones name right next to ours, even when the user may have just reported the issue and we did all the work - or even if two people did 98% of the work and some guy drops a patch that changes 2%. Just my two cents on the subject though - take it for what its worth.
          Hide
          Uwe Schindler added a comment -

          Chris: Could you test this patch, if it works as exspected also for you. Maybe I found a fix only valid for your testcase but not other cases. In my opinion, the code works now identical to the 2.4.1 one (without the output buffer). Unigrams are simply detected by shingleBufferPosition==0. The position increments are also tested by your code and the implementation also looks right. In principle, there was only missing the increment of the shingleBufferPosition if no unigrams are provided.

          Mark: If you want to build RC3 soon, just assign yourself and commit this fix. I will go to bed now. I will commit this tomorrow, if you hadn't. A CHANGES.txt entry is not needed in my opinion, as this is not a new feature or a bug from 2.4.1.

          Show
          Uwe Schindler added a comment - Chris: Could you test this patch, if it works as exspected also for you. Maybe I found a fix only valid for your testcase but not other cases. In my opinion, the code works now identical to the 2.4.1 one (without the output buffer). Unigrams are simply detected by shingleBufferPosition==0. The position increments are also tested by your code and the implementation also looks right. In principle, there was only missing the increment of the shingleBufferPosition if no unigrams are provided. Mark: If you want to build RC3 soon, just assign yourself and commit this fix. I will go to bed now. I will commit this tomorrow, if you hadn't. A CHANGES.txt entry is not needed in my opinion, as this is not a new feature or a bug from 2.4.1.
          Hide
          Uwe Schindler added a comment -

          ...hm, a beer is ok until now. A beer and a meal is needed if I also understand the whole filter (I am still in the dark) - this filter is complicated so late in the evening.
          As far as I understand, this should be the correct fix. The problem was in the change from the output buffer to the on-the-fly iterator.

          Show
          Uwe Schindler added a comment - ...hm, a beer is ok until now. A beer and a meal is needed if I also understand the whole filter (I am still in the dark) - this filter is complicated so late in the evening. As far as I understand, this should be the correct fix. The problem was in the change from the output buffer to the on-the-fly iterator.
          Hide
          Uwe Schindler added a comment -

          This patch seems to fix this issue. All tests pass.

          This patch also removes an unnecessary isEmpty() check (because fillShingleBuffer already returned false, if empty). I also replace remove(0) by removeFirst(), which is faster for linked lists.

          Show
          Uwe Schindler added a comment - This patch seems to fix this issue. All tests pass. This patch also removes an unnecessary isEmpty() check (because fillShingleBuffer already returned false, if empty). I also replace remove(0) by removeFirst(), which is faster for linked lists.
          Hide
          Mark Miller added a comment -

          Hmm - I think Busch might be on Vacation.

          I'd hate to release with a known bug ...

          I already owe Uwe beer, so not sure what else I can offer ...

          Show
          Mark Miller added a comment - Hmm - I think Busch might be on Vacation. I'd hate to release with a known bug ... I already owe Uwe beer, so not sure what else I can offer ...
          Hide
          Hoss Man added a comment -

          since this is code that worked in 2.4.1, it should probably block 2.9 (at least untill the scope is understood – a concious choice could be made to release with it as a known bug)

          Show
          Hoss Man added a comment - since this is code that worked in 2.4.1, it should probably block 2.9 (at least untill the scope is understood – a concious choice could be made to release with it as a known bug)
          Hide
          Chris Harris added a comment -

          Mark 2.9 as the affected version, according to Uwe's instructions.

          Show
          Chris Harris added a comment - Mark 2.9 as the affected version, according to Uwe's instructions.
          Hide
          Chris Harris added a comment -

          Oops. Wrong license option first time.

          Show
          Chris Harris added a comment - Oops. Wrong license option first time.
          Hide
          Chris Harris added a comment -

          Here is the equivalent patch for lucene 2.4.1. The six tests all pass.

          Show
          Chris Harris added a comment - Here is the equivalent patch for lucene 2.4.1. The six tests all pass.
          Hide
          Uwe Schindler added a comment -

          Did the test pass with 2.4.1?

          If yes, it's caused by the rewrite and could you please assign 2.9 as the affected version? I think Michael Busch did this TokenFilter, maybe he understands the shingle code.

          Show
          Uwe Schindler added a comment - Did the test pass with 2.4.1? If yes, it's caused by the rewrite and could you please assign 2.9 as the affected version? I think Michael Busch did this TokenFilter, maybe he understands the shingle code.
          Hide
          Chris Harris added a comment -

          Attaching a patch that adds three new ShingleFilter tests. Two of the new tests do not pass; these are testBiGramFilterWithoutUnigrams and testBiGramFilterWithSingleTokenWithoutUnigrams.

          Also attaching the XML output from my junit run.

          Show
          Chris Harris added a comment - Attaching a patch that adds three new ShingleFilter tests. Two of the new tests do not pass; these are testBiGramFilterWithoutUnigrams and testBiGramFilterWithSingleTokenWithoutUnigrams. Also attaching the XML output from my junit run.

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Chris Harris
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development