Lucene - Core
  1. Lucene - Core
  2. LUCENE-3971

MappingCharFilter rarely has wrong correctOffset (for finalOffset)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA, 3.6.1
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Found this bug over on LUCENE-3969, but I'm currently tracking a ton of bugs, so
      I figure I would open an issue and see if this one is obvious to anyone:

      Consider this input string: "gzw f quaxot" (length = 12) with a WhitespaceTokenizer.
      If i have mapping rules like this, then it works!:

      "t" => ""
      

      But if I have mapping rules like this:

      "t" => ""
      "tmakdbl" => "c"
      

      Then it will compute final offset wrong:

          [junit] junit.framework.AssertionFailedError: finalOffset  expected:<12> but was:<11>
      

      Looks like some logic/recursion bug in the correctOffset method? The second rule is not even "used" for this string,
      it just happens to also start with 't'

      1. LUCENE-3971_test.patch
        3 kB
        Robert Muir
      2. LUCENE-3971.patch
        3 kB
        Dawid Weiss

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Attached is two tests:

          1. testFinalOffsetSpecialCase: test case for the example in the description
          2. testRandomMaps: better random test for this filter. Occasionally it fails due to this finalOffset bug: but thats the only problem it ever finds. Otherwise the filter seems fine!
          Show
          Robert Muir added a comment - Attached is two tests: testFinalOffsetSpecialCase: test case for the example in the description testRandomMaps: better random test for this filter. Occasionally it fails due to this finalOffset bug: but thats the only problem it ever finds. Otherwise the filter seems fine!
          Hide
          Robert Muir added a comment -

          I agree Dawid: what do you think about the difficulty of LUCENE-3830?

          I feel like with an FST, the logic would probably be easier, and
          the filter would probably be faster (and we have pretty good tests,
          in general this thing works, this is just a corner case).

          On the other hand if there is a simple way we can fix the bug in the
          existing code it could be nice, e.g. for a future 3.6.1 or something
          like that.

          But I'll take any solutions anyone has

          Show
          Robert Muir added a comment - I agree Dawid: what do you think about the difficulty of LUCENE-3830 ? I feel like with an FST, the logic would probably be easier, and the filter would probably be faster (and we have pretty good tests, in general this thing works, this is just a corner case). On the other hand if there is a simple way we can fix the bug in the existing code it could be nice, e.g. for a future 3.6.1 or something like that. But I'll take any solutions anyone has
          Hide
          Dawid Weiss added a comment -

          I think this bug is similar (if not identical) to what I fixed a while ago in PatternReplaceCharFilter – I remember it suffered off by one as well and looking at the code it may be a similar in structure (linked list and all).

          There is also a question how this filter should work – should it be greedy or reluctant (match the first pattern or the longest pattern)?

          Show
          Dawid Weiss added a comment - I think this bug is similar (if not identical) to what I fixed a while ago in PatternReplaceCharFilter – I remember it suffered off by one as well and looking at the code it may be a similar in structure (linked list and all). There is also a question how this filter should work – should it be greedy or reluctant (match the first pattern or the longest pattern)?
          Hide
          Dawid Weiss added a comment -

          This code is one hairy bastard... the LinkedList<Character> is scary and so is the logic of counting position offset updates...

          This patch fixes the failing tests but I wouldn't guarantee it fixes all the problems.

          Definitely a candidate for clean rewrite.

          Show
          Dawid Weiss added a comment - This code is one hairy bastard... the LinkedList<Character> is scary and so is the logic of counting position offset updates... This patch fixes the failing tests but I wouldn't guarantee it fixes all the problems. Definitely a candidate for clean rewrite.
          Hide
          Robert Muir added a comment -

          Thanks Dawid!

          Well the random test isnt totally efficient (it passes often and sometimes doesnt find the corner case).
          But I'll happily test the patch with it (I think if i just run it 100 times and it passes I'm sold).

          Show
          Robert Muir added a comment - Thanks Dawid! Well the random test isnt totally efficient (it passes often and sometimes doesnt find the corner case). But I'll happily test the patch with it (I think if i just run it 100 times and it passes I'm sold).
          Hide
          Dawid Weiss added a comment -

          Passes for me with multiple runs. I'll commit it in.

          Show
          Dawid Weiss added a comment - Passes for me with multiple runs. I'll commit it in.
          Hide
          Robert Muir added a comment -

          Thanks again for taking care of this: I think this could be a good candidate for 3.6.1?

          If there is no objections ill take care of the merging (I need to backport some fixes from
          LUCENE-3969 too)...

          Show
          Robert Muir added a comment - Thanks again for taking care of this: I think this could be a good candidate for 3.6.1? If there is no objections ill take care of the merging (I need to backport some fixes from LUCENE-3969 too)...
          Hide
          Koji Sekiguchi added a comment - - edited

          Thanks again for taking care of this: I think this could be a good candidate for 3.6.1?

          +1

          Show
          Koji Sekiguchi added a comment - - edited Thanks again for taking care of this: I think this could be a good candidate for 3.6.1? +1
          Hide
          Dawid Weiss added a comment -

          This code is one hairy bastard

          Btw, nothing personal, Koji. It's just me being slow on understanding how those indexes are calculated.

          Show
          Dawid Weiss added a comment - This code is one hairy bastard Btw, nothing personal, Koji. It's just me being slow on understanding how those indexes are calculated.
          Hide
          Koji Sekiguchi added a comment -

          Sorry about that, Dawid. I thought I needed it in the buffers, similar way in PatterReplaceCharFilter...

          Show
          Koji Sekiguchi added a comment - Sorry about that, Dawid. I thought I needed it in the buffers, similar way in PatterReplaceCharFilter...
          Hide
          Uwe Schindler added a comment -

          Bulk close for 3.6.1

          Show
          Uwe Schindler added a comment - Bulk close for 3.6.1

            People

            • Assignee:
              Dawid Weiss
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development