Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-BETA
    • Fix Version/s: 4.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      See Dawid's email: http://find.searchhub.org/document/64b0a28c53faf39

      Reader.java is fine, it has lots of methods like read(), read(char[]), read(CharBuffer), skip(), but these all have default implementations delegating to read(char[], int, int).

      Unfortunately FilterReader delegates too many unnecessary things such as read() and skip() in a broken way. It should have just left these alone.

      This can cause traps for someone upgrading because they have to override multiple methods, when read(char[], int, int) should be enough, and all Reader methods will then work correctly.

      1. LUCENE-4321.patch
        14 kB
        Robert Muir
      2. LUCENE-4321.patch
        14 kB
        Robert Muir
      3. LUCENE-4321.patch
        14 kB
        Robert Muir
      4. LUCENE-4321.patch
        10 kB
        Robert Muir
      5. LUCENE-4321.patch
        4 kB
        Robert Muir
      6. LUCENE-4321.patch
        2 kB
        Robert Muir
      7. NoRandomReadMockTokenizer.java
        2 kB
        Artem Lukanin

        Activity

        Hide
        Robert Muir added a comment -

        Here's a patch.

        Show
        Robert Muir added a comment - Here's a patch.
        Hide
        Dawid Weiss added a comment -

        Yeah, this looks good to me! I'd even make "in" final, but maybe I'm paranoid and we shouldn't limit the possibility of replacing it with something else, don't know.

        // make this final?
        +  protected Reader in;
        

        I like it much better than subclassing FilterReader. I'm sure it will save many people debugging time if they already have custom char filters.

        +1 from me.

        Show
        Dawid Weiss added a comment - Yeah, this looks good to me! I'd even make "in" final, but maybe I'm paranoid and we shouldn't limit the possibility of replacing it with something else, don't know. // make this final ? + protected Reader in; I like it much better than subclassing FilterReader. I'm sure it will save many people debugging time if they already have custom char filters. +1 from me.
        Hide
        Robert Muir added a comment -

        Updated patch: made 'in' final as Dawid suggested.

        Additionally I kept read(char[], int, int) abstract. I think this makes the class a lot easier to use, because it ensures subclasses implement the correct methods they need at a minimum: read(char[], int, int) and correct(int).

        I also added javadocs.

        Show
        Robert Muir added a comment - Updated patch: made 'in' final as Dawid suggested. Additionally I kept read(char[], int, int) abstract. I think this makes the class a lot easier to use, because it ensures subclasses implement the correct methods they need at a minimum: read(char[], int, int) and correct(int). I also added javadocs.
        Hide
        Robert Muir added a comment -

        Also added tests.

        In the patch, MockTokenizer switches up its read() method. Since BaseTokenStreamTestCase consumes multiple times and compares the results, this means if a charfilters read() methods are inconsistent it will fail.

        I added a test for one that was affected by the FilterReader brokenness, and also gave it an optimized impl for read() anyway.

        Show
        Robert Muir added a comment - Also added tests. In the patch, MockTokenizer switches up its read() method. Since BaseTokenStreamTestCase consumes multiple times and compares the results, this means if a charfilters read() methods are inconsistent it will fail. I added a test for one that was affected by the FilterReader brokenness, and also gave it an optimized impl for read() anyway.
        Hide
        Uwe Schindler added a comment -

        In 3.x the delegate was named "input", this changed in BETA because it extended FilterReader. We should now make it "input" again, this also conforms to TokenFilter's design.

        Show
        Uwe Schindler added a comment - In 3.x the delegate was named "input", this changed in BETA because it extended FilterReader. We should now make it "input" again, this also conforms to TokenFilter's design.
        Hide
        Robert Muir added a comment -

        +1. I'll update the patch.

        Show
        Robert Muir added a comment - +1. I'll update the patch.
        Hide
        Robert Muir added a comment -

        Updated patch with the in->input rename. I think this is ready to go.

        Show
        Robert Muir added a comment - Updated patch with the in->input rename. I think this is ready to go.
        Hide
        Uwe Schindler added a comment - - edited

        I am just thinking about markSupported(), mark(), reset(): Lucene does not use those methods at all and implementing this shit correctly with offsets is to heavy. I would return false for markSupported, and let mark() and reset() be unimplented (no-op for mark()) or throw IOEx (for reset()). ready() can be delegated, thats fine. - and make final?

        Show
        Uwe Schindler added a comment - - edited I am just thinking about markSupported(), mark(), reset(): Lucene does not use those methods at all and implementing this shit correctly with offsets is to heavy. I would return false for markSupported, and let mark() and reset() be unimplented (no-op for mark()) or throw IOEx (for reset()). ready() can be delegated, thats fine. - and make final?
        Hide
        Robert Muir added a comment -

        We don't need to make these final: we should just not support by default: this is easy, its the Reader default, so just remove delegation.

        If someone (like MappingCharFilter) wants to support reset() and has tests for it (like MappingCharFilter), then thats fine,
        same for mark() or ready() or others.

        I'll update the patch

        Show
        Robert Muir added a comment - We don't need to make these final: we should just not support by default: this is easy, its the Reader default, so just remove delegation. If someone (like MappingCharFilter) wants to support reset() and has tests for it (like MappingCharFilter), then thats fine, same for mark() or ready() or others. I'll update the patch
        Hide
        Uwe Schindler added a comment -

        +1. Does MappingCharFilter really support mark() - interesting...

        Show
        Uwe Schindler added a comment - +1. Does MappingCharFilter really support mark() - interesting...
        Hide
        Robert Muir added a comment -

        Updated patch removing delegation for those methods so we have reader defaults (e.g. unsupported). Override if you want to implement, but by default these are all correct

        Show
        Robert Muir added a comment - Updated patch removing delegation for those methods so we have reader defaults (e.g. unsupported). Override if you want to implement, but by default these are all correct
        Hide
        Robert Muir added a comment -

        +1. Does MappingCharFilter really support mark() - interesting...

        No. it supports reset() but not mark(), (only if also its input supports reset()) which is allowed.

        Show
        Robert Muir added a comment - +1. Does MappingCharFilter really support mark() - interesting... No. it supports reset() but not mark(), (only if also its input supports reset()) which is allowed.
        Hide
        Dawid Weiss added a comment -

        So there was a bug hidden in that filterreader class hierarchy change? Nice

        Show
        Dawid Weiss added a comment - So there was a bug hidden in that filterreader class hierarchy change? Nice
        Hide
        Michael McCandless added a comment -

        +1

        Sneaky trap.

        Show
        Michael McCandless added a comment - +1 Sneaky trap.
        Hide
        Uwe Schindler added a comment -

        +1, I am fine now. CharFilter now looks elegant.

        Show
        Uwe Schindler added a comment - +1, I am fine now. CharFilter now looks elegant.
        Hide
        Uwe Schindler added a comment -

        Ah one thing:
        FilteredReader in Java passes "input" to the parent ctor, because all synchronization should be done on the most-inner source reader. The original FilterRaeder uses Reader's protected ctor to pass in another lock, and that is the input reader.
        We should do the same - I am not sure what the effect is, but otherwise it might slowdown as every raeder is using another lock?

        Show
        Uwe Schindler added a comment - Ah one thing: FilteredReader in Java passes "input" to the parent ctor, because all synchronization should be done on the most-inner source reader. The original FilterRaeder uses Reader's protected ctor to pass in another lock, and that is the input reader. We should do the same - I am not sure what the effect is, but otherwise it might slowdown as every raeder is using another lock?
        Hide
        Robert Muir added a comment -

        Good idea Uwe.

        Show
        Robert Muir added a comment - Good idea Uwe.
        Hide
        Artem Lukanin added a comment -

        I had to extend MockTokenizer, because I read the buffer completely to decide, what to do with the input (convert or not to something else).

        When you use different reading methods randomly, my tests don't pass. If you used the same method (may be different) for the complete input string, they would pass, but now the output string is messed up, becase some parts of the input are converted and some are not.

        Show
        Artem Lukanin added a comment - I had to extend MockTokenizer, because I read the buffer completely to decide, what to do with the input (convert or not to something else). When you use different reading methods randomly, my tests don't pass. If you used the same method (may be different) for the complete input string, they would pass, but now the output string is messed up, becase some parts of the input are converted and some are not.
        Hide
        Robert Muir added a comment -

        Your charfilter is broken.

        Show
        Robert Muir added a comment - Your charfilter is broken.
        Hide
        Artem Lukanin added a comment -

        Oh, indeed!

        Show
        Artem Lukanin added a comment - Oh, indeed!
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development