Lucene - Core
  1. Lucene - Core
  2. LUCENE-5952

Give Version parsing exceptions more descriptive error messages

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 4.10
    • Fix Version/s: 4.10.1, 5.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      As discussed on the dev list, it's spooky how Version.java tries to fully parse the incoming version string ... and then throw exceptions that lack details about what invalid value it received, which file contained the invalid value, etc.

      It also seems too low level to be checking versions (e.g. is not future proof for when 4.10 is passed a 5.x index by accident), and seems redundant with the codec headers we already have for checking versions?

      Should we just go back to lenient parsing?

      1. LUCENE-5952.patch
        31 kB
        Michael McCandless
      2. LUCENE-5952.patch
        30 kB
        Michael McCandless
      3. LUCENE-5952.patch
        28 kB
        Michael McCandless
      4. LUCENE-5952.patch
        17 kB
        Michael McCandless
      5. LUCENE-5952.patch
        21 kB
        Michael McCandless
      6. LUCENE-5952.patch
        16 kB
        Michael McCandless

        Activity

        Hide
        Ryan Ernst added a comment -

        The reason I added this check was to satisfy an existing test that parse would fail if passed a version like 1.0. So it was not completely lenient before, but I have no problem with allowing Version.java to take any version, and letting the code consuming version to decide whether the version is acceptable.

        Show
        Ryan Ernst added a comment - The reason I added this check was to satisfy an existing test that parse would fail if passed a version like 1.0. So it was not completely lenient before, but I have no problem with allowing Version.java to take any version, and letting the code consuming version to decide whether the version is acceptable.
        Hide
        Michael McCandless added a comment -

        In fact it wasn't lenient before at all ... it used Version.valueOf (and Version was an enum) so that would also hit IAE for any invalid versions (i.e. that did not have an exact matching enum constant)...

        I think for this issue I'll just try to improve the error reporting when there is a problem...

        Show
        Michael McCandless added a comment - In fact it wasn't lenient before at all ... it used Version.valueOf (and Version was an enum) so that would also hit IAE for any invalid versions (i.e. that did not have an exact matching enum constant)... I think for this issue I'll just try to improve the error reporting when there is a problem...
        Hide
        Michael McCandless added a comment -

        Patch, applies to 4.10.x.

        I improved the error handling in Version.java to more clearly say what
        offending value had been passed in, and various places that were
        calling Version.parse to say which resource/fileName they had read the
        version from and to throw CorruptIndexException.

        I also removed the two deprecated SegmentInfo ctors that took String
        version and parsed it (we are allowed to just change this API: it's
        experimental).

        I also fixed Version.java to not pass judgement on the major version,
        so we remain future proof.

        Show
        Michael McCandless added a comment - Patch, applies to 4.10.x. I improved the error handling in Version.java to more clearly say what offending value had been passed in, and various places that were calling Version.parse to say which resource/fileName they had read the version from and to throw CorruptIndexException. I also removed the two deprecated SegmentInfo ctors that took String version and parsed it (we are allowed to just change this API: it's experimental). I also fixed Version.java to not pass judgement on the major version, so we remain future proof.
        Hide
        Uwe Schindler added a comment -

        I think we still have to restrict major to be in the valid range, otherwise the encodedValue may overflow. So major should be between 0 and 255, right?

        Otherwise looks fine!

        Show
        Uwe Schindler added a comment - I think we still have to restrict major to be in the valid range, otherwise the encodedValue may overflow. So major should be between 0 and 255, right? Otherwise looks fine!
        Hide
        Uwe Schindler added a comment -

        Just a suggestion: Can we add a fake index with a version number of "6.1.0" to see if you correctly get IndexTooNewException and not an IAE? Could be added to TestBackwardsCompatibility! Just a ZIP file, but hack the write code temporarily to pass a fake string.

        Show
        Uwe Schindler added a comment - Just a suggestion: Can we add a fake index with a version number of "6.1.0" to see if you correctly get IndexTooNewException and not an IAE? Could be added to TestBackwardsCompatibility! Just a ZIP file, but hack the write code temporarily to pass a fake string.
        Hide
        Robert Muir added a comment -

        Wait, it was lenient before. It was just a string. Unrelated to version.java in any way.

        Show
        Robert Muir added a comment - Wait, it was lenient before. It was just a string. Unrelated to version.java in any way.
        Hide
        Uwe Schindler added a comment - - edited

        Wait, it was lenient before. It was just a string. Unrelated to version.java in any way.

        Version.java was not lenient at all. In the past Version.java was very strict (only allowed the constants). For parsing index version numbers we used another comparator, which was lenient and accepted any version number to compare (like the Maven version comparator). Now its both in one class and therefore we have to relax Version.java more, to be future proof.

        Before it was enum, now its a simple class with some additional bounds on major version. We just have to remove the major version constraint.

        In my opinion, we should not save index version as string at all and instead save the "encoded value" as an (v)int.

        Show
        Uwe Schindler added a comment - - edited Wait, it was lenient before. It was just a string. Unrelated to version.java in any way. Version.java was not lenient at all. In the past Version.java was very strict (only allowed the constants). For parsing index version numbers we used another comparator, which was lenient and accepted any version number to compare (like the Maven version comparator). Now its both in one class and therefore we have to relax Version.java more, to be future proof. Before it was enum, now its a simple class with some additional bounds on major version. We just have to remove the major version constraint. In my opinion, we should not save index version as string at all and instead save the "encoded value" as an (v)int.
        Hide
        Michael McCandless added a comment -

        I think we still have to restrict major to be in the valid range, otherwise the encodedValue may overflow. So major should be between 0 and 255, right?

        Ahh good, I'll do that.

        Just a suggestion: Can we add a fake index with a version number of "6.1.0" to see if you correctly get IndexTooNewException and not an IAE?

        So I built this 6.1.0 index (by hacking up a trunk checkout) and CheckIndex (also on trunk) happily checked the index without complaints! I agree we should try to somehow test forwards compatibility ... but I'd rather explore that on a separate issue? I'll open one.

        In my opinion, we should not save index version as string at all and instead save the "encoded value" as an (v)int.

        I agree ... I'll fix Lucene46SegmentInfoWriter/Reader to write as int ... I think I'll use separate vInts: I don't like tying this "encoded format" (stuffing 4 ints that are actually bytes into 1 int) to the index format.

        Show
        Michael McCandless added a comment - I think we still have to restrict major to be in the valid range, otherwise the encodedValue may overflow. So major should be between 0 and 255, right? Ahh good, I'll do that. Just a suggestion: Can we add a fake index with a version number of "6.1.0" to see if you correctly get IndexTooNewException and not an IAE? So I built this 6.1.0 index (by hacking up a trunk checkout) and CheckIndex (also on trunk) happily checked the index without complaints! I agree we should try to somehow test forwards compatibility ... but I'd rather explore that on a separate issue? I'll open one. In my opinion, we should not save index version as string at all and instead save the "encoded value" as an (v)int. I agree ... I'll fix Lucene46SegmentInfoWriter/Reader to write as int ... I think I'll use separate vInts: I don't like tying this "encoded format" (stuffing 4 ints that are actually bytes into 1 int) to the index format.
        Hide
        Michael McCandless added a comment -

        Wait, it was lenient before.

        Right, SegmentInfo.getVersion was lenient before, i.e. we read a String and left it as a String in SegmentInfo. But then we needed to "compare leniently" which was also scary ...

        Show
        Michael McCandless added a comment - Wait, it was lenient before. Right, SegmentInfo.getVersion was lenient before, i.e. we read a String and left it as a String in SegmentInfo. But then we needed to "compare leniently" which was also scary ...
        Hide
        Michael McCandless added a comment -

        Another patch folding in feedback ... I cutover to using 4 ints to write Version into .si.

        Show
        Michael McCandless added a comment - Another patch folding in feedback ... I cutover to using 4 ints to write Version into .si.
        Hide
        Uwe Schindler added a comment - - edited

        I agree ... I'll fix Lucene46SegmentInfoWriter/Reader to write as int ... I think I'll use separate vInts: I don't like tying this "encoded format" (stuffing 4 ints that are actually bytes into 1 int) to the index format.

        Yeah, we should maybe add a method to Version that writes itsself to a DataOutput: Version.writeVersion(DataOutput) and Version.readVersion(DataInput).

        In any case, this is a separate issue, because it involves a index format change, that we should not do in 4.10.1. I would suggest to change this for the coming 5.0.

        Show
        Uwe Schindler added a comment - - edited I agree ... I'll fix Lucene46SegmentInfoWriter/Reader to write as int ... I think I'll use separate vInts: I don't like tying this "encoded format" (stuffing 4 ints that are actually bytes into 1 int) to the index format. Yeah, we should maybe add a method to Version that writes itsself to a DataOutput: Version.writeVersion(DataOutput) and Version.readVersion(DataInput). In any case, this is a separate issue, because it involves a index format change, that we should not do in 4.10.1. I would suggest to change this for the coming 5.0.
        Hide
        Michael McCandless added a comment -

        Yeah, we should maybe add a method to Version that writes itsself to a DataOutput: Version.writeVersion(DataOutput) and Version.readVersion(DataInput).

        I didn't like that option either, because this (serialization formats) is not Version's burden to bear.

        E.g. what if we want to later change that format it wrote? We already put this burden on the codec formats, so I think leaving the burden in Lucene46SegmentInfoFormat is the right place...

        In any case, this is a separate issue, because it involves a index format change, that we should not do in 4.10.1. I would suggest to change this for the coming 5.0.

        I don't think this should wait for 5.0? I think it's dangerous/fragile relying on string parsing to rebuild the Version when we open IndexReaders...

        Show
        Michael McCandless added a comment - Yeah, we should maybe add a method to Version that writes itsself to a DataOutput: Version.writeVersion(DataOutput) and Version.readVersion(DataInput). I didn't like that option either, because this (serialization formats) is not Version's burden to bear. E.g. what if we want to later change that format it wrote? We already put this burden on the codec formats, so I think leaving the burden in Lucene46SegmentInfoFormat is the right place... In any case, this is a separate issue, because it involves a index format change, that we should not do in 4.10.1. I would suggest to change this for the coming 5.0. I don't think this should wait for 5.0? I think it's dangerous/fragile relying on string parsing to rebuild the Version when we open IndexReaders...
        Hide
        Uwe Schindler added a comment -

        I don't think this should wait for 5.0

        Just not in 4.10.1, we just have a bugfix release and the whole thing did not change since 4.0. We can change that for 4.11, but as Robert suggested to release 4.11 as 5.0 (after copy branch and merge some more stuff in), I wrote "5.0".

        what if we want to later change that format it wrote? We already put this burden on the codec formats, so I think leaving the burden in Lucene46SegmentInfoFormat is the right place...

        I am fine with that too. To me writing a version is like writing a String in DataOutput, which is also a method outside of codecs.

        Show
        Uwe Schindler added a comment - I don't think this should wait for 5.0 Just not in 4.10.1, we just have a bugfix release and the whole thing did not change since 4.0. We can change that for 4.11, but as Robert suggested to release 4.11 as 5.0 (after copy branch and merge some more stuff in), I wrote "5.0". what if we want to later change that format it wrote? We already put this burden on the codec formats, so I think leaving the burden in Lucene46SegmentInfoFormat is the right place... I am fine with that too. To me writing a version is like writing a String in DataOutput, which is also a method outside of codecs.
        Hide
        Robert Muir added a comment -

        I don't think its useful for Version to be picky here in any way.

        this stuff only even executes if we successfully loaded the codec (Codec.forName) and then invoked its .si reader. If Codec.forName succeeded then we should be supporting the index.

        Also, if a user has a 4.3 index and they go to 4.3.1 and the format hasnt changed, why prevent them from rolling backwards to 4.3? There is no need for that.

        -1 to it throwing IndexFormatTooOld/New for this reasons from here. It makes no sense.

        Show
        Robert Muir added a comment - I don't think its useful for Version to be picky here in any way. this stuff only even executes if we successfully loaded the codec (Codec.forName) and then invoked its .si reader. If Codec.forName succeeded then we should be supporting the index. Also, if a user has a 4.3 index and they go to 4.3.1 and the format hasnt changed, why prevent them from rolling backwards to 4.3? There is no need for that. -1 to it throwing IndexFormatTooOld/New for this reasons from here. It makes no sense.
        Hide
        Uwe Schindler added a comment -

        -1 to it throwing IndexFormatTooOld/New for this reasons from here. It makes no sense.

        Same here.

        Show
        Uwe Schindler added a comment - -1 to it throwing IndexFormatTooOld/New for this reasons from here. It makes no sense. Same here.
        Hide
        Michael McCandless added a comment -

        I don't think this should wait for 5.0

        Just not in 4.10.1, we just have a bugfix release and the whole thing did not change since 4.0. We can change that for 4.11, but as Robert suggested to release 4.11 as 5.0 (after copy branch and merge some more stuff in), I wrote "5.0".

        OK I see...

        I'll revert that change here (for 4.10.1) but put it back when I port to 4.x / 5.0.

        -1 to it throwing IndexFormatTooOld/New for this reasons from here. It makes no sense.

        I agree ...

        Show
        Michael McCandless added a comment - I don't think this should wait for 5.0 Just not in 4.10.1, we just have a bugfix release and the whole thing did not change since 4.0. We can change that for 4.11, but as Robert suggested to release 4.11 as 5.0 (after copy branch and merge some more stuff in), I wrote "5.0". OK I see... I'll revert that change here (for 4.10.1) but put it back when I port to 4.x / 5.0. -1 to it throwing IndexFormatTooOld/New for this reasons from here. It makes no sense. I agree ...
        Hide
        Michael McCandless added a comment -

        New patch, reverting the index format change.

        So this change amounts to improving the error messaging when there is a problem in parsing the version, and to fixing Version not to pass judgement on major version numbers.

        Show
        Michael McCandless added a comment - New patch, reverting the index format change. So this change amounts to improving the error messaging when there is a problem in parsing the version, and to fixing Version not to pass judgement on major version numbers.
        Hide
        Robert Muir added a comment -

        We can change that for 4.11, but as Robert suggested to release 4.11 as 5.0 (after copy branch and merge some more stuff in), I wrote "5.0".

        Which i thought would be a nice compromise: giving us a java 8 trunk, but delaying it till 6.x, clearing out trunk from all the crazy back compat, FINALLY dropping 3.x support in the stable branch (5.x), and giving users a 5.0 release with lots of features, but keeping controversial stuff in 6.x for more time to bake.

        Too bad not even one person replied, so i have no idea what we are doing currently.

        Show
        Robert Muir added a comment - We can change that for 4.11, but as Robert suggested to release 4.11 as 5.0 (after copy branch and merge some more stuff in), I wrote "5.0". Which i thought would be a nice compromise: giving us a java 8 trunk, but delaying it till 6.x, clearing out trunk from all the crazy back compat, FINALLY dropping 3.x support in the stable branch (5.x), and giving users a 5.0 release with lots of features, but keeping controversial stuff in 6.x for more time to bake. Too bad not even one person replied, so i have no idea what we are doing currently.
        Hide
        Robert Muir added a comment -

        is it possible to change this parse to throw a checked exception (ParseException would be good) instead of IllegalArgumentException?

        This would enforce that callers are doing the right thing.

        Show
        Robert Muir added a comment - is it possible to change this parse to throw a checked exception (ParseException would be good) instead of IllegalArgumentException? This would enforce that callers are doing the right thing.
        Hide
        Uwe Schindler added a comment -

        java.text.ParseException?

        Show
        Uwe Schindler added a comment - java.text.ParseException?
        Hide
        Robert Muir added a comment -

        Yes, its the correct one here.

        Show
        Robert Muir added a comment - Yes, its the correct one here.
        Hide
        Michael McCandless added a comment -

        Good idea, I'll cutover to java.text.ParseException.

        Show
        Michael McCandless added a comment - Good idea, I'll cutover to java.text.ParseException.
        Hide
        Michael McCandless added a comment -

        New patch, cutting over to ParseException from the parse methods ... I
        kept IllegalArgumentException in the [private] ctors.

        Also, I cutover to StringTokenizer from String.split ... the later
        makes me a bit nervous, and there are spooky changes in Java 8, e.g.
        http://stackoverflow.com/questions/22718744/why-does-split-in-java-8-sometimes-remove-empty-strings-at-start-of-result-array

        Finally I marked these parse methods as @lucene.internal.

        Show
        Michael McCandless added a comment - New patch, cutting over to ParseException from the parse methods ... I kept IllegalArgumentException in the [private] ctors. Also, I cutover to StringTokenizer from String.split ... the later makes me a bit nervous, and there are spooky changes in Java 8, e.g. http://stackoverflow.com/questions/22718744/why-does-split-in-java-8-sometimes-remove-empty-strings-at-start-of-result-array Finally I marked these parse methods as @lucene.internal.
        Hide
        Uwe Schindler added a comment -

        Can we change the unreadable (nextToken() == false) to (!nextToken())?

        Show
        Uwe Schindler added a comment - Can we change the unreadable (nextToken() == false) to (!nextToken())?
        Hide
        Ryan Ernst added a comment -

        I like the == false, it is so easy to lose a single ! character when reading..

        Show
        Ryan Ernst added a comment - I like the == false , it is so easy to lose a single ! character when reading..
        Hide
        Uwe Schindler added a comment -

        I am happy with == false for more complex constructs, but this makes it hard to read because the API developer had another idea of how the code should look like. When I read that code it tooks me several branches in my own main processor to understand what that meen, because its unnatureal. The offical Sun/Oracle Code guidelines for Java also forbid this.

        If you are afraid to loose the Unable to render embedded object: File (, just invert the if construct completely (thats what I mostly do) not found.)

        FYI: If I see the first one writing a TokenStream like:

        if (incrementToken() == false) ...
        

        I will stop working on Lucene! Trust me! Really!

        Show
        Uwe Schindler added a comment - I am happy with == false for more complex constructs, but this makes it hard to read because the API developer had another idea of how the code should look like. When I read that code it tooks me several branches in my own main processor to understand what that meen, because its unnatureal. The offical Sun/Oracle Code guidelines for Java also forbid this. If you are afraid to loose the Unable to render embedded object: File (, just invert the if construct completely (thats what I mostly do) not found. ) FYI: If I see the first one writing a TokenStream like: if (incrementToken() == false ) ... I will stop working on Lucene! Trust me! Really!
        Hide
        Ryan Ernst added a comment -

        I agree. If there is an if/else, putting the "positive" logic first is good.

        Show
        Ryan Ernst added a comment - I agree. If there is an if/else, putting the "positive" logic first is good.
        Hide
        Michael McCandless added a comment -

        In general I agree that positive logic should go first but I think in
        this case it makes the code worse: it would add 2 levels of nesting,
        with the else (to throw the exceptions) way at the end of the method.

        Uwe Schindler do you feel so strongly about this code styling that it will
        cause you to otherwise veto this change? If so I will switch to
        !.

        And are there any technical (not just code styling) feedback on the
        last patch? I'd like to commit this soon: it is blocking 4.10.1.

        Show
        Michael McCandless added a comment - In general I agree that positive logic should go first but I think in this case it makes the code worse: it would add 2 levels of nesting, with the else (to throw the exceptions) way at the end of the method. Uwe Schindler do you feel so strongly about this code styling that it will cause you to otherwise veto this change? If so I will switch to ! . And are there any technical (not just code styling) feedback on the last patch? I'd like to commit this soon: it is blocking 4.10.1.
        Hide
        Ryan Ernst added a comment -

        Patch looks good to me. One minor favor: can you change Version.LUCENE_CURRENT to Version.LATEST? So that the forward port to branch_4x/trunk does not add a new use of it (there all uses have been removed).

        Show
        Ryan Ernst added a comment - Patch looks good to me. One minor favor: can you change Version.LUCENE_CURRENT to Version.LATEST ? So that the forward port to branch_4x/trunk does not add a new use of it (there all uses have been removed).
        Hide
        Michael McCandless added a comment -

        Thank Ryan Ernst will do.

        Show
        Michael McCandless added a comment - Thank Ryan Ernst will do.
        Hide
        Uwe Schindler added a comment -

        Uwe Schindler do you feel so strongly about this code styling that it will cause you to otherwise veto this change? If so I will switch to "!".

        I don't veto that change, I am just unhappy. Its hard to read, thats all, and is not the intention of the API developer. I have no problem with commiting it, but be prepared to have it changed the next I time I have to somehow touch that code!

        Negating a simple method call should always be done with "!", especially if the method name is something that reads "fluent", Ryan's argument is not an issue (in my opinion - if it is an issue, define a method:

        Utils.not(boolean b) { return !b; }
        

        I agree, for stuff like if (!(a instanceof b)), I tend to invert the logic instead.

        Show
        Uwe Schindler added a comment - Uwe Schindler do you feel so strongly about this code styling that it will cause you to otherwise veto this change? If so I will switch to "!". I don't veto that change, I am just unhappy. Its hard to read, thats all, and is not the intention of the API developer. I have no problem with commiting it, but be prepared to have it changed the next I time I have to somehow touch that code! Negating a simple method call should always be done with "!", especially if the method name is something that reads "fluent", Ryan's argument is not an issue (in my opinion - if it is an issue, define a method: Utils.not( boolean b) { return !b; } I agree, for stuff like if (!(a instanceof b)) , I tend to invert the logic instead.
        Hide
        Uwe Schindler added a comment -

        Hi,
        one thing about the new Code using StringTokenizer. Beware of this issue: http://stackoverflow.com/questions/11409320/java-stringtokenizer-nexttoken-skips-over-empty-fiels. The reason for this is "A token is a maximal sequence of consecutive characters that are not delimiters."

        In fact, StringTokenizer removes empty tokens. Wth the new code this "4..9.0" would be parsed as "4.9.0".

        If you think there might be a bug in Java 8's String.split, I would use a regex and capturing groups. Like the Parser for LUCENE_xxx constants. Or alternatively, loop over the string and do if (ch == '.') ... like stuff. I have a custom StringTokenizer using a StringBuilder internally, which does not remove empty tokens.

        Show
        Uwe Schindler added a comment - Hi, one thing about the new Code using StringTokenizer. Beware of this issue: http://stackoverflow.com/questions/11409320/java-stringtokenizer-nexttoken-skips-over-empty-fiels . The reason for this is "A token is a maximal sequence of consecutive characters that are not delimiters." In fact, StringTokenizer removes empty tokens. Wth the new code this "4..9.0" would be parsed as "4.9.0". If you think there might be a bug in Java 8's String.split, I would use a regex and capturing groups. Like the Parser for LUCENE_xxx constants. Or alternatively, loop over the string and do if (ch == '.') ... like stuff. I have a custom StringTokenizer using a StringBuilder internally, which does not remove empty tokens.
        Hide
        Uwe Schindler added a comment -

        I found this one I implemented long time ago (its more than 10 years old and still in use, because faster than String.split):

        public final class StrictStringTokenizer {
        
            public StrictStringTokenizer(String s, char delimiter) {
                this.s=s;
                this.delimiter=delimiter;
            }
        	
            public final String nextToken() {
                if (pos<0) return "";
                int pos1=s.indexOf(delimiter,pos);
                String s1;
                if (pos1>=0) {
                    s1=s.substring(pos,pos1);
                    pos=pos1+1;
                } else {
                    s1=s.substring(pos);pos=-1;
                }
                return s1;
            }
        	
            public final boolean hasMoreTokens() {
                return (pos>=0);
            }
        	
            // add your data members here
            private final String s;
            private final char delimiter;
            private int pos=0;
        }
        

        Its a drop-in replacement for StringTokenizer. Maybe remove the "return empty token if after last" and throw NoSuchElement.

        Show
        Uwe Schindler added a comment - I found this one I implemented long time ago (its more than 10 years old and still in use, because faster than String.split): public final class StrictStringTokenizer { public StrictStringTokenizer( String s, char delimiter) { this .s=s; this .delimiter=delimiter; } public final String nextToken() { if (pos<0) return ""; int pos1=s.indexOf(delimiter,pos); String s1; if (pos1>=0) { s1=s.substring(pos,pos1); pos=pos1+1; } else { s1=s.substring(pos);pos=-1; } return s1; } public final boolean hasMoreTokens() { return (pos>=0); } // add your data members here private final String s; private final char delimiter; private int pos=0; } Its a drop-in replacement for StringTokenizer. Maybe remove the "return empty token if after last" and throw NoSuchElement.
        Hide
        Uwe Schindler added a comment -

        At least we should also add a test that "4..9.0" fails to parse.

        Show
        Uwe Schindler added a comment - At least we should also add a test that "4..9.0" fails to parse.
        Hide
        Michael McCandless added a comment -

        I like the StrictStringTokenizer! I'll cutover ... and I'll add 4..9.0 test case.

        Show
        Michael McCandless added a comment - I like the StrictStringTokenizer! I'll cutover ... and I'll add 4..9.0 test case.
        Hide
        Michael McCandless added a comment -

        New patch, cutting over to Uwe's tokenizer ... I think it's ready.

        Show
        Michael McCandless added a comment - New patch, cutting over to Uwe's tokenizer ... I think it's ready.
        Hide
        Ryan Ernst added a comment -

        +1

        Show
        Ryan Ernst added a comment - +1
        Hide
        ASF subversion and git services added a comment -

        Commit 1625990 from Uwe Schindler in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1625990 ]

        LUCENE-5944: Remove useless test (will be fixed soon by LUCENE-5952)

        Show
        ASF subversion and git services added a comment - Commit 1625990 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1625990 ] LUCENE-5944 : Remove useless test (will be fixed soon by LUCENE-5952 )
        Hide
        ASF subversion and git services added a comment -

        Commit 1625991 from Uwe Schindler in branch 'dev/trunk'
        [ https://svn.apache.org/r1625991 ]

        LUCENE-5944: Remove useless test (will be fixed soon by LUCENE-5952)

        Show
        ASF subversion and git services added a comment - Commit 1625991 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1625991 ] LUCENE-5944 : Remove useless test (will be fixed soon by LUCENE-5952 )
        Hide
        Uwe Schindler added a comment -

        +1
        (it would be good to commit this soon, because in Lucene trunk - version 6 now - I had to hack Version a bit to prevent failures).

        Show
        Uwe Schindler added a comment - +1 (it would be good to commit this soon, because in Lucene trunk - version 6 now - I had to hack Version a bit to prevent failures).
        Hide
        ASF subversion and git services added a comment -

        Commit 1626006 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1626006 ]

        LUCENE-5952: better error messages when version fails to parse; use simpler string tokenizer; don't check major versions at such a low level

        Show
        ASF subversion and git services added a comment - Commit 1626006 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1626006 ] LUCENE-5952 : better error messages when version fails to parse; use simpler string tokenizer; don't check major versions at such a low level
        Hide
        ASF subversion and git services added a comment -

        Commit 1626038 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1626038 ]

        LUCENE-5952: use short-cutting boolean ops

        Show
        ASF subversion and git services added a comment - Commit 1626038 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1626038 ] LUCENE-5952 : use short-cutting boolean ops
        Hide
        Michael McCandless added a comment -

        Here's the patch to port the fix to 5.x.

        The only change vs 4.10.x is that I changed Lucene46SegmentInfoFormat/Writer/Reader to explicitly encode the 4 ints in Version, instead of relying on String parsing at read time. I added Version.write/Version.read static methods for this, though it makes me somewhat nervous having them there outside of the scope of a "format"... I put a NOTE that this format can never be changed.

        Show
        Michael McCandless added a comment - Here's the patch to port the fix to 5.x. The only change vs 4.10.x is that I changed Lucene46SegmentInfoFormat/Writer/Reader to explicitly encode the 4 ints in Version, instead of relying on String parsing at read time. I added Version.write/Version.read static methods for this, though it makes me somewhat nervous having them there outside of the scope of a "format"... I put a NOTE that this format can never be changed.
        Hide
        Uwe Schindler added a comment -

        Why is Version.write static? Version.read() is static like Version.parse() because it creates a new Version, but write should write "this" as version?

        In any case, I dont think a Format change is needed. To me this version number is static like the codec header, what should you change?
        An alternative would be to put both static methods into CodecUtils, but this would also not help with changes in format.

        Show
        Uwe Schindler added a comment - Why is Version.write static? Version.read() is static like Version.parse() because it creates a new Version, but write should write "this" as version? In any case, I dont think a Format change is needed. To me this version number is static like the codec header, what should you change? An alternative would be to put both static methods into CodecUtils, but this would also not help with changes in format.
        Hide
        Robert Muir added a comment -

        thanks for beefing this up. the .si file is really centric to the segment, so any safety we can add is good.

        A few questions:

        • Can we encode 3 ints instead of 4? As far as I know, the 'prerelease' was added to support 4.0-alpha/4.0-beta. This was confusing (my fault), and this confusion ultimately worked its way into an index corruption bug. I think we should try to contain it to 4.0 instead and not keep things complicated like that.
        • Can we consider just making a new 5.0 si writer? its a pain to bump the codec version, but I'll do the work here. We can remove conditionals like 'supports checksums' as well.
        • I agree we should put these methods in CodecUtil (CodecUtil.readVersion, writeVersion). To answer Uwe's questions about why a format change is needed for the version, IMO its way better to encode this in a way that does not require parsing,.

        We can followup with this by improving the exceptions for tiny "slurp-in" classes like this (I would personally, as in do the work, also fix .fnm, segments_N, .nvm, .dvm, .fdt, .tvx as well). I would add a CodecUtil.addSuppressedChecksum or something, to easily allow these guys to 'annotate' any exc on init with checksum failure information. These are small but important and it would help considering we are dodging challenges like JVM bugs here.

        I also want to bump 5.0 codec anyway, to fix the bug where Lucene42TermVectorsFormat uses the same codecName as Lucene41StoredFieldsFormat in the codec header, thats a stupid bug we should fix.

        Show
        Robert Muir added a comment - thanks for beefing this up. the .si file is really centric to the segment, so any safety we can add is good. A few questions: Can we encode 3 ints instead of 4? As far as I know, the 'prerelease' was added to support 4.0-alpha/4.0-beta. This was confusing (my fault), and this confusion ultimately worked its way into an index corruption bug. I think we should try to contain it to 4.0 instead and not keep things complicated like that. Can we consider just making a new 5.0 si writer? its a pain to bump the codec version, but I'll do the work here. We can remove conditionals like 'supports checksums' as well. I agree we should put these methods in CodecUtil (CodecUtil.readVersion, writeVersion). To answer Uwe's questions about why a format change is needed for the version, IMO its way better to encode this in a way that does not require parsing,. We can followup with this by improving the exceptions for tiny "slurp-in" classes like this (I would personally, as in do the work, also fix .fnm, segments_N, .nvm, .dvm, .fdt, .tvx as well). I would add a CodecUtil.addSuppressedChecksum or something, to easily allow these guys to 'annotate' any exc on init with checksum failure information. These are small but important and it would help considering we are dodging challenges like JVM bugs here. I also want to bump 5.0 codec anyway, to fix the bug where Lucene42TermVectorsFormat uses the same codecName as Lucene41StoredFieldsFormat in the codec header, thats a stupid bug we should fix.
        Hide
        Ryan Ernst added a comment -

        I agree we should put these methods in CodecUtil (CodecUtil.readVersion, writeVersion)

        IMO they should be part of the Version class, so that the constructor can stay private. No one should be constructing Versions, they should be using the constants, or parsing/reading them.

        Show
        Ryan Ernst added a comment - I agree we should put these methods in CodecUtil (CodecUtil.readVersion, writeVersion) IMO they should be part of the Version class, so that the constructor can stay private. No one should be constructing Versions, they should be using the constants, or parsing/reading them.
        Hide
        Robert Muir added a comment -

        That does not really matter, or hurt anything. Scattering our index format across tons of files? No thanks, that is a real price.

        Show
        Robert Muir added a comment - That does not really matter, or hurt anything. Scattering our index format across tons of files? No thanks, that is a real price.
        Hide
        Michael McCandless added a comment -

        An alternative would be to put both static methods into CodecUtils, but this would also not help with changes in format.

        Or I can make the SIWriter do its own (private) thing. Yeah, that's an "abstraction violation" (public Version ctor), and, yeah, future places that need to write/read versions constants (e.g. LUCENE-5954) will have to dup this code, but then the format is clearly owned by that writer/reader. Already we are debating 4 vs 3 ints (format change...).

        Can we encode 3 ints instead of 4? As far as I know, the 'prerelease' was added to support 4.0-alpha/4.0-beta. This was confusing (my fault), and this confusion ultimately worked its way into an index corruption bug. I think we should try to contain it to 4.0 instead and not keep things complicated like that.

        OK... but should we never expect to use prerelease anymore (e.g 5.0)?

        Can we consider just making a new 5.0 si writer? its a pain to bump the codec version, but I'll do the work here. We can remove conditionals like 'supports checksums' as well.

        +1

        Separately we should make it easier to roll a new Codec version ... it's bad if it's "daunting" since it pressures us to hide biggish changes under the existing writers.

        We can followup with this by improving the exceptions for tiny "slurp-in" classes like this (I would personally, as in do the work, also fix .fnm, segments_N, .nvm, .dvm, .fdt, .tvx as well). I would add a CodecUtil.addSuppressedChecksum or something, to easily allow these guys to 'annotate' any exc on init with checksum failure information. These are small but important and it would help considering we are dodging challenges like JVM bugs here.

        Big +1: this would mean on any strange exc when reading these files, we would also see if (in addition) their checksum did or did not match? This saves the extra hassle of asking user to run CheckIndex to figure out if that file was corrupt...

        I also want to bump 5.0 codec anyway, to fix the bug where Lucene42TermVectorsFormat uses the same codecName as Lucene41StoredFieldsFormat in the codec header, thats a stupid bug we should fix.

        OK.

        I think I'll break out the format change from this issue, and leave this as just improving the Version error messages, having it not judge major version, etc... I'll open a new issue for Lucene50Codec.

        Show
        Michael McCandless added a comment - An alternative would be to put both static methods into CodecUtils, but this would also not help with changes in format. Or I can make the SIWriter do its own (private) thing. Yeah, that's an "abstraction violation" (public Version ctor), and, yeah, future places that need to write/read versions constants (e.g. LUCENE-5954 ) will have to dup this code, but then the format is clearly owned by that writer/reader. Already we are debating 4 vs 3 ints (format change...). Can we encode 3 ints instead of 4? As far as I know, the 'prerelease' was added to support 4.0-alpha/4.0-beta. This was confusing (my fault), and this confusion ultimately worked its way into an index corruption bug. I think we should try to contain it to 4.0 instead and not keep things complicated like that. OK... but should we never expect to use prerelease anymore (e.g 5.0)? Can we consider just making a new 5.0 si writer? its a pain to bump the codec version, but I'll do the work here. We can remove conditionals like 'supports checksums' as well. +1 Separately we should make it easier to roll a new Codec version ... it's bad if it's "daunting" since it pressures us to hide biggish changes under the existing writers. We can followup with this by improving the exceptions for tiny "slurp-in" classes like this (I would personally, as in do the work, also fix .fnm, segments_N, .nvm, .dvm, .fdt, .tvx as well). I would add a CodecUtil.addSuppressedChecksum or something, to easily allow these guys to 'annotate' any exc on init with checksum failure information. These are small but important and it would help considering we are dodging challenges like JVM bugs here. Big +1: this would mean on any strange exc when reading these files, we would also see if (in addition) their checksum did or did not match? This saves the extra hassle of asking user to run CheckIndex to figure out if that file was corrupt... I also want to bump 5.0 codec anyway, to fix the bug where Lucene42TermVectorsFormat uses the same codecName as Lucene41StoredFieldsFormat in the codec header, thats a stupid bug we should fix. OK. I think I'll break out the format change from this issue, and leave this as just improving the Version error messages, having it not judge major version, etc... I'll open a new issue for Lucene50Codec.
        Hide
        Michael McCandless added a comment -

        I opened LUCENE-5969 for the 5.0 codec.

        Show
        Michael McCandless added a comment - I opened LUCENE-5969 for the 5.0 codec.
        Hide
        ASF subversion and git services added a comment -

        Commit 1626536 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1626536 ]

        LUCENE-5952: String.split seems to be OK in 1.8.0_20

        Show
        ASF subversion and git services added a comment - Commit 1626536 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1626536 ] LUCENE-5952 : String.split seems to be OK in 1.8.0_20
        Hide
        ASF subversion and git services added a comment -

        Commit 1626546 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1626546 ]

        LUCENE-5952: better error messages when version fails to parse; use simpler string tokenizer; don't check major versions at such a low level

        Show
        ASF subversion and git services added a comment - Commit 1626546 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1626546 ] LUCENE-5952 : better error messages when version fails to parse; use simpler string tokenizer; don't check major versions at such a low level
        Hide
        ASF subversion and git services added a comment -

        Commit 1626550 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1626550 ]

        LUCENE-5952: better error messages when version fails to parse; use simpler string tokenizer; don't check major versions at such a low level

        Show
        ASF subversion and git services added a comment - Commit 1626550 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1626550 ] LUCENE-5952 : better error messages when version fails to parse; use simpler string tokenizer; don't check major versions at such a low level
        Hide
        ASF subversion and git services added a comment -

        Commit 1626551 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1626551 ]

        LUCENE-5952: remove dead writer.version parsing code

        Show
        ASF subversion and git services added a comment - Commit 1626551 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1626551 ] LUCENE-5952 : remove dead writer.version parsing code
        Hide
        ASF subversion and git services added a comment -

        Commit 1626552 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1626552 ]

        LUCENE-5952: remove dead writer.version parsing code

        Show
        ASF subversion and git services added a comment - Commit 1626552 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1626552 ] LUCENE-5952 : remove dead writer.version parsing code
        Hide
        ASF subversion and git services added a comment -

        Commit 1626553 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1626553 ]

        LUCENE-5952: remove dead writer.version parsing code

        Show
        ASF subversion and git services added a comment - Commit 1626553 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1626553 ] LUCENE-5952 : remove dead writer.version parsing code
        Hide
        Michael McCandless added a comment -

        Fixed ... we can iterate on the format change in LUCENE-5969.

        Show
        Michael McCandless added a comment - Fixed ... we can iterate on the format change in LUCENE-5969 .
        Hide
        Michael McCandless added a comment -

        Bulk close for Lucene/Solr 4.10.1 release

        Show
        Michael McCandless added a comment - Bulk close for Lucene/Solr 4.10.1 release
        Hide
        Dave Borowitz added a comment -

        This patch in Lucene 4.10.1 breaks code that used to compile under 4.10.0, which could safely assume Version.parse(Leniently) throws no exceptions. Is backwards incompatibility in a bugfix release common, or was this an oversight?

        Show
        Dave Borowitz added a comment - This patch in Lucene 4.10.1 breaks code that used to compile under 4.10.0, which could safely assume Version.parse(Leniently) throws no exceptions. Is backwards incompatibility in a bugfix release common, or was this an oversight?
        Hide
        Michael McCandless added a comment -

        Hi Dave,

        You're right, and I'm sorry about this violation of our back compat policy.

        But it was in fact done intentionally and we made an exception here: the API is really intended for internal use (marked @lucene.internal as of 4.10.2).

        Also, in 4.10.1 it was already throwing an IllegalArgumentException if you pass it an un-parseable version string; the change to the checked ParseException just means you must now explicitly handle this possible serious error.

        Show
        Michael McCandless added a comment - Hi Dave, You're right, and I'm sorry about this violation of our back compat policy. But it was in fact done intentionally and we made an exception here: the API is really intended for internal use (marked @lucene.internal as of 4.10.2). Also, in 4.10.1 it was already throwing an IllegalArgumentException if you pass it an un-parseable version string; the change to the checked ParseException just means you must now explicitly handle this possible serious error.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development