Lucene - Core
  1. Lucene - Core
  2. LUCENE-5992

Version should not be encoded as a String in the index

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The version is really "just" 3 (maybe 4) ints under-the-hood, but today we write it as a String which then requires spooky string tokenization/parsing when we open the index. I think it should be encoded directly as ints.

      In LUCENE-5952 I had tried to make this change, but it was controversial, and got booted.

      Then in LUCENE-5969, I tried again, but that issue has morphed (nicely!) into fixing all sorts of things except these three ints.

      Maybe 3rd time's a charm

      1. LUCENE-5992_tests.patch
        25 kB
        Robert Muir
      2. LUCENE-5992.patch
        3 kB
        Robert Muir
      3. LUCENE-5992.patch
        3 kB
        Robert Muir
      4. LUCENE-5992.patch
        8 kB
        Michael McCandless
      5. LUCENE-5992.patch
        6 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Here's a starting patch so the factions can start duking it out:

        I put the readVersion/writeVersion methods inside CodecUtil.java.
        Yes, it's not great that this means Version's ctor is public, but in
        CodecUtil it's at least more apparent that these methods are
        committing to a particular format, and that format is really "scoped"
        by the caller (meaning if ever we change what ints we write / how we
        encode them, the caller's own format tracking must handle that
        change). I put spooky javadocs and comments to convey this
        situation.

        I'm also defensively writing 4 (not 3) ints since it seems at least
        possible we may once again release alpha/beta releases and it seems
        safest to at least not preclude the possibility ... but some factions
        want 3 ints instead.

        Finally, I'm encoding using vInt, not unsigned byte. I know this is
        ridiculously defensive, and surely Lucene will never get to major or
        minor or bugfix version 256 (it would be a strange world), but the
        fact is that we are using java's "int" for these instance vars in
        Version.java... I can't remember whether there were factions
        upset by this.

        But as is I (my faction) think the patch is ready!

        Show
        Michael McCandless added a comment - Here's a starting patch so the factions can start duking it out: I put the readVersion/writeVersion methods inside CodecUtil.java. Yes, it's not great that this means Version's ctor is public, but in CodecUtil it's at least more apparent that these methods are committing to a particular format, and that format is really "scoped" by the caller (meaning if ever we change what ints we write / how we encode them, the caller's own format tracking must handle that change). I put spooky javadocs and comments to convey this situation. I'm also defensively writing 4 (not 3) ints since it seems at least possible we may once again release alpha/beta releases and it seems safest to at least not preclude the possibility ... but some factions want 3 ints instead. Finally, I'm encoding using vInt, not unsigned byte. I know this is ridiculously defensive, and surely Lucene will never get to major or minor or bugfix version 256 (it would be a strange world), but the fact is that we are using java's "int" for these instance vars in Version.java... I can't remember whether there were factions upset by this. But as is I (my faction) think the patch is ready!
        Hide
        Uwe Schindler added a comment - - edited

        About the "fixed format" issue (if we for example in future remove add components): We could theoretically write another prefix byte as marker for the version format? Not sure if this is really good, it would just be a possibility to change the encoding format.

        About the public version ctor: I don't like this. Initially I thougth CodecUtil is part of util package, too, so it could be pkg-protected, but this is not the case In general I would instead prefer to have the ctor hidden again, but add some static method with good comment not to use it. A ctor always makes people try to use it, especially because of Eclipse autocomplete. So better use Version.buildFromComponents(major, minor,...). This would also be consistent with parse* API.

        Show
        Uwe Schindler added a comment - - edited About the "fixed format" issue (if we for example in future remove add components): We could theoretically write another prefix byte as marker for the version format? Not sure if this is really good, it would just be a possibility to change the encoding format. About the public version ctor: I don't like this. Initially I thougth CodecUtil is part of util package, too, so it could be pkg-protected, but this is not the case In general I would instead prefer to have the ctor hidden again, but add some static method with good comment not to use it. A ctor always makes people try to use it, especially because of Eclipse autocomplete. So better use Version.buildFromComponents(major, minor,...). This would also be consistent with parse* API.
        Hide
        Michael McCandless added a comment -

        If we add the format byte then we could move read/write back into Version.java and keep the ctor private ... I think I like that option.

        Show
        Michael McCandless added a comment - If we add the format byte then we could move read/write back into Version.java and keep the ctor private ... I think I like that option.
        Hide
        Uwe Schindler added a comment -

        I am fine with both options. For me it is important that Version has no public ctor (under no circumstances), so people using Eclipse autocomplete do not "naturally" use the ctor to pass version constants to analyzers or other stuff in other public APIs. This will cause endless bug reports. A separate static method should be preferred here, because it has a "method name" that explains what it does. The eclipse user has to choose carefully and cannot automatically use the worst option.

        Show
        Uwe Schindler added a comment - I am fine with both options. For me it is important that Version has no public ctor (under no circumstances), so people using Eclipse autocomplete do not "naturally" use the ctor to pass version constants to analyzers or other stuff in other public APIs. This will cause endless bug reports. A separate static method should be preferred here, because it has a "method name" that explains what it does. The eclipse user has to choose carefully and cannot automatically use the worst option.
        Hide
        Robert Muir added a comment -

        What are you defending against? In LUCENE-5969: we fixed addIndexes(Dir) to just copy the .si file too, so old SI writers are read-only, instead files() is just always stripped like CFS.

        So if we need to write 4 ints, we just make a new SI that does that? But no longer must an SI "write for versions in the future"

        Show
        Robert Muir added a comment - What are you defending against? In LUCENE-5969 : we fixed addIndexes(Dir) to just copy the .si file too, so old SI writers are read-only, instead files() is just always stripped like CFS. So if we need to write 4 ints, we just make a new SI that does that? But no longer must an SI "write for versions in the future"
        Hide
        Jack Krupansky added a comment -

        What about "versions" of an index during the development process, like each time a change to the index format is committed? Such as the alpha and beta stages in 4.0?

        I'd be happier with four version ints: major, minor, patch, "change".

        Although, in theory, we shouldn't be changing the index format in either minor or patch releases, but bug fixes for indexing can be valid changes as well.

        Now, the question is whether "change" should reset to zero each time we branch, or should really just be an ever-increasing "index format version number." The latter may make sense, but either is fine. The latter also makes sense from the perspective of the potential of successive releases which don't introduce index incompatibilities. I lean towards the latter, but still makes sense to defensively record which release wrote an index.

        Show
        Jack Krupansky added a comment - What about "versions" of an index during the development process, like each time a change to the index format is committed? Such as the alpha and beta stages in 4.0? I'd be happier with four version ints: major, minor, patch, "change". Although, in theory, we shouldn't be changing the index format in either minor or patch releases, but bug fixes for indexing can be valid changes as well. Now, the question is whether "change" should reset to zero each time we branch, or should really just be an ever-increasing "index format version number." The latter may make sense, but either is fine. The latter also makes sense from the perspective of the potential of successive releases which don't introduce index incompatibilities. I lean towards the latter, but still makes sense to defensively record which release wrote an index.
        Hide
        Michael McCandless added a comment -

        What about "versions" of an index during the development process, like each time a change to the index format is committed?

        We are freely allowed to completely break the index format within one release.

        So it's only on releasing, that we commit to a published version of the format ... I don't think we should add the 4th "change" int, at least not on this issue. It's hard enough just writing and reading the 3 ints!

        Show
        Michael McCandless added a comment - What about "versions" of an index during the development process, like each time a change to the index format is committed? We are freely allowed to completely break the index format within one release. So it's only on releasing, that we commit to a published version of the format ... I don't think we should add the 4th "change" int, at least not on this issue. It's hard enough just writing and reading the 3 ints!
        Hide
        Michael McCandless added a comment -

        New patch.

        I moved write/read into Version.java, made its ctor private again, added a single-byte format header, only write 3 ints (and throw an exc if anyone ever tries to write a version whose prerelease != 0), switched to unsigned byte from vInt, and added some tests.

        I think it's ready.

        Show
        Michael McCandless added a comment - New patch. I moved write/read into Version.java, made its ctor private again, added a single-byte format header, only write 3 ints (and throw an exc if anyone ever tries to write a version whose prerelease != 0), switched to unsigned byte from vInt, and added some tests. I think it's ready.
        Hide
        Ryan Ernst added a comment -

        Patch LGTM.

        Show
        Ryan Ernst added a comment - Patch LGTM.
        Hide
        Uwe Schindler added a comment - - edited

        The documentation of Version#write is now incorrect. It still says "vint". Also this comment is now obsolete (I think): "// NOTE: do not change this format! If you do, you must fix all callers of this API, everywhere in the world to handle the format change:"
        In fact we can change it now, because of the marker byte.

        I had another idea: the first byte could also be the "number of version components" (like when writing a string). If we write version with 3 components, it is 3, 4 otherwise. Then we have full flexibility, too.

        Show
        Uwe Schindler added a comment - - edited The documentation of Version#write is now incorrect. It still says "vint". Also this comment is now obsolete (I think): "// NOTE: do not change this format! If you do, you must fix all callers of this API, everywhere in the world to handle the format change:" In fact we can change it now, because of the marker byte. I had another idea: the first byte could also be the "number of version components" (like when writing a string). If we write version with 3 components, it is 3, 4 otherwise. Then we have full flexibility, too.
        Hide
        Robert Muir added a comment -

        Here is my patch. I dont want to sound like a complainer, so i just offer another solution.

        When i see version encoding itself being versioned, thats a clear sign this is going in a worse direction, not better. Keep it simple please!

        Show
        Robert Muir added a comment - Here is my patch. I dont want to sound like a complainer, so i just offer another solution. When i see version encoding itself being versioned, thats a clear sign this is going in a worse direction, not better. Keep it simple please!
        Hide
        Robert Muir added a comment -

        Simpler iteration. Since major/minor/bugfix etc are 'public' in Version (its clearly a thin wrapper around these and they are not an internal detail), then we don't need the toArray method.

        Again, the important stuff is this: if we want to do something crazy like write massive numbers, or write alpha/beta, just make a new SIFormat. We don't need to future proof things inside the codec API.

        Show
        Robert Muir added a comment - Simpler iteration. Since major/minor/bugfix etc are 'public' in Version (its clearly a thin wrapper around these and they are not an internal detail), then we don't need the toArray method. Again, the important stuff is this: if we want to do something crazy like write massive numbers, or write alpha/beta, just make a new SIFormat. We don't need to future proof things inside the codec API.
        Hide
        Michael McCandless added a comment -

        +1 for this patch: it's much simpler, and takes away the whole "versioning of the version" problem!

        Show
        Michael McCandless added a comment - +1 for this patch: it's much simpler, and takes away the whole "versioning of the version" problem!
        Hide
        Uwe Schindler added a comment -

        +1, I like the static method name!

        Show
        Uwe Schindler added a comment - +1, I like the static method name!
        Hide
        Ryan Ernst added a comment -

        +1 to the latest patch. Can we add @lucene.internal to docs for Version.fromBits method?

        Show
        Ryan Ernst added a comment - +1 to the latest patch. Can we add @lucene.internal to docs for Version.fromBits method?
        Hide
        Michael McCandless added a comment -

        I'll commit Rob's last patch, and add @lucene.internal...

        Show
        Michael McCandless added a comment - I'll commit Rob's last patch, and add @lucene.internal...
        Hide
        ASF subversion and git services added a comment -

        Commit 1629769 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1629769 ]

        LUCENE-5992: encode version using 3 ints, not String, for Lucene 5.x indices

        Show
        ASF subversion and git services added a comment - Commit 1629769 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1629769 ] LUCENE-5992 : encode version using 3 ints, not String, for Lucene 5.x indices
        Hide
        Robert Muir added a comment -

        Thanks Mike (i am intermittent), i just wanted to add tests and so on too.

        I also hate that we pass FIS to SIWriter (it just made tests hard and seems bogus??? so i removed that too).

        see patch

        Show
        Robert Muir added a comment - Thanks Mike (i am intermittent), i just wanted to add tests and so on too. I also hate that we pass FIS to SIWriter (it just made tests hard and seems bogus??? so i removed that too). see patch
        Hide
        Michael McCandless added a comment -

        Thanks Rob, those tests and the removal of FIS from SIWriter look great ... I'll commit shortly & backport to 5.x

        Show
        Michael McCandless added a comment - Thanks Rob, those tests and the removal of FIS from SIWriter look great ... I'll commit shortly & backport to 5.x
        Hide
        ASF subversion and git services added a comment -

        Commit 1629774 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1629774 ]

        LUCENE-5992: add SI tests; remove FieldInfos from SegmentInfosWrite.write

        Show
        ASF subversion and git services added a comment - Commit 1629774 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1629774 ] LUCENE-5992 : add SI tests; remove FieldInfos from SegmentInfosWrite.write
        Hide
        ASF subversion and git services added a comment -

        Commit 1629775 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1629775 ]

        LUCENE-5992: encode version using 3 ints, not String, for Lucene 5.x indices; add SI tests; remove FieldInfos from SegmentInfosWrite.write

        Show
        ASF subversion and git services added a comment - Commit 1629775 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1629775 ] LUCENE-5992 : encode version using 3 ints, not String, for Lucene 5.x indices; add SI tests; remove FieldInfos from SegmentInfosWrite.write
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development