Details

    • Type: Task Task
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      PackedInts are more and more used to save/restore small arrays, but given that they are long-aligned, up to 63 bits are wasted per array. We should try to make PackedInts storage byte-aligned so that only 7 bits are wasted in the worst case.

      1. LUCENE-4536.patch
        36 kB
        Adrien Grand
      2. LUCENE-4536.patch
        36 kB
        Adrien Grand

        Activity

        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Adrien Grand
        http://svn.apache.org/viewvc?view=revision&revision=1406660

        LUCENE-4536: Make PackedInts on-disk format byte-aligned (merged from r1406651).

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1406660 LUCENE-4536 : Make PackedInts on-disk format byte-aligned (merged from r1406651).
        Hide
        Adrien Grand added a comment -

        Committed:

        • trunk: r1406651
        • branch 4.x: r1406660
        Show
        Adrien Grand added a comment - Committed: trunk: r1406651 branch 4.x: r1406660
        Hide
        Adrien Grand added a comment -

        New patch including Mike's suggestions:

        • VERSION_LONG_ALIGNED renamed to VERSION_START,
        • VERSION_CURRENT aliased to VERSION_BYTE_ALIGNED,
        • added an assert that the sneaky direct reader impl can only be instantiated if the stream has been produced with VERSION_START.
        Show
        Adrien Grand added a comment - New patch including Mike's suggestions: VERSION_LONG_ALIGNED renamed to VERSION_START, VERSION_CURRENT aliased to VERSION_BYTE_ALIGNED, added an assert that the sneaky direct reader impl can only be instantiated if the stream has been produced with VERSION_START.
        Hide
        Michael McCandless added a comment -

        These were my first ideas, but the truth is that I was very scared to break something (for example doc values rely on the assumption that after reading the last value of a direct array, the whole stream is consumed).

        It's hard to know what's best

        I like the explicitness / transparency / no sneaky code solution of
        .skipTrailingBytes().

        But then I don't like that skipTrailingBytes would only be for back
        compat (ie, we will remove it eventually, unless somehow we go back to
        wasted trailing bytes) ... annoying to add essentially a deprecated
        API.

        But then really it's "presumptuous" of the consumers of PackedInts to
        expect all bytes are consumed after iterating all values ... like
        that's making a sometimes invalid assumption about the file format of
        PackedInts.

        And this is an internal API so we are free to change things ..

        But net/net I think we should stick w/ your current patch?

        Show
        Michael McCandless added a comment - These were my first ideas, but the truth is that I was very scared to break something (for example doc values rely on the assumption that after reading the last value of a direct array, the whole stream is consumed). It's hard to know what's best I like the explicitness / transparency / no sneaky code solution of .skipTrailingBytes(). But then I don't like that skipTrailingBytes would only be for back compat (ie, we will remove it eventually, unless somehow we go back to wasted trailing bytes) ... annoying to add essentially a deprecated API. But then really it's "presumptuous" of the consumers of PackedInts to expect all bytes are consumed after iterating all values ... like that's making a sometimes invalid assumption about the file format of PackedInts. And this is an internal API so we are free to change things .. But net/net I think we should stick w/ your current patch?
        Hide
        Adrien Grand added a comment -

        This patch only changes the on-disk format right? The specialized in-memory readers are still backed by native arrays (short[]/int[]/long[], etc.)?

        Exactly.

        Ie, in general, I think the version constants should be created once and then not changed (write once), and VERSION_CURRENT changes to point to whichever is most recent.

        Ok, I'll change it.

        That careful anonymous subclass in PackedInts to handle seeking to the end when the last value is read is sort of sneaky ... this should only kick in when reading the old (long-aligned) format right?

        This only happens when reading the old format AND the number of bytes used to serialized the array is not a multiple of 8. I'll add an assert to make sure that this condition can only be true with the old format.

        Or ... maybe... we should not "promise" this (no trailing wasted bytes) in the API?

        Or maybe we expose a new explicit method to "seek to the end of this packed ints" or something (eg maybe "skipTrailingBytes").

        These were my first ideas, but the truth is that I was very scared to break something (for example doc values rely on the assumption that after reading the last value of a direct array, the whole stream is consumed). Fixing PackedInts to make sure those assumptions are still true looked easier to me as I was able to create "fake" long-aligned packed ints and make sure that the whole stream was consumed after reading the last value.

        But your option makes perfect sense to me and I will do it if you think it is cleaner.

        Thanks for the review!

        Show
        Adrien Grand added a comment - This patch only changes the on-disk format right? The specialized in-memory readers are still backed by native arrays (short[]/int[]/long[], etc.)? Exactly. Ie, in general, I think the version constants should be created once and then not changed (write once), and VERSION_CURRENT changes to point to whichever is most recent. Ok, I'll change it. That careful anonymous subclass in PackedInts to handle seeking to the end when the last value is read is sort of sneaky ... this should only kick in when reading the old (long-aligned) format right? This only happens when reading the old format AND the number of bytes used to serialized the array is not a multiple of 8. I'll add an assert to make sure that this condition can only be true with the old format. Or ... maybe... we should not "promise" this (no trailing wasted bytes) in the API? Or maybe we expose a new explicit method to "seek to the end of this packed ints" or something (eg maybe "skipTrailingBytes"). These were my first ideas, but the truth is that I was very scared to break something (for example doc values rely on the assumption that after reading the last value of a direct array, the whole stream is consumed). Fixing PackedInts to make sure those assumptions are still true looked easier to me as I was able to create "fake" long-aligned packed ints and make sure that the whole stream was consumed after reading the last value. But your option makes perfect sense to me and I will do it if you think it is cleaner. Thanks for the review!
        Hide
        Michael McCandless added a comment -

        This patch only changes the on-disk format right? The specialized
        in-memory readers are still backed by native arrays
        (short[]/int[]/long[], etc.)?

        Instead of PackedInts.VERSION_CURRENT = 1 can we add
        VERSION_BYTE_ALIGNED = 1 and then set VERSION_CURRENT =
        VERSION_BYTE_ALIGNED?

        Also, can we leave VERSION_START = 0 (ie don't rename that to
        VERSION_LONG_ALIGNED)? But we should put a comment saying that one
        was long aligned ...

        Ie, in general, I think the version constants should be created once
        and then not changed (write once), and VERSION_CURRENT changes to
        point to whichever is most recent.

        That careful anonymous subclass in PackedInts to handle seeking to the
        end when the last value is read is sort of sneaky ... this should only
        kick in when reading the old (long-aligned) format right? Can you add
        an assert checking that the version is VERSION_START? Or ... maybe
        ... we should not "promise" this (no trailing wasted bytes) in the
        API? Or maybe we expose a new explicit method to "seek to the end of
        this packed ints" or something (eg maybe "skipTrailingBytes").

        Show
        Michael McCandless added a comment - This patch only changes the on-disk format right? The specialized in-memory readers are still backed by native arrays (short[]/int[]/long[], etc.)? Instead of PackedInts.VERSION_CURRENT = 1 can we add VERSION_BYTE_ALIGNED = 1 and then set VERSION_CURRENT = VERSION_BYTE_ALIGNED? Also, can we leave VERSION_START = 0 (ie don't rename that to VERSION_LONG_ALIGNED)? But we should put a comment saying that one was long aligned ... Ie, in general, I think the version constants should be created once and then not changed (write once), and VERSION_CURRENT changes to point to whichever is most recent. That careful anonymous subclass in PackedInts to handle seeking to the end when the last value is read is sort of sneaky ... this should only kick in when reading the old (long-aligned) format right? Can you add an assert checking that the version is VERSION_START? Or ... maybe ... we should not "promise" this (no trailing wasted bytes) in the API? Or maybe we expose a new explicit method to "seek to the end of this packed ints" or something (eg maybe "skipTrailingBytes").
        Hide
        Adrien Grand added a comment -

        Patch. I added a test so that two assumptions that were made by callers are still true, even if you are reading packed ints streams generated with 4.0 with this patch applied:

        • after reading all values of a Readeriterator, the IndexInput is positioned at the end of the stream,
        • after reading the last value of a direct Reader, the IndexInput is positioned at the end of the stream.
        Show
        Adrien Grand added a comment - Patch. I added a test so that two assumptions that were made by callers are still true, even if you are reading packed ints streams generated with 4.0 with this patch applied: after reading all values of a Readeriterator, the IndexInput is positioned at the end of the stream, after reading the last value of a direct Reader, the IndexInput is positioned at the end of the stream.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development