Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/store
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Some codecs might be interested in using PackedInts.

      {Writer,Reader,ReaderIterator}

      to read and write fixed-size values efficiently.

      The problem is that the serialization format is self contained, and always writes the name of the codec, its version, its number of bits per value and its format. For example, if you want to use packed ints to store your postings list, this is a lot of overhead (at least ~60 bytes per term, in case you only use one Writer per term, more otherwise).

      Users should be able to externalize the storage of metadata to save space. For example, to use PackedInts to store a postings list, one should be able to store the codec name, its version and the number of bits per doc in the header of the terms+postings list instead of having to write it once (or more!) per term.

      1. LUCENE-4161.patch
        533 kB
        Adrien Grand
      2. LUCENE-4161.patch
        513 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          +1

          We could really use this on LUCENE-3892 where the FOR postings format is really just packed ints. Would be best if we could simply efficiently use oal.util.packed.* for this.

          Show
          Michael McCandless added a comment - +1 We could really use this on LUCENE-3892 where the FOR postings format is really just packed ints. Would be best if we could simply efficiently use oal.util.packed.* for this.
          Hide
          Adrien Grand added a comment -

          First version of the patch.

          A few things that were internal now need to be exposed, so I tried to do some clean up:

          • CODEC_NAME and CODEC_VERSION {START,CURRENT}

            are public,

          • the format is an enum (PackedInts.Format. {PACKED,PACKED_SINGLE_BLOCK}

            ),

          • improved docs overall.

          There are new factory methods get

          {Reader,ReaderIterator,Writer}NoHeader that do the same as their get{Reader,ReaderIterator,Writer}

          counterpart, but with no header writing/checking.

          Improved performance of Reader/Mutable bulk methods (using code generation, see http://people.apache.org/~jpountz/packed_ints.html vs. http://people.apache.org/~jpountz/packed_ints2.html).

          ReaderIterator and Writer now use the same code as Reader/Mutable bulk methods so they are likely to be much faster too. In addition, ReaderIterator now allows consumers to retrieve several values at the same time.

          Direct* and Packed*ThreeBlocks had a lot of duplicate code that was not factorizable so I created scripts to generate them.

          Something that might still slow down ReaderIterator (probably the most useful class for codecs) a bit is that ReaderIterator always reads one long at a time. Adding a method to bulk-read longs to DataInput (similarly to readBytes) might improve performance. This probably deserves an other issue in JIRA and can be done later.

          Show
          Adrien Grand added a comment - First version of the patch. A few things that were internal now need to be exposed, so I tried to do some clean up: CODEC_NAME and CODEC_VERSION {START,CURRENT} are public, the format is an enum (PackedInts.Format. {PACKED,PACKED_SINGLE_BLOCK} ), improved docs overall. There are new factory methods get {Reader,ReaderIterator,Writer}NoHeader that do the same as their get{Reader,ReaderIterator,Writer} counterpart, but with no header writing/checking. Improved performance of Reader/Mutable bulk methods (using code generation, see http://people.apache.org/~jpountz/packed_ints.html vs. http://people.apache.org/~jpountz/packed_ints2.html ). ReaderIterator and Writer now use the same code as Reader / Mutable bulk methods so they are likely to be much faster too. In addition, ReaderIterator now allows consumers to retrieve several values at the same time. Direct* and Packed*ThreeBlocks had a lot of duplicate code that was not factorizable so I created scripts to generate them. Something that might still slow down ReaderIterator (probably the most useful class for codecs) a bit is that ReaderIterator always reads one long at a time. Adding a method to bulk-read longs to DataInput (similarly to readBytes) might improve performance. This probably deserves an other issue in JIRA and can be done later.
          Hide
          Michael McCandless added a comment -

          Wow, this patch is impressive! Lots of amazing changes... very cool
          how you factored out a simple common API for bulk write/read of all
          the formats.

          Should computeN be non-static method on BulkOperation base class?
          (Just seems odd to make a static method whose first arg is an instance
          of that class anyway...).

          Can we find a better name for computeN? I think n is the number of
          blocks we can buffer up given the RAM "budget"? computeNumBlocks?
          computeNumBufferedBlocks? computeBufferedBlocksCount? Something
          else...?

          I suspect, to use these for codecs, we will want to have versions that
          work on int[] values instead (everything we encode are ints:
          docIDs/deltas, term freqs, offsets, positions).

          Code styling: can we use three lines, ie:

          -    if (valueCount > MAX_SIZE) {
          -      throw new ArrayIndexOutOfBoundsException("MAX_SIZE exceeded");
          -    }
          

          instead of one line:

          +    if (valueCount > MAX_SIZE) { throw new ArrayIndexOutOfBoundsException("MAX_SIZE exceeded"); }
          

          in general?

          Does this change the on-disk format? I think no? (if those
          Format.getId()s match?) If it does change we need back compat (4.0.0 alpha
          has left the station...).

          Show
          Michael McCandless added a comment - Wow, this patch is impressive! Lots of amazing changes... very cool how you factored out a simple common API for bulk write/read of all the formats. Should computeN be non-static method on BulkOperation base class? (Just seems odd to make a static method whose first arg is an instance of that class anyway...). Can we find a better name for computeN? I think n is the number of blocks we can buffer up given the RAM "budget"? computeNumBlocks? computeNumBufferedBlocks? computeBufferedBlocksCount? Something else...? I suspect, to use these for codecs, we will want to have versions that work on int[] values instead (everything we encode are ints: docIDs/deltas, term freqs, offsets, positions). Code styling: can we use three lines, ie: - if (valueCount > MAX_SIZE) { - throw new ArrayIndexOutOfBoundsException("MAX_SIZE exceeded"); - } instead of one line: + if (valueCount > MAX_SIZE) { throw new ArrayIndexOutOfBoundsException("MAX_SIZE exceeded"); } in general? Does this change the on-disk format? I think no? (if those Format.getId()s match?) If it does change we need back compat (4.0.0 alpha has left the station...).
          Hide
          Adrien Grand added a comment -

          Can we find a better name for computeN?

          The meaning of n is actually a bit complicated. For every number of bits per value, there is a minimum number of blocks (b) / values (v) you need to write in order to reach the next block boundary:

          • 16 bits per value -> b=1, v=4
          • 24 bits per value -> b=3, v=8
          • 50 bits per value -> b=25, v=32
          • 63 bits per value -> b=63, v = 64
          • ...

          A bulk read consists in copying n*v values that are contained in n*b blocks into a long[] (higher values of n are likely to yield a better throughput) => this requires n * (b + v) longs in memory, this is why I compute n as ramBudget / (8 * (b + v)) (since a long is 8 bytes). I called it n in the method name because I have no idea how to name it... "iterations", maybe?

          I suspect, to use these for codecs, we will want to have versions that work on int[] values instead (everything we encode are ints: docIDs/deltas, term freqs, offsets, positions).

          I hesitated to do this since it would involve some code duplication, but I guess it can't be avoided if we want this API to be actually used... What additional methods do you think we need?

          • PackedReaderIterator.nextInts(int count)
          • others?

          [static computeN], [code style]

          You are right, I will fix it!

          Does this change the on-disk format?

          No, it doesn't. I will add unit tests for that...

          Show
          Adrien Grand added a comment - Can we find a better name for computeN? The meaning of n is actually a bit complicated. For every number of bits per value, there is a minimum number of blocks (b) / values (v) you need to write in order to reach the next block boundary: 16 bits per value -> b=1, v=4 24 bits per value -> b=3, v=8 50 bits per value -> b=25, v=32 63 bits per value -> b=63, v = 64 ... A bulk read consists in copying n*v values that are contained in n*b blocks into a long[] (higher values of n are likely to yield a better throughput) => this requires n * (b + v) longs in memory, this is why I compute n as ramBudget / (8 * (b + v)) (since a long is 8 bytes). I called it n in the method name because I have no idea how to name it... "iterations", maybe? I suspect, to use these for codecs, we will want to have versions that work on int[] values instead (everything we encode are ints: docIDs/deltas, term freqs, offsets, positions). I hesitated to do this since it would involve some code duplication, but I guess it can't be avoided if we want this API to be actually used... What additional methods do you think we need? PackedReaderIterator.nextInts(int count) others? [static computeN] , [code style] You are right, I will fix it! Does this change the on-disk format? No, it doesn't. I will add unit tests for that...
          Hide
          Michael McCandless added a comment -

          The meaning of n is actually a bit complicated.

          Thank you for the explanation! That makes sense. I think "iterations" is good? Or ... maybe we simply leave it as n and then put this nice explanation in there as a comment?

          Naming is the hardest part

          What additional methods do you think we need?

          I'm not sure off-hand yet ... we've been iterating in LUCENE-3892 to find the least-cost way to decode from the underlying byte based storage from the IndexInput, but with no real clear fastest solution yet. Logically we are currently storing an int[] and decoding into int[], so I guess encode/decode to/from int[]? We should probably try long[] as the backing too ... but, I think we should explore this (adding int[] based methods) under a new issue? This patch is already great progress.

          Show
          Michael McCandless added a comment - The meaning of n is actually a bit complicated. Thank you for the explanation! That makes sense. I think "iterations" is good? Or ... maybe we simply leave it as n and then put this nice explanation in there as a comment? Naming is the hardest part What additional methods do you think we need? I'm not sure off-hand yet ... we've been iterating in LUCENE-3892 to find the least-cost way to decode from the underlying byte based storage from the IndexInput, but with no real clear fastest solution yet. Logically we are currently storing an int[] and decoding into int[], so I guess encode/decode to/from int[]? We should probably try long[] as the backing too ... but, I think we should explore this (adding int[] based methods) under a new issue? This patch is already great progress.
          Hide
          Adrien Grand added a comment -

          New patch. I renamed 'n' to 'iterations', fixed the style issues and improved documentation. All core tests pass, including the backward-compatibility tests I added in r1356228.

          I think this is a good idea to work on int[] encoding/decoding in a separate issue given how big this patch already is.

          Show
          Adrien Grand added a comment - New patch. I renamed 'n' to 'iterations', fixed the style issues and improved documentation. All core tests pass, including the backward-compatibility tests I added in r1356228. I think this is a good idea to work on int[] encoding/decoding in a separate issue given how big this patch already is.
          Hide
          Michael McCandless added a comment -

          +1, patch looks great!

          Show
          Michael McCandless added a comment - +1, patch looks great!
          Hide
          Adrien Grand added a comment -

          Thanks, Mike! Committed (r1357159 on trunk and r1357166 on branch 4.x).

          Show
          Adrien Grand added a comment - Thanks, Mike! Committed (r1357159 on trunk and r1357166 on branch 4.x).
          Hide
          Han Jiang added a comment -

          Hi Adrien, we're trying to use PackedInts in LUCENE-3892 to compress/decompress an int array. Currently, to support block skipping, we need to know the on-disk size of a compressed block before it is written into output stream. Seems that PackedInts.Reader.ramBytesUsed() doesn't meet the requirement? Do we have other methods to solve this problem?

          Show
          Han Jiang added a comment - Hi Adrien, we're trying to use PackedInts in LUCENE-3892 to compress/decompress an int array. Currently, to support block skipping, we need to know the on-disk size of a compressed block before it is written into output stream. Seems that PackedInts.Reader.ramBytesUsed() doesn't meet the requirement? Do we have other methods to solve this problem?
          Hide
          Adrien Grand added a comment -

          Hi! I think PackedInts.Format.PACKED.nblocks(bitsPerValue, n) should help. It returns the number of long blocks required to store n bitsPerValue-bits values.

          On your pfor branch, I'm afraid instantiating a PackedInts.ReaderIterator for every block would hurt performance since blocks only contain 128 values. Maybe the easiest way to go would be to try to replace calls to PackedIntsDecompress.decode* with calls to PackedInts.BulkOperation.get/set. These methods are not exposed yet and only work with longs but if it looks good to you, I should be able to come up with something in the next few days...

          Show
          Adrien Grand added a comment - Hi! I think PackedInts.Format.PACKED.nblocks(bitsPerValue, n) should help. It returns the number of long blocks required to store n bitsPerValue -bits values. On your pfor branch, I'm afraid instantiating a PackedInts.ReaderIterator for every block would hurt performance since blocks only contain 128 values. Maybe the easiest way to go would be to try to replace calls to PackedIntsDecompress.decode* with calls to PackedInts.BulkOperation.get/set . These methods are not exposed yet and only work with longs but if it looks good to you, I should be able to come up with something in the next few days...

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development