Lucene - Core
  1. Lucene - Core
  2. LUCENE-4062

More fine-grained control over the packed integer implementation that is chosen

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA, 5.0
    • Component/s: core/other
    • Labels:
    • Lucene Fields:
      New, Patch Available

      Description

      In order to save space, Lucene has two main PackedInts.Mutable implentations, one that is very fast and is based on a byte/short/integer/long array (Direct*) and another one which packs bits in a memory-efficient manner (Packed*).

      The packed implementation tends to be much slower than the direct one, which discourages some Lucene components to use it. On the other hand, if you store 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.

      If you accept to trade some space for speed, you could store 3 of these 21 bits integers in a long, resulting in an overhead of 1/3 bit per value. One advantage of this approach is that you never need to read more than one block to read or write a value, so this can be significantly faster than Packed32 and Packed64 which always need to read/write two blocks in order to avoid costly branches.

      I ran some tests, and for 10000000 21 bits values, this implementation takes less than 2% more space and has 44% faster writes and 30% faster reads. The 12 bits version (5 values per block) has the same performance improvement and a 6% memory overhead compared to the packed implementation.

      In order to select the best implementation for a given integer size, I wrote the PackedInts.getMutable(valueCount, bitsPerValue, acceptableOverheadPerValue) method. This method select the fastest implementation that has less than acceptableOverheadPerValue wasted bits per value. For example, if you accept an overhead of 20% (acceptableOverheadPerValue = 0.2f * bitsPerValue), which is pretty reasonable, here is what implementations would be selected:

      • 1: Packed64SingleBlock1
      • 2: Packed64SingleBlock2
      • 3: Packed64SingleBlock3
      • 4: Packed64SingleBlock4
      • 5: Packed64SingleBlock5
      • 6: Packed64SingleBlock6
      • 7: Direct8
      • 8: Direct8
      • 9: Packed64SingleBlock9
      • 10: Packed64SingleBlock10
      • 11: Packed64SingleBlock12
      • 12: Packed64SingleBlock12
      • 13: Packed64
      • 14: Direct16
      • 15: Direct16
      • 16: Direct16
      • 17: Packed64
      • 18: Packed64SingleBlock21
      • 19: Packed64SingleBlock21
      • 20: Packed64SingleBlock21
      • 21: Packed64SingleBlock21
      • 22: Packed64
      • 23: Packed64
      • 24: Packed64
      • 25: Packed64
      • 26: Packed64
      • 27: Direct32
      • 28: Direct32
      • 29: Direct32
      • 30: Direct32
      • 31: Direct32
      • 32: Direct32
      • 33: Packed64
      • 34: Packed64
      • 35: Packed64
      • 36: Packed64
      • 37: Packed64
      • 38: Packed64
      • 39: Packed64
      • 40: Packed64
      • 41: Packed64
      • 42: Packed64
      • 43: Packed64
      • 44: Packed64
      • 45: Packed64
      • 46: Packed64
      • 47: Packed64
      • 48: Packed64
      • 49: Packed64
      • 50: Packed64
      • 51: Packed64
      • 52: Packed64
      • 53: Packed64
      • 54: Direct64
      • 55: Direct64
      • 56: Direct64
      • 57: Direct64
      • 58: Direct64
      • 59: Direct64
      • 60: Direct64
      • 61: Direct64
      • 62: Direct64

      Under 32 bits per value, only 13, 17 and 22-26 bits per value would still choose the slower Packed64 implementation. Allowing a 50% overhead would prevent the packed implementation to be selected for bits per value under 32. Allowing an overhead of 32 bits per value would make sure that a Direct* implementation is always selected.

      Next steps would be to:

      • make lucene components use this getMutable method and let users decide what trade-off better suits them,
      • write a Packed32SingleBlock implementation if necessary (I didn't do it because I have no 32-bits computer to test the performance improvements).

      I think this would allow more fine-grained control over the speed/space trade-off, what do you think?

      1. PackedZero.java
        1 kB
        Toke Eskildsen
      2. PackedIntsBenchmark.java
        6 kB
        Adrien Grand
      3. PackedIntsBenchmark.java
        7 kB
        Toke Eskildsen
      4. PackedIntsBenchmark.java
        7 kB
        Toke Eskildsen
      5. PackedIntsBenchmark.java
        10 kB
        Toke Eskildsen
      6. Packed64Strategy.java
        14 kB
        Toke Eskildsen
      7. Packed64SingleBlock.java
        12 kB
        Adrien Grand
      8. Packed64calc.java
        9 kB
        Toke Eskildsen
      9. measurements_te_xeon.txt
        132 kB
        Toke Eskildsen
      10. measurements_te_p4.txt
        128 kB
        Toke Eskildsen
      11. measurements_te_i7.txt
        131 kB
        Toke Eskildsen
      12. measurements_te_graphs.pdf
        57 kB
        Toke Eskildsen
      13. LUCENE-4062-2.patch
        10 kB
        Adrien Grand
      14. LUCENE-4062.patch
        13 kB
        Adrien Grand
      15. LUCENE-4062.patch
        38 kB
        Adrien Grand
      16. LUCENE-4062.patch
        44 kB
        Adrien Grand
      17. LUCENE-4062.patch
        61 kB
        Adrien Grand
      18. LUCENE-4062.patch
        61 kB
        Adrien Grand
      19. LUCENE-4062.patch
        74 kB
        Adrien Grand
      20. LUCENE-4062.patch
        89 kB
        Adrien Grand
      There are no Sub-Tasks for this issue.

        Activity

        Hide
        Adrien Grand added a comment -

        Here is the patch.

        Show
        Adrien Grand added a comment - Here is the patch.
        Hide
        Michael McCandless added a comment -

        This is a great idea! I don't think it's necessary to create a Packed32SingleBlock right now ... this (Packed64SingleBlock) is already a great improvement. We can do it later...

        Somehow we need to fix the fasterButMoreRAM places (FieldCache, DocValues) to make use of this; maybe we change them from a boolean to the float acceptableOverhead instead?

        Show
        Michael McCandless added a comment - This is a great idea! I don't think it's necessary to create a Packed32SingleBlock right now ... this (Packed64SingleBlock) is already a great improvement. We can do it later... Somehow we need to fix the fasterButMoreRAM places (FieldCache, DocValues) to make use of this; maybe we change them from a boolean to the float acceptableOverhead instead?
        Hide
        Adrien Grand added a comment -

        Hi Mike, thanks for your quick feedback.

        Agreed on Packed32SingleBlock. Regarding fasterButMoreRAM/roundFixedSize, the acceptableOverhead parameter currently is in bits per value. However I think most use-cases would be more interested in the overhead ratio than in its absolute value. So I think they should take a float acceptableOverheadRatio and call getMutable(valueCount, bitsPerValue, bitsPerValue * acceptableOverheadRatio). I'm interested in doing this, would you like me to include these changes in the patch or does it deserve a separate commit?

        Show
        Adrien Grand added a comment - Hi Mike, thanks for your quick feedback. Agreed on Packed32SingleBlock. Regarding fasterButMoreRAM / roundFixedSize , the acceptableOverhead parameter currently is in bits per value. However I think most use-cases would be more interested in the overhead ratio than in its absolute value. So I think they should take a float acceptableOverheadRatio and call getMutable(valueCount, bitsPerValue, bitsPerValue * acceptableOverheadRatio) . I'm interested in doing this, would you like me to include these changes in the patch or does it deserve a separate commit?
        Hide
        Michael McCandless added a comment -

        Hi Adrien,

        I think we should do the cutover as part of this issue, so yes, please, take a stab at that!

        I agree the overhead should be measured in %tg not as bits-per-value. Maybe we also have public constants for the two extremes (zero overhead but possibly slow; super-fast but possibly lots of overhead)?

        Show
        Michael McCandless added a comment - Hi Adrien, I think we should do the cutover as part of this issue, so yes, please, take a stab at that! I agree the overhead should be measured in %tg not as bits-per-value. Maybe we also have public constants for the two extremes (zero overhead but possibly slow; super-fast but possibly lots of overhead)?
        Hide
        Michael McCandless added a comment -

        Can we also move the concrete impls (eg Packed64SingleBlock21) either as static inner classes inside Packed64SingleBlock, or as separate source files? Thanks. We've had trouble w/ multiple classes in a single java source file in the past...

        Show
        Michael McCandless added a comment - Can we also move the concrete impls (eg Packed64SingleBlock21) either as static inner classes inside Packed64SingleBlock, or as separate source files? Thanks. We've had trouble w/ multiple classes in a single java source file in the past...
        Hide
        Michael McCandless added a comment -

        As a quick test I temporarily forced the PackedInts impl to always use Packed64SingleBlock when possible and all tests passed! Good

        Show
        Michael McCandless added a comment - As a quick test I temporarily forced the PackedInts impl to always use Packed64SingleBlock when possible and all tests passed! Good
        Hide
        Adrien Grand added a comment -

        I added the changes we discussed in the patch, replacing fasterButMoreRam=true by PackedInts.FAST and fasterButMoreRam=false by PackedInts.DEFAULT = 20%. All Lucene tests passed.

        Show
        Adrien Grand added a comment - I added the changes we discussed in the patch, replacing fasterButMoreRam=true by PackedInts.FAST and fasterButMoreRam=false by PackedInts.DEFAULT = 20%. All Lucene tests passed.
        Hide
        Adrien Grand added a comment -

        Updated patch with improved performance when bitsPerValue is close (depending on the acceptable overhead) to 24 or 48. The two new implementations store values on three contiguous blocks (byte and short) and have very close performance to the Direct* implementations. Their maximum capacity is lower (by a factor 3) so PackedInts.getMutable decides whether they should be selected based on valueCount. All tests still pass.

        Show
        Adrien Grand added a comment - Updated patch with improved performance when bitsPerValue is close (depending on the acceptable overhead) to 24 or 48. The two new implementations store values on three contiguous blocks (byte and short) and have very close performance to the Direct* implementations. Their maximum capacity is lower (by a factor 3) so PackedInts.getMutable decides whether they should be selected based on valueCount. All tests still pass.
        Hide
        Adrien Grand added a comment -

        Mike, I just noticed that I should have updated the PackedInts#getReader method as well but I have no time to work on this issue until next week. I will try to finish this patch next monday.

        Show
        Adrien Grand added a comment - Mike, I just noticed that I should have updated the PackedInts#getReader method as well but I have no time to work on this issue until next week. I will try to finish this patch next monday.
        Hide
        Michael McCandless added a comment -

        New patch looks great (but yeah we should do getReader too...)! I like the new impls for 24/48 cases. Thanks Adrien!

        Show
        Michael McCandless added a comment - New patch looks great (but yeah we should do getReader too...)! I like the new impls for 24/48 cases. Thanks Adrien!
        Hide
        Adrien Grand added a comment -

        New patch with updated getReader method. All tests still pass.

        Show
        Adrien Grand added a comment - New patch with updated getReader method. All tests still pass.
        Hide
        Adrien Grand added a comment -

        Updated javadocs (there were still a few references to `fasterButMoreRam`).

        Show
        Adrien Grand added a comment - Updated javadocs (there were still a few references to `fasterButMoreRam`).
        Hide
        Michael McCandless added a comment -

        Thanks Adrien!

        Hmm I'm seeing some tests failing with this patch (eg TestRollingUpdates). I haven't tried to track it down...

        This new copyData worries me: I don't think, at read time, we should be rewriting all packed ints? I think we should sort that out at write time and then reading just use whatever reader can best decode what was written? Also, it seems like the logic may be broken? Eg it uses Direct16 when bitsPerValue is < Short.SIZE? (Same for Direct32)... maybe the if cases are just swapped...

        Show
        Michael McCandless added a comment - Thanks Adrien! Hmm I'm seeing some tests failing with this patch (eg TestRollingUpdates). I haven't tried to track it down... This new copyData worries me: I don't think, at read time, we should be rewriting all packed ints? I think we should sort that out at write time and then reading just use whatever reader can best decode what was written? Also, it seems like the logic may be broken? Eg it uses Direct16 when bitsPerValue is < Short.SIZE? (Same for Direct32)... maybe the if cases are just swapped...
        Hide
        Adrien Grand added a comment -

        Oh right, cases are swapped, I uploaded the wrong patch. Here is a correct version.

        I guess there are two options:
        1. always writing the most compact form to disk and deserializing it in a fast PackedInts.Mutable implementation (depending on acceptableOverheadRatio),
        2. writing a possibly aligned version on disk (depending on acceptableOverheadRatio) and then selecting the fastest PackedInts.Mutable implementation that exactly matches bitsPerValue (with no padding bits).

        A third option could be to write padding bits (Packed64SingleBlock subclasses may have such padding bits) as well, but I really dislike the fact that the on-disk format is implementation-dependent.

        Option 1 is likely to make deserialization slower since some decoding might occur (the copyData method) but on the other hand, option 2 would prevent us from using implementations that add padding bits (such as Packed64SingleBlock21 which has one padding bit every 3 21-bits integers or Packed64SingleBlock12 which has 4 padding bits every 5 12-bits integers but not Packed64SingleBlock

        {1,2,4} since 64%{1,2,4}

        =0).

        I initially chose option 1 because I think it is nice to have Packed64SingleBlock21 when bitsPerValue is close to 21, since it might be significantly faster than Packed64.

        In order to know how slower deserialization is (option 1), I ran a micro-benchmark which:

        • loads a PackedInts.Reader from a ByteArrayDataInput,
        • performs n random reads on the reader.

        Here is the code for the micro-benchmark in case you would like to run it.

        int valueCount = 10000000;
        int bitsPerValue = 21;
        int[] offsets = new int[valueCount];
        Random random = new Random();
        for (int i = 0; i < valueCount; ++i) {
          offsets[i] = random.nextInt(valueCount);
        }
        byte[] bytes = new byte[valueCount * 4];
        DataOutput out = new ByteArrayDataOutput(bytes);
        PackedInts.Writer writer = PackedInts.getWriter(out, valueCount, bitsPerValue);
        for (int i = 0; i < valueCount; ++i) {
          writer.add(random.nextInt(1 << bitsPerValue));
        }
        writer.finish();
        for (int i = 0; i < 50; ++i) {
          long start = System.nanoTime();
          DataInput in = new ByteArrayDataInput(bytes);
          PackedInts.Reader reader = PackedInts.getReader(in, 0f); // Packed64
          // PackedInts.Reader reader = PackedInts.getReader(in, 0.1f); // Packed64SingleBlock
          for (int j = 0, n = valueCount; j < n; ++j) {
            reader.get(offsets[j]);
          }
          long end = System.nanoTime();
          System.out.println(end - start);
        }
        

        I ran this microbenchmark for bitsPerValue=21 and valueCount in (1 000 000, 10 000 000). The loading time (n = 0) is 2x to 3x slower with Packed64SingleBlock21. However, as soon as you perform valueCount/4 or more random reads (n >= valueCount/4), the total time is better for Packed64SingleBlock21. When n=valueCount, the total time is even 2x better for Packed64SingleBlock21.

        The loading overhead doesn't seem too bad.

        However I guess that some people might still be more interested in the loading time than in the query time (for write-intensive applications), but in this case they could still request a reader in its compact form (Packed 64 or Direct* when bitsPerValue is 8, 16, 32 or 64). If they want to be sure to have a fast reader too, they could also make sure that they used 8, 16, 32 or 64 as the bitsPerValue parameter of PackedInts.getWriter.

        Mike, what do you think?

        Show
        Adrien Grand added a comment - Oh right, cases are swapped, I uploaded the wrong patch. Here is a correct version. I guess there are two options: 1. always writing the most compact form to disk and deserializing it in a fast PackedInts.Mutable implementation (depending on acceptableOverheadRatio), 2. writing a possibly aligned version on disk (depending on acceptableOverheadRatio) and then selecting the fastest PackedInts.Mutable implementation that exactly matches bitsPerValue (with no padding bits). A third option could be to write padding bits ( Packed64SingleBlock subclasses may have such padding bits) as well, but I really dislike the fact that the on-disk format is implementation-dependent. Option 1 is likely to make deserialization slower since some decoding might occur (the copyData method) but on the other hand, option 2 would prevent us from using implementations that add padding bits (such as Packed64SingleBlock21 which has one padding bit every 3 21-bits integers or Packed64SingleBlock12 which has 4 padding bits every 5 12-bits integers but not Packed64SingleBlock {1,2,4} since 64%{1,2,4} =0). I initially chose option 1 because I think it is nice to have Packed64SingleBlock21 when bitsPerValue is close to 21, since it might be significantly faster than Packed64. In order to know how slower deserialization is (option 1), I ran a micro-benchmark which: loads a PackedInts.Reader from a ByteArrayDataInput, performs n random reads on the reader. Here is the code for the micro-benchmark in case you would like to run it. int valueCount = 10000000; int bitsPerValue = 21; int [] offsets = new int [valueCount]; Random random = new Random(); for ( int i = 0; i < valueCount; ++i) { offsets[i] = random.nextInt(valueCount); } byte [] bytes = new byte [valueCount * 4]; DataOutput out = new ByteArrayDataOutput(bytes); PackedInts.Writer writer = PackedInts.getWriter(out, valueCount, bitsPerValue); for ( int i = 0; i < valueCount; ++i) { writer.add(random.nextInt(1 << bitsPerValue)); } writer.finish(); for ( int i = 0; i < 50; ++i) { long start = System .nanoTime(); DataInput in = new ByteArrayDataInput(bytes); PackedInts.Reader reader = PackedInts.getReader(in, 0f); // Packed64 // PackedInts.Reader reader = PackedInts.getReader(in, 0.1f); // Packed64SingleBlock for ( int j = 0, n = valueCount; j < n; ++j) { reader.get(offsets[j]); } long end = System .nanoTime(); System .out.println(end - start); } I ran this microbenchmark for bitsPerValue=21 and valueCount in (1 000 000, 10 000 000). The loading time (n = 0) is 2x to 3x slower with Packed64SingleBlock21. However, as soon as you perform valueCount/4 or more random reads (n >= valueCount/4), the total time is better for Packed64SingleBlock21. When n=valueCount, the total time is even 2x better for Packed64SingleBlock21. The loading overhead doesn't seem too bad. However I guess that some people might still be more interested in the loading time than in the query time (for write-intensive applications), but in this case they could still request a reader in its compact form (Packed 64 or Direct* when bitsPerValue is 8, 16, 32 or 64). If they want to be sure to have a fast reader too, they could also make sure that they used 8, 16, 32 or 64 as the bitsPerValue parameter of PackedInts.getWriter . Mike, what do you think?
        Hide
        Dawid Weiss added a comment -

        Hi Adrien. I wrote back to dev list, don't know if you caught that – the benchmark could probably be improved by calculating something off reader.get() because otherwise it's a pure read without side effects and can be removed by the jit (or at least optimized in unpredictable ways). A rolling sum or something will probably do.

        Show
        Dawid Weiss added a comment - Hi Adrien. I wrote back to dev list, don't know if you caught that – the benchmark could probably be improved by calculating something off reader.get() because otherwise it's a pure read without side effects and can be removed by the jit (or at least optimized in unpredictable ways). A rolling sum or something will probably do.
        Hide
        Adrien Grand added a comment -

        Good catch, David. I modified the code to perform a rolling sum and the results are actually a little different. Packed64SingleBlock21 is now faster than Packed64 when n>=valueCount (instead of valueCount/4).

        Show
        Adrien Grand added a comment - Good catch, David. I modified the code to perform a rolling sum and the results are actually a little different. Packed64SingleBlock21 is now faster than Packed64 when n>=valueCount (instead of valueCount/4).
        Hide
        Dawid Weiss added a comment -

        You didn't attach the updated benchmark – I didn't say it explicitly but you should do something with the resulting value (jit optimizer is quite smart . A field store (write the result to a field) should do the trick. So is System.out.println of course...

        All this may sound paranoid but really isn't. This is a source of many problems with microbenchmarks – the compiler just throws away (or optimizes loops/ branches) in a way that doesn't happen later on in real code. My recent favorite example of such a problem in real life code (it's a bug in jdk) is this one:

        http://hg.openjdk.java.net/jdk8/tl/jdk/rev/332bebb463d1

        Show
        Dawid Weiss added a comment - You didn't attach the updated benchmark – I didn't say it explicitly but you should do something with the resulting value (jit optimizer is quite smart . A field store (write the result to a field) should do the trick. So is System.out.println of course... All this may sound paranoid but really isn't. This is a source of many problems with microbenchmarks – the compiler just throws away (or optimizes loops/ branches) in a way that doesn't happen later on in real code. My recent favorite example of such a problem in real life code (it's a bug in jdk) is this one: http://hg.openjdk.java.net/jdk8/tl/jdk/rev/332bebb463d1
        Hide
        Adrien Grand added a comment -

        Hi David. Thanks for the link, it's very interesting!

        I added a print statement to make sure that the sum is actually computed. Here is the code (for values of n > valueCount, just modify the k loop):

        int valueCount = 10000000;
        int bitsPerValue = 21;
        int[] offsets = new int[valueCount];
        Random random = new Random();
        for (int i = 0; i < valueCount; ++i) {
          offsets[i] = random.nextInt(valueCount);
        }
        byte[] bytes = new byte[valueCount * 4];
        DataOutput out = new ByteArrayDataOutput(bytes);
        PackedInts.Writer writer = PackedInts.getWriter(out, valueCount, bitsPerValue);
        for (int i = 0; i < valueCount; ++i) {
          writer.add(random.nextInt(1 << bitsPerValue));
        }
        writer.finish();
        long sum = 0L;
        for (int i = 0; i < 50; ++i) {
          long start = System.nanoTime();
          DataInput in = new ByteArrayDataInput(bytes);
          // PackedInts.Reader reader = PackedInts.getReader(in, 0f); // Packed64
          PackedInts.Reader reader = PackedInts.getReader(in, 0.1f); // Packed64SingleBlock
          for (int k = 0; k < 1; ++k) {
              for (int j = 0, n = valueCount / 2; j < n; ++j) {
                sum += reader.get(offsets[j]);
              }
          }
          long end = System.nanoTime();
          System.out.println("sum is " + sum);
          System.out.println(end - start);
        }
        

        I'm on a different computer today and n >= valueCount/3 is enough to make the benchmark faster with Packed64SingleBlock.

        Show
        Adrien Grand added a comment - Hi David. Thanks for the link, it's very interesting! I added a print statement to make sure that the sum is actually computed. Here is the code (for values of n > valueCount, just modify the k loop): int valueCount = 10000000; int bitsPerValue = 21; int [] offsets = new int [valueCount]; Random random = new Random(); for ( int i = 0; i < valueCount; ++i) { offsets[i] = random.nextInt(valueCount); } byte [] bytes = new byte [valueCount * 4]; DataOutput out = new ByteArrayDataOutput(bytes); PackedInts.Writer writer = PackedInts.getWriter(out, valueCount, bitsPerValue); for ( int i = 0; i < valueCount; ++i) { writer.add(random.nextInt(1 << bitsPerValue)); } writer.finish(); long sum = 0L; for ( int i = 0; i < 50; ++i) { long start = System .nanoTime(); DataInput in = new ByteArrayDataInput(bytes); // PackedInts.Reader reader = PackedInts.getReader(in, 0f); // Packed64 PackedInts.Reader reader = PackedInts.getReader(in, 0.1f); // Packed64SingleBlock for ( int k = 0; k < 1; ++k) { for ( int j = 0, n = valueCount / 2; j < n; ++j) { sum += reader.get(offsets[j]); } } long end = System .nanoTime(); System .out.println( "sum is " + sum); System .out.println(end - start); } I'm on a different computer today and n >= valueCount/3 is enough to make the benchmark faster with Packed64SingleBlock.
        Hide
        Michael McCandless added a comment -

        A third option could be to write padding bits (Packed64SingleBlock subclasses may have such padding bits) as well, but I really dislike the fact that the on-disk format is implementation-dependent.

        Actually, I think we should stop specializing based on 32 bit vs 64
        bit JRE, and always use the impls backed by long[] (Packed64*). Then, I
        think it's fine if we write the long[] image (with padding bits)
        directly to disk?

        Show
        Michael McCandless added a comment - A third option could be to write padding bits (Packed64SingleBlock subclasses may have such padding bits) as well, but I really dislike the fact that the on-disk format is implementation-dependent. Actually, I think we should stop specializing based on 32 bit vs 64 bit JRE, and always use the impls backed by long[] (Packed64*). Then, I think it's fine if we write the long[] image (with padding bits) directly to disk?
        Hide
        Adrien Grand added a comment -

        Mike, I am not sure how we should do it. For 21-bits values how would the reader know whether it should use a Packed64SingleBlock21 or a Packed64? Should we add a flag to the data stream in order to know what implementation serialized the integers?

        Show
        Adrien Grand added a comment - Mike, I am not sure how we should do it. For 21-bits values how would the reader know whether it should use a Packed64SingleBlock21 or a Packed64? Should we add a flag to the data stream in order to know what implementation serialized the integers?
        Hide
        Michael McCandless added a comment -

        Should we add a flag to the data stream in order to know what implementation serialized the integers?

        I think so?

        Show
        Michael McCandless added a comment - Should we add a flag to the data stream in order to know what implementation serialized the integers? I think so?
        Hide
        Adrien Grand added a comment -

        Isn't it a problem to break compatibility? Or should we use special (> 64) values of bitsPerValue so that current trunk indexes will still work after the patch is applied?

        Show
        Adrien Grand added a comment - Isn't it a problem to break compatibility? Or should we use special (> 64) values of bitsPerValue so that current trunk indexes will still work after the patch is applied?
        Hide
        Michael McCandless added a comment -

        Isn't it a problem to break compatibility?

        It isn't.

        3x indices never store packed ints ... so we are only breaking doc values in 4.0, and we are allowed (for only a bit more time!) to break 4.0's index format. So we should just break it and not pollute 4.0's sources with false back compat code...

        Separately, if somehow we did need to preserve back compat for packed ints file format... we should use the version in the codec header to accomplish that (ie, we don't have to stuff version information inside the bitsPerValue).

        Show
        Michael McCandless added a comment - Isn't it a problem to break compatibility? It isn't. 3x indices never store packed ints ... so we are only breaking doc values in 4.0, and we are allowed (for only a bit more time!) to break 4.0's index format. So we should just break it and not pollute 4.0's sources with false back compat code... Separately, if somehow we did need to preserve back compat for packed ints file format... we should use the version in the codec header to accomplish that (ie, we don't have to stuff version information inside the bitsPerValue).
        Hide
        Adrien Grand added a comment -

        New patch. I added a VInt flag to the streams generated by writers so that the readers can know how to parse the stream. All tests passed.

        Show
        Adrien Grand added a comment - New patch. I added a VInt flag to the streams generated by writers so that the readers can know how to parse the stream. All tests passed.
        Hide
        Michael McCandless added a comment -

        Thanks Adrien, this looks great! I'll commit soon...

        Show
        Michael McCandless added a comment - Thanks Adrien, this looks great! I'll commit soon...
        Hide
        Michael McCandless added a comment -

        Thanks Adrien!

        Show
        Michael McCandless added a comment - Thanks Adrien!
        Hide
        Adrien Grand added a comment -

        I have run more tests on PackedInts impls over the last days to test their relative performance.

        It appears that the specializations in Packed64SingleBlock don't help much and even hurt performance in some cases. Moreover, replacing the naive bulk operations by a System.arraycopy in Direct64 is a big win. (See attached patch.)

        You can look at the details of the tests here: http://people.apache.org/~jpountz/packed_ints.html (contiguous=Packed64, padding=Packed64SingleBlock,3 blocks=Packed*ThreeBlocks,direct=Direct*).

        The tests were run on a 64-bit computer (Core 2 Duo E5500) with valueCount=10 000 000. "Memory overhead" is

        {unused space in bits}

        /

        {bits per value}

        while the other charts measure the number of gets/sets per second.

        The random get/set results are very good for the packed versions, probably because they manage to fit much more values into the CPU caches than other impls. The reason why bulk get/set is faster when bitsPerValue>32 is that Direct64 uses System.arraycopy instead of naive copy (in a for loop).

        Interestingly, the different impls have very close random get performance.

        Show
        Adrien Grand added a comment - I have run more tests on PackedInts impls over the last days to test their relative performance. It appears that the specializations in Packed64SingleBlock don't help much and even hurt performance in some cases. Moreover, replacing the naive bulk operations by a System.arraycopy in Direct64 is a big win. (See attached patch.) You can look at the details of the tests here: http://people.apache.org/~jpountz/packed_ints.html (contiguous=Packed64, padding=Packed64SingleBlock,3 blocks=Packed*ThreeBlocks,direct=Direct*). The tests were run on a 64-bit computer (Core 2 Duo E5500) with valueCount=10 000 000. "Memory overhead" is {unused space in bits} / {bits per value} while the other charts measure the number of gets/sets per second. The random get/set results are very good for the packed versions, probably because they manage to fit much more values into the CPU caches than other impls. The reason why bulk get/set is faster when bitsPerValue>32 is that Direct64 uses System.arraycopy instead of naive copy (in a for loop). Interestingly, the different impls have very close random get performance.
        Hide
        Dawid Weiss added a comment -

        What's on the axes in those plots? System.copyarray is an intrinsic – it'll be much faster than any other loop that doesn't eliminate bounds checks (and I think with more complex logic this will not be done).

        Show
        Dawid Weiss added a comment - What's on the axes in those plots? System.copyarray is an intrinsic – it'll be much faster than any other loop that doesn't eliminate bounds checks (and I think with more complex logic this will not be done).
        Hide
        Adrien Grand added a comment -

        The x axis is the number of bits per value while the y axis is the number of values that are read or written per second. For every bitsPerValue and bit-packing scheme, I took the impl with the lowest working bitsPerValue. (For example, bitsPerValue=19 would give a Direct32, a Packed64(bitsPerValue=19), a Packed8ThreeBlocks(24 bits per value) and a Packed64SingleBlock(bitsPerValue=21)). There are 4 lines because we currently have 4 different bit-packing schemes.

        In the two first cases, values are read at random offsets while the two bulk tests read/write a large number of values sequentially. I didn't want to test System.arraycopy against a naive for-loop, I just noticed that Direct64 bulk operations didn't use arraycopy, so I fixed that and added a few words about it so that people understand why the throughput increases when bitsPerValue > 32, which is counter-intuitive.

        Show
        Adrien Grand added a comment - The x axis is the number of bits per value while the y axis is the number of values that are read or written per second. For every bitsPerValue and bit-packing scheme, I took the impl with the lowest working bitsPerValue. (For example, bitsPerValue=19 would give a Direct32, a Packed64(bitsPerValue=19), a Packed8ThreeBlocks(24 bits per value) and a Packed64SingleBlock(bitsPerValue=21)). There are 4 lines because we currently have 4 different bit-packing schemes. In the two first cases, values are read at random offsets while the two bulk tests read/write a large number of values sequentially. I didn't want to test System.arraycopy against a naive for-loop, I just noticed that Direct64 bulk operations didn't use arraycopy , so I fixed that and added a few words about it so that people understand why the throughput increases when bitsPerValue > 32, which is counter-intuitive.
        Hide
        Dawid Weiss added a comment -

        Ok, thanks - makes sense. Is the code for these benchmarks somewhere?

        Show
        Dawid Weiss added a comment - Ok, thanks - makes sense. Is the code for these benchmarks somewhere?
        Hide
        Michael McCandless added a comment -

        Very cool graphs! Somehow you should turn them into a blog post

        Show
        Michael McCandless added a comment - Very cool graphs! Somehow you should turn them into a blog post
        Hide
        Adrien Grand added a comment -

        @Dawid I attached the file I used to compute these numbers. There is a lot of garbage in the output because I print some stuff to prevent the JVM from doing optimizations.

        @Mike Yes, I was actually thinking of it too...

        Show
        Adrien Grand added a comment - @Dawid I attached the file I used to compute these numbers. There is a lot of garbage in the output because I print some stuff to prevent the JVM from doing optimizations. @Mike Yes, I was actually thinking of it too...
        Hide
        Dawid Weiss added a comment -

        I thought about the full cycle you used to produce the pretty graphs

        Show
        Dawid Weiss added a comment - I thought about the full cycle you used to produce the pretty graphs
        Hide
        Adrien Grand added a comment -

        Oh, sorry! When the program ends (it takes to lot of time to run), there are 4 sections at the end (SET, GET, BULKSET and BULKGET) that contain a list of arrays (which are actually JSON arrays). To generate the charts, just take the HTML source of http://people.apache.org/~jpountz/packed_ints.html (which uses Google Charts API https://developers.google.com/chart/) and replace the JSON arrays with the ones you just generated (lines 101, 184, 267 and 349).

        Show
        Adrien Grand added a comment - Oh, sorry! When the program ends (it takes to lot of time to run), there are 4 sections at the end (SET, GET, BULKSET and BULKGET) that contain a list of arrays (which are actually JSON arrays). To generate the charts, just take the HTML source of http://people.apache.org/~jpountz/packed_ints.html (which uses Google Charts API https://developers.google.com/chart/ ) and replace the JSON arrays with the ones you just generated (lines 101, 184, 267 and 349).
        Hide
        Dawid Weiss added a comment -

        Ah, nice. We were using google charts but then switched to this one – http://code.google.com/p/flot/. It is nice but of course comes with its own limitations.

        As for the benchmark – you may want to peek at Google Caliper (http://code.google.com/p/caliper/) or JUnitBenchmarks (http://labs.carrotsearch.com/junit-benchmarks.html) instead of writing your own code for doing warmups, etc. Caliper is nice in that it spawns a separate JVM so you can minimize the influence of code order execution on hotspot optimizations. Just a thought.

        Show
        Dawid Weiss added a comment - Ah, nice. We were using google charts but then switched to this one – http://code.google.com/p/flot/ . It is nice but of course comes with its own limitations. As for the benchmark – you may want to peek at Google Caliper ( http://code.google.com/p/caliper/ ) or JUnitBenchmarks ( http://labs.carrotsearch.com/junit-benchmarks.html ) instead of writing your own code for doing warmups, etc. Caliper is nice in that it spawns a separate JVM so you can minimize the influence of code order execution on hotspot optimizations. Just a thought.
        Hide
        Adrien Grand added a comment -

        We were using google charts but then switched to this one – http://code.google.com/p/flot/.

        What was your motivation?

        As for the benchmark – you may want to peek at Google Caliper (http://code.google.com/p/caliper/) or JUnitBenchmarks (http://labs.carrotsearch.com/junit-benchmarks.html)

        Thanks for the JUnitBenchmarks link, I didn't know it. How does it compare to Caliper?

        Show
        Adrien Grand added a comment - We were using google charts but then switched to this one – http://code.google.com/p/flot/ . What was your motivation? As for the benchmark – you may want to peek at Google Caliper ( http://code.google.com/p/caliper/ ) or JUnitBenchmarks ( http://labs.carrotsearch.com/junit-benchmarks.html ) Thanks for the JUnitBenchmarks link, I didn't know it. How does it compare to Caliper?
        Hide
        Dawid Weiss added a comment -

        What was your motivation?

        Mostly aesthetics. I guess same things can be done with both.

        Thanks for the JUnitBenchmarks link, I didn't know it. How does it compare to Caliper?

        They're different. I wanted something I could use with JUnit (existing toolchains) without much extra overhead. Caliper is more advanced, supports running under different jvms (forks the task), supports cartesian sets of attributes. It's nice. I use both from time to time.

        Show
        Dawid Weiss added a comment - What was your motivation? Mostly aesthetics. I guess same things can be done with both. Thanks for the JUnitBenchmarks link, I didn't know it. How does it compare to Caliper? They're different. I wanted something I could use with JUnit (existing toolchains) without much extra overhead. Caliper is more advanced, supports running under different jvms (forks the task), supports cartesian sets of attributes. It's nice. I use both from time to time.
        Hide
        Adrien Grand added a comment -

        Committed (r1351682 on trunk and r1351698 on branch 4.x). Thanks Mike and Dawid for your comments.

        Show
        Adrien Grand added a comment - Committed (r1351682 on trunk and r1351698 on branch 4.x). Thanks Mike and Dawid for your comments.
        Hide
        Toke Eskildsen added a comment -

        I tried running the PackedIntsBenchmark on an i7 processor and I agree on the overall conclusion with regard to the speed, or rather lack of speed, for Packed64. However, my the winners for the different BPVs were not always the same as Adrien observed. I suspect that CPU architecture and memory system, especially caching, plays a very large role here.

        Re-thinking the no-branching idea for Packed64, it seems that in reality it is slower to perform two memory requests (where the second will most probably be cached) than take the pipeline flush. Therefore I have created Packed64calc (see attachment), which is a full replacement for Packed64.

        My own tests shows Packed64calc to be significantly faster than Packed64 and in many cases faster than Packed64SingleBlock. I suspect the latter to be caused either by caching or the fact that Packed64SingleBlock uses division and modulo for set & get. While the modulo can be avoided in Packed64SingleBlock, I never did find a reliable way to bypass the division when I experimented with it.

        I have attached an updated PackedIntsBenchmark which can be used with Packed64calc and hope that Adrien will take a look.

        Show
        Toke Eskildsen added a comment - I tried running the PackedIntsBenchmark on an i7 processor and I agree on the overall conclusion with regard to the speed, or rather lack of speed, for Packed64. However, my the winners for the different BPVs were not always the same as Adrien observed. I suspect that CPU architecture and memory system, especially caching, plays a very large role here. Re-thinking the no-branching idea for Packed64, it seems that in reality it is slower to perform two memory requests (where the second will most probably be cached) than take the pipeline flush. Therefore I have created Packed64calc (see attachment), which is a full replacement for Packed64. My own tests shows Packed64calc to be significantly faster than Packed64 and in many cases faster than Packed64SingleBlock. I suspect the latter to be caused either by caching or the fact that Packed64SingleBlock uses division and modulo for set & get. While the modulo can be avoided in Packed64SingleBlock, I never did find a reliable way to bypass the division when I experimented with it. I have attached an updated PackedIntsBenchmark which can be used with Packed64calc and hope that Adrien will take a look.
        Hide
        Adrien Grand added a comment -

        Thanks for your patch, Toke. All tests seem to pass, I'll try to generate graphs for your impl as soon as possible!

        Show
        Adrien Grand added a comment - Thanks for your patch, Toke. All tests seem to pass, I'll try to generate graphs for your impl as soon as possible!
        Hide
        Toke Eskildsen added a comment -

        I ran the test on three different machines. results are attached as measurements*.txt along with a PDF with graphs generated from iteration #6 (which should probably be the mean or max of run 2-5). The setter-graph for the p4 looks extremely strange for Direct, but I tried generating a graph for iteration #5 instead and it looked the same. in the same vein, the Direct performance for the Xeon is suspiciously low, so I wonder if there's some freaky JITting happening to the test code.

        Unfortunately I did not find an AMD machine to test on. For the three tested Intels, it seems that the Packed64calc does perform very well.

        Show
        Toke Eskildsen added a comment - I ran the test on three different machines. results are attached as measurements*.txt along with a PDF with graphs generated from iteration #6 (which should probably be the mean or max of run 2-5). The setter-graph for the p4 looks extremely strange for Direct, but I tried generating a graph for iteration #5 instead and it looked the same. in the same vein, the Direct performance for the Xeon is suspiciously low, so I wonder if there's some freaky JITting happening to the test code. Unfortunately I did not find an AMD machine to test on. For the three tested Intels, it seems that the Packed64calc does perform very well.
        Hide
        Adrien Grand added a comment - - edited

        Thanks for sharing your results. Here are mines: http://people.apache.org/~jpountz/packed_ints_calc.html (E5500 @ 2.80GHz, java 1.7.0_02, hotspot build 22.0-b10). Funny to see those little bumps when the number of bits per value is 8, 16, 32 or 64 (24 as well, although it is smaller)!

        It is not clear whether this impl is faster or slower than the single-block impl (or even the 3 blocks impl, I am happily surprised by the read throughput on the intel 4 machine) depending on the hardware. However, this new impl seems to be consistently better than the current Packed64 class so I think we should replace it with your new impl. What do you think? Can you write a patch?

        Show
        Adrien Grand added a comment - - edited Thanks for sharing your results. Here are mines: http://people.apache.org/~jpountz/packed_ints_calc.html (E5500 @ 2.80GHz, java 1.7.0_02, hotspot build 22.0-b10). Funny to see those little bumps when the number of bits per value is 8, 16, 32 or 64 (24 as well, although it is smaller)! It is not clear whether this impl is faster or slower than the single-block impl (or even the 3 blocks impl, I am happily surprised by the read throughput on the intel 4 machine) depending on the hardware. However, this new impl seems to be consistently better than the current Packed64 class so I think we should replace it with your new impl. What do you think? Can you write a patch?
        Hide
        Toke Eskildsen added a comment -

        Making Packed64calc the new Packed64 seems like a safe bet. I'd be happy to create a patch for it. Should I open a new issue or add the patch here? If I do it here, how do we avoid confusing the original fine-grained-oriented patch from the Packed64 replacement?

        I think it it hard to see a clear pattern as to which Mutable implementation should be selected for the different size & bpv-requirements, with the current available measurements. I'll perform some more experiments with JRE1.6/JRE1.7 on different hardware and see if the picture gets clearer.

        Show
        Toke Eskildsen added a comment - Making Packed64calc the new Packed64 seems like a safe bet. I'd be happy to create a patch for it. Should I open a new issue or add the patch here? If I do it here, how do we avoid confusing the original fine-grained-oriented patch from the Packed64 replacement? I think it it hard to see a clear pattern as to which Mutable implementation should be selected for the different size & bpv-requirements, with the current available measurements. I'll perform some more experiments with JRE1.6/JRE1.7 on different hardware and see if the picture gets clearer.
        Hide
        Adrien Grand added a comment -

        Yes, a new issue will make things clearer. Thanks, Toke!

        Show
        Adrien Grand added a comment - Yes, a new issue will make things clearer. Thanks, Toke!
        Hide
        Toke Eskildsen added a comment -

        I have tried running the performance test again with different setups. Graphs at http://ekot.dk/misc/packedints/

        Most of the tests under JRE 1.7 are still running, but from the single JRE 1.6 vs. JRE 1.7 (at the bottom), it would seem that there is a difference in the performance of Packed64SingleBlock vs. Packed64. Take that with a grain of salt though, as the results from that machine are quite strange (especially for Direct sets).

        If the Java version influences which choice is best and if we throw valueCount into the mix (or rather cache utilization, which is very much non-determinable when a Mutable is requested), I am afraid that it will be very hard to auto-recommend a specific implementation.

        I will update the page and notify here, when more tests has finished.

        Show
        Toke Eskildsen added a comment - I have tried running the performance test again with different setups. Graphs at http://ekot.dk/misc/packedints/ Most of the tests under JRE 1.7 are still running, but from the single JRE 1.6 vs. JRE 1.7 (at the bottom), it would seem that there is a difference in the performance of Packed64SingleBlock vs. Packed64. Take that with a grain of salt though, as the results from that machine are quite strange (especially for Direct sets). If the Java version influences which choice is best and if we throw valueCount into the mix (or rather cache utilization, which is very much non-determinable when a Mutable is requested), I am afraid that it will be very hard to auto-recommend a specific implementation. I will update the page and notify here, when more tests has finished.
        Hide
        Adrien Grand added a comment -

        Very interesting. Out of curiousity could you try to run the tests on your machines with the attached Packed64SingleBlock.java? It generates sub-classes that might help the JVM optimize some multiplications/remainders. I'd be interested to see what the difference is on your machines (on mine, it is slightly faster than the current Packed64SingleBlock impl).

        Show
        Adrien Grand added a comment - Very interesting. Out of curiousity could you try to run the tests on your machines with the attached Packed64SingleBlock.java? It generates sub-classes that might help the JVM optimize some multiplications/remainders. I'd be interested to see what the difference is on your machines (on mine, it is slightly faster than the current Packed64SingleBlock impl).
        Hide
        Toke Eskildsen added a comment -

        Nice trick with the inheritance in Packed64SingleBlock.java. I really helped a lot on the machines I tested with. I tried to use the Strategy Pattern myself, using switch on the final bitsPerValue in set & get, but it seems that Hotspot is not keen on optimizing such a conditional away. I have now used the inheritance method to provide faster implementations in Packed64 for 1, 2, 4, 8, 16, 32 and 64 bpv. Due to the existence of the Direct classes, this is only relevant for 1, 2 & 4 bpv though.

        Some new measurements at http://ekot.dk/misc/packedints/ with the following implementations:

        • contiguous: The Packed64 that is being reviewed at LYCENE-4171
        • padding: The new Packed64SingleBlock.java with sub-classes
        • 3 blocks: As before
        • direct: As before
        • old padding: Packed64SingleBlock.java without sub-classes
        • contiguous strategy: Packed64Strategy.java, which uses sub-classes to optimize for bpv 2^n

        I changed PackedIntsBenchmark to report the highest measured performance from the iterations instead of the mean. The rationale is that the mean is vulnerable to machine load fluxations and GC throughout the whole measurement process.

        My observations are that the measurements from atria & pc254, which are relatively old machines, are very similar and fits well with the theory: Padding does give higher or equal performance than Packed64 at all other bpv's than 2^n.

        The story is not so clear for mars, which is a very fast machine: For the current test case of 10M values, the padding only provides better performance after 18-20 bpv and for some bpv, Packed64 is a bit faster. I suspect that point would be moved if other processes were competing for memory cache. Still pending are measurements from my i7-machine. I hope to add them this evening or tomorrow.

        Also pending is why the performance of Direct set is so low for mars. It makes no sense: Even if there is some freaky hit for requesting bytes, shorts or ints on that machine, performance should be >= everything else at 64bpv.

        When should we stop optimizing? For 1 bpv, a list of booleans would probably be faster, so should we make a PackedBoolean? Similarly, Packed8, Packed16 and Packed32 would also make sense if we were really determined. None of the implementations are hard to make, but it becomes a hassle to update when a new feature is needed.

        Show
        Toke Eskildsen added a comment - Nice trick with the inheritance in Packed64SingleBlock.java. I really helped a lot on the machines I tested with. I tried to use the Strategy Pattern myself, using switch on the final bitsPerValue in set & get, but it seems that Hotspot is not keen on optimizing such a conditional away. I have now used the inheritance method to provide faster implementations in Packed64 for 1, 2, 4, 8, 16, 32 and 64 bpv. Due to the existence of the Direct classes, this is only relevant for 1, 2 & 4 bpv though. Some new measurements at http://ekot.dk/misc/packedints/ with the following implementations: contiguous: The Packed64 that is being reviewed at LYCENE-4171 padding: The new Packed64SingleBlock.java with sub-classes 3 blocks: As before direct: As before old padding: Packed64SingleBlock.java without sub-classes contiguous strategy: Packed64Strategy.java, which uses sub-classes to optimize for bpv 2^n I changed PackedIntsBenchmark to report the highest measured performance from the iterations instead of the mean. The rationale is that the mean is vulnerable to machine load fluxations and GC throughout the whole measurement process. My observations are that the measurements from atria & pc254, which are relatively old machines, are very similar and fits well with the theory: Padding does give higher or equal performance than Packed64 at all other bpv's than 2^n. The story is not so clear for mars, which is a very fast machine: For the current test case of 10M values, the padding only provides better performance after 18-20 bpv and for some bpv, Packed64 is a bit faster. I suspect that point would be moved if other processes were competing for memory cache. Still pending are measurements from my i7-machine. I hope to add them this evening or tomorrow. Also pending is why the performance of Direct set is so low for mars. It makes no sense: Even if there is some freaky hit for requesting bytes, shorts or ints on that machine, performance should be >= everything else at 64bpv. When should we stop optimizing? For 1 bpv, a list of booleans would probably be faster, so should we make a PackedBoolean? Similarly, Packed8, Packed16 and Packed32 would also make sense if we were really determined. None of the implementations are hard to make, but it becomes a hassle to update when a new feature is needed.
        Hide
        Adrien Grand added a comment -

        This is very interesting!

        Results computed on mars are very strange so I think we should not take them into account to make decisions...

        When should we stop optimizing?

        This is a relevant question. I think there is no problem going further provided that we have evidence that the changes improve performance in most cases and that the code remains maintainable. For example, it is probably not necessary to work on specialized Packed64 for the 2^n bits cases since there are direct impls for the 8, 16, 32 and 64 bits cases and since the new Packed64SingleBlock impl seems to be as fast for 1, 2 and 4 bits per value.

        However, I think it makes sense to replace Packed64SingleBlock with the inheritance-based impl since it seems faster than the current impl and than Packed64. I will open an issue if the graphs computed on your core i7 computer confirm this trend.

        Show
        Adrien Grand added a comment - This is very interesting! Results computed on mars are very strange so I think we should not take them into account to make decisions... When should we stop optimizing? This is a relevant question. I think there is no problem going further provided that we have evidence that the changes improve performance in most cases and that the code remains maintainable. For example, it is probably not necessary to work on specialized Packed64 for the 2^n bits cases since there are direct impls for the 8, 16, 32 and 64 bits cases and since the new Packed64SingleBlock impl seems to be as fast for 1, 2 and 4 bits per value. However, I think it makes sense to replace Packed64SingleBlock with the inheritance-based impl since it seems faster than the current impl and than Packed64. I will open an issue if the graphs computed on your core i7 computer confirm this trend.
        Hide
        Toke Eskildsen added a comment - - edited

        +1 for replacing the Packed64SingleBlock with the optimized version. It is consistently better.

        I have updated the charts at http://ekot.dk/misc/packedints/ with i7 measurements. For the three non-i7-based Intel processors, Packed64SingleBlock seems clear-cut for the "Best possible performance without going all-out with Direct".

        I tried running two tests in parallel on te-prime (see http://ekot.dk/misc/packedints/te-prime_parallel.html) and got very consistent results:

        • For set, contiguous strategy is faster than or equal to padded for bpvs 3-12 and 16-21 (with bpv 1 and 2 being hard to measure). Since padded only supports bpv 1-10, 12, 16, 21 & 32 with non-conforming bpvs rounded up, this effectively means that there is zero gain in using padded over contiguous with regard to set on that machine.
        • For get, the picture is nearly the same, except for bpv 17-21 where contiguous is slower than padded (observed in process 2 as well as the single-thread run). The difference is less than 10% though. The same pattern, although noisier, can be seen on mars.

        My preliminary conclusion for i7-processors is thus that Packed64Strategy is the right choice for "Best possible performance without going all-out with Direct". I am getting very curious about the untested AMD architecture now.

        A by-product of the te-prime parallel test is that the amount of cache seems to matter little when it comes to selecting the most fitting implementation. Thank $diety for small favors.

        Show
        Toke Eskildsen added a comment - - edited +1 for replacing the Packed64SingleBlock with the optimized version. It is consistently better. I have updated the charts at http://ekot.dk/misc/packedints/ with i7 measurements. For the three non-i7-based Intel processors, Packed64SingleBlock seems clear-cut for the "Best possible performance without going all-out with Direct". I tried running two tests in parallel on te-prime (see http://ekot.dk/misc/packedints/te-prime_parallel.html ) and got very consistent results: For set, contiguous strategy is faster than or equal to padded for bpvs 3-12 and 16-21 (with bpv 1 and 2 being hard to measure). Since padded only supports bpv 1-10, 12, 16, 21 & 32 with non-conforming bpvs rounded up, this effectively means that there is zero gain in using padded over contiguous with regard to set on that machine. For get, the picture is nearly the same, except for bpv 17-21 where contiguous is slower than padded (observed in process 2 as well as the single-thread run). The difference is less than 10% though. The same pattern, although noisier, can be seen on mars. My preliminary conclusion for i7-processors is thus that Packed64Strategy is the right choice for "Best possible performance without going all-out with Direct". I am getting very curious about the untested AMD architecture now. A by-product of the te-prime parallel test is that the amount of cache seems to matter little when it comes to selecting the most fitting implementation. Thank $diety for small favors.
        Hide
        Toke Eskildsen added a comment - - edited

        Another angle on the auto-selection problem: Packed64SingleBlock supports 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 16, 21, 32 as bpv's. The Direct-classes handles 8, 16 & 32 bpv and with the inheritance-trick, Packed64 handles 1, 2 & 4 exactly as Packed64SingleBlock. This leaves us with 3, 5, 6, 7, 9, 10, 12 and 21 bpv as "unique" implementations. In the hope of getting a better overview, I tried making bar-charts with those bps's: http://ekot.dk/misc/packedints/padding.html

        Again, it would be illuminating to get measurements from an older as well as a newer AMD machine. I also hope to get the time to run the test on another i7-machine to see if the trend from te-prime and mars can be confirmed, but I am leaving for vacation on Tuesday. As of now, the most fitting implementation depends on the machine and since the choice of implementation affects the persistence format, I find this problematic. It is not optimal for a heterogeneous distributed setup.

        Show
        Toke Eskildsen added a comment - - edited Another angle on the auto-selection problem: Packed64SingleBlock supports 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 16, 21, 32 as bpv's. The Direct-classes handles 8, 16 & 32 bpv and with the inheritance-trick, Packed64 handles 1, 2 & 4 exactly as Packed64SingleBlock. This leaves us with 3, 5, 6, 7, 9, 10, 12 and 21 bpv as "unique" implementations. In the hope of getting a better overview, I tried making bar-charts with those bps's: http://ekot.dk/misc/packedints/padding.html Again, it would be illuminating to get measurements from an older as well as a newer AMD machine. I also hope to get the time to run the test on another i7-machine to see if the trend from te-prime and mars can be confirmed, but I am leaving for vacation on Tuesday. As of now, the most fitting implementation depends on the machine and since the choice of implementation affects the persistence format, I find this problematic. It is not optimal for a heterogeneous distributed setup.
        Hide
        Toke Eskildsen added a comment -

        The graphs are updated with data from my i7 desktop machine. The data looks a lot like the measurements from the other two i7-machines.

        Show
        Toke Eskildsen added a comment - The graphs are updated with data from my i7 desktop machine. The data looks a lot like the measurements from the other two i7-machines.
        Hide
        Adrien Grand added a comment -

        Thanks for your graphs, Toke. It is very interesting to see how the different impls compare on various computers!

        I created LUCENE-4184 to replace Packed64SingleBlock with the inheritance-based impl.

        Initially, I chose valueCount=10,000,000 because it sounded like a reasonable segment size and because it would yield a lot of cache misses on my 2M-cache CPU (when bitsPerValue >= 4 at least). Given Mars and your i7 machines have big CPU caches, I'd be interested to know whether graphs still have the same shape with valueCount=100,000,000 for example.

        Otherwise, I am a little bit reluctant to try to improve the impl selection routine used in PackedInts.getMutable given how different results are based on the arch, unless we manage to find some clear (and easily accessible from Java) patterns that make an impl faster than another.

        Show
        Adrien Grand added a comment - Thanks for your graphs, Toke. It is very interesting to see how the different impls compare on various computers! I created LUCENE-4184 to replace Packed64SingleBlock with the inheritance-based impl. Initially, I chose valueCount=10,000,000 because it sounded like a reasonable segment size and because it would yield a lot of cache misses on my 2M-cache CPU (when bitsPerValue >= 4 at least). Given Mars and your i7 machines have big CPU caches, I'd be interested to know whether graphs still have the same shape with valueCount=100,000,000 for example. Otherwise, I am a little bit reluctant to try to improve the impl selection routine used in PackedInts.getMutable given how different results are based on the arch, unless we manage to find some clear (and easily accessible from Java) patterns that make an impl faster than another.
        Hide
        Toke Eskildsen added a comment -

        Attached improved benchmark that allows for user specified value count, bpv's and implementations. Also attached is PackedZero that is a dummy Mutable that is used for measuring the maximum possible throughput.

        Show
        Toke Eskildsen added a comment - Attached improved benchmark that allows for user specified value count, bpv's and implementations. Also attached is PackedZero that is a dummy Mutable that is used for measuring the maximum possible throughput.
        Hide
        Toke Eskildsen added a comment -

        Using the old benchmark, I've measured performance for 100M values on the i7 server. The graphs are at http://ekot.dk/misc/packedints/100m.html

        The "unique" bpvs for padded are 3, 5, 6, 7, 9, 10, 12 and 21. Rounding up to nearest valid bpv extends these to 3, 5, 6, 7, 9-12 and 17-21. For the set method on the i7 server, padding is slower than contiguous for all of these bpvs. For the get method, padding is a smidgen faster for 17-21 bpvs. These results mirrors the 10M test on the i7 laptop.

        I have started measurement with 100M values on the i7 laptop, but I am running out of time. I expect to check back in 1-2 weeks. Note that I have uploaded a version of PackedIntsBenchmark that makes it easy to experiment with different valueCounts and bpvs.

        Show
        Toke Eskildsen added a comment - Using the old benchmark, I've measured performance for 100M values on the i7 server. The graphs are at http://ekot.dk/misc/packedints/100m.html The "unique" bpvs for padded are 3, 5, 6, 7, 9, 10, 12 and 21. Rounding up to nearest valid bpv extends these to 3, 5, 6, 7, 9-12 and 17-21. For the set method on the i7 server, padding is slower than contiguous for all of these bpvs. For the get method, padding is a smidgen faster for 17-21 bpvs. These results mirrors the 10M test on the i7 laptop. I have started measurement with 100M values on the i7 laptop, but I am running out of time. I expect to check back in 1-2 weeks. Note that I have uploaded a version of PackedIntsBenchmark that makes it easy to experiment with different valueCounts and bpvs.
        Hide
        Adrien Grand added a comment -

        Thanks, Toke! Really strange how the direct impl remains the slowest one at setting values...

        Show
        Adrien Grand added a comment - Thanks, Toke! Really strange how the direct impl remains the slowest one at setting values...
        Hide
        Toke Eskildsen added a comment -

        One solution later and the mystery is still there: Tricking the JVM to request the old memory value before setting the new one in Direct64 does increase its speed to the expected level on the i7 server mars:

          public static long MASK = 0L;
          public void set(final int index, final long value) {
            values[index] = values[index] & MASK | value;
          }
        

        I am guessing it somehow triggers either a pre-fetch of the relevant 4K page in main memory or marks the cache as hotter and thus delaying cache flush. Alas, I am not a hardware guy and this is a bit beyond me.

        Anyway, the solution is not usable as it slows execution on other machines. I can get access to another i7 server of the same generation, but it is identical to mars so I doubt the results would be different. Unless someone steps up with another current-generation i7 server to test on, I think we should file this under pixies.

        Also unresolved, but more relevant, is the question about auto-selection of Mutable implementation. Based on the data so far, the current auto-selector will result in sub-optimal selection on i7-machines. Without a reliable architecture detection, another way of viewing the issue is whether to optimize the selector towards older machines (padded) or newer (contiguous).

        I've looked around for a way to determine the CPU and cache size, but so far it seems that the only fairly reliable way is to determine the OS, then call some platform-specific native code and parse the output. Besides the ugliness of this solution, I guess it gets disqualified for the extra needed permissions from the security manager.

        Show
        Toke Eskildsen added a comment - One solution later and the mystery is still there: Tricking the JVM to request the old memory value before setting the new one in Direct64 does increase its speed to the expected level on the i7 server mars: public static long MASK = 0L; public void set( final int index, final long value) { values[index] = values[index] & MASK | value; } I am guessing it somehow triggers either a pre-fetch of the relevant 4K page in main memory or marks the cache as hotter and thus delaying cache flush. Alas, I am not a hardware guy and this is a bit beyond me. Anyway, the solution is not usable as it slows execution on other machines. I can get access to another i7 server of the same generation, but it is identical to mars so I doubt the results would be different. Unless someone steps up with another current-generation i7 server to test on, I think we should file this under pixies. Also unresolved, but more relevant, is the question about auto-selection of Mutable implementation. Based on the data so far, the current auto-selector will result in sub-optimal selection on i7-machines. Without a reliable architecture detection, another way of viewing the issue is whether to optimize the selector towards older machines (padded) or newer (contiguous). I've looked around for a way to determine the CPU and cache size, but so far it seems that the only fairly reliable way is to determine the OS, then call some platform-specific native code and parse the output. Besides the ugliness of this solution, I guess it gets disqualified for the extra needed permissions from the security manager.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 4h
              4h
              Remaining:
              Remaining Estimate - 4h
              4h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development