Lucene - Core
  1. Lucene - Core
  2. LUCENE-1990

Add unsigned packed int impls in oal.util

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      There are various places in Lucene that could take advantage of an
      efficient packed unsigned int/long impl. EG the terms dict index in
      the standard codec in LUCENE-1458 could subsantially reduce it's RAM
      usage. FieldCache.StringIndex could as well. And I think "load into
      RAM" codecs like the one in TestExternalCodecs could use this too.

      I'm picturing something very basic like:

      interface PackedUnsignedLongs  {
        long get(long index);
        void set(long index, long value);
      }
      

      Plus maybe an iterator for getting and maybe also for setting. If it
      helps, most of the usages of this inside Lucene will be "write once"
      so eg the set could make that an assumption/requirement.

      And a factory somewhere:

        PackedUnsignedLongs create(int count, long maxValue);
      

      I think we should simply autogen the code (we can start from the
      autogen code in LUCENE-1410), or, if there is an good existing impl
      that has a compatible license that'd be great.

      I don't have time near-term to do this... so if anyone has the itch,
      please jump!

      1. LUCENE-1990_PerformanceMeasurements20100104.zip
        18 kB
        Toke Eskildsen
      2. LUCENE-1990.patch
        251 kB
        Michael McCandless
      3. LUCENE-1990-te20100122.patch
        281 kB
        Toke Eskildsen
      4. LUCENE-1990-te20100210.patch
        71 kB
        Toke Eskildsen
      5. LUCENE-1990-te20100212.patch
        99 kB
        Toke Eskildsen
      6. LUCENE-1990-te20100223.patch
        103 kB
        Toke Eskildsen
      7. LUCENE-1990-te20100226.patch
        130 kB
        Toke Eskildsen
      8. LUCENE-1990-te20100226b.patch
        134 kB
        Toke Eskildsen
      9. performance-te20100226.txt
        33 kB
        Toke Eskildsen
      10. LUCENE-1990-te20100226c.patch
        356 kB
        Toke Eskildsen
      11. generated_performance-te20100226.txt
        23 kB
        Toke Eskildsen
      12. perf-mkm-20100227.txt
        7 kB
        Michael McCandless
      13. performance-20100301.txt
        45 kB
        Toke Eskildsen
      14. LUCENE-1990-te20100301.patch
        364 kB
        Toke Eskildsen
      15. LUCENE-1990.patch
        73 kB
        Michael McCandless
      16. LUCENE-1990.patch
        4 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Fuad Efendi added a comment - - edited

          Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high.

          Value0  0  1  0  0  0  0  0   0  1  0  0  0  0  0 ...  
          Value1  1  0  1  0  0  0  0   0  0  0  0  0  0  0 ...  
          Value2  0  0  0  1  1  0  0   0  0  0  0  0  0  0 ...  
          Value3  0  0  0  0  0  0  0   0  0  0  0  1  0  0 ...  
          Value4  0  0  0  0  0  0  1   0  0  0  0  0  0  0 ...  
          Value5  0  0  0  0  0  1  0   0  0  0  1  0  1  0 ...  
          Value6  0  0  0  0  0  0  0   1  0  1  0  0  0  0 ...  
          Value7  0  0  0  0  0  0  0   0  0  0  0  0  0  1 ...  
          
          • represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0.

          And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if we go "vertically"? Field could be encoded as 3-bit (8 different values).

          CONSTRAINT: specifically for FieldCache, each Column must have the only "1".

          And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc().

          hope I am reinventing bycycle

          P.S.
          Of course each solution has pros and cons, I am trying to focus on FieldCache specific use cases.

          1. For a given document ID, find a value for a field
          2. For a given query results, sort it by a field values
          3. For a given query results, count "facet" for each field value

          I don't think such naive compression is slower than abstract int[] arrays... and we need to change public API of field cache too: if method returns int[] we are not saving any RAM.

          Better is to compare with SOLR use cases and to make API closer to real requirements; SOLR operates with some bitsets instead of arrays...

          Show
          Fuad Efendi added a comment - - edited Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high. Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ... Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ... Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ... Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ... Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ... Value7 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ... represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0. And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if we go "vertically"? Field could be encoded as 3-bit (8 different values). CONSTRAINT: specifically for FieldCache, each Column must have the only "1". And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc(). hope I am reinventing bycycle P.S. Of course each solution has pros and cons, I am trying to focus on FieldCache specific use cases. 1. For a given document ID, find a value for a field 2. For a given query results, sort it by a field values 3. For a given query results, count "facet" for each field value I don't think such naive compression is slower than abstract int[] arrays... and we need to change public API of field cache too: if method returns int[] we are not saving any RAM. Better is to compare with SOLR use cases and to make API closer to real requirements; SOLR operates with some bitsets instead of arrays...
          Hide
          Earwin Burrfoot added a comment -

          hope I am reinventing bycycle

          I believe some databases do that.

          Show
          Earwin Burrfoot added a comment - hope I am reinventing bycycle I believe some databases do that.
          Hide
          Fuad Efendi added a comment -

          Suttiwat sent me a link:
          http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/

          This is vendor-specific, and possibly may cause unexpected problems, but we can try in some specific cases:
          "Compressed Oops have been included (but disabled by default) in the performance release JDK6u6p (requires you to fill a survey), so I decided to try it in an internal application and compare it with 64-bit mode and 32-bit mode."

          -XX:+UseCompressedOops

          There are other vendors around too such as Oracle JRockit which is much faster server-side...

          Show
          Fuad Efendi added a comment - Suttiwat sent me a link: http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/ This is vendor-specific, and possibly may cause unexpected problems, but we can try in some specific cases: "Compressed Oops have been included (but disabled by default) in the performance release JDK6u6p (requires you to fill a survey), so I decided to try it in an internal application and compare it with 64-bit mode and 32-bit mode." -XX:+UseCompressedOops There are other vendors around too such as Oracle JRockit which is much faster server-side...
          Hide
          Toke Eskildsen added a comment -

          I have very recently done some experiments with random-access packed positive integer arrays (we could also call them unsigned). The basic approach is to put the bits after each other in an int- or long-array, then extract the value by shifting and masking. Unfortunately I have not been able to determine a single winning strategy for space/speed trade-offs.

          I've tried four simple implementations:

          1. Pack the bits right after each other.
          2. Pack the bits right after each other but allow only 1, 2, 4, 8, 16 or 32 as value-bits.
          3. Wrap an int[] or long[] and store the values directly (for comparison with 1 & 2).
          4. Use int[] directly without any wrapping.

          (Code can be found at http://summa.svn.sourceforge.net/viewvc/summa/trunk/Common/src/dk/statsbiblioteket/summa/common/util/bits/ where #1 is BitsArrayPacked, #2 is BitsArrayAligned and #3 is BitsArrayInt)

          The obvious benefit from #2 is that the bits for values are always contained in a single block (int or long), which reduces the amount of operations for set and get considerably. Unfortunately this means that as soon as a value greater than 65535 needs to be stored, the internal representation will require 32 bits/value. This means no space-benefit compared to #3 while the speed penalty remains. A hybrid approach might be considered where the implementation is determined by the number of bits needed to store a value.

          I've done some performance tests of the implementations on an aging 1.8GHz single-core laptop (Dell 820). The code can be found at http://summa.svn.sourceforge.net/viewvc/summa/trunk/Common/test/dk/statsbiblioteket/summa/common/util/bits/BitsArrayPerformanceTest.java?revision=2038&view=markup

          In the tables below, the measurements for Packed, Aligned, Int, Constant and int[] are the total time in milliseconds to perform actionCount actions (either read or write). Random numbers (using the same seed) were used for index as well as value and the overhead of constructing the random numbers were subtracted from the measurements. "Constant" is a dummy where no values are stored and the same value is always returned on a get. It is used to measure method-call overhead.

          For arrays of 10M values, 6-33 million values can be read or written per second. If this is within acceptable limits, I'd be happy to try and make a contribution to Lucene based on the code. However, I probably won't find the time before February or March 2010.

          actionCount arrayLength  actionType    valueMax  Packed(#1) Aligned(#2)     Int(#3)    Constant   int[](#4)
             10000000     1000000       write           1         583         416         984         254         635
             10000000     1000000       write          15         594         499        1172         286         843
             10000000     1000000       write         255         604         478        1057         213         656
             10000000     1000000       write         256         734         765        1062         109         729
             10000000     1000000       write       65535        1036         802        1124         417         734
             10000000     1000000       write       65536        1015        1130        1072         172         781
             10000000     1000000       write     2097151        1020        1187        1052         223         614
             10000000     1000000       write  2147483646        1136        1073         839          73         719
             10000000     1000000        read           1         286         203         786         104           0
             10000000     1000000        read          15         291         182         859          78           0
             10000000     1000000        read         255         672         494         989          67           0
             10000000     1000000        read         256         568         729        1104          93           0
             10000000     1000000        read       65535         833         755        1088          99           0
             10000000     1000000        read       65536         947         974        1062         104           0
             10000000     1000000        read     2097151         999         963        1062          88           0
             10000000     1000000        read  2147483646        1349         869        1260          84           0
          
             10000000    10000000       write           1         833         568        1458         239        1229
             10000000    10000000       write          15        1427        1255        1432         276        1615
             10000000    10000000       write         255        1599        1578        1448         250        1244
             10000000    10000000       write         256        1656        1520        1317         109        1109
             10000000    10000000       write       65535        1734        1630        1385         245        1307
             10000000    10000000       write       65536        1718        1640        1395         182        1208
             10000000    10000000       write     2097151        1807        1781        1447         250        1291
             10000000    10000000       write  2147483646        1718        1599        1281          73        1099
             10000000    10000000        read           1         562         296        1301         109           0
             10000000    10000000        read          15        1187        1198        1322          83           0
             10000000    10000000        read         255        1421        1432        1526          99           0
             10000000    10000000        read         256        1521        1583        1588         104           0
             10000000    10000000        read       65535        1578        1427        1374          31           0
             10000000    10000000        read       65536        1634        1499        1489         113           0
             10000000    10000000        read     2097151        1625        1374        1468         224           0
             10000000    10000000        read  2147483646        1609        1349        1499          83           0
          
          Show
          Toke Eskildsen added a comment - I have very recently done some experiments with random-access packed positive integer arrays (we could also call them unsigned). The basic approach is to put the bits after each other in an int- or long-array, then extract the value by shifting and masking. Unfortunately I have not been able to determine a single winning strategy for space/speed trade-offs. I've tried four simple implementations: 1. Pack the bits right after each other. 2. Pack the bits right after each other but allow only 1, 2, 4, 8, 16 or 32 as value-bits. 3. Wrap an int[] or long[] and store the values directly (for comparison with 1 & 2). 4. Use int[] directly without any wrapping. (Code can be found at http://summa.svn.sourceforge.net/viewvc/summa/trunk/Common/src/dk/statsbiblioteket/summa/common/util/bits/ where #1 is BitsArrayPacked, #2 is BitsArrayAligned and #3 is BitsArrayInt) The obvious benefit from #2 is that the bits for values are always contained in a single block (int or long), which reduces the amount of operations for set and get considerably. Unfortunately this means that as soon as a value greater than 65535 needs to be stored, the internal representation will require 32 bits/value. This means no space-benefit compared to #3 while the speed penalty remains. A hybrid approach might be considered where the implementation is determined by the number of bits needed to store a value. I've done some performance tests of the implementations on an aging 1.8GHz single-core laptop (Dell 820). The code can be found at http://summa.svn.sourceforge.net/viewvc/summa/trunk/Common/test/dk/statsbiblioteket/summa/common/util/bits/BitsArrayPerformanceTest.java?revision=2038&view=markup In the tables below, the measurements for Packed, Aligned, Int, Constant and int[] are the total time in milliseconds to perform actionCount actions (either read or write). Random numbers (using the same seed) were used for index as well as value and the overhead of constructing the random numbers were subtracted from the measurements. "Constant" is a dummy where no values are stored and the same value is always returned on a get. It is used to measure method-call overhead. For arrays of 10M values, 6-33 million values can be read or written per second. If this is within acceptable limits, I'd be happy to try and make a contribution to Lucene based on the code. However, I probably won't find the time before February or March 2010. actionCount arrayLength actionType valueMax Packed(#1) Aligned(#2) Int(#3) Constant int [](#4) 10000000 1000000 write 1 583 416 984 254 635 10000000 1000000 write 15 594 499 1172 286 843 10000000 1000000 write 255 604 478 1057 213 656 10000000 1000000 write 256 734 765 1062 109 729 10000000 1000000 write 65535 1036 802 1124 417 734 10000000 1000000 write 65536 1015 1130 1072 172 781 10000000 1000000 write 2097151 1020 1187 1052 223 614 10000000 1000000 write 2147483646 1136 1073 839 73 719 10000000 1000000 read 1 286 203 786 104 0 10000000 1000000 read 15 291 182 859 78 0 10000000 1000000 read 255 672 494 989 67 0 10000000 1000000 read 256 568 729 1104 93 0 10000000 1000000 read 65535 833 755 1088 99 0 10000000 1000000 read 65536 947 974 1062 104 0 10000000 1000000 read 2097151 999 963 1062 88 0 10000000 1000000 read 2147483646 1349 869 1260 84 0 10000000 10000000 write 1 833 568 1458 239 1229 10000000 10000000 write 15 1427 1255 1432 276 1615 10000000 10000000 write 255 1599 1578 1448 250 1244 10000000 10000000 write 256 1656 1520 1317 109 1109 10000000 10000000 write 65535 1734 1630 1385 245 1307 10000000 10000000 write 65536 1718 1640 1395 182 1208 10000000 10000000 write 2097151 1807 1781 1447 250 1291 10000000 10000000 write 2147483646 1718 1599 1281 73 1099 10000000 10000000 read 1 562 296 1301 109 0 10000000 10000000 read 15 1187 1198 1322 83 0 10000000 10000000 read 255 1421 1432 1526 99 0 10000000 10000000 read 256 1521 1583 1588 104 0 10000000 10000000 read 65535 1578 1427 1374 31 0 10000000 10000000 read 65536 1634 1499 1489 113 0 10000000 10000000 read 2097151 1625 1374 1468 224 0 10000000 10000000 read 2147483646 1609 1349 1499 83 0
          Hide
          Michael McCandless added a comment -

          These are great results – thanks Toke!

          What's the difference between the two "sections" of your results? The bottom section seems to have slower times overall than the top one.

          It's curious that "aligned" isn't always a win.

          For the packed case, did you generate specialized code for each of the cases? I wonder how much / if that'd really help in practice.

          For the standard codec's terms dict, my guess is we'd just go with packed ints.

          Show
          Michael McCandless added a comment - These are great results – thanks Toke! What's the difference between the two "sections" of your results? The bottom section seems to have slower times overall than the top one. It's curious that "aligned" isn't always a win. For the packed case, did you generate specialized code for each of the cases? I wonder how much / if that'd really help in practice. For the standard codec's terms dict, my guess is we'd just go with packed ints.
          Hide
          Toke Eskildsen added a comment -

          The first section if for 1M values in the structure, the second is for 10M. As the CPU on the test-machine (Intel T2400) has only 2MB of level 2 cache, the increased processing time for the seemingly same amount of work is an effect of more cache-misses.

          Caching also accounts for why the packed version is sometimes better than the aligned. For values representable as 9 or 17 bits, the aligned version needs 16 and 32 bits respectively. In the case with 10M values, the packed version uses 1.1MB and 2.1MB for 9 and 17 bits respectively, while the aligned uses 2MB and 4MB respectively. The simpler logic of the aligned version does not compensate enough for the higher amount of trips around main memory.

          I did not generate any specialized code for the aligned case: No matter the number of bits/value, the amount of shifts, masks and ors is always the same. If the number of bits/value is known beforehand, specialized cases should be used (I made a factory that selects between packed, aligned and direct (#3), depending on the number of bits/value). The reason for not doing so in the first place is that I wanted to let the structure auto-adjust the bits/value when a new value was added. Having different implementations encapsulated in the same class means another level of indirection or conditionals, both of which I wanted to avoid for performance reasons. That being said, I haven't tested how much of a penalty this would be.

          The standard use case seems to be some sort of update-round, after which no updates are performed. Having a cleanupAndOptimize-call that potentially creates a new and optimized structure, would fit well into this and would avoid the indirection / conditional penalty.

          A whole other matter is long vs. ints. I've tried using longs instead of ints as the backing array and the penalty on my 32bit processor was very high (I need to make some tests on this). If it must be possible to set and get longs, it's hard to avoid using long[] as the internal structure, but if ints are accepted as the only valid values, selecting long[] as backing array for 64 bit machines and int[] for 32 bit, might be the solution.

          All this calls for a factory-approach to hide the fairly complex task of choosing the right implementation.

          Show
          Toke Eskildsen added a comment - The first section if for 1M values in the structure, the second is for 10M. As the CPU on the test-machine (Intel T2400) has only 2MB of level 2 cache, the increased processing time for the seemingly same amount of work is an effect of more cache-misses. Caching also accounts for why the packed version is sometimes better than the aligned. For values representable as 9 or 17 bits, the aligned version needs 16 and 32 bits respectively. In the case with 10M values, the packed version uses 1.1MB and 2.1MB for 9 and 17 bits respectively, while the aligned uses 2MB and 4MB respectively. The simpler logic of the aligned version does not compensate enough for the higher amount of trips around main memory. I did not generate any specialized code for the aligned case: No matter the number of bits/value, the amount of shifts, masks and ors is always the same. If the number of bits/value is known beforehand, specialized cases should be used (I made a factory that selects between packed, aligned and direct (#3), depending on the number of bits/value). The reason for not doing so in the first place is that I wanted to let the structure auto-adjust the bits/value when a new value was added. Having different implementations encapsulated in the same class means another level of indirection or conditionals, both of which I wanted to avoid for performance reasons. That being said, I haven't tested how much of a penalty this would be. The standard use case seems to be some sort of update-round, after which no updates are performed. Having a cleanupAndOptimize-call that potentially creates a new and optimized structure, would fit well into this and would avoid the indirection / conditional penalty. A whole other matter is long vs. ints. I've tried using longs instead of ints as the backing array and the penalty on my 32bit processor was very high (I need to make some tests on this). If it must be possible to set and get longs, it's hard to avoid using long[] as the internal structure, but if ints are accepted as the only valid values, selecting long[] as backing array for 64 bit machines and int[] for 32 bit, might be the solution. All this calls for a factory-approach to hide the fairly complex task of choosing the right implementation.
          Hide
          Toke Eskildsen added a comment -

          I made some small tweaks to improve performance and added long[]-backed versions of Packed (optimal space) and Aligned (no values span underlying blocks), the ran the performance tests on 5 different computers. It seems very clear that level 2 cache (and presumably RAM-speed, but I do not know how to determine that without root-access on a Linux box) plays a bigger role for access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about half as fast than a 1.8GHz laptop with 2MB of level 2 cache.

          There is a whole lot of measurements and it is getting hard to digest. I've attached logs from the 5 computers, should anyone want to have a look. Some observations are:

          1. The penalty of using long[] instead of int[] on my 32 bit laptop depends on the number of values in the array. For less than a million values, it is severe: The long[]-version if 30-60% slower, depending on whether packed or aligned values are used. Above that, it was 10% slower for Aligned, 25% slower for Packed.
          On the other hand, 64 bit machines dos not seem to care that much whether int[] or long[] is used: There was 10% win for arrays below 1M for one machine, 50% for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a small loss of below 1% for all lenghts above 10K for a third.

          2. There's a fast drop-off in speed when the array reaches a certain size that is correlated to level 2 cache size. After that, the speed does not decrease much when the array grows. This also affects direct writes to an int[] and has the interesting implication that a packed array out-performs the direct access approach for writes in a number of cases. For reads, there's no contest: Direct access to int[] is blazingly fast.

          3. The access-speed of the different implementations converges when the number of values in the array rises (think 10M+ values): The slow round-trip to main memory dwarfs the logic used for value-extraction.

          Observation #3 supports Mike McCandless choice of going for the packed approach and #1 suggests using int[] as the internal structure for now. Using int[] as internal structure makes if unfeasible to accept longs as input (or rather: longs with more than 32 significant bits). I don't know if this is acceptable?

          Show
          Toke Eskildsen added a comment - I made some small tweaks to improve performance and added long[]-backed versions of Packed (optimal space) and Aligned (no values span underlying blocks), the ran the performance tests on 5 different computers. It seems very clear that level 2 cache (and presumably RAM-speed, but I do not know how to determine that without root-access on a Linux box) plays a bigger role for access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about half as fast than a 1.8GHz laptop with 2MB of level 2 cache. There is a whole lot of measurements and it is getting hard to digest. I've attached logs from the 5 computers, should anyone want to have a look. Some observations are: 1. The penalty of using long[] instead of int[] on my 32 bit laptop depends on the number of values in the array. For less than a million values, it is severe: The long[]-version if 30-60% slower, depending on whether packed or aligned values are used. Above that, it was 10% slower for Aligned, 25% slower for Packed. On the other hand, 64 bit machines dos not seem to care that much whether int[] or long[] is used: There was 10% win for arrays below 1M for one machine, 50% for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a small loss of below 1% for all lenghts above 10K for a third. 2. There's a fast drop-off in speed when the array reaches a certain size that is correlated to level 2 cache size. After that, the speed does not decrease much when the array grows. This also affects direct writes to an int[] and has the interesting implication that a packed array out-performs the direct access approach for writes in a number of cases. For reads, there's no contest: Direct access to int[] is blazingly fast. 3. The access-speed of the different implementations converges when the number of values in the array rises (think 10M+ values): The slow round-trip to main memory dwarfs the logic used for value-extraction. Observation #3 supports Mike McCandless choice of going for the packed approach and #1 suggests using int[] as the internal structure for now. Using int[] as internal structure makes if unfeasible to accept longs as input (or rather: longs with more than 32 significant bits). I don't know if this is acceptable?
          Hide
          Michael McCandless added a comment -

          The first section if for 1M values in the structure, the second is for 10M. As the CPU on the test-machine (Intel T2400) has only 2MB of level 2 cache, the increased processing time for the seemingly same amount of work is an effect of more cache-misses.

          Got it.

          Caching also accounts for why the packed version is sometimes better than the aligned. For values representable as 9 or 17 bits, the aligned version needs 16 and 32 bits respectively. In the case with 10M values, the packed version uses 1.1MB and 2.1MB for 9 and 17 bits respectively, while the aligned uses 2MB and 4MB respectively. The simpler logic of the aligned version does not compensate enough for the higher amount of trips around main memory.

          OK, though, I think we should somehow exclude these CPU cache effects
          from the test. In a real production situation, lots of other things
          are competing for that cache, so I think the mostly-cache-miss case is
          most realistic here.

          Actually a good way to do that is to test it in a wider context, eg
          once I can get LUCENE-2186 "integrated", we can test sorting by int
          field with these different ways of encoding.

          I did not generate any specialized code for the aligned case: No matter the number of bits/value, the amount of shifts, masks and ors is always the same. If the number of bits/value is known beforehand, specialized cases should be used (I made a factory that selects between packed, aligned and direct (#3), depending on the number of bits/value).

          OK I have a start at this under LUCENE-2186. I'll try to clean up &
          post here...

          The reason for not doing so in the first place is that I wanted to let the structure auto-adjust the bits/value when a new value was added. Having different implementations encapsulated in the same class means another level of indirection or conditionals, both of which I wanted to avoid for performance reasons. That being said, I haven't tested how much of a penalty this would be.

          The standard use case seems to be some sort of update-round, after which no updates are performed. Having a cleanupAndOptimize-call that potentially creates a new and optimized structure, would fit well into this and would avoid the indirection / conditional penalty.

          I think the writing API can be a write-once API, where I declare the
          maxValue I will write. This fits with the indexing chain, where we
          buffer values in RAM and then flush to the segmetn files.

          This way we don't need the complexity of having to resize the bits
          internally when a new max value is seen.

          So, the factory would take number of values we will write, the max
          value, and I guess an enum for Packed/Aligned/DedicatedArray, and
          return a simple writer that just exposes an "add(long value)" method?

          A whole other matter is long vs. ints. I've tried using longs instead of ints as the backing array and the penalty on my 32bit processor was very high (I need to make some tests on this). If it must be possible to set and get longs, it's hard to avoid using long[] as the internal structure, but if ints are accepted as the only valid values, selecting long[] as backing array for 64 bit machines and int[] for 32 bit, might be the solution.

          Hmm... I guess we could still support values requiring more than 32
          bits, encoded into int[], but it's more hairy as the value could span
          2 or 3 ints.

          Probably we should specialize/default the factory selection based on
          whether the JVM is 32/64 bit? And, whether the bit size is <= 32.

          All this calls for a factory-approach to hide the fairly complex task of choosing the right implementation.

          Yes.

          I made some small tweaks to improve performance and added long[]-backed versions of Packed (optimal space) and Aligned (no values span underlying blocks), the ran the performance tests on 5 different computers. It seems very clear that level 2 cache (and presumably RAM-speed, but I do not know how to determine that without root-access on a Linux box) plays a bigger role for access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about half as fast than a 1.8GHz laptop with 2MB of level 2 cache.
          There is a whole lot of measurements and it is getting hard to digest. I've attached logs from the 5 computers, should anyone want to have a look. Some observations are:

          1. The penalty of using long[] instead of int[] on my 32 bit laptop depends on the number of values in the array. For less than a million values, it is severe: The long[]-version if 30-60% slower, depending on whether packed or aligned values are used. Above that, it was 10% slower for Aligned, 25% slower for Packed.
          On the other hand, 64 bit machines dos not seem to care that much whether int[] or long[] is used: There was 10% win for arrays below 1M for one machine, 50% for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a small loss of below 1% for all lenghts above 10K for a third.

          2. There's a fast drop-off in speed when the array reaches a certain size that is correlated to level 2 cache size. After that, the speed does not decrease much when the array grows. This also affects direct writes to an int[] and has the interesting implication that a packed array out-performs the direct access approach for writes in a number of cases. For reads, there's no contest: Direct access to int[] is blazingly fast.

          3. The access-speed of the different implementations converges when the number of values in the array rises (think 10M+ values): The slow round-trip to main memory dwarfs the logic used for value-extraction.

          Observation #3 supports Mike McCandless choice of going for the packed approach and #1 suggests using int[] as the internal structure for now. Using int[] as internal structure makes if unfeasible to accept longs as input (or rather: longs with more than 32 significant bits). I don't know if this is acceptable?

          I think we should agree on an API that LUCENE-2186 can use to
          write/read packed ints, then get our two patches talking. I'll pull
          out the barebones packed ints I'm currently using, and post
          here... then let's merge/iterate to a common API, so I can cutover to
          the patch from here in LUCENE-2186?

          Show
          Michael McCandless added a comment - The first section if for 1M values in the structure, the second is for 10M. As the CPU on the test-machine (Intel T2400) has only 2MB of level 2 cache, the increased processing time for the seemingly same amount of work is an effect of more cache-misses. Got it. Caching also accounts for why the packed version is sometimes better than the aligned. For values representable as 9 or 17 bits, the aligned version needs 16 and 32 bits respectively. In the case with 10M values, the packed version uses 1.1MB and 2.1MB for 9 and 17 bits respectively, while the aligned uses 2MB and 4MB respectively. The simpler logic of the aligned version does not compensate enough for the higher amount of trips around main memory. OK, though, I think we should somehow exclude these CPU cache effects from the test. In a real production situation, lots of other things are competing for that cache, so I think the mostly-cache-miss case is most realistic here. Actually a good way to do that is to test it in a wider context, eg once I can get LUCENE-2186 "integrated", we can test sorting by int field with these different ways of encoding. I did not generate any specialized code for the aligned case: No matter the number of bits/value, the amount of shifts, masks and ors is always the same. If the number of bits/value is known beforehand, specialized cases should be used (I made a factory that selects between packed, aligned and direct (#3), depending on the number of bits/value). OK I have a start at this under LUCENE-2186 . I'll try to clean up & post here... The reason for not doing so in the first place is that I wanted to let the structure auto-adjust the bits/value when a new value was added. Having different implementations encapsulated in the same class means another level of indirection or conditionals, both of which I wanted to avoid for performance reasons. That being said, I haven't tested how much of a penalty this would be. The standard use case seems to be some sort of update-round, after which no updates are performed. Having a cleanupAndOptimize-call that potentially creates a new and optimized structure, would fit well into this and would avoid the indirection / conditional penalty. I think the writing API can be a write-once API, where I declare the maxValue I will write. This fits with the indexing chain, where we buffer values in RAM and then flush to the segmetn files. This way we don't need the complexity of having to resize the bits internally when a new max value is seen. So, the factory would take number of values we will write, the max value, and I guess an enum for Packed/Aligned/DedicatedArray, and return a simple writer that just exposes an "add(long value)" method? A whole other matter is long vs. ints. I've tried using longs instead of ints as the backing array and the penalty on my 32bit processor was very high (I need to make some tests on this). If it must be possible to set and get longs, it's hard to avoid using long[] as the internal structure, but if ints are accepted as the only valid values, selecting long[] as backing array for 64 bit machines and int[] for 32 bit, might be the solution. Hmm... I guess we could still support values requiring more than 32 bits, encoded into int[], but it's more hairy as the value could span 2 or 3 ints. Probably we should specialize/default the factory selection based on whether the JVM is 32/64 bit? And, whether the bit size is <= 32. All this calls for a factory-approach to hide the fairly complex task of choosing the right implementation. Yes. I made some small tweaks to improve performance and added long[]-backed versions of Packed (optimal space) and Aligned (no values span underlying blocks), the ran the performance tests on 5 different computers. It seems very clear that level 2 cache (and presumably RAM-speed, but I do not know how to determine that without root-access on a Linux box) plays a bigger role for access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about half as fast than a 1.8GHz laptop with 2MB of level 2 cache. There is a whole lot of measurements and it is getting hard to digest. I've attached logs from the 5 computers, should anyone want to have a look. Some observations are: 1. The penalty of using long[] instead of int[] on my 32 bit laptop depends on the number of values in the array. For less than a million values, it is severe: The long[]-version if 30-60% slower, depending on whether packed or aligned values are used. Above that, it was 10% slower for Aligned, 25% slower for Packed. On the other hand, 64 bit machines dos not seem to care that much whether int[] or long[] is used: There was 10% win for arrays below 1M for one machine, 50% for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a small loss of below 1% for all lenghts above 10K for a third. 2. There's a fast drop-off in speed when the array reaches a certain size that is correlated to level 2 cache size. After that, the speed does not decrease much when the array grows. This also affects direct writes to an int[] and has the interesting implication that a packed array out-performs the direct access approach for writes in a number of cases. For reads, there's no contest: Direct access to int[] is blazingly fast. 3. The access-speed of the different implementations converges when the number of values in the array rises (think 10M+ values): The slow round-trip to main memory dwarfs the logic used for value-extraction. Observation #3 supports Mike McCandless choice of going for the packed approach and #1 suggests using int[] as the internal structure for now. Using int[] as internal structure makes if unfeasible to accept longs as input (or rather: longs with more than 32 significant bits). I don't know if this is acceptable? I think we should agree on an API that LUCENE-2186 can use to write/read packed ints, then get our two patches talking. I'll pull out the barebones packed ints I'm currently using, and post here... then let's merge/iterate to a common API, so I can cutover to the patch from here in LUCENE-2186 ?
          Hide
          Michael McCandless added a comment -

          How about something like this API, for writing packed ints:

          abstract class Writer {
            public abstract void add(long v) throws IOException;
            public abstract void finish() throws IOException;
          }
          

          then a factory:

          enum Mode {Packed, Aligned, FixedArray};
          
          public static Writer getWriter(IndexOutput out, int valueCount, long maxValue, Mode mode);
          

          (we can iterate on the names... always the hardest part).

          Packed means full bit packing (most space efficient, but slowest
          decode time), Aligned might waste some bits (eg for nbits=4, that's
          naturally aligned, but for nbits=7, we'd waste 1 bit per long,
          FixedArray (which'd use byte[], short[], int[], long[]) would
          potentially waste the most bits but have the fastest decode.

          If nbits happens to be 8, 16, 32, 64, the factory should just always
          FixedArray I think? And of course powers of two will automatically be
          Aligned (with the per-nbits specialized code).

          Wew can also default impls to underlying int[] vs long[] backing store
          depending on 54/32 bit jre, and, nbits. If jre is 32 bit but nbits is
          > 32 bit I think we just use long[] backing.

          For reading, a similar API:

          abstract class Reader {
            public abstract long get(index);
          }
          
          public static Reader getReader(IndexInput in);
          
          Show
          Michael McCandless added a comment - How about something like this API, for writing packed ints: abstract class Writer { public abstract void add( long v) throws IOException; public abstract void finish() throws IOException; } then a factory: enum Mode {Packed, Aligned, FixedArray}; public static Writer getWriter(IndexOutput out, int valueCount, long maxValue, Mode mode); (we can iterate on the names... always the hardest part). Packed means full bit packing (most space efficient, but slowest decode time), Aligned might waste some bits (eg for nbits=4, that's naturally aligned, but for nbits=7, we'd waste 1 bit per long, FixedArray (which'd use byte[], short[], int[], long[]) would potentially waste the most bits but have the fastest decode. If nbits happens to be 8, 16, 32, 64, the factory should just always FixedArray I think? And of course powers of two will automatically be Aligned (with the per-nbits specialized code). Wew can also default impls to underlying int[] vs long[] backing store depending on 54/32 bit jre, and, nbits. If jre is 32 bit but nbits is > 32 bit I think we just use long[] backing. For reading, a similar API: abstract class Reader { public abstract long get(index); } public static Reader getReader(IndexInput in);
          Hide
          Michael McCandless added a comment -

          Attached patch with my current roughed up approach for packed ints
          (from LUCENE-2186).

          Let's try to standardize the API, then merge the two approaches, then
          I'll cutover with LUCENE-2186.

          It includes gen.py, which autogens dedicated decoders for each of the
          nbits cases, excluding 8, 16, 32, 64 bits, since these are done with
          dedicated array reader impls.

          It uses a single writer (I don't think we need specialized writers),
          but the writer encodes in the same byte order as
          IndexOutput.writeLong, so that the byte order matches the dedicated
          array reader impls.

          It only encodes into long[] – we should create cases for int[]
          (selected by the factory depending on 32 vs 64 bit jre).

          We should also explore just reading in a full byte[] and using
          Int/Short/Long buffer to decode. This API should also allow for a
          future mmap impl as well.

          Probably we should name all of these UnsignedPackedInts, since they
          require values >= 0. (Hmm, though, the 64 bit case is tricky – I
          guess we make an exception for that case).

          Show
          Michael McCandless added a comment - Attached patch with my current roughed up approach for packed ints (from LUCENE-2186 ). Let's try to standardize the API, then merge the two approaches, then I'll cutover with LUCENE-2186 . It includes gen.py, which autogens dedicated decoders for each of the nbits cases, excluding 8, 16, 32, 64 bits, since these are done with dedicated array reader impls. It uses a single writer (I don't think we need specialized writers), but the writer encodes in the same byte order as IndexOutput.writeLong, so that the byte order matches the dedicated array reader impls. It only encodes into long[] – we should create cases for int[] (selected by the factory depending on 32 vs 64 bit jre). We should also explore just reading in a full byte[] and using Int/Short/Long buffer to decode. This API should also allow for a future mmap impl as well. Probably we should name all of these UnsignedPackedInts, since they require values >= 0. (Hmm, though, the 64 bit case is tricky – I guess we make an exception for that case).
          Hide
          Toke Eskildsen added a comment -

          Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting. I'm fairly afraid of the overhead of the extra method-calls, but I'll try it and see what happens.

          I've read your (Michael McCandless) code an I can see that the tiny interfaces for Reader and Writer works well for your scenario. However, as the Reader must have (fast) random access, wouldn't it make sense to make it possible to update values? That way the same code can be used to hold ords for sorting and similar structures.

          Instead of Reader, we could use

          abstract class Mutator {
            public abstract long get(int index);
            public abstract long set(int index, long value);
          }
          

          ...should the index also be a long? No need to be bound by Java's 31-bit limit on array-length, although I might very well be over-engineering here.

          The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only).

          Show
          Toke Eskildsen added a comment - Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting. I'm fairly afraid of the overhead of the extra method-calls, but I'll try it and see what happens. I've read your (Michael McCandless) code an I can see that the tiny interfaces for Reader and Writer works well for your scenario. However, as the Reader must have (fast) random access, wouldn't it make sense to make it possible to update values? That way the same code can be used to hold ords for sorting and similar structures. Instead of Reader, we could use abstract class Mutator { public abstract long get( int index); public abstract long set( int index, long value); } ...should the index also be a long? No need to be bound by Java's 31-bit limit on array-length, although I might very well be over-engineering here. The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only).
          Hide
          Michael McCandless added a comment -

          Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting.

          You mean the layer that stores the minValue, so that the full range is
          supported? I actually think we should absorb that into packed ints,
          so it's only one method call per lookup, and specialize the "positive
          only" cases to avoid the extra add per lookup.

          With that fix, it's still a method call per lookup, but I don't see
          how we can get away from that, unless we allow for exposure of the raw
          array for the no-packing cases (which we could consider...).

          Remember we use packed ints in places where we can accept some loss of
          CPU perf. for improvements in RAM usage (see the comment I just added
          to LUCENE-2186).

          However, as the Reader must have (fast) random access, wouldn't it make sense to make it possible to update values?

          Yeah, we do eventually want CSF to be updateable, but I don't think we
          need this for phase 1? Likewise, I think all we need now for Lucene
          is a "WriteOnceWriter", not a "RandomAccessWriter". Ie, you open a
          writer, you add (sequentially) all values, you close.

          ...should the index also be a long?

          I would stick with int now (we are doing this for Lucene, whose docIDs
          are still ints...). Design for today.

          The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only).

          I'm confused – a future mmap impl shouldn't put pressure on the file
          format used by packed ints today? Ie, a future mmap impl can use a
          totally different format than the designed-to-be-slurped-into-RAM
          format for packed ints, today?

          Also, what do you mean by optimized for building not searching?

          Note that on 32 bit machines, if there is actually a gain, we can make
          a backing store with ints yet still allow for storage of nbits>32? It
          "just" means a value may be split across 2 or 3 values?

          Show
          Michael McCandless added a comment - Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting. You mean the layer that stores the minValue, so that the full range is supported? I actually think we should absorb that into packed ints, so it's only one method call per lookup, and specialize the "positive only" cases to avoid the extra add per lookup. With that fix, it's still a method call per lookup, but I don't see how we can get away from that, unless we allow for exposure of the raw array for the no-packing cases (which we could consider...). Remember we use packed ints in places where we can accept some loss of CPU perf. for improvements in RAM usage (see the comment I just added to LUCENE-2186 ). However, as the Reader must have (fast) random access, wouldn't it make sense to make it possible to update values? Yeah, we do eventually want CSF to be updateable, but I don't think we need this for phase 1? Likewise, I think all we need now for Lucene is a "WriteOnceWriter", not a "RandomAccessWriter". Ie, you open a writer, you add (sequentially) all values, you close. ...should the index also be a long? I would stick with int now (we are doing this for Lucene, whose docIDs are still ints...). Design for today. The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only). I'm confused – a future mmap impl shouldn't put pressure on the file format used by packed ints today? Ie, a future mmap impl can use a totally different format than the designed-to-be-slurped-into-RAM format for packed ints, today? Also, what do you mean by optimized for building not searching? Note that on 32 bit machines, if there is actually a gain, we can make a backing store with ints yet still allow for storage of nbits>32? It "just" means a value may be split across 2 or 3 values?
          Hide
          Paul Elschot added a comment -

          I've made a remark at LUCENE-1410 (a first attempt at a PFOR implementation) about the header structure for encoding this.
          One thing that is not covered here is how to deal with input arrays with intermediate length that are shorter than 32 and longer than 3 or 4. Shorter ones can easily be encoded as vByte.
          Simple9 might be a solution, but it has only 28 data bits and 9 different encoding cases so it appears to be somewhat small.
          There is first attempt at Simple9 at LUCENE-2189.

          Since the discussion here is on alignment (int/long) I'm wondering how (and whether) to go from the current byte aligned structures to int aligned. Using aligned ints would save the shifting done at IndexInput.getInt() that reads 4 bytes and shifts them into place to create an int from them.
          Simple9 can be int aligned and I'd like to add bigger variations of that, but peferably only ones that add a multiple of 4 bytes.

          So would make sense to add functionality to IndexInput and IndexOutput to allow int aligned access?
          Are java's data streams and/or nio buffers smart enough to avoid the byte shifting for ints in such cases?

          Show
          Paul Elschot added a comment - I've made a remark at LUCENE-1410 (a first attempt at a PFOR implementation) about the header structure for encoding this. One thing that is not covered here is how to deal with input arrays with intermediate length that are shorter than 32 and longer than 3 or 4. Shorter ones can easily be encoded as vByte. Simple9 might be a solution, but it has only 28 data bits and 9 different encoding cases so it appears to be somewhat small. There is first attempt at Simple9 at LUCENE-2189 . Since the discussion here is on alignment (int/long) I'm wondering how (and whether) to go from the current byte aligned structures to int aligned. Using aligned ints would save the shifting done at IndexInput.getInt() that reads 4 bytes and shifts them into place to create an int from them. Simple9 can be int aligned and I'd like to add bigger variations of that, but peferably only ones that add a multiple of 4 bytes. So would make sense to add functionality to IndexInput and IndexOutput to allow int aligned access? Are java's data streams and/or nio buffers smart enough to avoid the byte shifting for ints in such cases?
          Hide
          Yonik Seeley added a comment -

          Using aligned ints would save the shifting done at IndexInput.getInt() that reads 4 bytes and shifts them into place to create an int from them.

          How's that? Is there some JVM intrinsic?

          Show
          Yonik Seeley added a comment - Using aligned ints would save the shifting done at IndexInput.getInt() that reads 4 bytes and shifts them into place to create an int from them. How's that? Is there some JVM intrinsic?
          Hide
          Paul Elschot added a comment -

          For the record: on the flex branch I just saw IntIndexInput and IntIndexOutput.

          Show
          Paul Elschot added a comment - For the record: on the flex branch I just saw IntIndexInput and IntIndexOutput.
          Hide
          Paul Elschot added a comment -
          Using aligned ints would save the shifting done at IndexInput.getInt() that reads 4 bytes and shifts them into place to create an int from them.

          How's that? Is there some JVM intrinsic?

          In the aligned case, and with the right byte order, getInt() on a java data stream might be reduced to processor instructions operating on 4 bytes at a time.

          Is that what you mean by JVM intrinsic?

          Show
          Paul Elschot added a comment - Using aligned ints would save the shifting done at IndexInput.getInt() that reads 4 bytes and shifts them into place to create an int from them. How's that? Is there some JVM intrinsic? In the aligned case, and with the right byte order, getInt() on a java data stream might be reduced to processor instructions operating on 4 bytes at a time. Is that what you mean by JVM intrinsic?
          Hide
          Yonik Seeley added a comment -

          In the aligned case, and with the right byte order, getInt() on a java data stream might be reduced to processor instructions operating on 4 bytes at a time. Is that what you mean by JVM intrinsic?

          Yes. But are you aware of that having been implemented in any JVMs?

          Show
          Yonik Seeley added a comment - In the aligned case, and with the right byte order, getInt() on a java data stream might be reduced to processor instructions operating on 4 bytes at a time. Is that what you mean by JVM intrinsic? Yes. But are you aware of that having been implemented in any JVMs?
          Hide
          Toke Eskildsen added a comment -

          Toke:
          Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting.

          You mean the layer that stores the minValue, so that the full range is
          supported? I actually think we should absorb that into packed ints,
          so it's only one method call per lookup, and specialize the "positive
          only" cases to avoid the extra add per lookup.

          No, I mean the backing array of ints or longs. For memory, the obvious choice is int[] or long[], but designing for flexibility calls for an API, which coincidentally is identical to Reader. So, a 4-bit Aligned could be backed by a DirectInt (which will contain an int[]) or a Persistent (doing mmap or some other persistent-oriented lookup).

          ...I should show this in code instead. I'll try and find the time.

          Yeah, we do eventually want CSF to be updateable, but I don't think we
          need this for phase 1?

          Not in the specific scenario, no.

          Toke:
          The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only).

          I'm confused - a future mmap impl shouldn't put pressure on the file
          format used by packed ints today? Ie, a future mmap impl can use a
          totally different format than the designed-to-be-slurped-into-RAM
          format for packed ints, today?

          Also, what do you mean by optimized for building not searching?

          When we're using Aligned, the choice of using int or long for the backing array dictates how the persistent bitstream will be. If the index builder is a 64 bit machine and 7 bits/value is used, the result will be longs, each containing 9 values.

          When the structure is read into memory by the searcher, the backing array will again be long. But if the searcher happens to be a 32 bit machine, the performance will be less than if ints were used for the backing array.

          One way to handle this is to do a conversion, when loading into memory. If the searcher is 64 bit, it will always convert into longs. If the searcher is 32 bit, it will convert into int, if the bits/value is <= 32. The conversion is cheap, so this is no problem in itself.

          However, if we're planning for the future (or for flexibility, depending on point of view), we would very much like the persistent format to be directly usable, so that we don't need to load the whole structure into memory. This rules out conversion and sets us back to step 1: The index will be optimized for either 32bit or 64 bit searchers.

          Oh well, we could always just ignore it and say that Aligned is 64 bit based. As it is more memory-efficient than Aligned on 32 bit machines, maybe the slightly smaller number of backing longs will compensate for the overhead of retrieving longs on a 64 bit machine.

          Note that on 32 bit machines, if there is actually a gain, we can make
          a backing store with ints yet still allow for storage of nbits>32? It
          "just" means a value may be split across 2 or 3 values?

          My guess is that the number of values vs. the available level 2 cache will play a big role here: For a relatively small number of values, the added logic will be costly. For a larger number of values, the cache-misses will dwarf that.

          Show
          Toke Eskildsen added a comment - Toke: Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting. You mean the layer that stores the minValue, so that the full range is supported? I actually think we should absorb that into packed ints, so it's only one method call per lookup, and specialize the "positive only" cases to avoid the extra add per lookup. No, I mean the backing array of ints or longs. For memory, the obvious choice is int[] or long[], but designing for flexibility calls for an API, which coincidentally is identical to Reader. So, a 4-bit Aligned could be backed by a DirectInt (which will contain an int[]) or a Persistent (doing mmap or some other persistent-oriented lookup). ...I should show this in code instead. I'll try and find the time. Yeah, we do eventually want CSF to be updateable, but I don't think we need this for phase 1? Not in the specific scenario, no. Toke: The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only). I'm confused - a future mmap impl shouldn't put pressure on the file format used by packed ints today? Ie, a future mmap impl can use a totally different format than the designed-to-be-slurped-into-RAM format for packed ints, today? Also, what do you mean by optimized for building not searching? When we're using Aligned, the choice of using int or long for the backing array dictates how the persistent bitstream will be. If the index builder is a 64 bit machine and 7 bits/value is used, the result will be longs, each containing 9 values. When the structure is read into memory by the searcher, the backing array will again be long. But if the searcher happens to be a 32 bit machine, the performance will be less than if ints were used for the backing array. One way to handle this is to do a conversion, when loading into memory. If the searcher is 64 bit, it will always convert into longs. If the searcher is 32 bit, it will convert into int, if the bits/value is <= 32. The conversion is cheap, so this is no problem in itself. However, if we're planning for the future (or for flexibility, depending on point of view), we would very much like the persistent format to be directly usable, so that we don't need to load the whole structure into memory. This rules out conversion and sets us back to step 1: The index will be optimized for either 32bit or 64 bit searchers. Oh well, we could always just ignore it and say that Aligned is 64 bit based. As it is more memory-efficient than Aligned on 32 bit machines, maybe the slightly smaller number of backing longs will compensate for the overhead of retrieving longs on a 64 bit machine. Note that on 32 bit machines, if there is actually a gain, we can make a backing store with ints yet still allow for storage of nbits>32? It "just" means a value may be split across 2 or 3 values? My guess is that the number of values vs. the available level 2 cache will play a big role here: For a relatively small number of values, the added logic will be costly. For a larger number of values, the cache-misses will dwarf that.
          Hide
          Paul Elschot added a comment -

          Paul: In the aligned case, and with the right byte order, getInt() on a java data stream might be reduced to processor instructions operating on 4 bytes at a time. Is that what you mean by JVM intrinsic?

          Yonik: Yes. But are you aware of that having been implemented in any JVMs?

          I remember reading about an implementation doing that, but I can't find it back now.

          If one cannot have such ints directly, there is not much point in unpacking via IndexInput.getInt(), it would be better to unpack directly from the bytes.
          Also, in that case, I'd prefer to use single byte increments (above 4 byte increments) for the size of the encoded data for variations on Simple9.

          Show
          Paul Elschot added a comment - Paul: In the aligned case, and with the right byte order, getInt() on a java data stream might be reduced to processor instructions operating on 4 bytes at a time. Is that what you mean by JVM intrinsic? Yonik: Yes. But are you aware of that having been implemented in any JVMs? I remember reading about an implementation doing that, but I can't find it back now. If one cannot have such ints directly, there is not much point in unpacking via IndexInput.getInt(), it would be better to unpack directly from the bytes. Also, in that case, I'd prefer to use single byte increments (above 4 byte increments) for the size of the encoded data for variations on Simple9.
          Hide
          Toke Eskildsen added a comment -

          Looking at bit patterns and persistence, I see 3 different ones: Packed, aligned32 and aligned64. Regardless of whether 32bit or 64bit is used when a packed structure is created, it can be read as both 32bit and 64bit packed. As for the special cases of 8, 16, 32 and 64 bits/value, the bit patterns are identically to both packed and aligned. This leeds me to propose a header designating one of the three structures mentioned.

          The current draft from Michael McCandless states both bitsPerValue and maxValue in the persistent format. It seems a redundant to have both, but I might be missing something here? Either way, the bitsPerValue is ambiguous as it does not translate to memory usage the same way for packed, aligned32 or aligned64. Should I choose, I'd go for maxValue.

          What about a header stating

          format (String "packed", "aligned32" or "aligned64")
          valueCount (vInt)
          maxValue (vLong)
          

          ?

          I have working code for packed32 and packed64 and am currently fitting it into Michael's patch. I hope to finish it this weekend.

          Show
          Toke Eskildsen added a comment - Looking at bit patterns and persistence, I see 3 different ones: Packed, aligned32 and aligned64. Regardless of whether 32bit or 64bit is used when a packed structure is created, it can be read as both 32bit and 64bit packed. As for the special cases of 8, 16, 32 and 64 bits/value, the bit patterns are identically to both packed and aligned. This leeds me to propose a header designating one of the three structures mentioned. The current draft from Michael McCandless states both bitsPerValue and maxValue in the persistent format. It seems a redundant to have both, but I might be missing something here? Either way, the bitsPerValue is ambiguous as it does not translate to memory usage the same way for packed, aligned32 or aligned64. Should I choose, I'd go for maxValue. What about a header stating format ( String "packed" , "aligned32" or "aligned64" ) valueCount (vInt) maxValue (vLong) ? I have working code for packed32 and packed64 and am currently fitting it into Michael's patch. I hope to finish it this weekend.
          Hide
          Paul Elschot added a comment -

          How about encoding the header something like this:

          VByte 1 bit.
          Simple9 4 bits. These cases imply valueCount and maxValue.
          For around 4-16 numbers encode the complete header in 5-6 bits, also implying valueCount and maxValue.
          For FrameOfRef, encoding 32 or more numbers, a larger header can be used, 4 bytes for example, maxValue is implied from the number of frame bits. valueCount could be 32, 64, 128 (i.e. 2 bits). Also the number of exceptions could be put there.

          This header "type" can be chosen depending on the given length of the encoded sequence.

          Show
          Paul Elschot added a comment - How about encoding the header something like this: VByte 1 bit. Simple9 4 bits. These cases imply valueCount and maxValue. For around 4-16 numbers encode the complete header in 5-6 bits, also implying valueCount and maxValue. For FrameOfRef, encoding 32 or more numbers, a larger header can be used, 4 bytes for example, maxValue is implied from the number of frame bits. valueCount could be 32, 64, 128 (i.e. 2 bits). Also the number of exceptions could be put there. This header "type" can be chosen depending on the given length of the encoded sequence.
          Hide
          Michael McCandless added a comment -

          I have working code for packed32 and packed64 and am currently fitting it into Michael's patch. I hope to finish it this weekend.

          Nice! Sounds like good progress Toke!

          The current draft from Michael McCandless states both bitsPerValue and maxValue in the persistent format

          I was only storing maxValue as a convenience for the layer above – we don't need to do that – I think storing format (packed, aligned32, aligned64) and bitsPerValue makes sense.

          Regardless of whether 32bit or 64bit is used when a packed structure is created, it can be read as both 32bit and 64bit packed.

          Right, but with the challenge (if we use 32bit backing array) of properly handling the nbits>32 case (this is perfectly doable... "it's just software" ).

          As for the special cases of 8, 16, 32 and 64 bits/value, the bit patterns are identically to both packed and aligned.

          I had chosen to match IndexOutput/Inputs's byte order (big-endian) so that the packed format naturally reads back with IndexInput's readLong/Int/Short (I added a readShort).

          I'm assuming for these special cases that dedicated Reader impls, with byte[], short[], int[], long[] backing array, is faster than eg backing with a long[] and shift/masking per lookup.

          But eg for the nbits=3 case, aligned 32/64 would ensure that no value spans across two underlying entries in the backing array (wasting some bits of storage in exchange). Whereas the nbits=2 or 4 cases would naturally be aligned anyway...

          One question: the Reader api is now this:

          long get(int index);
          

          Which is convenient since obviously long can accommodate all of the underlying possible nbits, but... for small nbits values, this logically entails a cast. EG say nbits=8, so it's a direct byte[] backing array. get() must cast up to long, and caller must operate with long... I'm wondering whether that forced casting is going to hurt performance enough to make us want to have dedicated precision (8, 16, 32, 64) Reader interfaces....

          Show
          Michael McCandless added a comment - I have working code for packed32 and packed64 and am currently fitting it into Michael's patch. I hope to finish it this weekend. Nice! Sounds like good progress Toke! The current draft from Michael McCandless states both bitsPerValue and maxValue in the persistent format I was only storing maxValue as a convenience for the layer above – we don't need to do that – I think storing format (packed, aligned32, aligned64) and bitsPerValue makes sense. Regardless of whether 32bit or 64bit is used when a packed structure is created, it can be read as both 32bit and 64bit packed. Right, but with the challenge (if we use 32bit backing array) of properly handling the nbits>32 case (this is perfectly doable... "it's just software" ). As for the special cases of 8, 16, 32 and 64 bits/value, the bit patterns are identically to both packed and aligned. I had chosen to match IndexOutput/Inputs's byte order (big-endian) so that the packed format naturally reads back with IndexInput's readLong/Int/Short (I added a readShort). I'm assuming for these special cases that dedicated Reader impls, with byte[], short[], int[], long[] backing array, is faster than eg backing with a long[] and shift/masking per lookup. But eg for the nbits=3 case, aligned 32/64 would ensure that no value spans across two underlying entries in the backing array (wasting some bits of storage in exchange). Whereas the nbits=2 or 4 cases would naturally be aligned anyway... One question: the Reader api is now this: long get( int index); Which is convenient since obviously long can accommodate all of the underlying possible nbits, but... for small nbits values, this logically entails a cast. EG say nbits=8, so it's a direct byte[] backing array. get() must cast up to long, and caller must operate with long... I'm wondering whether that forced casting is going to hurt performance enough to make us want to have dedicated precision (8, 16, 32, 64) Reader interfaces....
          Hide
          Michael McCandless added a comment -

          This header "type" can be chosen depending on the given length of the encoded sequence.

          I think we should separate the header needed for this issue (single header stored once at the beginning of a potentially massive file), from the per-block header used by int block based formats like SimpleN/PForDelta/etc?

          I think Toke's proposed header, if we swap in bitsPerValue in place of maxValue, is good for this issue?

          Show
          Michael McCandless added a comment - This header "type" can be chosen depending on the given length of the encoded sequence. I think we should separate the header needed for this issue (single header stored once at the beginning of a potentially massive file), from the per-block header used by int block based formats like SimpleN/PForDelta/etc? I think Toke's proposed header, if we swap in bitsPerValue in place of maxValue, is good for this issue?
          Hide
          Toke Eskildsen added a comment -

          I've uploaded a preliminary patch with packed32, packed64, directByte, directShort, directInt and directLong implementations. I've used Michael McCandless patch as foundation, but the new patch is generated to be independent from the old one. It uses maxValue instead of bitsPerValue for the header, there's no test of packed32 and there's a general need for cleanup. The main missing components are aligned32 and aligned64.

          I've done quite a bit of refactoring and (cheater that I am) added setters to all implementations of Reader, although not to the interface. Besides the nitty-gritty details of the implementation, I suspect that the code for selecting which implementation to use is a prime candidate for discussion. It is located in PackedInts and tries to select the best implementation based on preference for packed, aligned and direct paired with preference for 32bit and 64bit.

            private static IMPLEMENTATION getImplementation(
                    long maxValue, PRIORITY priority, BLOCK_PREFERENCE block) {
              int bits = bitsRequired(maxValue);
              switch (priority) {
                case direct: {
                  bits = getNextFixedSize(bits);
                  break;
                }
                case aligned: {
                  if (block == BLOCK_PREFERENCE.bit32) {
                    if (bits == 7 || bits >= 11) {
                      bits = getNextFixedSize(bits); // Align to byte, short, int or long
                    }
                  } else {
                    if ((bits >= 13 && bits <= 15) || (bits >= 22)) {
                      bits = getNextFixedSize(bits); // Align to short, int or long
                    }
                  }
                }
              }
              switch (bits) { // The safe choices
                case 8: return IMPLEMENTATION.directByte;
                case 16: return IMPLEMENTATION.directShort;
                case 32: return IMPLEMENTATION.directInt;
                case 63:
                case 64: return IMPLEMENTATION.directLong;
              }
          
              if (priority == PRIORITY.aligned || bits == 1 || bits == 2 || bits == 4) {
                return block == BLOCK_PREFERENCE.bit32 && bits < 32 ?
                        IMPLEMENTATION.aligned32 : IMPLEMENTATION.aligned64;
              }
              return block == BLOCK_PREFERENCE.bit32 && bits < 32 ?
                      IMPLEMENTATION.packed32 : IMPLEMENTATION.packed64;
          
              return IMPLEMENTATION.packed64;
          

          I think that an "auto"-value for priority is worth considering: For 9, 17 and 33 bits/value, packed is often faster than aligned due to only using half the memory and thus having lower risk of level 2 cache misses. For high bits/value, such as 30, 31, 62, 63 and 64 (guesstimating here), choosing direct seems to be the best choice for most situations. Users of PackedInts should not be expected to know this.

          I'll start work on aligned32 and aligned64, but I will leave the rest of the patch alone for now, as I suspect that there'll be some changes to the current draft.

          Show
          Toke Eskildsen added a comment - I've uploaded a preliminary patch with packed32, packed64, directByte, directShort, directInt and directLong implementations. I've used Michael McCandless patch as foundation, but the new patch is generated to be independent from the old one. It uses maxValue instead of bitsPerValue for the header, there's no test of packed32 and there's a general need for cleanup. The main missing components are aligned32 and aligned64. I've done quite a bit of refactoring and (cheater that I am) added setters to all implementations of Reader, although not to the interface. Besides the nitty-gritty details of the implementation, I suspect that the code for selecting which implementation to use is a prime candidate for discussion. It is located in PackedInts and tries to select the best implementation based on preference for packed, aligned and direct paired with preference for 32bit and 64bit. private static IMPLEMENTATION getImplementation( long maxValue, PRIORITY priority, BLOCK_PREFERENCE block) { int bits = bitsRequired(maxValue); switch (priority) { case direct: { bits = getNextFixedSize(bits); break ; } case aligned: { if (block == BLOCK_PREFERENCE.bit32) { if (bits == 7 || bits >= 11) { bits = getNextFixedSize(bits); // Align to byte , short , int or long } } else { if ((bits >= 13 && bits <= 15) || (bits >= 22)) { bits = getNextFixedSize(bits); // Align to short , int or long } } } } switch (bits) { // The safe choices case 8: return IMPLEMENTATION.directByte; case 16: return IMPLEMENTATION.directShort; case 32: return IMPLEMENTATION.directInt; case 63: case 64: return IMPLEMENTATION.directLong; } if (priority == PRIORITY.aligned || bits == 1 || bits == 2 || bits == 4) { return block == BLOCK_PREFERENCE.bit32 && bits < 32 ? IMPLEMENTATION.aligned32 : IMPLEMENTATION.aligned64; } return block == BLOCK_PREFERENCE.bit32 && bits < 32 ? IMPLEMENTATION.packed32 : IMPLEMENTATION.packed64; return IMPLEMENTATION.packed64; I think that an "auto"-value for priority is worth considering: For 9, 17 and 33 bits/value, packed is often faster than aligned due to only using half the memory and thus having lower risk of level 2 cache misses. For high bits/value, such as 30, 31, 62, 63 and 64 (guesstimating here), choosing direct seems to be the best choice for most situations. Users of PackedInts should not be expected to know this. I'll start work on aligned32 and aligned64, but I will leave the rest of the patch alone for now, as I suspect that there'll be some changes to the current draft.
          Hide
          Paul Elschot added a comment -

          The generated code in the patches has quite a few switch statements to decode a single value.
          These switch statements could be avoided by using something like this (adapted from the 1410b patch):

          /** Decode a value from the compressed array of b bit values by retrieving the corresponding bits.
           * Since numFrameBits is always smaller than the number of bits in an int,
           * at most two ints in the buffer will be used.
           */
          public int decodeCompressedValueBase(int compressedPos, int numBits) {
            int compressedBitPos = numBits * compressedPos;
            int intIndex = (compressedBitPos >> 5);
            int firstBitPosition = compressedBitPos & 31;
            int value = intBuffer.get(intIndex) >>> firstBitPosition;
            if ((firstBitPosition + numBits) > 32) { // value does not fit in first int
              intIndex++;
              value |= (intBuffer.get(intIndex) << (32 - firstBitPosition));
            }
            final int maxValue = (int) ((1L << numBits) - 1);
            return value & maxValue;
          }
          

          As maxValue is essentially a mask, it could also be looked up in an array.

          Could that be faster than these generated switch statements?

          Show
          Paul Elschot added a comment - The generated code in the patches has quite a few switch statements to decode a single value. These switch statements could be avoided by using something like this (adapted from the 1410b patch): /** Decode a value from the compressed array of b bit values by retrieving the corresponding bits. * Since numFrameBits is always smaller than the number of bits in an int , * at most two ints in the buffer will be used. */ public int decodeCompressedValueBase( int compressedPos, int numBits) { int compressedBitPos = numBits * compressedPos; int intIndex = (compressedBitPos >> 5); int firstBitPosition = compressedBitPos & 31; int value = intBuffer.get(intIndex) >>> firstBitPosition; if ((firstBitPosition + numBits) > 32) { // value does not fit in first int intIndex++; value |= (intBuffer.get(intIndex) << (32 - firstBitPosition)); } final int maxValue = ( int ) ((1L << numBits) - 1); return value & maxValue; } As maxValue is essentially a mask, it could also be looked up in an array. Could that be faster than these generated switch statements?
          Hide
          Toke Eskildsen added a comment -

          I think Michaels generated code was meant as a temporary solution, until a handcrafted version was available. In packed32 and packed64, the code for decoding a value is

            public long get(final int index) {
              final long majorBitPos = index * elementBits;
              final int elementPos = (int)(majorBitPos >>> BLOCK_BITS); // / BLOCK_SIZE
              final int bitPos =     (int)(majorBitPos & MOD_MASK); // % BLOCK_SIZE);
          
              final int base = bitPos * FAC_BITPOS;
          
              return ((blocks[elementPos] << shifts[base]) >>> shifts[base+1]) |
                      ((blocks[elementPos+1] >>> shifts[base+2]) & readMasks[bitPos]);
            }
          

          which looks a lot like your (Paul Elschot) suggestion. It avoids all conditionals at the cost of more bit-operations and a dummy element at the end of the backing array. I must admit that my performance testing of the different solutions has been fairly ad hoc (measure, tweak, repeat), so an appropriate test would be in order.

          Show
          Toke Eskildsen added a comment - I think Michaels generated code was meant as a temporary solution, until a handcrafted version was available. In packed32 and packed64, the code for decoding a value is public long get( final int index) { final long majorBitPos = index * elementBits; final int elementPos = ( int )(majorBitPos >>> BLOCK_BITS); // / BLOCK_SIZE final int bitPos = ( int )(majorBitPos & MOD_MASK); // % BLOCK_SIZE); final int base = bitPos * FAC_BITPOS; return ((blocks[elementPos] << shifts[base]) >>> shifts[base+1]) | ((blocks[elementPos+1] >>> shifts[base+2]) & readMasks[bitPos]); } which looks a lot like your (Paul Elschot) suggestion. It avoids all conditionals at the cost of more bit-operations and a dummy element at the end of the backing array. I must admit that my performance testing of the different solutions has been fairly ad hoc (measure, tweak, repeat), so an appropriate test would be in order.
          Hide
          Paul Elschot added a comment -

          Nice to see a more mature alternative.
          The trade off between (dummy element/unconditioned extra access) and (conditioned extra access) is of later concern.
          Both the extra access and the condition take cycles, so it's not clear which one will be faster. It might even depend on the value of elementBits.

          Show
          Paul Elschot added a comment - Nice to see a more mature alternative. The trade off between (dummy element/unconditioned extra access) and (conditioned extra access) is of later concern. Both the extra access and the condition take cycles, so it's not clear which one will be faster. It might even depend on the value of elementBits.
          Hide
          Paul Elschot added a comment - - edited

          As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids. However this change can be delayed by not allowing an index segment to grow beyond 2^32 or 2^31-1docs, and by only implementing the long docids for multiple index segments.
          So as long as it is ok to assume that an index segment can have MAXINT docs at most, we could use an int interface here.
          Do Nutch and/or Solr already have long docids implemented on multiple index readers/writers or segments?

          The other border is the max size of a document field. If that goes beyond MAXINT, the positions and maybe even the frequencies would need to be changed from int to long. But for now I can't think of a real use case with a document field that has more than MAXINT positions. That would be like a book with ten million pages of text. Did anyone ever run into this limitation?

          Show
          Paul Elschot added a comment - - edited As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids. However this change can be delayed by not allowing an index segment to grow beyond 2^32 or 2^31-1docs, and by only implementing the long docids for multiple index segments. So as long as it is ok to assume that an index segment can have MAXINT docs at most, we could use an int interface here. Do Nutch and/or Solr already have long docids implemented on multiple index readers/writers or segments? The other border is the max size of a document field. If that goes beyond MAXINT, the positions and maybe even the frequencies would need to be changed from int to long. But for now I can't think of a real use case with a document field that has more than MAXINT positions. That would be like a book with ten million pages of text. Did anyone ever run into this limitation?
          Hide
          Michael McCandless added a comment -

          Good progress !

          I think Michaels generated code was meant as a temporary solution, until a handcrafted version was available

          Actually that was intended to be a fast impl... the switch should be
          compiled to a direct lookup (maybe plus a conditional to catch the
          "default" case even though it will never happen...ugh). But I like
          your impl with no conditional at all. We should test both.

          As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids.

          Also the file offsets into the terms dict, possibly the offsets in RAM
          into the terms dict character data (UTF8 byte[]). Also, when we do
          column stride fields, we allow storing values > int. I think we
          should stick with long get(index) for now.

          Other comments:

          • Maybe we should move all of this under oal.util.packed?
            (packedints? ints?)
          • I think we should remove getMaxValue() from the Reader interface?
          • Why create the IMPLEMENTATION enum? Why not simply return an
            [anonymous] instance of Writer?
          • Why not store bitsPerValue in the header instead of maxValue? EG
            maybe my maxValue is 7000, but because I'm using directShort,
            bitsPerValue is 16. Also, the maxValue at write time should not
            have to be known – eg the factory API should let me ask for a
            direct short writer without declaring the maxValue I will store.
          • I wonder if we should add an optional Object
            getDirectBackingArray(). The packed/aligned impls would return
            null, but the direct byte/short/int/long impls would return their
            array. This would allow callers to specialize upstream impls to
            do the direct array lookup without the cast-to-long (like how
            FieldComparator now has impls for byte,short,int,long). I suspect
            for column stride fields, when sorting by an integer field, on a
            32bit arch, this would be a perf win. But: let's wait until we
            have CSFs, and we can test whether there really is a gain here....
          • I think we shouldn't put a getWriter on every Reader
            impl... because it's a one to many mapping? Eg the format written
            by PackedWriter can be read by direct byte/short/int/long,
            Packed32/64.
          • For starters I don't think we should make reader impls that can
            read nbits > 31 bits with an int[] backing array. I think long[]
            backing array is fine.
          • I don't think we need separate PRIORITY and BLOCK_PREFERENCE?
            Can't we have a single enum (STORAGE?) with: packed, aligned32,
            aligned64? "Direct" is really just packed with nbits rounded up
            to 8,16,32,64.
          • Aligned32/64 is very wasteful for certain nbits... I like the idea
            of "auto" to avoid risk that caller picks a bad combination.
          • I think for starters we should not make any reader impls that do
            remapping at load time.
          Show
          Michael McCandless added a comment - Good progress ! I think Michaels generated code was meant as a temporary solution, until a handcrafted version was available Actually that was intended to be a fast impl... the switch should be compiled to a direct lookup (maybe plus a conditional to catch the "default" case even though it will never happen...ugh). But I like your impl with no conditional at all. We should test both. As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids. Also the file offsets into the terms dict, possibly the offsets in RAM into the terms dict character data (UTF8 byte[]). Also, when we do column stride fields, we allow storing values > int. I think we should stick with long get(index) for now. Other comments: Maybe we should move all of this under oal.util.packed? (packedints? ints?) I think we should remove getMaxValue() from the Reader interface? Why create the IMPLEMENTATION enum? Why not simply return an [anonymous] instance of Writer? Why not store bitsPerValue in the header instead of maxValue? EG maybe my maxValue is 7000, but because I'm using directShort, bitsPerValue is 16. Also, the maxValue at write time should not have to be known – eg the factory API should let me ask for a direct short writer without declaring the maxValue I will store. I wonder if we should add an optional Object getDirectBackingArray(). The packed/aligned impls would return null, but the direct byte/short/int/long impls would return their array. This would allow callers to specialize upstream impls to do the direct array lookup without the cast-to-long (like how FieldComparator now has impls for byte,short,int,long). I suspect for column stride fields, when sorting by an integer field, on a 32bit arch, this would be a perf win. But: let's wait until we have CSFs, and we can test whether there really is a gain here.... I think we shouldn't put a getWriter on every Reader impl... because it's a one to many mapping? Eg the format written by PackedWriter can be read by direct byte/short/int/long, Packed32/64. For starters I don't think we should make reader impls that can read nbits > 31 bits with an int[] backing array. I think long[] backing array is fine. I don't think we need separate PRIORITY and BLOCK_PREFERENCE? Can't we have a single enum (STORAGE?) with: packed, aligned32, aligned64? "Direct" is really just packed with nbits rounded up to 8,16,32,64. Aligned32/64 is very wasteful for certain nbits... I like the idea of "auto" to avoid risk that caller picks a bad combination. I think for starters we should not make any reader impls that do remapping at load time.
          Hide
          Paul Elschot added a comment -

          As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids.

          Also the file offsets into the terms dict, possibly the offsets in RAM
          into the terms dict character data (UTF8 byte[]). Also, when we do
          column stride fields, we allow storing values > int. I think we
          should stick with long get(index) for now.

          Indeed I missed the offsets. However, a long get(index) will probably get in the way, because there are far fewer offsets
          than normal data, and the 32 bit processors will be around for a long time. So I think we'll need both long (for the offsets) and int (for the rest) in the end.

          Show
          Paul Elschot added a comment - As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids. Also the file offsets into the terms dict, possibly the offsets in RAM into the terms dict character data (UTF8 byte[]). Also, when we do column stride fields, we allow storing values > int. I think we should stick with long get(index) for now. Indeed I missed the offsets. However, a long get(index) will probably get in the way, because there are far fewer offsets than normal data, and the 32 bit processors will be around for a long time. So I think we'll need both long (for the offsets) and int (for the rest) in the end.
          Hide
          Toke Eskildsen added a comment -

          I think we should remove getMaxValue() from the Reader interface?

          Yes. I only left maxValue in the code because I ran out of time.

          Why create the IMPLEMENTATION enum? Why not simply return an
          [anonymous] instance of Writer?

          The IMPLEMENTATION enum is only used internally and is package private. It was introduced to separate decision-making from specific implementations - e.g. the packed writer is the same for packed32 and packed64, although the reader differs. But it could very well be that it confuses more than it helps.

          Why not store bitsPerValue in the header instead of maxValue?

          As above - I did not have the time to fix it and wanted to push the patch in order to move the discussion along.

          Also, the maxValue at write time should not
          have to be known - eg the factory API should let me ask for a
          direct short writer without declaring the maxValue I will store.

          Since Packed and Aligned needs maxValue (or bitsPerValue), this would require two distinct methods in the factory, each returning a subset of the possible implementations. I find that rather confusing.

          I wonder if we should add an optional Object
          getDirectBackingArray(). The packed/aligned impls would return
          null, but the direct byte/short/int/long impls would return their
          array. [...]

          Speaking of API additions, I find that

          public int getBitsPerValue();
          public int size();
          public void set(long value);
          public void clear();
          

          are trivial to implement for the known implementations. They open up for things like auto-growing to fit higher values by using a delegating wrapper, re-using the structure for counting purposes and sorting in-place.

          I think we shouldn't put a getWriter on every Reader
          impl... because it's a one to many mapping? Eg the format written
          by PackedWriter can be read by direct byte/short/int/long,
          Packed32/64.

          Quite right.

          For starters I don't think we should make reader impls that can
          read nbits > 31 bits with an int[] backing array. I think long[]
          backing array is fine.

          The current patch limits nbits to 32 for Packed32. I am confident that an int-backed reader with nbits > 32 will be slower than a long-backed reader on a 32 bit machine.

          I don't think we need separate PRIORITY and BLOCK_PREFERENCE?
          Can't we have a single enum (STORAGE?) with: packed, aligned32,
          aligned64? "Direct" is really just packed with nbits rounded up
          to 8,16,32,64.

          I agree that it does complicate matters somewhat to have them separated. When calling getReader the BLOCK_PREFERENCE should also be removed, as the block preference will always be the same as that architecture. Removing the "direct" option would require the caller to do some of the logic in some cases: If low processing requirements is a priority, direct is preferably and when the bitsPerValue is calculated, the caller would have to do the if (bitsPerValue > 32) bitsPerValue = 64 and so on.

          Show
          Toke Eskildsen added a comment - I think we should remove getMaxValue() from the Reader interface? Yes. I only left maxValue in the code because I ran out of time. Why create the IMPLEMENTATION enum? Why not simply return an [anonymous] instance of Writer? The IMPLEMENTATION enum is only used internally and is package private. It was introduced to separate decision-making from specific implementations - e.g. the packed writer is the same for packed32 and packed64, although the reader differs. But it could very well be that it confuses more than it helps. Why not store bitsPerValue in the header instead of maxValue? As above - I did not have the time to fix it and wanted to push the patch in order to move the discussion along. Also, the maxValue at write time should not have to be known - eg the factory API should let me ask for a direct short writer without declaring the maxValue I will store. Since Packed and Aligned needs maxValue (or bitsPerValue), this would require two distinct methods in the factory, each returning a subset of the possible implementations. I find that rather confusing. I wonder if we should add an optional Object getDirectBackingArray(). The packed/aligned impls would return null, but the direct byte/short/int/long impls would return their array. [...] Speaking of API additions, I find that public int getBitsPerValue(); public int size(); public void set( long value); public void clear(); are trivial to implement for the known implementations. They open up for things like auto-growing to fit higher values by using a delegating wrapper, re-using the structure for counting purposes and sorting in-place. I think we shouldn't put a getWriter on every Reader impl... because it's a one to many mapping? Eg the format written by PackedWriter can be read by direct byte/short/int/long, Packed32/64. Quite right. For starters I don't think we should make reader impls that can read nbits > 31 bits with an int[] backing array. I think long[] backing array is fine. The current patch limits nbits to 32 for Packed32. I am confident that an int-backed reader with nbits > 32 will be slower than a long-backed reader on a 32 bit machine. I don't think we need separate PRIORITY and BLOCK_PREFERENCE? Can't we have a single enum (STORAGE?) with: packed, aligned32, aligned64? "Direct" is really just packed with nbits rounded up to 8,16,32,64. I agree that it does complicate matters somewhat to have them separated. When calling getReader the BLOCK_PREFERENCE should also be removed, as the block preference will always be the same as that architecture. Removing the "direct" option would require the caller to do some of the logic in some cases: If low processing requirements is a priority, direct is preferably and when the bitsPerValue is calculated, the caller would have to do the if (bitsPerValue > 32) bitsPerValue = 64 and so on.
          Hide
          Michael McCandless added a comment -

          Also, the maxValue at write time should not have to be known - eg the factory API should let me ask for a direct short writer without declaring the maxValue I will store.

          Since Packed and Aligned needs maxValue (or bitsPerValue), this would require two distinct methods in the factory, each returning a subset of the possible implementations. I find that rather confusing.

          Maybe the caller just always uses the bitsRequired method to get the
          required bit width per value?

          Though, when we enable specializing storing of negative values as
          well, that'll be a hassle...

          OK let's leave it as you must pass the maxValue for now.

          Speaking of API additions, I find that

          public int getBitsPerValue();
          public int size();
          public void set(long value);
          public void clear();
          

          are trivial to implement for the known implementations. They open up for things like auto-growing to fit higher values by using a delegating wrapper, re-using the structure for counting purposes and sorting in-place.

          I think the first 2 make sense, but I'd rather not pursue the 2nd two
          at this time. Ie, I think this API only needs write-once, and then
          read-only.

          If we open up random writing (set/clear), with auto-growing, etc.,
          that does add complexities to the impl. EG the backing store can no
          longer be final, we'd have to do some locking (or mark the array
          volatile) for thread safety, etc.

          As far as I can tell... Lucene today doesn't yet need random write to
          the packed ints. The terms dict index and CSF are the two needs I
          think we have now. Someday (when CSF supports writing) we will... but
          not yet today?

          I don't think we need separate PRIORITY and BLOCK_PREFERENCE? Can't we have a single enum (STORAGE?) with: packed, aligned32, aligned64? "Direct" is really just packed with nbits rounded up to 8,16,32,64.

          I agree that it does complicate matters somewhat to have them separated. When calling getReader the BLOCK_PREFERENCE should also be removed, as the block preference will always be the same as that architecture. Removing the "direct" option would require the caller to do some of the logic in some cases: If low processing requirements is a priority, direct is preferably and when the bitsPerValue is calculated, the caller would have to do the if (bitsPerValue > 32) bitsPerValue = 64 and so on.

          (There's a bug in the patch in PackedInts.getReader, where it switches
          the block size based on whether JRE is 64 bit: it's always choosing 64
          bit now).

          The "direct" option only applies during writing (ie, you round up to
          the nearest native type bit width). At read time it's just a packed
          8/16/32/64.

          Hmm... maybe we could just add an optional 2nd arg to bitsRequired, a
          boolean eg "roundUpToNative" or something, which if true does that
          rounding for you? (And then go back to caller computes bit width and
          passes it in?).

          Show
          Michael McCandless added a comment - Also, the maxValue at write time should not have to be known - eg the factory API should let me ask for a direct short writer without declaring the maxValue I will store. Since Packed and Aligned needs maxValue (or bitsPerValue), this would require two distinct methods in the factory, each returning a subset of the possible implementations. I find that rather confusing. Maybe the caller just always uses the bitsRequired method to get the required bit width per value? Though, when we enable specializing storing of negative values as well, that'll be a hassle... OK let's leave it as you must pass the maxValue for now. Speaking of API additions, I find that public int getBitsPerValue(); public int size(); public void set( long value); public void clear(); are trivial to implement for the known implementations. They open up for things like auto-growing to fit higher values by using a delegating wrapper, re-using the structure for counting purposes and sorting in-place. I think the first 2 make sense, but I'd rather not pursue the 2nd two at this time. Ie, I think this API only needs write-once, and then read-only. If we open up random writing (set/clear), with auto-growing, etc., that does add complexities to the impl. EG the backing store can no longer be final, we'd have to do some locking (or mark the array volatile) for thread safety, etc. As far as I can tell... Lucene today doesn't yet need random write to the packed ints. The terms dict index and CSF are the two needs I think we have now. Someday (when CSF supports writing) we will... but not yet today? I don't think we need separate PRIORITY and BLOCK_PREFERENCE? Can't we have a single enum (STORAGE?) with: packed, aligned32, aligned64? "Direct" is really just packed with nbits rounded up to 8,16,32,64. I agree that it does complicate matters somewhat to have them separated. When calling getReader the BLOCK_PREFERENCE should also be removed, as the block preference will always be the same as that architecture. Removing the "direct" option would require the caller to do some of the logic in some cases: If low processing requirements is a priority, direct is preferably and when the bitsPerValue is calculated, the caller would have to do the if (bitsPerValue > 32) bitsPerValue = 64 and so on. (There's a bug in the patch in PackedInts.getReader, where it switches the block size based on whether JRE is 64 bit: it's always choosing 64 bit now). The "direct" option only applies during writing (ie, you round up to the nearest native type bit width). At read time it's just a packed 8/16/32/64. Hmm... maybe we could just add an optional 2nd arg to bitsRequired, a boolean eg "roundUpToNative" or something, which if true does that rounding for you? (And then go back to caller computes bit width and passes it in?).
          Hide
          Michael McCandless added a comment -

          Toke, are you still working on this...? If not, I can take a crack? I'd really like to get something online here before we land flex, so the terms dict index isn't so wasteful of RAM.

          Show
          Michael McCandless added a comment - Toke, are you still working on this...? If not, I can take a crack? I'd really like to get something online here before we land flex, so the terms dict index isn't so wasteful of RAM.
          Hide
          Toke Eskildsen added a comment -

          Changing the code to use bitsPerValue instead of maxValue for constructors and persistent format took a bit longer than anticipated. To get things flowing, I've attached the code as it is now. I've moved the classes to o.a.l.util.packed and performed some clenup too. It still needs aligned32 and aligned64 implementations and more cleanup, which I'll work on for the next hour today and hopefully some hours tomorrow.

          One current use-case for mutable packed ints would be for StringOrdValComparator (using an auto-grow wrapper), although the gain might be small as the overhead of the Strings is so large. I understand the problem of making all packed ints mutable, but a compromise might be to have a Mutable interface and a new factory-method that returns the same implementations as Mutable instead of Reader? That way it is possible to use the implementations for things such as sorting instead of having to re-implement them. I've left the interface for Reader clean as you suggested, but kept the implementations of set in the classes for now, as the code has already been made.

          Show
          Toke Eskildsen added a comment - Changing the code to use bitsPerValue instead of maxValue for constructors and persistent format took a bit longer than anticipated. To get things flowing, I've attached the code as it is now. I've moved the classes to o.a.l.util.packed and performed some clenup too. It still needs aligned32 and aligned64 implementations and more cleanup, which I'll work on for the next hour today and hopefully some hours tomorrow. One current use-case for mutable packed ints would be for StringOrdValComparator (using an auto-grow wrapper), although the gain might be small as the overhead of the Strings is so large. I understand the problem of making all packed ints mutable, but a compromise might be to have a Mutable interface and a new factory-method that returns the same implementations as Mutable instead of Reader? That way it is possible to use the implementations for things such as sorting instead of having to re-implement them. I've left the interface for Reader clean as you suggested, but kept the implementations of set in the classes for now, as the code has already been made.
          Hide
          Toke Eskildsen added a comment -

          I've read through the comments on LUCENE-1990 and implemented most of what has been suggested. The attached patch contains implementations for all the variants we've talked about, including aligned. There's a known bug in persistence for aligned64 (and probably also for aligned32) that I haven't stomped yet. There's also a clear need for a more elaborate unit-test with regard to persistence.

          Other outstanding issues, as I see them, are whether or not mutable packed arrays should be requestable (as general purpose data structures) and how the factory for creating a writer should work. I have added a getMutable-method to the factory and not touched the return type Reader for the getReader-method. That way read-only users will not be tempted to try and update the received structure. As for the arguments to the factory, Michael McCandless suggested that the preferences should be expressed with (packed | aligned32 | aligned64 | auto). As fas as I can see, this should work. However, I've only just reached this conclusion and haven't had the time to implement it.

          A speed-test has been added and the results from my machine can be seen below. In order for it to be really usable, it should be tried on other machines too.

          I won't touch the code before sometime next week, but I'll keep an eye on LUCENE-1990 comments until then.

                  bitsPerValue          valueCount            getCount    PackedDirectByte   PackedDirectShort            Packed32     PackedAligned32     PackedDirectInt            Packed64     PackedAligned64    PackedDirectLong
                             1                1000            10000000                 167                 141                 258                 242                 172                 264                 242                 183
                             1             1000000            10000000                 224                 232                 266                 233                 246                 262                 238                 338
                             1            10000000            10000000                 359                 469                 280                 278                 508                 278                 272                 551
                             3                1000            10000000                 168                 166                 265                 241                 163                 262                 243                 166
                             3             1000000            10000000                 227                 226                 261                 251                 239                 274                 249                 330
                             3            10000000            10000000                 406                 476                 301                 304                 522                 300                 308                 547
                             4                1000            10000000                 167                 168                 266                 239                 164                 285                 239                 169
                             4             1000000            10000000                 228                 231                 294                 274                 262                 291                 269                 314
                             4            10000000            10000000                 385                 480                 308                 333                 514                 331                 315                 557
                             7                1000            10000000                 172                 174                 278                 248                 162                 271                 238                 177
                             7             1000000            10000000                 224                 236                 289                 281                 272                 278                 277                 345
                             7            10000000            10000000                 405                 473                 389                 447                 516                 399                 402                 553
                             8                1000            10000000                 192                 171                 268                 242                 174                 291                 240                 163
                             8             1000000            10000000                 226                 232                 291                 284                 286                 274                 265                 314
                             8            10000000            10000000                 381                 467                 406                 428                 512                 422                 419                 580
          
                  bitsPerValue          valueCount            getCount   PackedDirectShort            Packed32     PackedAligned32     PackedDirectInt            Packed64     PackedAligned64    PackedDirectLong
                             9                1000            10000000                 166                 274                 241                 170                 261                 237                 163
                             9             1000000            10000000                 229                 299                 273                 250                 284                 275                 327
                             9            10000000            10000000                 483                 443                 477                 519                 438                 455                 568
                            15                1000            10000000                 170                 265                 239                 174                 264                 235                 162
                            15             1000000            10000000                 232                 285                 274                 240                 278                 269                 339
                            15            10000000            10000000                 473                 518                 524                 523                 519                 521                 550
                            16                1000            10000000                 166                 263                 236                 172                 264                 235                 160
                            16             1000000            10000000                 229                 285                 278                 244                 293                 272                 332
                            16            10000000            10000000                 470                 513                 517                 509                 534                 529                 548
          
                  bitsPerValue          valueCount            getCount            Packed32     PackedAligned32     PackedDirectInt            Packed64     PackedAligned64    PackedDirectLong
                            17                1000            10000000                 262                 255                 177                 260                 234                 160
                            17             1000000            10000000                 290                 306                 273                 304                 290                 320
                            17            10000000            10000000                 532                 572                 533                 529                 556                 551
                            28                1000            10000000                 269                 256                 187                 267                 238                 163
                            28             1000000            10000000                 293                 295                 253                 293                 296                 312
                            28            10000000            10000000                 542                 567                 501                 548                 567                 542
                            31                1000            10000000                 260                 235                 177                 266                 232                 158
                            31             1000000            10000000                 292                 294                 244                 296                 297                 328
                            31            10000000            10000000                 552                 563                 516                 562                 568                 548
          
                  bitsPerValue          valueCount            getCount     PackedDirectInt            Packed64     PackedAligned64    PackedDirectLong
                            32                1000            10000000                 172                 263                 241                 166
                            32             1000000            10000000                 241                 291                 297                 320
                            32            10000000            10000000                 519                 556                 573                 546
          
                  bitsPerValue          valueCount            getCount            Packed64     PackedAligned64    PackedDirectLong
                            33                1000            10000000                 264                 239                 159
                            33             1000000            10000000                 293                 374                 319
                            33            10000000            10000000                 559                 595                 552
                            47                1000            10000000                 264                 242                 164
                            47             1000000            10000000                 319                 369                 322
                            47            10000000            10000000                 577                 601                 548
                            49                1000            10000000                 261                 243                 162
                            49             1000000            10000000                 323                 413                 319
                            49            10000000            10000000                 584                 610                 551
                            63                1000            10000000                 269                 235                 161
                            63             1000000            10000000                 396                 369                 313
                            63            10000000            10000000                 592                 596                 559
          

          (Java 1.6.0_15-b03, default settings on a Dell Precision M6500: Intel i7 Q 820 @ 1.73GHz, 8 MB level 2 cache, dual-channel PC 1333 RAM, running Ubuntu Karmic)

          Show
          Toke Eskildsen added a comment - I've read through the comments on LUCENE-1990 and implemented most of what has been suggested. The attached patch contains implementations for all the variants we've talked about, including aligned. There's a known bug in persistence for aligned64 (and probably also for aligned32) that I haven't stomped yet. There's also a clear need for a more elaborate unit-test with regard to persistence. Other outstanding issues, as I see them, are whether or not mutable packed arrays should be requestable (as general purpose data structures) and how the factory for creating a writer should work. I have added a getMutable-method to the factory and not touched the return type Reader for the getReader-method. That way read-only users will not be tempted to try and update the received structure. As for the arguments to the factory, Michael McCandless suggested that the preferences should be expressed with (packed | aligned32 | aligned64 | auto). As fas as I can see, this should work. However, I've only just reached this conclusion and haven't had the time to implement it. A speed-test has been added and the results from my machine can be seen below. In order for it to be really usable, it should be tried on other machines too. I won't touch the code before sometime next week, but I'll keep an eye on LUCENE-1990 comments until then. bitsPerValue valueCount getCount PackedDirectByte PackedDirectShort Packed32 PackedAligned32 PackedDirectInt Packed64 PackedAligned64 PackedDirectLong 1 1000 10000000 167 141 258 242 172 264 242 183 1 1000000 10000000 224 232 266 233 246 262 238 338 1 10000000 10000000 359 469 280 278 508 278 272 551 3 1000 10000000 168 166 265 241 163 262 243 166 3 1000000 10000000 227 226 261 251 239 274 249 330 3 10000000 10000000 406 476 301 304 522 300 308 547 4 1000 10000000 167 168 266 239 164 285 239 169 4 1000000 10000000 228 231 294 274 262 291 269 314 4 10000000 10000000 385 480 308 333 514 331 315 557 7 1000 10000000 172 174 278 248 162 271 238 177 7 1000000 10000000 224 236 289 281 272 278 277 345 7 10000000 10000000 405 473 389 447 516 399 402 553 8 1000 10000000 192 171 268 242 174 291 240 163 8 1000000 10000000 226 232 291 284 286 274 265 314 8 10000000 10000000 381 467 406 428 512 422 419 580 bitsPerValue valueCount getCount PackedDirectShort Packed32 PackedAligned32 PackedDirectInt Packed64 PackedAligned64 PackedDirectLong 9 1000 10000000 166 274 241 170 261 237 163 9 1000000 10000000 229 299 273 250 284 275 327 9 10000000 10000000 483 443 477 519 438 455 568 15 1000 10000000 170 265 239 174 264 235 162 15 1000000 10000000 232 285 274 240 278 269 339 15 10000000 10000000 473 518 524 523 519 521 550 16 1000 10000000 166 263 236 172 264 235 160 16 1000000 10000000 229 285 278 244 293 272 332 16 10000000 10000000 470 513 517 509 534 529 548 bitsPerValue valueCount getCount Packed32 PackedAligned32 PackedDirectInt Packed64 PackedAligned64 PackedDirectLong 17 1000 10000000 262 255 177 260 234 160 17 1000000 10000000 290 306 273 304 290 320 17 10000000 10000000 532 572 533 529 556 551 28 1000 10000000 269 256 187 267 238 163 28 1000000 10000000 293 295 253 293 296 312 28 10000000 10000000 542 567 501 548 567 542 31 1000 10000000 260 235 177 266 232 158 31 1000000 10000000 292 294 244 296 297 328 31 10000000 10000000 552 563 516 562 568 548 bitsPerValue valueCount getCount PackedDirectInt Packed64 PackedAligned64 PackedDirectLong 32 1000 10000000 172 263 241 166 32 1000000 10000000 241 291 297 320 32 10000000 10000000 519 556 573 546 bitsPerValue valueCount getCount Packed64 PackedAligned64 PackedDirectLong 33 1000 10000000 264 239 159 33 1000000 10000000 293 374 319 33 10000000 10000000 559 595 552 47 1000 10000000 264 242 164 47 1000000 10000000 319 369 322 47 10000000 10000000 577 601 548 49 1000 10000000 261 243 162 49 1000000 10000000 323 413 319 49 10000000 10000000 584 610 551 63 1000 10000000 269 235 161 63 1000000 10000000 396 369 313 63 10000000 10000000 592 596 559 (Java 1.6.0_15-b03, default settings on a Dell Precision M6500: Intel i7 Q 820 @ 1.73GHz, 8 MB level 2 cache, dual-channel PC 1333 RAM, running Ubuntu Karmic)
          Hide
          Michael McCandless added a comment -

          Great progress Toke!

          I guess we should do Mutable since you're so far along already

          But, now that we have getMutable, can we make the concrete impls
          package private? Javadocs for Mutable.set should note that the size
          is fixed once you allocate it. We have no way to save a
          Mutable... should we add that? If so, we may want to rename Writer ->
          WriteOnceWriter. This way consumers can also get a Mutable, do random
          writes, then save, if the "write once" model isn't a good fit.

          Maybe we should just merge Mutable & Reader, then? (LongStore?
          LongArray? PackedLongs?)

          We should state clearly that these are all unsigned ints storage.

          Maybe rename PackedDirectInt to PackedDirect32 (and Short to 16,
          Byte to 8). Because... while it is using a direct int[] under the hood,
          it's really using all 32 bits for the full positive int range. So
          PackedDirect32 can be used even for pos ints that would overflow a
          normal java "int". (Though, for long we obviously can't use that 64th
          bit for positive ints...).

          The @see in the new IndexInput.readShort is wrong (referencing
          writeInt).

          Can you add @lucene.internal to the javadocs?

          Seems like once we stomp the bugs, beef up the tests, and merge
          PRIORITY and BLOCK_PREFERENCE (into maybe STORAGE?) for
          the public API, we are nearly done? Thanks Toke!

          Show
          Michael McCandless added a comment - Great progress Toke! I guess we should do Mutable since you're so far along already But, now that we have getMutable, can we make the concrete impls package private? Javadocs for Mutable.set should note that the size is fixed once you allocate it. We have no way to save a Mutable... should we add that? If so, we may want to rename Writer -> WriteOnceWriter. This way consumers can also get a Mutable, do random writes, then save, if the "write once" model isn't a good fit. Maybe we should just merge Mutable & Reader, then? (LongStore? LongArray? PackedLongs?) We should state clearly that these are all unsigned ints storage. Maybe rename PackedDirectInt to PackedDirect32 (and Short to 16, Byte to 8). Because... while it is using a direct int[] under the hood, it's really using all 32 bits for the full positive int range. So PackedDirect32 can be used even for pos ints that would overflow a normal java "int". (Though, for long we obviously can't use that 64th bit for positive ints...). The @see in the new IndexInput.readShort is wrong (referencing writeInt). Can you add @lucene.internal to the javadocs? Seems like once we stomp the bugs, beef up the tests, and merge PRIORITY and BLOCK_PREFERENCE (into maybe STORAGE?) for the public API, we are nearly done? Thanks Toke!
          Hide
          Toke Eskildsen added a comment -

          I am sorry, but personal issues sapped my time and energy this week, so Lucene got bumped down my priority-list. I am going to code4lib next week and I'll try and get some hacking done in the plane from Denmark to USA, but that depends on whether or not there is a power socket near my seat. If I don't upload a patch late monday, it will be early march before I'll get it done

          But, now that we have getMutable, can we make the concrete impls
          package private? Javadocs for Mutable.set should note that the size
          is fixed once you allocate it.

          Agreed on both.

          We have no way to save a Mutable... should we add that?

          I dont know enough about persistence in Lucene to make that call. Since the writer is tied to Lucene, it would not work for general purposes, so making a writer for Mutables only seems to make sense if the user uses it to build index-structures?

          Maybe we should just merge Mutable & Reader, then? (LongStore?
          LongArray? PackedLongs?)

          I don't understand that one? You made a compelling argument for returning immutables to readers earlier (problems with concurrency and having all back ends support writes).

          As for the name... I don't know. None of the sound right, but I have no other suggestion.

          We should state clearly that these are all unsigned ints storage.

          Maybe rename PackedDirectInt to PackedDirect32 (and Short to 16,
          Byte to 8). Because... while it is using a direct int[] under the hood,
          it's really using all 32 bits for the full positive int range.

          Good point. The rest of your suggestions are also very valid.

          Show
          Toke Eskildsen added a comment - I am sorry, but personal issues sapped my time and energy this week, so Lucene got bumped down my priority-list. I am going to code4lib next week and I'll try and get some hacking done in the plane from Denmark to USA, but that depends on whether or not there is a power socket near my seat. If I don't upload a patch late monday, it will be early march before I'll get it done But, now that we have getMutable, can we make the concrete impls package private? Javadocs for Mutable.set should note that the size is fixed once you allocate it. Agreed on both. We have no way to save a Mutable... should we add that? I dont know enough about persistence in Lucene to make that call. Since the writer is tied to Lucene, it would not work for general purposes, so making a writer for Mutables only seems to make sense if the user uses it to build index-structures? Maybe we should just merge Mutable & Reader, then? (LongStore? LongArray? PackedLongs?) I don't understand that one? You made a compelling argument for returning immutables to readers earlier (problems with concurrency and having all back ends support writes). As for the name... I don't know. None of the sound right, but I have no other suggestion. We should state clearly that these are all unsigned ints storage. Maybe rename PackedDirectInt to PackedDirect32 (and Short to 16, Byte to 8). Because... while it is using a direct int[] under the hood, it's really using all 32 bits for the full positive int range. Good point. The rest of your suggestions are also very valid.
          Hide
          Toke Eskildsen added a comment -

          I've renamed most of the classes to short form, as the "Packed"-prefix did was not that descriptive and fixed some bugs. Still pending is the mutable writer and a bug in persistence for aligned64. Good news (for Lucene at least) is that an airplane blocking snowdrift means that I have time this week for continued hacking.

          But, now that we have getMutable, can we make the concrete impls
          package private? Javadocs for Mutable.set should note that the size
          is fixed once you allocate it.

          The implementations are now package private, but I only put the note about fixed size on the getMutable-method. There's nothing wrong with creating a custom auto growing Mutable.

          We should state clearly that these are all unsigned ints storage.

          Done.

          Maybe rename PackedDirectInt to PackedDirect32 (and Short to 16,
          Byte to 8).

          Done (Direct8, Direct16, Direct32 and Direct64).

          The @see in the new IndexInput.readShort is wrong (referencing
          writeInt).

          Fixed.

          Can you add @lucene.internal to the javadocs?

          Should this also be applied to package private classes? Marking those as internal seems redundant.

          Seems like once we stomp the bugs, beef up the tests, and merge
          PRIORITY and BLOCK_PREFERENCE (into maybe STORAGE?) for
          the public API, we are nearly done?

          I've removed BLOCK_PREFERENCE from the API. It's still used internally, mainly to do controlled testing. Tests are beefed up (and currently fails for aligned, so clearly beefing worked).

          Show
          Toke Eskildsen added a comment - I've renamed most of the classes to short form, as the "Packed"-prefix did was not that descriptive and fixed some bugs. Still pending is the mutable writer and a bug in persistence for aligned64. Good news (for Lucene at least) is that an airplane blocking snowdrift means that I have time this week for continued hacking. But, now that we have getMutable, can we make the concrete impls package private? Javadocs for Mutable.set should note that the size is fixed once you allocate it. The implementations are now package private, but I only put the note about fixed size on the getMutable-method. There's nothing wrong with creating a custom auto growing Mutable. We should state clearly that these are all unsigned ints storage. Done. Maybe rename PackedDirectInt to PackedDirect32 (and Short to 16, Byte to 8). Done (Direct8, Direct16, Direct32 and Direct64). The @see in the new IndexInput.readShort is wrong (referencing writeInt). Fixed. Can you add @lucene.internal to the javadocs? Should this also be applied to package private classes? Marking those as internal seems redundant. Seems like once we stomp the bugs, beef up the tests, and merge PRIORITY and BLOCK_PREFERENCE (into maybe STORAGE?) for the public API, we are nearly done? I've removed BLOCK_PREFERENCE from the API. It's still used internally, mainly to do controlled testing. Tests are beefed up (and currently fails for aligned, so clearly beefing worked).
          Hide
          Toke Eskildsen added a comment -

          Now we're getting somewhere. I finally squashed the persistence bug and the tests has been turned up another notch. Everything seems to run as it should. Pending issues, as I see them:

          • Review of the code
          • Should we make a MutableWriter?
          • Should we drop support for aligned?

          The last one is interesting. The code for getting a value from aligned uses devision and a single RAM-request:

            public long get(final int index) {
              final int blockPos = index / valuesPerBlock;
              final int bitPos = (index - (blockPos * valuesPerBlock)) * bitsPerValue;
              return (blocks[blockPos] >>> shifts[bitPos]) & readMask;
          

          where the code for packed uses shift and two RAM-requests:

              final long majorBitPos = index * bitsPerValue;
              final int elementPos = (int)(majorBitPos >>> BLOCK_BITS); // / BLOCK_SIZE
              final int bitPos =     (int)(majorBitPos & MOD_MASK); // % BLOCK_SIZE);
          
              final int base = bitPos * FAC_BITPOS;
          
              return ((blocks[elementPos] << shifts[base]) >>> shifts[base+1]) |
                      ((blocks[elementPos+1] >>> shifts[base+2]) & readMasks[bitPos]);
          

          I have done some tests (see the TODO-file in the attached patch) and on 64 bit machines, the difference in access-speed for aligned vs. packed is not that great and not always in favor of aligned. Probably because some space is wasted and the RAM-cache is not so well utilized. If this is also the case for 32 bit machines, I vote for removing aligned and only used packed with the special-case optimizations direct8, direct16, direct32 and direct64. This would also mean that there is only one persistent format.

          java -cp lucene-core-3.1-dev.jar org.apache.lucene.util.packed.PackedIntsPerformance
          

          Runs throught the performance tests and delivers a simple report, so it should be very easy to test on different platforms. It only measures access speed.

          I consider this patch ready for review and concentrate on other matters until I hear more.

          Show
          Toke Eskildsen added a comment - Now we're getting somewhere. I finally squashed the persistence bug and the tests has been turned up another notch. Everything seems to run as it should. Pending issues, as I see them: Review of the code Should we make a MutableWriter? Should we drop support for aligned? The last one is interesting. The code for getting a value from aligned uses devision and a single RAM-request: public long get( final int index) { final int blockPos = index / valuesPerBlock; final int bitPos = (index - (blockPos * valuesPerBlock)) * bitsPerValue; return (blocks[blockPos] >>> shifts[bitPos]) & readMask; where the code for packed uses shift and two RAM-requests: final long majorBitPos = index * bitsPerValue; final int elementPos = ( int )(majorBitPos >>> BLOCK_BITS); // / BLOCK_SIZE final int bitPos = ( int )(majorBitPos & MOD_MASK); // % BLOCK_SIZE); final int base = bitPos * FAC_BITPOS; return ((blocks[elementPos] << shifts[base]) >>> shifts[base+1]) | ((blocks[elementPos+1] >>> shifts[base+2]) & readMasks[bitPos]); I have done some tests (see the TODO-file in the attached patch) and on 64 bit machines, the difference in access-speed for aligned vs. packed is not that great and not always in favor of aligned. Probably because some space is wasted and the RAM-cache is not so well utilized. If this is also the case for 32 bit machines, I vote for removing aligned and only used packed with the special-case optimizations direct8, direct16, direct32 and direct64. This would also mean that there is only one persistent format. java -cp lucene-core-3.1-dev.jar org.apache.lucene.util.packed.PackedIntsPerformance Runs throught the performance tests and delivers a simple report, so it should be very easy to test on different platforms. It only measures access speed. I consider this patch ready for review and concentrate on other matters until I hear more.
          Hide
          Toke Eskildsen added a comment -

          I couldn't help making a tiny tweak to the performance test so that it outputs execution time means for the different implementations. I have attached measurements from 5 different 64 bit machines. Looking at the means, I observe the following:

          • i7 Q820 and Xeon L5420: Practically no difference between aligned and packed with a small edge to aligned
          • Core 2 and Xeon 5148: Aligned is consistently about 10% slower than packed
          • Xeon MP (old with just 1 MB CPU cache): Aligned ranges from 0-10% slower than packed, depending on bits/value

          The direct implementations outperforms packed and aligned for all sane cases (using direct8 to hold only 1 bit/value is clearly a bad idea). No surprise there.

          Caveat: The tests were run without any other significantly resource heavy processes disturbing it. This means that there were no fighting for the CPU cache.

          Major caveat: Tests are needed on other processors than 64 bit Intel.

          I would be great if someone could figure out how to make an aligned getter without using division as that is surely the thing that hampers aligned performance.

          Show
          Toke Eskildsen added a comment - I couldn't help making a tiny tweak to the performance test so that it outputs execution time means for the different implementations. I have attached measurements from 5 different 64 bit machines. Looking at the means, I observe the following: i7 Q820 and Xeon L5420: Practically no difference between aligned and packed with a small edge to aligned Core 2 and Xeon 5148: Aligned is consistently about 10% slower than packed Xeon MP (old with just 1 MB CPU cache): Aligned ranges from 0-10% slower than packed, depending on bits/value The direct implementations outperforms packed and aligned for all sane cases (using direct8 to hold only 1 bit/value is clearly a bad idea). No surprise there. Caveat: The tests were run without any other significantly resource heavy processes disturbing it. This means that there were no fighting for the CPU cache. Major caveat: Tests are needed on other processors than 64 bit Intel. I would be great if someone could figure out how to make an aligned getter without using division as that is surely the thing that hampers aligned performance.
          Hide
          Michael McCandless added a comment -

          Great progress! I think this is very close.

          Airplane blocking snow drifts!? Where on earth are you anyway?

          Can you add @lucene.internal to the javadocs?

          Should this also be applied to package private classes? Marking those as internal seems redundant.

          Yeah I agree package private APIs don't need the @lucene.internal...

          It's very interesting that align is never a win – I think in that
          case removing it makes sense? It'll be a nice simplification.

          I think we don't need to make a MutableWriter, at least before
          committing? Nobody needs it now... (I think?).

          Other small things:

          • Can you use @lucene.internal instead of the NOTE that I had put on
            the classes?
          • We lost "final" in the RamUsageEstimator constants
          • Did we ever test performance of the specialized (generated)
            decoders using switch statements?
          Show
          Michael McCandless added a comment - Great progress! I think this is very close. Airplane blocking snow drifts!? Where on earth are you anyway? Can you add @lucene.internal to the javadocs? Should this also be applied to package private classes? Marking those as internal seems redundant. Yeah I agree package private APIs don't need the @lucene.internal... It's very interesting that align is never a win – I think in that case removing it makes sense? It'll be a nice simplification. I think we don't need to make a MutableWriter, at least before committing? Nobody needs it now... (I think?). Other small things: Can you use @lucene.internal instead of the NOTE that I had put on the classes? We lost "final" in the RamUsageEstimator constants Did we ever test performance of the specialized (generated) decoders using switch statements?
          Hide
          Toke Eskildsen added a comment -

          Airplane blocking snow drifts!? Where on earth are you anyway?

          In Denmark. The guy responsible for clearing the runway did indeed clear the runway. He just forgot that the plane needs to taxi into the runway in the first place. That made us miss our connecting flight.

          It's very interesting that align is never a win – I think in that case removing it makes sense? It'll be a nice simplification.

          Well, practically never wins for the machines I tested on and never wins with my implementation.

          Can you use @lucene.internal instead of the NOTE that I had put on the classes?

          Done... I think. I'm not very good at this part, so if someone else wants to do some cleanup i JavaDoc and such, they are very welcome by me.

          We lost "final" in the RamUsageEstimator constants

          Strange. Oh well, fixed.

          Did we ever test performance of the specialized (generated) decoders using switch statements?

          I just did a quick hack in order to measure performance and I was very surprised that the generated switch-based implementations performs so well. It's nearly on par with packed most of the time and exceeds it in some cases. I only tested on 3 machines though. The hack is in the LUCENE-1990-te20100226c.patch and is called when the performance test is executed.

          Attachment generated_performance-te20100226.txt contains measurements where the py-generated code is tested together with the other implementations.

          Note to self: Switch is not equivalent to a series of if-else, when we're talking performance and when we switch without omissions in the cases.

          Show
          Toke Eskildsen added a comment - Airplane blocking snow drifts!? Where on earth are you anyway? In Denmark. The guy responsible for clearing the runway did indeed clear the runway. He just forgot that the plane needs to taxi into the runway in the first place. That made us miss our connecting flight. It's very interesting that align is never a win – I think in that case removing it makes sense? It'll be a nice simplification. Well, practically never wins for the machines I tested on and never wins with my implementation. Can you use @lucene.internal instead of the NOTE that I had put on the classes? Done... I think. I'm not very good at this part, so if someone else wants to do some cleanup i JavaDoc and such, they are very welcome by me. We lost "final" in the RamUsageEstimator constants Strange. Oh well, fixed. Did we ever test performance of the specialized (generated) decoders using switch statements? I just did a quick hack in order to measure performance and I was very surprised that the generated switch-based implementations performs so well. It's nearly on par with packed most of the time and exceeds it in some cases. I only tested on 3 machines though. The hack is in the LUCENE-1990 -te20100226c.patch and is called when the performance test is executed. Attachment generated_performance-te20100226.txt contains measurements where the py-generated code is tested together with the other implementations. Note to self: Switch is not equivalent to a series of if-else, when we're talking performance and when we switch without omissions in the cases.
          Hide
          Michael McCandless added a comment -

          Airplane blocking snow drifts!? Where on earth are you anyway?

          In Denmark. The guy responsible for clearing the runway did indeed clear the runway. He just forgot that the plane needs to taxi into the runway in the first place. That made us miss our connecting flight.

          Good grief!

          It's very interesting that align is never a win - I think in that case removing it makes sense? It'll be a nice simplification.

          Well, practically never wins for the machines I tested on and never wins with my implementation.

          I think we should remove it...

          Did we ever test performance of the specialized (generated) decoders using switch statements?

          I just did a quick hack in order to measure performance and I was very surprised that the generated switch-based implementations performs so well. It's nearly on par with packed most of the time and exceeds it in some cases. I only tested on 3 machines though. The hack is in the LUCENE-1990-te20100226c.patch and is called when the performance test is executed.

          Thanks for testing this! It is interesting.

          I ran the perf test on a CentOS 5.4 machine, java
          1.6.0_17-b04 64 bit server, Intel core 2 duo E8400 (3 ghz) – attached
          perf-mkm-20100227.txt. I also show the switch impl close, though
          always a bit behind.

          Seems like we should just stick with the non-gen'd packed impl?

          Note to self: Switch is not equivalent to a series of if-else, when we're talking performance and when we switch without omissions in the cases.

          Right, if the switch cases are compact, it should compile into a fast jump
          table (though it may still do an unecessary bounds check).

          I think, once we removed aligned, this is ready to commit? I think we
          should land this on flex branch? (It's using CodecUtil, BytesRef –
          I'll merge them when I commit). Then I can cutover the terms index to
          use packed ints.

          Show
          Michael McCandless added a comment - Airplane blocking snow drifts!? Where on earth are you anyway? In Denmark. The guy responsible for clearing the runway did indeed clear the runway. He just forgot that the plane needs to taxi into the runway in the first place. That made us miss our connecting flight. Good grief! It's very interesting that align is never a win - I think in that case removing it makes sense? It'll be a nice simplification. Well, practically never wins for the machines I tested on and never wins with my implementation. I think we should remove it... Did we ever test performance of the specialized (generated) decoders using switch statements? I just did a quick hack in order to measure performance and I was very surprised that the generated switch-based implementations performs so well. It's nearly on par with packed most of the time and exceeds it in some cases. I only tested on 3 machines though. The hack is in the LUCENE-1990 -te20100226c.patch and is called when the performance test is executed. Thanks for testing this! It is interesting. I ran the perf test on a CentOS 5.4 machine, java 1.6.0_17-b04 64 bit server, Intel core 2 duo E8400 (3 ghz) – attached perf-mkm-20100227.txt. I also show the switch impl close, though always a bit behind. Seems like we should just stick with the non-gen'd packed impl? Note to self: Switch is not equivalent to a series of if-else, when we're talking performance and when we switch without omissions in the cases. Right, if the switch cases are compact, it should compile into a fast jump table (though it may still do an unecessary bounds check). I think, once we removed aligned, this is ready to commit? I think we should land this on flex branch? (It's using CodecUtil, BytesRef – I'll merge them when I commit). Then I can cutover the terms index to use packed ints.
          Hide
          Toke Eskildsen added a comment -

          I've tested on two 32 bit Windows machines: An Intel T2400 (32 bit only) running XP and an Athlon X2 4850e (64 bit capable) running 32 bit XP. The result can be seen in attachment performance-20100301.txt. Something curious happens with high (32+) bits/value for the T2400 as aligned overtakes packed. However, the overall picture is still that aligned only wins for a few special cases, so now I'll be happy to remove it from the patch. As a note, generated is also slower than packed on the AMD processor, although not as much as for Intel.

          I have removed all traces of aligned from PackedInts, but kept the classes in the patch, in the case that someone finds a faster way to handle aligned. PackedIntsPerformance still includes both the generated switch-implementation and Aligned32 and Aligned64. It should be possible to apply the patch without Aligned32, Aligned64, AlignedWriter and PackedIntsPerformance.

          Show
          Toke Eskildsen added a comment - I've tested on two 32 bit Windows machines: An Intel T2400 (32 bit only) running XP and an Athlon X2 4850e (64 bit capable) running 32 bit XP. The result can be seen in attachment performance-20100301.txt. Something curious happens with high (32+) bits/value for the T2400 as aligned overtakes packed. However, the overall picture is still that aligned only wins for a few special cases, so now I'll be happy to remove it from the patch. As a note, generated is also slower than packed on the AMD processor, although not as much as for Intel. I have removed all traces of aligned from PackedInts, but kept the classes in the patch, in the case that someone finds a faster way to handle aligned. PackedIntsPerformance still includes both the generated switch-implementation and Aligned32 and Aligned64. It should be possible to apply the patch without Aligned32, Aligned64, AlignedWriter and PackedIntsPerformance.
          Hide
          Toke Eskildsen added a comment -

          Some thoughts on avoiding the generic division by experimenting with reciprocal multiplication: For aligned, the sane number of values/block are [3, 5, 6, 7, 8, 9, 10, 16, 21, 32, 64]. I tried testing index from 0 to Integer.MAX_VALUE with these divisors and reciprocal multiplication. It worked perfectly for all divisors except [5, 7, 9, 10, 21]. Unfortunately it already falls for divisor 21 at index 252645140, which makes it useless as a full replacement. If one were so inclined, it would be possible to select aligned implementation based on valueCount, with fallback to the "slow" version. The gain of using fast division seems quite substantial as it makes aligned 14-40% faster than packed (note: Just tested on a single machine). However, re-introducing aligned with four different implementations (Aligned32, Aligned32Fast, Aligned64, Aligned64Fast) is rather daunting and it would make the selection code really messy.

          I can see that there are well-known tricks to get around the rounding errors. Some are described at http://www.cs.uiowa.edu/~jones/bcd/divide.html#fixed . I don't know if these extra tricks would negate the 14-40% speed gain though. Since I would like to get the patch out of the door, I vote for keeping aligned disabled and just note that more bit fiddling might make it attractive at some point.

          Show
          Toke Eskildsen added a comment - Some thoughts on avoiding the generic division by experimenting with reciprocal multiplication: For aligned, the sane number of values/block are [3, 5, 6, 7, 8, 9, 10, 16, 21, 32, 64] . I tried testing index from 0 to Integer.MAX_VALUE with these divisors and reciprocal multiplication. It worked perfectly for all divisors except [5, 7, 9, 10, 21] . Unfortunately it already falls for divisor 21 at index 252645140, which makes it useless as a full replacement. If one were so inclined, it would be possible to select aligned implementation based on valueCount, with fallback to the "slow" version. The gain of using fast division seems quite substantial as it makes aligned 14-40% faster than packed (note: Just tested on a single machine). However, re-introducing aligned with four different implementations (Aligned32, Aligned32Fast, Aligned64, Aligned64Fast) is rather daunting and it would make the selection code really messy. I can see that there are well-known tricks to get around the rounding errors. Some are described at http://www.cs.uiowa.edu/~jones/bcd/divide.html#fixed . I don't know if these extra tricks would negate the 14-40% speed gain though. Since I would like to get the patch out of the door, I vote for keeping aligned disabled and just note that more bit fiddling might make it attractive at some point.
          Hide
          Michael McCandless added a comment -

          Patch looks great Toke – a few small things:

          • I think we shouldn't add Aligned*.java to svn? It'll just add
            unused bits to the JAR, and, we can always fallback to this issue
            to pull them in at a future time?
          • Can you resolve the remaining nocommits? EG (since we are
            unsigned) we can't get the 64 bit case working. I don't think we
            should rename to UnsignedXXX, nor, support minValue at this
            point, and remove the ComparableBytesRef, and I'll merge BytesRef
            into flex's when I commit.

          I can take these too – I think it's ready to commit on flex after
          this. Thanks!

          Show
          Michael McCandless added a comment - Patch looks great Toke – a few small things: I think we shouldn't add Aligned*.java to svn? It'll just add unused bits to the JAR, and, we can always fallback to this issue to pull them in at a future time? Can you resolve the remaining nocommits? EG (since we are unsigned) we can't get the 64 bit case working. I don't think we should rename to UnsignedXXX, nor, support minValue at this point, and remove the ComparableBytesRef, and I'll merge BytesRef into flex's when I commit. I can take these too – I think it's ready to commit on flex after this. Thanks!
          Hide
          Uwe Schindler added a comment -

          We should also add the @lucene.internal javadoc comments everywhere instead of the big NOTE. Why has one class a full-uppercase class name?

          Show
          Uwe Schindler added a comment - We should also add the @lucene.internal javadoc comments everywhere instead of the big NOTE. Why has one class a full-uppercase class name?
          Hide
          Toke Eskildsen added a comment -

          Michael McCandless:

          • I think we shouldn't add Aligned*.java to svn? It'll just add
            unused bits to the JAR, and, we can always fallback to this issue
            to pull them in at a future time?

          I agree. At the current state, Aligned is just dead weight.

          This also means that the performance tester won't be part of the commit though. I can quickly make a performance tester that does not use aligned, if it is preferable to keep performance testing.

          • Can you resolve the remaining nocommits? EG (since we are
            unsigned) we can't get the 64 bit case working. I don't think we
            should rename to UnsignedXXX, nor, support minValue at this
            point, and remove the ComparableBytesRef, and I'll merge BytesRef
            into flex's when I commit.

          I can take these too - I think it's ready to commit on flex after this

          It will help a lot if you take care of these issues, thanks.

          Show
          Toke Eskildsen added a comment - Michael McCandless: I think we shouldn't add Aligned*.java to svn? It'll just add unused bits to the JAR, and, we can always fallback to this issue to pull them in at a future time? I agree. At the current state, Aligned is just dead weight. This also means that the performance tester won't be part of the commit though. I can quickly make a performance tester that does not use aligned, if it is preferable to keep performance testing. Can you resolve the remaining nocommits? EG (since we are unsigned) we can't get the 64 bit case working. I don't think we should rename to UnsignedXXX, nor, support minValue at this point, and remove the ComparableBytesRef, and I'll merge BytesRef into flex's when I commit. I can take these too - I think it's ready to commit on flex after this It will help a lot if you take care of these issues, thanks.
          Hide
          Toke Eskildsen added a comment -

          Uwe Schindler:

          We should also add the @lucene.internal javadoc comments everywhere instead of the big NOTE. Why has one class a full-uppercase class name?

          Are you looking at patch LUCENE-1990-te20100301.patch? I don't see any NOTE and no full-uppercase class name?

          Show
          Toke Eskildsen added a comment - Uwe Schindler: We should also add the @lucene.internal javadoc comments everywhere instead of the big NOTE. Why has one class a full-uppercase class name? Are you looking at patch LUCENE-1990 -te20100301.patch? I don't see any NOTE and no full-uppercase class name?
          Hide
          Michael McCandless added a comment -

          It will help a lot if you take care of these issues, thanks.

          OK will do. I'll commit soon to flex... thanks Toke!

          Show
          Michael McCandless added a comment - It will help a lot if you take care of these issues, thanks. OK will do. I'll commit soon to flex... thanks Toke!
          Hide
          Michael McCandless added a comment -

          We should also add the @lucene.internal javadoc comments everywhere instead of the big NOTE.

          I found one more NOTE in CodecUtil – I'll fix before committing.

          Show
          Michael McCandless added a comment - We should also add the @lucene.internal javadoc comments everywhere instead of the big NOTE. I found one more NOTE in CodecUtil – I'll fix before committing.
          Hide
          Michael McCandless added a comment -

          Why has one class a full-uppercase class name?

          Uwe you mean eg STORAGE? (And also BLOCK, IMPLEMENTATION... but they are package private). These are enums – seems OK to make them all caps?

          Though I do think we can simplify some of this now that we're removing the aligned case... eg PERSISTENCE is an enum with only one value. I'll take a stab & post patch.

          Show
          Michael McCandless added a comment - Why has one class a full-uppercase class name? Uwe you mean eg STORAGE? (And also BLOCK, IMPLEMENTATION... but they are package private). These are enums – seems OK to make them all caps? Though I do think we can simplify some of this now that we're removing the aligned case... eg PERSISTENCE is an enum with only one value. I'll take a stab & post patch.
          Hide
          Michael McCandless added a comment -

          OK new patch attached:

          • Ported to flex, and cutover to CodecUtil. BytesRef required no
            changes...
          • Simplified the API/impl to not use STORAGE, PERSISTENCE,
            IMPLEMENTATION, etc. You just specify required bitsPerValue.
          • Removed Aligned*, and ConsumesRAM interface
          • Fixed the nocommits.

          I think it's ready! I'll wait a day or two...

          Show
          Michael McCandless added a comment - OK new patch attached: Ported to flex, and cutover to CodecUtil. BytesRef required no changes... Simplified the API/impl to not use STORAGE, PERSISTENCE, IMPLEMENTATION, etc. You just specify required bitsPerValue. Removed Aligned*, and ConsumesRAM interface Fixed the nocommits. I think it's ready! I'll wait a day or two...
          Hide
          Michael McCandless added a comment -

          Thanks Toke!

          Show
          Michael McCandless added a comment - Thanks Toke!
          Hide
          Toke Eskildsen added a comment -

          Thanks for rounding off, Michael. It's been a pleasure.

          Show
          Toke Eskildsen added a comment - Thanks for rounding off, Michael. It's been a pleasure.
          Hide
          Robert Muir added a comment -

          By the way Toke, we have lately been benchmarking automaton queries, which are pretty intensive on the terms dictionary.
          I think we were expecting some acceptable slowdown once we switched to packed ints, but according to my benchmarks this is not the case.

          I think its pretty impressive to see no measurable performance impact on this stuff at all, great work.

          Show
          Robert Muir added a comment - By the way Toke, we have lately been benchmarking automaton queries, which are pretty intensive on the terms dictionary. I think we were expecting some acceptable slowdown once we switched to packed ints, but according to my benchmarks this is not the case. I think its pretty impressive to see no measurable performance impact on this stuff at all, great work.
          Hide
          Toke Eskildsen added a comment -

          I am very happy to hear that, Robert. The benchmarks I made had the glaring flaw that they were ... well, benchmarks. With the CPU-cache being hammered in a real world scenario, your findings indicate that the slow round-trip to main memory dwarfs the extra logic for extracting the values from the packed structure. For a few scenarios, it might even be faster than plain arrays.

          Getting back to reality, my own findings indicates that using PackedInts for ord-based sorted search is not at all faster than plain arrays. The access pattern here is very sequential, so the chance that the needed value is already fetched from main memory is high for both plain and packed structures.

          Show
          Toke Eskildsen added a comment - I am very happy to hear that, Robert. The benchmarks I made had the glaring flaw that they were ... well, benchmarks. With the CPU-cache being hammered in a real world scenario, your findings indicate that the slow round-trip to main memory dwarfs the extra logic for extracting the values from the packed structure. For a few scenarios, it might even be faster than plain arrays. Getting back to reality, my own findings indicates that using PackedInts for ord-based sorted search is not at all faster than plain arrays. The access pattern here is very sequential, so the chance that the needed value is already fetched from main memory is high for both plain and packed structures.
          Hide
          Michael McCandless added a comment -

          Toke, are sort ords that much slower than straight arrays for sorting?

          After flex lands I'd really like to make a variant of FieldCache.StringIndex that uses BytesRef for the values and packed ints for the ords.... should save alot of memory in many cases (English text saves since utf8 is 1 byte per char; enumerated fields (eg country name) should save tons by using only a few bits instead of 32 we now always use) when sorting by string.

          Show
          Michael McCandless added a comment - Toke, are sort ords that much slower than straight arrays for sorting? After flex lands I'd really like to make a variant of FieldCache.StringIndex that uses BytesRef for the values and packed ints for the ords.... should save alot of memory in many cases (English text saves since utf8 is 1 byte per char; enumerated fields (eg country name) should save tons by using only a few bits instead of 32 we now always use) when sorting by string.
          Hide
          Toke Eskildsen added a comment -

          In the original proof of concept for LUCENE-2335, I measured the time for extracting top-20 for ... 10 million? documents and got something like 600ms when using PackedInts, which is fairly slow in my book and recall getting better performance with straight arrays for that. This is all wery fuzzy though and I'd love to be proven wrong. If PackedInts are faster for sorting too, it's getting very hard to see the downside of that representation.

          Since LUCENE-2335 relies heavily on arays of sorted indexes into ordinal arrays, I'll make sure to performance test both PackedInts and straight arrays for sorted search.

          Show
          Toke Eskildsen added a comment - In the original proof of concept for LUCENE-2335 , I measured the time for extracting top-20 for ... 10 million? documents and got something like 600ms when using PackedInts, which is fairly slow in my book and recall getting better performance with straight arrays for that. This is all wery fuzzy though and I'd love to be proven wrong. If PackedInts are faster for sorting too, it's getting very hard to see the downside of that representation. Since LUCENE-2335 relies heavily on arays of sorted indexes into ordinal arrays, I'll make sure to performance test both PackedInts and straight arrays for sorted search.
          Hide
          Toke Eskildsen added a comment - - edited

          It seem like my unit-testing of PackedInts.Mutable wasn't good enough. There is a bug in Packed64 (and probably in Packed32 too) when using the set-method. In certain cases the secondary block is changed when it should be left alone. A simple unit-test is

              PackedInts.Mutable mutable = PackedInts.getMutable(26, 5);
              mutable.set(24, 31);
              mutable.set(4, 16);
              assertEquals("The value #24 should remain unchanged", 31, mutable.get(24));
          

          The PackedWriter uses a different algorithm for generating the bit stream and is unaffected by this bug.

          I expect the write-masks for the set-method to be at fault and I am working on a fix. ETA: Within an hour or sometime during the weekend, depending on difficulty.

          Show
          Toke Eskildsen added a comment - - edited It seem like my unit-testing of PackedInts.Mutable wasn't good enough. There is a bug in Packed64 (and probably in Packed32 too) when using the set-method. In certain cases the secondary block is changed when it should be left alone. A simple unit-test is PackedInts.Mutable mutable = PackedInts.getMutable(26, 5); mutable.set(24, 31); mutable.set(4, 16); assertEquals( "The value #24 should remain unchanged" , 31, mutable.get(24)); The PackedWriter uses a different algorithm for generating the bit stream and is unaffected by this bug. I expect the write-masks for the set-method to be at fault and I am working on a fix. ETA: Within an hour or sometime during the weekend, depending on difficulty.
          Hide
          Michael McCandless added a comment -

          Good catch Toke! Flex actually uses Mutable on loading the terms index when indexDivisor is not 1...

          Show
          Michael McCandless added a comment - Good catch Toke! Flex actually uses Mutable on loading the terms index when indexDivisor is not 1...
          Hide
          Michael McCandless added a comment -

          Reopen to fix issue Toke found with Mutable

          Show
          Michael McCandless added a comment - Reopen to fix issue Toke found with Mutable
          Hide
          Toke Eskildsen added a comment -

          I've located the bug and fixed it. As expected, it was in the write-masks. Unfortunately I'm running out of time, so I cannot make a patch right now. The code for Packed64 is

            private static final long[][] WRITE_MASKS =
                    new long[ENTRY_SIZE][ENTRY_SIZE * FAC_BITPOS];
            static {
              for (int elementBits = 1 ; elementBits <= BLOCK_SIZE ; elementBits++) {
                  long elementPosMask = ~(~0L << elementBits);
                  int[] currentShifts = SHIFTS[elementBits];
                  long[] currentMasks = WRITE_MASKS[elementBits];
                  for (int bitPos = 0 ; bitPos < BLOCK_SIZE ; bitPos++) {
                      int base = bitPos * FAC_BITPOS;
                      currentMasks[base  ] =~((elementPosMask
                                         << currentShifts[base + 1])
                                        >>> currentShifts[base]);
                      currentMasks[base+1] =
                          ~(elementPosMask << currentShifts[base + 2]);
                      currentMasks[base+2] = currentShifts[base + 2] == 0 ? 0 : ~0;
                    if (bitPos <= BLOCK_SIZE - elementBits) { // Second block not used
                      currentMasks[base+1] = ~0; // Keep all bits
                      currentMasks[base+2] = 0;  // Or with 0
                    }
                  }
              }
            }
          

          The changed code is the addition of the last check for second block usage. Likewise the fix for Packed32 is

            private static final int[][] WRITE_MASKS =
                    new int[ENTRY_SIZE][ENTRY_SIZE * FAC_BITPOS];
            static {
              for (int elementBits = 1 ; elementBits <= BLOCK_SIZE ; elementBits++) {
                int elementPosMask = ~(~0 << elementBits);
                int[] currentShifts = SHIFTS[elementBits];
                int[] currentMasks = WRITE_MASKS[elementBits];
                for (int bitPos = 0 ; bitPos < BLOCK_SIZE ; bitPos++) {
                  int base = bitPos * FAC_BITPOS;
                  currentMasks[base  ] =~((elementPosMask
                          << currentShifts[base + 1])
                          >>> currentShifts[base]);
                  currentMasks[base+1] = ~(elementPosMask
                          << currentShifts[base + 2]);
                  currentMasks[base+2] = currentShifts[base + 2] == 0 ? 0 : ~0;
                  if (bitPos <= BLOCK_SIZE - elementBits) { // Second block not used
                    currentMasks[base+1] = ~0; // Keep all bits
                    currentMasks[base+2] = 0;  // Or with 0
                  }
                }
              }
            }
          

          Without checking thoroughly, I'd expect the two pieces of code to be exactly the same, at the difference between Packed32 and Packed64 is just long vs. int and some constants. The unit-test from above can be used for Packed32 by explicitly creating a Packed32 instead of calling the factory.

          I'll be back behind the screen in a few days where I can make a patch, but you are more than welcome to roll the patch if it is more convenient to get it immediately.

          Show
          Toke Eskildsen added a comment - I've located the bug and fixed it. As expected, it was in the write-masks. Unfortunately I'm running out of time, so I cannot make a patch right now. The code for Packed64 is private static final long [][] WRITE_MASKS = new long [ENTRY_SIZE][ENTRY_SIZE * FAC_BITPOS]; static { for ( int elementBits = 1 ; elementBits <= BLOCK_SIZE ; elementBits++) { long elementPosMask = ~(~0L << elementBits); int [] currentShifts = SHIFTS[elementBits]; long [] currentMasks = WRITE_MASKS[elementBits]; for ( int bitPos = 0 ; bitPos < BLOCK_SIZE ; bitPos++) { int base = bitPos * FAC_BITPOS; currentMasks[base ] =~((elementPosMask << currentShifts[base + 1]) >>> currentShifts[base]); currentMasks[base+1] = ~(elementPosMask << currentShifts[base + 2]); currentMasks[base+2] = currentShifts[base + 2] == 0 ? 0 : ~0; if (bitPos <= BLOCK_SIZE - elementBits) { // Second block not used currentMasks[base+1] = ~0; // Keep all bits currentMasks[base+2] = 0; // Or with 0 } } } } The changed code is the addition of the last check for second block usage. Likewise the fix for Packed32 is private static final int [][] WRITE_MASKS = new int [ENTRY_SIZE][ENTRY_SIZE * FAC_BITPOS]; static { for ( int elementBits = 1 ; elementBits <= BLOCK_SIZE ; elementBits++) { int elementPosMask = ~(~0 << elementBits); int [] currentShifts = SHIFTS[elementBits]; int [] currentMasks = WRITE_MASKS[elementBits]; for ( int bitPos = 0 ; bitPos < BLOCK_SIZE ; bitPos++) { int base = bitPos * FAC_BITPOS; currentMasks[base ] =~((elementPosMask << currentShifts[base + 1]) >>> currentShifts[base]); currentMasks[base+1] = ~(elementPosMask << currentShifts[base + 2]); currentMasks[base+2] = currentShifts[base + 2] == 0 ? 0 : ~0; if (bitPos <= BLOCK_SIZE - elementBits) { // Second block not used currentMasks[base+1] = ~0; // Keep all bits currentMasks[base+2] = 0; // Or with 0 } } } } Without checking thoroughly, I'd expect the two pieces of code to be exactly the same, at the difference between Packed32 and Packed64 is just long vs. int and some constants. The unit-test from above can be used for Packed32 by explicitly creating a Packed32 instead of calling the factory. I'll be back behind the screen in a few days where I can make a patch, but you are more than welcome to roll the patch if it is more convenient to get it immediately.
          Hide
          Michael McCandless added a comment -

          I turned it into a patch (attached).

          But: without the fix, when I run the test, I don't see it failing (on current flex branch). I've fixed a few issues with packed ints on flex, but I don't think they would've fixed this. Confused....

          Show
          Michael McCandless added a comment - I turned it into a patch (attached). But: without the fix, when I run the test, I don't see it failing (on current flex branch). I've fixed a few issues with packed ints on flex, but I don't think they would've fixed this. Confused....
          Hide
          Toke Eskildsen added a comment -

          I did a checkout with

          svn co https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene lucene-flex
          

          and added the following method to TestPackedInts

            public void testSecondaryBlockChange() throws IOException {
              PackedInts.Mutable mutable = new Packed64(26, 5);
              mutable.set(24, 31);
              assertEquals("The value #24 should be correct", 31, mutable.get(24));
              mutable.set(4, 16);
              assertEquals("The value #24 should remain unchanged", 31, mutable.get(24));
            }
          

          after which I ran

          ant test-core
          

          which gave me

              [junit] Testsuite: org.apache.lucene.util.packed.TestPackedInts
              [junit] Testcase: testSecondaryBlockChange(org.apache.lucene.util.packed.TestPackedInts):	FAILED
              [junit] The value #24 should remain unchanged expected:<31> but was:<28>
              [junit] junit.framework.AssertionFailedError: The value #24 should remain unchanged expected:<31> but was:<28>
              [junit] 	at org.apache.lucene.util.packed.TestPackedInts.testSecondaryBlockChange(TestPackedInts.java:106)
              [junit] 	at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:276)
          

          then I added

                        if (bitPos <= BLOCK_SIZE - elementBits) { // Second block not used
                          currentMasks[base+1] = ~0; // Keep all bits
                          currentMasks[base+2] = 0;  // Or with 0
                        }
          

          to the relevant parts of Packed32 and Packed, as described above and ran

          ant test-core
          

          again, which gave me

              [junit] Testsuite: org.apache.lucene.util.packed.TestPackedInts
              [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 5.463 sec
          

          My initial unit-test contained an error, which I corrected after a minute or two (as far as I remember). Maybe you used the first version?

          It seems that the bug is indeed in trunk. It corrupts the value of the block after the current block in certain cases: Sequential assignment of values works fine, but out-of-order assignments corrupts the array.

          Show
          Toke Eskildsen added a comment - I did a checkout with svn co https: //svn.apache.org/repos/asf/lucene/dev/trunk/lucene lucene-flex and added the following method to TestPackedInts public void testSecondaryBlockChange() throws IOException { PackedInts.Mutable mutable = new Packed64(26, 5); mutable.set(24, 31); assertEquals( "The value #24 should be correct" , 31, mutable.get(24)); mutable.set(4, 16); assertEquals( "The value #24 should remain unchanged" , 31, mutable.get(24)); } after which I ran ant test-core which gave me [junit] Testsuite: org.apache.lucene.util.packed.TestPackedInts [junit] Testcase: testSecondaryBlockChange(org.apache.lucene.util.packed.TestPackedInts): FAILED [junit] The value #24 should remain unchanged expected:<31> but was:<28> [junit] junit.framework.AssertionFailedError: The value #24 should remain unchanged expected:<31> but was:<28> [junit] at org.apache.lucene.util.packed.TestPackedInts.testSecondaryBlockChange(TestPackedInts.java:106) [junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:276) then I added if (bitPos <= BLOCK_SIZE - elementBits) { // Second block not used currentMasks[base+1] = ~0; // Keep all bits currentMasks[base+2] = 0; // Or with 0 } to the relevant parts of Packed32 and Packed, as described above and ran ant test-core again, which gave me [junit] Testsuite: org.apache.lucene.util.packed.TestPackedInts [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 5.463 sec My initial unit-test contained an error, which I corrected after a minute or two (as far as I remember). Maybe you used the first version? It seems that the bug is indeed in trunk. It corrupts the value of the block after the current block in certain cases: Sequential assignment of values works fine, but out-of-order assignments corrupts the array.
          Hide
          Michael McCandless added a comment -

          OK indeed now I can see the failure when I add the test (I guess I did use the first version), and the changes fix it. I'll commit shortly. Thanks Toke!

          Show
          Michael McCandless added a comment - OK indeed now I can see the failure when I add the test (I guess I did use the first version), and the changes fix it. I'll commit shortly. Thanks Toke!
          Hide
          Michael McCandless added a comment -

          Thanks Toke!

          Show
          Michael McCandless added a comment - Thanks Toke!
          Hide
          Toke Eskildsen added a comment -

          I have discovered a bug in Packed32 and Packed64: When the number of bits exceed 2^31, the setters and getters fail. This is due to a missing cast in the calculation of the entry-point in the backing int/long-arrays.

          In both Packed32 and Packed64 the line

              final long majorBitPos = index * bitsPerValue;
          

          is used in get as well as set. This should be

              final long majorBitPos = (long)index * bitsPerValue;
          

          in all 4 cases.

          A unit test for this is

            /*
            Check if the structures properly handle the case where
            index * bitsPerValue > Integer.MAX_VALUE
             */
            public void testIntOverflow() {
              int INDEX = (int) Math.pow(2, 30);
              int BITS = 4;
          
              Packed32 p32 = new Packed32(INDEX, BITS);
              p32.set(INDEX-1, 1);
              assertEquals("The value at position " + (INDEX-1)
                  + " should be correct for Packed32", 1, p32.get(INDEX-1));
              p32 = null; // To free 512MB
          
              Packed64 p64 = new Packed64(INDEX, BITS);
              p64.set(INDEX-1, 1);
              assertEquals("The value at position " + (INDEX-1)
                  + " should be correct for Packed64", 1, p64.get(INDEX-1));
            }
          

          One big problem with the unit-test is that it requires 2^30*4/8 bytes = 512MB of heap. I am guessing that this makes it impossible to run in the standard test-suite.

          I am unsure as to how I should push this fix through. Should I create a new JIRA issue? Make a patch against trunk? Or maybe a committer could just try the test above and insert the fix in trunk?

          Show
          Toke Eskildsen added a comment - I have discovered a bug in Packed32 and Packed64: When the number of bits exceed 2^31, the setters and getters fail. This is due to a missing cast in the calculation of the entry-point in the backing int/long-arrays. In both Packed32 and Packed64 the line final long majorBitPos = index * bitsPerValue; is used in get as well as set. This should be final long majorBitPos = ( long )index * bitsPerValue; in all 4 cases. A unit test for this is /* Check if the structures properly handle the case where index * bitsPerValue > Integer .MAX_VALUE */ public void testIntOverflow() { int INDEX = ( int ) Math .pow(2, 30); int BITS = 4; Packed32 p32 = new Packed32(INDEX, BITS); p32.set(INDEX-1, 1); assertEquals( "The value at position " + (INDEX-1) + " should be correct for Packed32" , 1, p32.get(INDEX-1)); p32 = null ; // To free 512MB Packed64 p64 = new Packed64(INDEX, BITS); p64.set(INDEX-1, 1); assertEquals( "The value at position " + (INDEX-1) + " should be correct for Packed64" , 1, p64.get(INDEX-1)); } One big problem with the unit-test is that it requires 2^30*4/8 bytes = 512MB of heap. I am guessing that this makes it impossible to run in the standard test-suite. I am unsure as to how I should push this fix through. Should I create a new JIRA issue? Make a patch against trunk? Or maybe a committer could just try the test above and insert the fix in trunk?
          Hide
          Simon Willnauer added a comment -

          One big problem with the unit-test is that it requires 2^30*4/8 bytes = 512MB of heap. I am guessing that this makes it impossible to run in the standard test-suite.

          Seems to be a bit high for a unittest but you can't help it, right?

          I am unsure as to how I should push this fix through. Should I create a new JIRA issue? Make a patch against trunk? Or maybe a committer could just try the test above and insert the fix in trunk?

          I would suggest to create a new issue and attach a patch with the fix including your unittest. Since the unittest might break hudson etc I would recommend to add an @Ignore on top of it (JUnit 4) until we decided how to include tests like that. Maybe we might introduce a special flag that enables tests like that with a JUnit Assume call but that needs to be further discussed.

          Show
          Simon Willnauer added a comment - One big problem with the unit-test is that it requires 2^30*4/8 bytes = 512MB of heap. I am guessing that this makes it impossible to run in the standard test-suite. Seems to be a bit high for a unittest but you can't help it, right? I am unsure as to how I should push this fix through. Should I create a new JIRA issue? Make a patch against trunk? Or maybe a committer could just try the test above and insert the fix in trunk? I would suggest to create a new issue and attach a patch with the fix including your unittest. Since the unittest might break hudson etc I would recommend to add an @Ignore on top of it (JUnit 4) until we decided how to include tests like that. Maybe we might introduce a special flag that enables tests like that with a JUnit Assume call but that needs to be further discussed.
          Hide
          Toke Eskildsen added a comment -

          Remembering signed integer representation, a better test would be

            /*
            Check if the structures properly handle the case where
            index * bitsPerValue > Integer.MAX_VALUE
             */
            public void testIntOverflow() {
              int INDEX = (int)Math.pow(2, 30)+1;
              int BITS = 2;
          
              Packed32 p32 = new Packed32(INDEX, BITS);
              p32.set(INDEX-1, 1);
              assertEquals("The value at position " + (INDEX-1)
                  + " should be correct for Packed32", 1, p32.get(INDEX-1));
              p32 = null; // To free the 256MB used
          
              Packed64 p64 = new Packed64(INDEX, BITS);
              p64.set(INDEX-1, 1);
              assertEquals("The value at position " + (INDEX-1)
                  + " should be correct for Packed64", 1, p64.get(INDEX-1));
            }
          

          This still triggers the bug but requires "only" 256 MB. Is this acceptable in the Hudson environment?

          Show
          Toke Eskildsen added a comment - Remembering signed integer representation, a better test would be /* Check if the structures properly handle the case where index * bitsPerValue > Integer .MAX_VALUE */ public void testIntOverflow() { int INDEX = ( int ) Math .pow(2, 30)+1; int BITS = 2; Packed32 p32 = new Packed32(INDEX, BITS); p32.set(INDEX-1, 1); assertEquals( "The value at position " + (INDEX-1) + " should be correct for Packed32" , 1, p32.get(INDEX-1)); p32 = null ; // To free the 256MB used Packed64 p64 = new Packed64(INDEX, BITS); p64.set(INDEX-1, 1); assertEquals( "The value at position " + (INDEX-1) + " should be correct for Packed64" , 1, p64.get(INDEX-1)); } This still triggers the bug but requires "only" 256 MB. Is this acceptable in the Hudson environment?
          Hide
          Robert Muir added a comment -

          This still triggers the bug but requires "only" 256 MB. Is this acceptable in the Hudson environment?

          The default maxMemory size is wired at 512MB. i think this might be too large already: on my machine with 4 cpus
          this means max 2GB for lucene and 4GB for solr tests.

          one way to do this would be to make this maxMemory configurable with a -D, set it accordingly in hudson, and
          also setup a way to write 'hudson-only' tests. then stuff like this could run in the nightly only.

          Show
          Robert Muir added a comment - This still triggers the bug but requires "only" 256 MB. Is this acceptable in the Hudson environment? The default maxMemory size is wired at 512MB. i think this might be too large already: on my machine with 4 cpus this means max 2GB for lucene and 4GB for solr tests. one way to do this would be to make this maxMemory configurable with a -D, set it accordingly in hudson, and also setup a way to write 'hudson-only' tests. then stuff like this could run in the nightly only.
          Hide
          Toke Eskildsen added a comment - - edited

          Correction: I have created LUCENE-2633 and uploaded a patch for the issue above.

          Show
          Toke Eskildsen added a comment - - edited Correction: I have created LUCENE-2633 and uploaded a patch for the issue above.
          Hide
          Uwe Schindler added a comment -

          Bulk close after release of 3.5

          Show
          Uwe Schindler added a comment - Bulk close after release of 3.5

            People

            • Assignee:
              Unassigned
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development