Lucene - Core
  1. Lucene - Core
  2. LUCENE-2221

Micro-benchmarks for ntz and pop (BitUtils) operations.

    Details

    • Type: Task Task
    • Status: Resolved
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/other
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      As suggested by Yonik, I performed a suite of micro-benchmarks to investigate the following:

      • pop() (bitCount) seems to be implemented in the same way ("hacker's delight") as in the BitUtils class (SUN's standard library from version 1.5).
      • native intrinsics have been recently added to the HotSpot that should speed up bitCount significantly.

      I have tried to run the code on various VMs and architectures, but of course the results may vary depending on the setting.

      Micro-benchmark code, evaluation

      • The micro-benchmarks were written as JUnit tests with a custom set of rules that repeats each test measuring execution time, gc use, etc.
      • There were 5 warmup runs before each test, followed by 15 benchmarked runs. The result contain overall times, round times and standard deviations where applicable.
      • There were several tests for isolated performance of BitUtil.pop(), JDK's bitCount, BitUtil.ntz(), BitUtil.ntz2(), BitUtil.ntz3() and JDK's numberOfTrailingZeros, the test code had the following loop:
        final long [] longs = NtzPopBenchmark.random.bits;
        int cnt = 0;
        for (int i = longs.length; --i >= 0;)
        {
           cnt += Long.bitCount(longs[i]);
        }
        volatileVariable = cnt; // to prevent dead code removal.
        
      • I also added another version of pop() based on a precomputed bit counts. This version was called pop2.
      • The input array of long values was initialized to a memory taking 200MB. There were two different sets: random (random values) and single (single bit rotated to the right in each long).
      • There were tests that called BitUtil.pop_xor between the two input bitsets (random, single).
      • Additional tests that used iterators and and other BitUtil operations showed similar performance to isolated loops, so I omit them here.

      Evaluation environment

      I tested on three different machines:

      • Pentium 4, 32-bit, 3GHZ, 2GB RAM (Windows)
      • AMD Athlon(tm) 64 X2 Dual Core Processor 5200+, 64-bit, 4GB RAM (Ubuntu)
      • Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz, 64-bit, 4GB RAM (Ubuntu)

      and on various VMs:

      • 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc.,
      • 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc.,
      • 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc.,
      • 1.6.0, IBM J9 VM, 2.4, IBM Corporation,
      • BEA JRockit.
      • (ant other minor versions of the VMs above, depending on the computer).

      Results overview

      pop

      The times between BitUtil and JDK were mostly identical. However, on 32-bit systems, precached pop2 performed
      much better. Examples:

      # 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
      test_POP_JDK_random       : 15/20 rounds, time.total: 15.61, time.warmup: 4.31, time.bench: 11.30, round: 0.75 [+- 0.02]
      test_POP_JDK_single       : 15/20 rounds, time.total: 15.67, time.warmup: 4.31, time.bench: 11.36, round: 0.76 [+- 0.02]
      test_POP_BitUtil_random   : 15/20 rounds, time.total: 15.55, time.warmup: 4.33, time.bench: 11.22, round: 0.75 [+- 0.01]
      test_POP_BitUtil_single   : 15/20 rounds, time.total: 15.55, time.warmup: 4.31, time.bench: 11.23, round: 0.75 [+- 0.01]
      test_POP2_random          : 15/20 rounds, time.total:  6.69, time.warmup: 1.75, time.bench:  4.94, round: 0.33 [+- 0.00]
      test_POP2_single          : 15/20 rounds, time.total:  4.66, time.warmup: 1.22, time.bench:  3.44, round: 0.23 [+- 0.01]
      

      Note the difference between random and single distributions – most probably due to more cache hits when referring to the
      lookup table. Other VMs on this 32-bit machine:

      # 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc., 
      test_POP_JDK_random       : 15/20 rounds, time.total: 20.67, time.warmup: 5.19, time.bench: 15.48, round: 1.03 [+- 0.01]
      test_POP_JDK_single       : 15/20 rounds, time.total: 22.70, time.warmup: 5.63, time.bench: 17.08, round: 1.14 [+- 0.01]
      test_POP_BitUtil_random   : 15/20 rounds, time.total: 22.69, time.warmup: 5.63, time.bench: 17.06, round: 1.14 [+- 0.01]
      test_POP_BitUtil_single   : 15/20 rounds, time.total: 20.67, time.warmup: 5.19, time.bench: 15.48, round: 1.03 [+- 0.01]
      test_POP2_random          : 15/20 rounds, time.total:  6.30, time.warmup: 1.63, time.bench:  4.67, round: 0.31 [+- 0.01]
      test_POP2_single          : 15/20 rounds, time.total:  4.33, time.warmup: 1.16, time.bench:  3.17, round: 0.21 [+- 0.01]
      
      # 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
      test_POP_JDK_random       : 15/20 rounds, time.total: 15.28, time.warmup: 4.25, time.bench: 11.03, round: 0.74 [+- 0.03]
      test_POP_JDK_single       : 15/20 rounds, time.total: 15.16, time.warmup: 4.20, time.bench: 10.95, round: 0.73 [+- 0.01]
      test_POP_BitUtil_random   : 15/20 rounds, time.total: 15.12, time.warmup: 4.20, time.bench: 10.92, round: 0.73 [+- 0.01]
      test_POP_BitUtil_single   : 15/20 rounds, time.total: 15.13, time.warmup: 4.25, time.bench: 10.88, round: 0.73 [+- 0.01]
      test_POP2_random          : 15/20 rounds, time.total:  6.78, time.warmup: 1.72, time.bench:  5.06, round: 0.34 [+- 0.01]
      test_POP2_single          : 15/20 rounds, time.total:  4.72, time.warmup: 1.20, time.bench:  3.52, round: 0.23 [+- 0.02]
      

      On 64-bit machines, the results are nearly equal, with pop2 performing slightly worse on SUN's 1.6 compared to JDK and BitUtil
      (but this difference is really tiny and not present on all VMs; see IBM J9 and SUN's 1.5 below).

      # 1.6.0_16, Java HotSpot(TM) 64-Bit Server VM, 14.2-b01, Sun Microsystems Inc.,
      test_POP_JDK_random       : 15/20 rounds, time.total: 3.27, time.warmup: 0.81, time.bench: 2.46, round: 0.16 [+- 0.00]
      test_POP_JDK_single       : 15/20 rounds, time.total: 3.11, time.warmup: 0.76, time.bench: 2.34, round: 0.16 [+- 0.02]
      test_POP_BitUtil_random   : 15/20 rounds, time.total: 3.27, time.warmup: 0.81, time.bench: 2.46, round: 0.16 [+- 0.00]
      test_POP_BitUtil_single   : 15/20 rounds, time.total: 3.03, time.warmup: 0.77, time.bench: 2.26, round: 0.15 [+- 0.00]
      test_POP2_random          : 15/20 rounds, time.total: 3.63, time.warmup: 0.93, time.bench: 2.70, round: 0.18 [+- 0.00]
      test_POP2_single          : 15/20 rounds, time.total: 3.51, time.warmup: 0.89, time.bench: 2.62, round: 0.17 [+- 0.00]
      
      # 1.6.0, IBM J9 VM, 2.4, IBM Corporation,
      test_POP_JDK_random       : 15/20 rounds, time.total: 4.80, time.warmup: 1.24, time.bench: 3.57, round: 0.24 [+- 0.01]
      test_POP_JDK_single       : 15/20 rounds, time.total: 5.00, time.warmup: 1.44, time.bench: 3.56, round: 0.24 [+- 0.01]
      test_POP_BitUtil_random   : 15/20 rounds, time.total: 4.81, time.warmup: 1.24, time.bench: 3.56, round: 0.24 [+- 0.01]
      test_POP_BitUtil_single   : 15/20 rounds, time.total: 4.75, time.warmup: 1.19, time.bench: 3.56, round: 0.24 [+- 0.01]
      test_POP2_random          : 15/20 rounds, time.total: 3.65, time.warmup: 0.90, time.bench: 2.76, round: 0.18 [+- 0.00]
      test_POP2_single          : 15/20 rounds, time.total: 3.82, time.warmup: 0.93, time.bench: 2.89, round: 0.19 [+- 0.01]
      
      # 1.5.0_18, Java HotSpot(TM) 64-Bit Server VM, 1.5.0_18-b02, Sun Microsystems Inc.,
      test_POP_JDK_random       : 15/20 rounds, time.total: 3.72, time.warmup: 0.94, time.bench: 2.78, round: 0.19 [+- 0.00]
      test_POP_JDK_single       : 15/20 rounds, time.total: 5.96, time.warmup: 1.40, time.bench: 4.56, round: 0.30 [+- 0.00]
      test_POP_BitUtil_random   : 15/20 rounds, time.total: 6.16, time.warmup: 1.43, time.bench: 4.73, round: 0.31 [+- 0.00]
      test_POP_BitUtil_single   : 15/20 rounds, time.total: 3.62, time.warmup: 0.92, time.bench: 2.70, round: 0.18 [+- 0.00]
      test_POP2_random          : 15/20 rounds, time.total: 3.70, time.warmup: 0.96, time.bench: 2.74, round: 0.18 [+- 0.00]
      test_POP2_single          : 15/20 rounds, time.total: 3.57, time.warmup: 0.93, time.bench: 2.65, round: 0.18 [+- 0.00]
      

      The other 64-bit machine (quad-core):

      # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
      test_POP_JDK_random       : 15/20 rounds, time.total: 2.46, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00]
      test_POP_JDK_single       : 15/20 rounds, time.total: 2.49, time.warmup: 0.62, time.bench: 1.87, round: 0.12 [+- 0.01]
      test_POP_BitUtil_random   : 15/20 rounds, time.total: 2.47, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00]
      test_POP_BitUtil_single   : 15/20 rounds, time.total: 2.46, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00]
      test_POP2_random          : 15/20 rounds, time.total: 2.82, time.warmup: 0.71, time.bench: 2.11, round: 0.14 [+- 0.00]
      test_POP2_single          : 15/20 rounds, time.total: 2.15, time.warmup: 0.55, time.bench: 1.61, round: 0.11 [+- 0.00]
      

      I then replaced BitUtil.pop with BitUtil.pop2 in bit-counting methods like xor/and/or. The results are intriguing.
      On 32-bit systems, there is a measureable gain, like here:

      # 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
      test_pop_xor              : 15/20 rounds, time.total:  9.78, time.warmup: 2.59, time.bench:  7.19, round: 0.48 [+- 0.01]
      test_pop2_hd_xor          : 15/20 rounds, time.total:  8.27, time.warmup: 2.22, time.bench:  6.05, round: 0.40 [+- 0.01]
      
      # 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
      test_pop_xor              : 15/20 rounds, time.total:  9.89, time.warmup: 2.59, time.bench: 7.30, round: 0.49 [+- 0.02]
      test_pop2_hd_xor          : 15/20 rounds, time.total:  8.20, time.warmup: 2.24, time.bench: 5.97, round: 0.40 [+- 0.01]
      

      On 64-bit systems, when 64-bit values can be manipulated directly in registers, there was nearly no speedup or even
      a small performance penalty like in here:

      # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
      test_pop_xor              : 15/20 rounds, time.total: 1.76, time.warmup: 0.49, time.bench: 1.27, round: 0.09 [+- 0.00]
      test_pop2_hd_xor          : 15/20 rounds, time.total: 2.06, time.warmup: 0.55, time.bench: 1.51, round: 0.10 [+- 0.00]
      

      I'm guessing referencing memory on this fast processors is slower than manipulating registers.

      ntz

      On JVMs prior to 1.7, the {{ntz} version from Lucene was much faster in my tests than the one from JDK,
      but it also has a greater variance depending on the input bits' distribution (compare the same routine
      for random and single below).

      # 32-bit system;
      # 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
      test_NTZ_JDK_random       : 15/20 rounds, time.total:  6.69, time.warmup: 1.73, time.bench: 4.95, round: 0.33 [+- 0.01]
      test_NTZ_JDK_single       : 15/20 rounds, time.total:  7.59, time.warmup: 1.94, time.bench: 5.66, round: 0.38 [+- 0.01]
      test_NTZ_BitUtil_random   : 15/20 rounds, time.total:  2.72, time.warmup: 0.73, time.bench: 1.98, round: 0.13 [+- 0.02]
      test_NTZ_BitUtil_single   : 15/20 rounds, time.total:  5.28, time.warmup: 1.34, time.bench: 3.94, round: 0.26 [+- 0.02]
      test_NTZ2_BitUtil_random  : 15/20 rounds, time.total:  3.06, time.warmup: 0.81, time.bench: 2.25, round: 0.15 [+- 0.01]
      test_NTZ2_BitUtil_single  : 15/20 rounds, time.total:  5.36, time.warmup: 1.34, time.bench: 4.02, round: 0.27 [+- 0.01]
      test_NTZ3_BitUtil_random  : 15/20 rounds, time.total:  5.80, time.warmup: 1.48, time.bench: 4.31, round: 0.29 [+- 0.01]
      test_NTZ3_BitUtil_single  : 15/20 rounds, time.total:  6.98, time.warmup: 1.81, time.bench: 5.17, round: 0.34 [+- 0.01]
      
      # 64-bit Athlon
      # 1.6.0_16, Java HotSpot(TM) 64-Bit Server VM, 14.2-b01, Sun Microsystems Inc.,
      test_NTZ_JDK_random       : 15/20 rounds, time.total: 4.59, time.warmup: 1.16, time.bench: 3.44, round: 0.23 [+- 0.00]
      test_NTZ_JDK_single       : 15/20 rounds, time.total: 6.64, time.warmup: 1.59, time.bench: 5.04, round: 0.34 [+- 0.01]
      test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 2.09, time.warmup: 0.53, time.bench: 1.56, round: 0.10 [+- 0.00]
      test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 3.87, time.warmup: 0.98, time.bench: 2.90, round: 0.19 [+- 0.00]
      test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 2.09, time.warmup: 0.52, time.bench: 1.57, round: 0.10 [+- 0.00]
      test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 3.31, time.warmup: 0.84, time.bench: 2.47, round: 0.16 [+- 0.00]
      test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 3.31, time.warmup: 0.83, time.bench: 2.48, round: 0.17 [+- 0.00]
      test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 5.71, time.warmup: 1.39, time.bench: 4.32, round: 0.29 [+- 0.00]
      

      But then comes the 1.7 HotSport and things change radically, on 32-bit system the JDK's version is much faster for nearly-empty
      long values:

      # 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
      test_NTZ_JDK_random       : 15/20 rounds, time.total:  1.97, time.warmup: 0.61, time.bench: 1.36, round: 0.09 [+- 0.01]
      test_NTZ_JDK_single       : 15/20 rounds, time.total:  2.53, time.warmup: 0.77, time.bench: 1.77, round: 0.12 [+- 0.01]
      test_NTZ_BitUtil_random   : 15/20 rounds, time.total:  2.36, time.warmup: 0.66, time.bench: 1.70, round: 0.11 [+- 0.01]
      test_NTZ_BitUtil_single   : 15/20 rounds, time.total:  4.50, time.warmup: 1.19, time.bench: 3.31, round: 0.22 [+- 0.01]
      test_NTZ2_BitUtil_random  : 15/20 rounds, time.total:  3.08, time.warmup: 0.81, time.bench: 2.27, round: 0.15 [+- 0.01]
      test_NTZ2_BitUtil_single  : 15/20 rounds, time.total:  4.97, time.warmup: 1.28, time.bench: 3.69, round: 0.25 [+- 0.01]
      test_NTZ3_BitUtil_random  : 15/20 rounds, time.total:  5.78, time.warmup: 1.48, time.bench: 4.30, round: 0.29 [+- 0.01]
      test_NTZ3_BitUtil_single  : 15/20 rounds, time.total:  7.77, time.warmup: 1.91, time.bench: 5.86, round: 0.39 [+- 0.01]
      

      On the 64-bit quad core:

      # 1.6.0_13, Java HotSpot(TM) 64-Bit Server VM, 11.3-b02, Sun Microsystems Inc.,
      test_NTZ_JDK_random       : 15/20 rounds, time.total: 3.92, time.warmup: 0.97, time.bench: 2.94, round: 0.20 [+- 0.00]
      test_NTZ_JDK_single       : 15/20 rounds, time.total: 3.80, time.warmup: 0.97, time.bench: 2.82, round: 0.19 [+- 0.00]
      test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 0.96, time.warmup: 0.25, time.bench: 0.71, round: 0.05 [+- 0.00]
      test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 2.74, time.warmup: 0.69, time.bench: 2.04, round: 0.14 [+- 0.00]
      test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 1.22, time.warmup: 0.31, time.bench: 0.91, round: 0.06 [+- 0.00]
      test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 2.18, time.warmup: 0.56, time.bench: 1.62, round: 0.11 [+- 0.00]
      test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 2.76, time.warmup: 0.71, time.bench: 2.06, round: 0.14 [+- 0.00]
      test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 3.47, time.warmup: 0.91, time.bench: 2.56, round: 0.17 [+- 0.01]
      

      And then comes the 1.7, compare JDK's implementation with anything else (especially the time.bench for
      the single input data. Looks like this is hardware-accelerated.

      # -server -Xbatch -Xmx1024m
      # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
      test_NTZ_JDK_random       : 15/20 rounds, time.total: 0.79, time.warmup: 0.21, time.bench: 0.58, round: 0.04 [+- 0.00]
      test_NTZ_JDK_single       : 15/20 rounds, time.total: 0.75, time.warmup: 0.20, time.bench: 0.55, round: 0.04 [+- 0.00]
      test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 0.98, time.warmup: 0.25, time.bench: 0.72, round: 0.05 [+- 0.00]
      test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 2.61, time.warmup: 0.66, time.bench: 1.95, round: 0.13 [+- 0.00]
      test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 1.30, time.warmup: 0.33, time.bench: 0.97, round: 0.06 [+- 0.00]
      test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 2.48, time.warmup: 0.61, time.bench: 1.88, round: 0.13 [+- 0.00]
      test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 2.81, time.warmup: 0.70, time.bench: 2.11, round: 0.14 [+- 0.00]
      test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 4.07, time.warmup: 1.02, time.bench: 3.05, round: 0.20 [+- 0.00]
      

      Conclusions

      It seems that any change introduced to these routines will hurt somebody in some configuration, so it's really hard
      for me to make choices. I would definitely opt for the precached pop2 version on 32-bit systems as it seems to
      be always faster or equally fast compared to other bit counting options. pop2 looked like this:

         private static byte [] bcounts = new byte [0x10000];
         static
         {
             for (int i = 0x10000; --i >= 0;)
                 bcounts[i] = (byte) Integer.bitCount(i);
         }
      
         public static int pop2(long v)
         {
             int t;
             return 
                   bcounts[(t = (int) v) & 0xffff]
                 + bcounts[t >>> 16]
                 + bcounts[(t = ((int) (v >>> 32))) & 0xffff]
                 + bcounts[t >>> 16];
         }
      

      As for the hardware-accelerated ntz, if one can detect 1.7, then using the JDK's version is speeding up things
      significantly. But I have not checked how this detection would affect speed if done at run-time (I assume a final
      static flag wouldn't cause any performance penalty) and it is definitely not worth replacing for folks with older
      VMs.

      Raw results data.

      I will attach raw results as part of the issue if you want to draw your own conclusions. Didn't have access to sparc-machine
      or to any machine with the newest Intels.

      1. benchmark.jar
        277 kB
        Dawid Weiss
      2. benchmarks.txt
        17 kB
        Dawid Weiss
      3. LUCENE-2221.patch
        31 kB
        Adrien Grand
      4. lucene-bitset-benchmarks.zip
        249 kB
        Dawid Weiss
      5. results-popntz.txt
        27 kB
        Dawid Weiss

        Activity

        Hide
        Dawid Weiss added a comment -

        Performance test results.

        Show
        Dawid Weiss added a comment - Performance test results.
        Hide
        Michael McCandless added a comment -

        We can switch the pop2 impl based on 32bit vs 64bit jre?

        Show
        Michael McCandless added a comment - We can switch the pop2 impl based on 32bit vs 64bit jre?
        Hide
        Dawid Weiss added a comment -

        Yes, this would be my initial suggestion. Although before this is committed to the code, I would first test such a patch:

        a) on a more diverse set of machines (and here the help of users would be invaluable), especially with different CPUs (sparc) and motherboard architectures. pop2 references memory heavily, I wonder how this affects performance.

        b) adding an if on a static final variable should effectively allow both implementations to coexist (hotspot should get rid of the dead code at runtime). I have not verified this, however.

        I'm sure Yonik has implemented precalced pop() though (since ntz is a precalc too), so I wonder if his implementation didn't show this improvement or if there were other reasons not to include it. I sometimes wish we went back to the simplicity of clock-cycle counting processors and assembly... everything was so straightforward back then

        Show
        Dawid Weiss added a comment - Yes, this would be my initial suggestion. Although before this is committed to the code, I would first test such a patch: a) on a more diverse set of machines (and here the help of users would be invaluable), especially with different CPUs (sparc) and motherboard architectures. pop2 references memory heavily, I wonder how this affects performance. b) adding an if on a static final variable should effectively allow both implementations to coexist (hotspot should get rid of the dead code at runtime). I have not verified this, however. I'm sure Yonik has implemented precalced pop() though (since ntz is a precalc too), so I wonder if his implementation didn't show this improvement or if there were other reasons not to include it. I sometimes wish we went back to the simplicity of clock-cycle counting processors and assembly... everything was so straightforward back then
        Hide
        Yonik Seeley added a comment -

        I'm sure Yonik has implemented precalced pop() though (since ntz is a precalc too),

        I never tried a 64K lookup table. The problem with something like that is that it completely blows out your L1 caches (or takes up half of it - depending on the processor). So a micro-benchmark could show a speedup while the whole program may show a slowdown. This stuff needs to be tested in context. The performance of pop() was also de-emphasized via the implementation of pop_array, pop_intersect, etc, which only call pop once for every 8 longs.

        So it's things like pop_intersect that should be the test target, not pop.
        Create 100 sets of size 10M bits and do intersectionCount between the pairs.
        That's the use case that will matter to Solr (OpenBitSet came from Solr, and Lucene doesn't use pop much AFAIK).

        Show
        Yonik Seeley added a comment - I'm sure Yonik has implemented precalced pop() though (since ntz is a precalc too), I never tried a 64K lookup table. The problem with something like that is that it completely blows out your L1 caches (or takes up half of it - depending on the processor). So a micro-benchmark could show a speedup while the whole program may show a slowdown. This stuff needs to be tested in context. The performance of pop() was also de-emphasized via the implementation of pop_array, pop_intersect, etc, which only call pop once for every 8 longs. So it's things like pop_intersect that should be the test target, not pop. Create 100 sets of size 10M bits and do intersectionCount between the pairs. That's the use case that will matter to Solr (OpenBitSet came from Solr, and Lucene doesn't use pop much AFAIK).
        Hide
        Dawid Weiss added a comment -

        Look closely at the results above, Yonik. I have done this – the test_pop_xor and test_pop2_hd_xor do exactly this. While I do agree that a lookup table blows the cache, it still seems to be more efficient than running 64-bit manipulations on a 32-bit processor, at least on those machines I had available.

        Show
        Dawid Weiss added a comment - Look closely at the results above, Yonik. I have done this – the test_pop_xor and test_pop2_hd_xor do exactly this. While I do agree that a lookup table blows the cache, it still seems to be more efficient than running 64-bit manipulations on a 32-bit processor, at least on those machines I had available.
        Hide
        Dawid Weiss added a comment -

        Plain ASCII results.

        Show
        Dawid Weiss added a comment - Plain ASCII results.
        Hide
        Yonik Seeley added a comment -

        Ah yeah, missed that.

        In fact it appears on 32 bit systems, even
        Integer.bitCount((int)v) + Integer.bitCount((int)(v>>>32))
        is faster than Long.bitCount

        Never looked into that since we were deploying on 64 bit systems (even way back in '04 in fact)

        Show
        Yonik Seeley added a comment - Ah yeah, missed that. In fact it appears on 32 bit systems, even Integer.bitCount((int)v) + Integer.bitCount((int)(v>>>32)) is faster than Long.bitCount Never looked into that since we were deploying on 64 bit systems (even way back in '04 in fact)
        Hide
        Dawid Weiss added a comment -

        I had a suspicion this must be the case. I even wondered if, given this, it wouldn't make sense to write a simple loop (loops are unrolled by HotSpot anyway ) for xor'ing long arrays and use int bitcounts twice.... when I did this, however, the times were slower. There is a lot of magic going on when HotSpot inlines stuff (register allocation, etc.), so I can only support your claim that a test in context would be much better; this context should be something that uses bitutils extensively, but on some real chunk of data and including a larger bit of Lucene code.

        Show
        Dawid Weiss added a comment - I had a suspicion this must be the case. I even wondered if, given this, it wouldn't make sense to write a simple loop (loops are unrolled by HotSpot anyway ) for xor'ing long arrays and use int bitcounts twice.... when I did this, however, the times were slower. There is a lot of magic going on when HotSpot inlines stuff (register allocation, etc.), so I can only support your claim that a test in context would be much better; this context should be something that uses bitutils extensively, but on some real chunk of data and including a larger bit of Lucene code.
        Hide
        Dawid Weiss added a comment -

        Benchmark results for array operations and iterators comparing the performance of ntz in JRE, BitUtil (Lucene trunk) and BitUtil (different implementations for 32/64 bit JVMs, runtime decision).

        Show
        Dawid Weiss added a comment - Benchmark results for array operations and iterators comparing the performance of ntz in JRE, BitUtil (Lucene trunk) and BitUtil (different implementations for 32/64 bit JVMs, runtime decision).
        Hide
        Dawid Weiss added a comment -

        Executable Java JAR with benchmarking code for anybody that wishes to repeat/ run these tests on other architectures. I'd be particularly interested in Intel I7 – these have HotSpot intrinsics.

        Show
        Dawid Weiss added a comment - Executable Java JAR with benchmarking code for anybody that wishes to repeat/ run these tests on other architectures. I'd be particularly interested in Intel I7 – these have HotSpot intrinsics.
        Hide
        Dawid Weiss added a comment -

        I wrote a set of micro-benchmarks comparing pop_* methods from BitUtil and an OpenBitSet iterator (nextSetBit) performance. Three different variations of BitUtil:

        • current Lucene trunk,
        • pop/ntz redirecting to corresponding JRE methods,
        • current Lucene trunk with pop doing an if depending on the current architecture (64/32 bits) and redirecting to table lookup or bit-fiddling version.

        The results I acquired from the machines I have access to confirm what I said earlier – the 32-bit version brings performance improvement to bit counting and does not degrade performance on 64-bit architectures. Compare time.bench and round fields:

        # Windows XP, 32-bit, Pentium 4
        
        # 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc., 
        #
        Benchmark_BitUtil_trunk.test_pop_array            : 15/20 rounds, time.bench:  26.45, round:  1.77 [+- 0.02]
        Benchmark_BitUtil_trunk.test_pop_xor              : 15/20 rounds, time.bench:  43.34, round:  2.89 [+- 0.00]
        Benchmark_BitUtil_trunk.test_pop_intersect        : 15/20 rounds, time.bench:  43.91, round:  2.93 [+- 0.01]
        Benchmark_BitUtil_trunk.test_pop_andnot           : 15/20 rounds, time.bench:  40.61, round:  2.71 [+- 0.01]
        Benchmark_BitUtil_trunk.test_pop_union            : 15/20 rounds, time.bench:  44.00, round:  2.93 [+- 0.01]
                                                                                                                    
        Benchmark_BitUtil_pop3264.test_pop_array          : 15/20 rounds, time.bench:  19.86, round:  1.32 [+- 0.01]
        Benchmark_BitUtil_pop3264.test_pop_xor            : 15/20 rounds, time.bench:  38.42, round:  2.56 [+- 0.01]
        Benchmark_BitUtil_pop3264.test_pop_intersect      : 15/20 rounds, time.bench:  37.95, round:  2.53 [+- 0.01]
        Benchmark_BitUtil_pop3264.test_pop_andnot         : 15/20 rounds, time.bench:  32.75, round:  2.18 [+- 0.01]
        Benchmark_BitUtil_pop3264.test_pop_union          : 15/20 rounds, time.bench:  37.87, round:  2.52 [+- 0.01]
                                                                                                                    
        #                                                                                                           
        # 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems                                          
        #                                                                                                           
        Benchmark_BitUtil_trunk.test_pop_array            : 15/20 rounds, time.bench:  27.41, round:  1.83 [+- 0.01]
        Benchmark_BitUtil_trunk.test_pop_xor              : 15/20 rounds, time.bench:  40.69, round:  2.71 [+- 0.01]
        Benchmark_BitUtil_trunk.test_pop_intersect        : 15/20 rounds, time.bench:  40.67, round:  2.71 [+- 0.01]
        Benchmark_BitUtil_trunk.test_pop_andnot           : 15/20 rounds, time.bench:  40.36, round:  2.69 [+- 0.02]
        Benchmark_BitUtil_trunk.test_pop_union            : 15/20 rounds, time.bench:  40.69, round:  2.71 [+- 0.01]
                                                                                                                    
        Benchmark_BitUtil_pop3264.test_pop_array          : 15/20 rounds, time.bench:  19.59, round:  1.31 [+- 0.01]
        Benchmark_BitUtil_pop3264.test_pop_xor            : 15/20 rounds, time.bench:  33.23, round:  2.22 [+- 0.02]
        Benchmark_BitUtil_pop3264.test_pop_intersect      : 15/20 rounds, time.bench:  33.05, round:  2.20 [+- 0.00]
        Benchmark_BitUtil_pop3264.test_pop_andnot         : 15/20 rounds, time.bench:  33.66, round:  2.24 [+- 0.01]
        Benchmark_BitUtil_pop3264.test_pop_union          : 15/20 rounds, time.bench:  33.02, round:  2.20 [+- 0.01]
        

        The attachment benchmarks.txt has details for 64-bit systems. No performance degradation from a runtime selection of bit-counting routine. I had no access to a machine with hardware popcnt support.

        The other thing measured was bit-iterators and hardware support for ntz. I had access to a single machine with hardware support for ntz – bit iterators speed up a lot if Java 1.7 is available. Compare (same VM, same machine):

        Benchmark_BitUtil_trunk.test_ntz_iterator_int     :   time.bench: 26.24, round: 5.25 [+- 0.00]
        Benchmark_BitUtil_trunk.test_ntz_iterator_long    :   time.bench: 31.53, round: 6.31 [+- 0.00]
        
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int :   time.bench: 17.52, round: 3.50 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long:   time.bench: 22.67, round: 4.53 [+- 0.00]
        

        Older VM (1.6, same machine (no support for hardware ntz):

        Benchmark_BitUtil_trunk.test_ntz_iterator_int     :   time.bench: 24.96, round: 4.99 [+- 0.00]
        Benchmark_BitUtil_trunk.test_ntz_iterator_long    :   time.bench: 28.92, round: 5.78 [+- 0.00]
                                                                                                      
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int :   time.bench: 45.94, round: 9.19 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long:   time.bench: 48.59, round: 9.72 [+- 0.00]
        

        The binary used to acquire these numbers (and the source code) is attached to this issue.

        My idea for a patch here would be to have the following logic:

        • if the system is 32-bit, use table-based pop() – does not hurt and speeds up things on 32-bit systems.
        • if the system is 64-bit AND a system property lucene.popntz.jre is set, use JRE's implementation.
        • otherwise use the current implementation from the trunk.

        The system property above is an expert option – if somebody is running an Intel I7 and Java 1.7, enabling it will bring a measureable performance gain. I have no idea how to detect hardware support for pop/ntz other than by measuring execution time (easily affected by concurrent load) or using external utilities (cat /proc/cpuinfo; not too nice and platform-dependent).

        Show
        Dawid Weiss added a comment - I wrote a set of micro-benchmarks comparing pop_* methods from BitUtil and an OpenBitSet iterator ( nextSetBit ) performance. Three different variations of BitUtil : current Lucene trunk, pop/ntz redirecting to corresponding JRE methods, current Lucene trunk with pop doing an if depending on the current architecture (64/32 bits) and redirecting to table lookup or bit-fiddling version. The results I acquired from the machines I have access to confirm what I said earlier – the 32-bit version brings performance improvement to bit counting and does not degrade performance on 64-bit architectures. Compare time.bench and round fields: # Windows XP, 32-bit, Pentium 4 # 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc., # Benchmark_BitUtil_trunk.test_pop_array : 15/20 rounds, time.bench: 26.45, round: 1.77 [+- 0.02] Benchmark_BitUtil_trunk.test_pop_xor : 15/20 rounds, time.bench: 43.34, round: 2.89 [+- 0.00] Benchmark_BitUtil_trunk.test_pop_intersect : 15/20 rounds, time.bench: 43.91, round: 2.93 [+- 0.01] Benchmark_BitUtil_trunk.test_pop_andnot : 15/20 rounds, time.bench: 40.61, round: 2.71 [+- 0.01] Benchmark_BitUtil_trunk.test_pop_union : 15/20 rounds, time.bench: 44.00, round: 2.93 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_array : 15/20 rounds, time.bench: 19.86, round: 1.32 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_xor : 15/20 rounds, time.bench: 38.42, round: 2.56 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_intersect : 15/20 rounds, time.bench: 37.95, round: 2.53 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_andnot : 15/20 rounds, time.bench: 32.75, round: 2.18 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_union : 15/20 rounds, time.bench: 37.87, round: 2.52 [+- 0.01] # # 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems # Benchmark_BitUtil_trunk.test_pop_array : 15/20 rounds, time.bench: 27.41, round: 1.83 [+- 0.01] Benchmark_BitUtil_trunk.test_pop_xor : 15/20 rounds, time.bench: 40.69, round: 2.71 [+- 0.01] Benchmark_BitUtil_trunk.test_pop_intersect : 15/20 rounds, time.bench: 40.67, round: 2.71 [+- 0.01] Benchmark_BitUtil_trunk.test_pop_andnot : 15/20 rounds, time.bench: 40.36, round: 2.69 [+- 0.02] Benchmark_BitUtil_trunk.test_pop_union : 15/20 rounds, time.bench: 40.69, round: 2.71 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_array : 15/20 rounds, time.bench: 19.59, round: 1.31 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_xor : 15/20 rounds, time.bench: 33.23, round: 2.22 [+- 0.02] Benchmark_BitUtil_pop3264.test_pop_intersect : 15/20 rounds, time.bench: 33.05, round: 2.20 [+- 0.00] Benchmark_BitUtil_pop3264.test_pop_andnot : 15/20 rounds, time.bench: 33.66, round: 2.24 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_union : 15/20 rounds, time.bench: 33.02, round: 2.20 [+- 0.01] The attachment benchmarks.txt has details for 64-bit systems. No performance degradation from a runtime selection of bit-counting routine. I had no access to a machine with hardware popcnt support. The other thing measured was bit-iterators and hardware support for ntz . I had access to a single machine with hardware support for ntz – bit iterators speed up a lot if Java 1.7 is available. Compare (same VM, same machine): Benchmark_BitUtil_trunk.test_ntz_iterator_int : time.bench: 26.24, round: 5.25 [+- 0.00] Benchmark_BitUtil_trunk.test_ntz_iterator_long : time.bench: 31.53, round: 6.31 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : time.bench: 17.52, round: 3.50 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: time.bench: 22.67, round: 4.53 [+- 0.00] Older VM (1.6, same machine (no support for hardware ntz): Benchmark_BitUtil_trunk.test_ntz_iterator_int : time.bench: 24.96, round: 4.99 [+- 0.00] Benchmark_BitUtil_trunk.test_ntz_iterator_long : time.bench: 28.92, round: 5.78 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : time.bench: 45.94, round: 9.19 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: time.bench: 48.59, round: 9.72 [+- 0.00] The binary used to acquire these numbers (and the source code) is attached to this issue. My idea for a patch here would be to have the following logic: if the system is 32-bit, use table-based pop() – does not hurt and speeds up things on 32-bit systems. if the system is 64-bit AND a system property lucene.popntz.jre is set, use JRE's implementation. otherwise use the current implementation from the trunk. The system property above is an expert option – if somebody is running an Intel I7 and Java 1.7, enabling it will bring a measureable performance gain. I have no idea how to detect hardware support for pop/ntz other than by measuring execution time (easily affected by concurrent load) or using external utilities (cat /proc/cpuinfo; not too nice and platform-dependent).
        Hide
        Dawid Weiss added a comment -

        Benchmarks, source code.

        Show
        Dawid Weiss added a comment - Benchmarks, source code.
        Hide
        Yonik Seeley added a comment - - edited

        If you only tested 32 bit on the P4, that can be the source of some of the difference too - they had slow shift operations. Current processors from both Intel and AMD do much better. So if you haven't, you might want to try a 32 bit JVM on one of your other processors (P4s are sort of going the way of the dinosaur anyway).

        edit: I did a quick test previously on my core2, and the lookup method was still faster (I forget what the numbers were though).

        Show
        Yonik Seeley added a comment - - edited If you only tested 32 bit on the P4, that can be the source of some of the difference too - they had slow shift operations. Current processors from both Intel and AMD do much better. So if you haven't, you might want to try a 32 bit JVM on one of your other processors (P4s are sort of going the way of the dinosaur anyway). edit: I did a quick test previously on my core2, and the lookup method was still faster (I forget what the numbers were though).
        Hide
        Dawid Weiss added a comment -

        I do have a bunch of dinosaur-age computers, such is life. I added the benchmark JAR to this issue so that other people can experiment on their own, I'm sure there is such a variety of different architectures and memory speed-to-processor speed ratios that there will be no clear winner. I just checked on two other 32-bit machines (AMD Athlon and Intel Core2 Duo U7700). The Intel shows table-lookup gain of about 10%. For AMD the results are nearly identical.

        Show
        Dawid Weiss added a comment - I do have a bunch of dinosaur-age computers, such is life. I added the benchmark JAR to this issue so that other people can experiment on their own, I'm sure there is such a variety of different architectures and memory speed-to-processor speed ratios that there will be no clear winner. I just checked on two other 32-bit machines (AMD Athlon and Intel Core2 Duo U7700). The Intel shows table-lookup gain of about 10%. For AMD the results are nearly identical.
        Hide
        Dawid Weiss added a comment -

        Results from Intel I7 – an improvement of about 20%. Significant, but I was silently hoping for more. I'm guessing memory throughput becomes the limit in these benchmarks.

        #
        # Intel(R) Core(TM) i7 CPU 920  @ 2.67GHz
        #
        
        # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc., 
        
        Benchmark_BitUtil_trunk.test_pop_array            : 15/20 rounds, time.total:  6.91, time.warmup:  1.75, time.bench:  5.15, round: 0.34 [+- 0.00]
        Benchmark_BitUtil_trunk.test_pop_xor              : 15/20 rounds, time.total: 10.51, time.warmup:  2.66, time.bench:  7.86, round: 0.52 [+- 0.00]
        Benchmark_BitUtil_trunk.test_pop_intersect        : 15/20 rounds, time.total: 10.49, time.warmup:  2.64, time.bench:  7.85, round: 0.52 [+- 0.00]
        Benchmark_BitUtil_trunk.test_pop_andnot           : 15/20 rounds, time.total:  9.91, time.warmup:  2.51, time.bench:  7.40, round: 0.49 [+- 0.00]
        Benchmark_BitUtil_trunk.test_pop_union            : 15/20 rounds, time.total: 10.48, time.warmup:  2.64, time.bench:  7.84, round: 0.52 [+- 0.00]
        Benchmark_BitUtil_trunk.test_ntz_iterator_int     :   5/7 rounds, time.total: 38.67, time.warmup: 11.18, time.bench: 27.49, round: 5.50 [+- 0.03]
        Benchmark_BitUtil_trunk.test_ntz_iterator_long    :   5/7 rounds, time.total: 46.42, time.warmup: 13.24, time.bench: 33.19, round: 6.64 [+- 0.03]
         
        Benchmark_BitUtil_pop3264.test_pop_array          : 15/20 rounds, time.total:  6.91, time.warmup:  1.76, time.bench:  5.15, round: 0.34 [+- 0.00]
        Benchmark_BitUtil_pop3264.test_pop_xor            : 15/20 rounds, time.total: 10.48, time.warmup:  2.64, time.bench:  7.84, round: 0.52 [+- 0.00]
        Benchmark_BitUtil_pop3264.test_pop_intersect      : 15/20 rounds, time.total: 10.50, time.warmup:  2.64, time.bench:  7.86, round: 0.52 [+- 0.00]
        Benchmark_BitUtil_pop3264.test_pop_andnot         : 15/20 rounds, time.total:  9.91, time.warmup:  2.51, time.bench:  7.41, round: 0.49 [+- 0.00]
        Benchmark_BitUtil_pop3264.test_pop_union          : 15/20 rounds, time.total: 10.49, time.warmup:  2.64, time.bench:  7.85, round: 0.52 [+- 0.00]
        
        Benchmark_BitUtil_popNtzJRE.test_pop_array        : 15/20 rounds, time.total:  4.56, time.warmup:  1.16, time.bench:  3.40, round: 0.23 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE.test_pop_xor          : 15/20 rounds, time.total:  8.81, time.warmup:  2.20, time.bench:  6.60, round: 0.44 [+- 0.02]
        Benchmark_BitUtil_popNtzJRE.test_pop_intersect    : 15/20 rounds, time.total:  8.58, time.warmup:  2.16, time.bench:  6.42, round: 0.43 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE.test_pop_andnot       : 15/20 rounds, time.total:  8.13, time.warmup:  2.06, time.bench:  6.07, round: 0.40 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE.test_pop_union        : 15/20 rounds, time.total:  8.57, time.warmup:  2.16, time.bench:  6.41, round: 0.43 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int :   5/7 rounds, time.total: 28.68, time.warmup:  7.96, time.bench: 20.73, round: 4.15 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long:   5/7 rounds, time.total: 36.17, time.warmup: 10.30, time.bench: 25.88, round: 5.18 [+- 0.00]
        
        Show
        Dawid Weiss added a comment - Results from Intel I7 – an improvement of about 20%. Significant, but I was silently hoping for more. I'm guessing memory throughput becomes the limit in these benchmarks. # # Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz # # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc., Benchmark_BitUtil_trunk.test_pop_array : 15/20 rounds, time.total: 6.91, time.warmup: 1.75, time.bench: 5.15, round: 0.34 [+- 0.00] Benchmark_BitUtil_trunk.test_pop_xor : 15/20 rounds, time.total: 10.51, time.warmup: 2.66, time.bench: 7.86, round: 0.52 [+- 0.00] Benchmark_BitUtil_trunk.test_pop_intersect : 15/20 rounds, time.total: 10.49, time.warmup: 2.64, time.bench: 7.85, round: 0.52 [+- 0.00] Benchmark_BitUtil_trunk.test_pop_andnot : 15/20 rounds, time.total: 9.91, time.warmup: 2.51, time.bench: 7.40, round: 0.49 [+- 0.00] Benchmark_BitUtil_trunk.test_pop_union : 15/20 rounds, time.total: 10.48, time.warmup: 2.64, time.bench: 7.84, round: 0.52 [+- 0.00] Benchmark_BitUtil_trunk.test_ntz_iterator_int : 5/7 rounds, time.total: 38.67, time.warmup: 11.18, time.bench: 27.49, round: 5.50 [+- 0.03] Benchmark_BitUtil_trunk.test_ntz_iterator_long : 5/7 rounds, time.total: 46.42, time.warmup: 13.24, time.bench: 33.19, round: 6.64 [+- 0.03] Benchmark_BitUtil_pop3264.test_pop_array : 15/20 rounds, time.total: 6.91, time.warmup: 1.76, time.bench: 5.15, round: 0.34 [+- 0.00] Benchmark_BitUtil_pop3264.test_pop_xor : 15/20 rounds, time.total: 10.48, time.warmup: 2.64, time.bench: 7.84, round: 0.52 [+- 0.00] Benchmark_BitUtil_pop3264.test_pop_intersect : 15/20 rounds, time.total: 10.50, time.warmup: 2.64, time.bench: 7.86, round: 0.52 [+- 0.00] Benchmark_BitUtil_pop3264.test_pop_andnot : 15/20 rounds, time.total: 9.91, time.warmup: 2.51, time.bench: 7.41, round: 0.49 [+- 0.00] Benchmark_BitUtil_pop3264.test_pop_union : 15/20 rounds, time.total: 10.49, time.warmup: 2.64, time.bench: 7.85, round: 0.52 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_pop_array : 15/20 rounds, time.total: 4.56, time.warmup: 1.16, time.bench: 3.40, round: 0.23 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_pop_xor : 15/20 rounds, time.total: 8.81, time.warmup: 2.20, time.bench: 6.60, round: 0.44 [+- 0.02] Benchmark_BitUtil_popNtzJRE.test_pop_intersect : 15/20 rounds, time.total: 8.58, time.warmup: 2.16, time.bench: 6.42, round: 0.43 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_pop_andnot : 15/20 rounds, time.total: 8.13, time.warmup: 2.06, time.bench: 6.07, round: 0.40 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_pop_union : 15/20 rounds, time.total: 8.57, time.warmup: 2.16, time.bench: 6.41, round: 0.43 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 28.68, time.warmup: 7.96, time.bench: 20.73, round: 4.15 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 36.17, time.warmup: 10.30, time.bench: 25.88, round: 5.18 [+- 0.00]
        Hide
        Yonik Seeley added a comment -

        Results from Intel I7 - an improvement of about 20%. Significant, but I was silently hoping for more.

        Remember, pop_array and friends do tricks to only call pop once for every 8 longs because pop was slow...
        for the intrinsic case, did you try a simple loop that calls pop for every element? For example:

          public static long pop_intersect_simple(long A[], long B[], int wordOffset, int numWords) {
            int end = wordOffset + numWords;
            long ret = 0;
            for (int i=wordOffset; i<end; i++) {
              ret += Long.bitCount(A[i]^B[i]);
            }
            return ret;
          }
        
        Show
        Yonik Seeley added a comment - Results from Intel I7 - an improvement of about 20%. Significant, but I was silently hoping for more. Remember, pop_array and friends do tricks to only call pop once for every 8 longs because pop was slow... for the intrinsic case, did you try a simple loop that calls pop for every element? For example: public static long pop_intersect_simple( long A[], long B[], int wordOffset, int numWords) { int end = wordOffset + numWords; long ret = 0; for ( int i=wordOffset; i<end; i++) { ret += Long .bitCount(A[i]^B[i]); } return ret; }
        Hide
        Cristian Vat added a comment -

        Ran the benchmark.jar also on my machine:

        # Windows 7 x64, Intel(R) Core(TM) i5 CPU 750 @ 2.67Ghz
        
        # JDK 1.6
        
        # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc., 
        Benchmark_BitUtil_trunk.test_pop_array            : 15/20 rounds, time.total: 6.38, time.warmup: 1.67, time.bench: 4.72, round: 0.32 [+- 0.02]
        Benchmark_BitUtil_trunk.test_pop_xor              : 15/20 rounds, time.total: 9.18, time.warmup: 2.28, time.bench: 6.90, round: 0.46 [+- 0.01]
        Benchmark_BitUtil_trunk.test_pop_intersect        : 15/20 rounds, time.total: 9.24, time.warmup: 2.33, time.bench: 6.91, round: 0.46 [+- 0.01]
        Benchmark_BitUtil_trunk.test_pop_andnot           : 15/20 rounds, time.total: 8.82, time.warmup: 2.17, time.bench: 6.66, round: 0.44 [+- 0.01]
        Benchmark_BitUtil_trunk.test_pop_union            : 15/20 rounds, time.total: 9.13, time.warmup: 2.29, time.bench: 6.84, round: 0.46 [+- 0.01]
        Benchmark_BitUtil_trunk.test_ntz_iterator_int     : 5/7 rounds, time.total: 34.72, time.warmup: 10.00, time.bench: 24.72, round: 4.94 [+- 0.07]
        Benchmark_BitUtil_trunk.test_ntz_iterator_long    : 5/7 rounds, time.total: 42.58, time.warmup: 12.39, time.bench: 30.19, round: 6.04 [+- 0.05]
        # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc., 
        Benchmark_BitUtil_pop3264.test_pop_array          : 15/20 rounds, time.total: 6.21, time.warmup: 1.58, time.bench: 4.63, round: 0.31 [+- 0.01]
        Benchmark_BitUtil_pop3264.test_pop_xor            : 15/20 rounds, time.total: 9.19, time.warmup: 2.38, time.bench: 6.81, round: 0.45 [+- 0.01]
        Benchmark_BitUtil_pop3264.test_pop_intersect      : 15/20 rounds, time.total: 9.27, time.warmup: 2.28, time.bench: 6.99, round: 0.47 [+- 0.01]
        Benchmark_BitUtil_pop3264.test_pop_andnot         : 15/20 rounds, time.total: 8.74, time.warmup: 2.28, time.bench: 6.46, round: 0.43 [+- 0.01]
        Benchmark_BitUtil_pop3264.test_pop_union          : 15/20 rounds, time.total: 9.24, time.warmup: 2.31, time.bench: 6.93, round: 0.46 [+- 0.01]
        # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc., 
        Benchmark_BitUtil_popNtzJRE.test_pop_array        : 15/20 rounds, time.total: 4.16, time.warmup: 1.05, time.bench: 3.10, round: 0.21 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE.test_pop_xor          : 15/20 rounds, time.total: 7.32, time.warmup: 1.81, time.bench: 5.51, round: 0.37 [+- 0.01]
        Benchmark_BitUtil_popNtzJRE.test_pop_intersect    : 15/20 rounds, time.total: 7.42, time.warmup: 1.84, time.bench: 5.58, round: 0.37 [+- 0.01]
        Benchmark_BitUtil_popNtzJRE.test_pop_andnot       : 15/20 rounds, time.total: 6.74, time.warmup: 1.69, time.bench: 5.06, round: 0.34 [+- 0.01]
        Benchmark_BitUtil_popNtzJRE.test_pop_union        : 15/20 rounds, time.total: 7.13, time.warmup: 1.79, time.bench: 5.34, round: 0.36 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 25.39, time.warmup: 7.13, time.bench: 18.25, round: 3.65 [+- 0.03]
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 32.25, time.warmup: 9.57, time.bench: 22.68, round: 4.54 [+- 0.03]
        
        # JDK 1.7
        
        # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc., 
        Benchmark_BitUtil_trunk.test_pop_array            : 15/20 rounds, time.total: 6.45, time.warmup: 1.66, time.bench: 4.79, round: 0.32 [+- 0.02]
        Benchmark_BitUtil_trunk.test_pop_xor              : 15/20 rounds, time.total: 9.35, time.warmup: 2.34, time.bench: 7.01, round: 0.47 [+- 0.01]
        Benchmark_BitUtil_trunk.test_pop_intersect        : 15/20 rounds, time.total: 9.44, time.warmup: 2.37, time.bench: 7.08, round: 0.47 [+- 0.01]
        Benchmark_BitUtil_trunk.test_pop_andnot           : 15/20 rounds, time.total: 8.62, time.warmup: 2.19, time.bench: 6.43, round: 0.43 [+- 0.01]
        Benchmark_BitUtil_trunk.test_pop_union            : 15/20 rounds, time.total: 9.34, time.warmup: 2.31, time.bench: 7.03, round: 0.47 [+- 0.01]
        Benchmark_BitUtil_trunk.test_ntz_iterator_int     : 5/7 rounds, time.total: 34.76, time.warmup: 10.16, time.bench: 24.60, round: 4.92 [+- 0.03]
        Benchmark_BitUtil_trunk.test_ntz_iterator_long    : 5/7 rounds, time.total: 41.97, time.warmup: 12.02, time.bench: 29.95, round: 5.99 [+- 0.04]
        # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc., 
        Benchmark_BitUtil_pop3264.test_pop_array          : 15/20 rounds, time.total: 6.30, time.warmup: 1.61, time.bench: 4.68, round: 0.31 [+- 0.01]
        Benchmark_BitUtil_pop3264.test_pop_xor            : 15/20 rounds, time.total: 9.33, time.warmup: 2.31, time.bench: 7.02, round: 0.47 [+- 0.01]
        Benchmark_BitUtil_pop3264.test_pop_intersect      : 15/20 rounds, time.total: 9.32, time.warmup: 2.33, time.bench: 6.99, round: 0.47 [+- 0.01]
        Benchmark_BitUtil_pop3264.test_pop_andnot         : 15/20 rounds, time.total: 8.57, time.warmup: 2.17, time.bench: 6.39, round: 0.43 [+- 0.00]
        Benchmark_BitUtil_pop3264.test_pop_union          : 15/20 rounds, time.total: 9.37, time.warmup: 2.31, time.bench: 7.06, round: 0.47 [+- 0.01]
        # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc., 
        Benchmark_BitUtil_popNtzJRE.test_pop_array        : 15/20 rounds, time.total: 4.19, time.warmup: 1.05, time.bench: 3.14, round: 0.21 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE.test_pop_xor          : 15/20 rounds, time.total: 7.26, time.warmup: 1.82, time.bench: 5.44, round: 0.36 [+- 0.01]
        Benchmark_BitUtil_popNtzJRE.test_pop_intersect    : 15/20 rounds, time.total: 7.29, time.warmup: 1.79, time.bench: 5.50, round: 0.37 [+- 0.01]
        Benchmark_BitUtil_popNtzJRE.test_pop_andnot       : 15/20 rounds, time.total: 6.66, time.warmup: 1.69, time.bench: 4.97, round: 0.33 [+- 0.01]
        Benchmark_BitUtil_popNtzJRE.test_pop_union        : 15/20 rounds, time.total: 7.13, time.warmup: 1.78, time.bench: 5.34, round: 0.36 [+- 0.01]
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 25.47, time.warmup: 7.17, time.bench: 18.31, round: 3.66 [+- 0.03]
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 32.79, time.warmup: 9.22, time.bench: 23.57, round: 4.71 [+- 0.03]
        
        
        Show
        Cristian Vat added a comment - Ran the benchmark.jar also on my machine: # Windows 7 x64, Intel(R) Core(TM) i5 CPU 750 @ 2.67Ghz # JDK 1.6 # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc., Benchmark_BitUtil_trunk.test_pop_array : 15/20 rounds, time.total: 6.38, time.warmup: 1.67, time.bench: 4.72, round: 0.32 [+- 0.02] Benchmark_BitUtil_trunk.test_pop_xor : 15/20 rounds, time.total: 9.18, time.warmup: 2.28, time.bench: 6.90, round: 0.46 [+- 0.01] Benchmark_BitUtil_trunk.test_pop_intersect : 15/20 rounds, time.total: 9.24, time.warmup: 2.33, time.bench: 6.91, round: 0.46 [+- 0.01] Benchmark_BitUtil_trunk.test_pop_andnot : 15/20 rounds, time.total: 8.82, time.warmup: 2.17, time.bench: 6.66, round: 0.44 [+- 0.01] Benchmark_BitUtil_trunk.test_pop_union : 15/20 rounds, time.total: 9.13, time.warmup: 2.29, time.bench: 6.84, round: 0.46 [+- 0.01] Benchmark_BitUtil_trunk.test_ntz_iterator_int : 5/7 rounds, time.total: 34.72, time.warmup: 10.00, time.bench: 24.72, round: 4.94 [+- 0.07] Benchmark_BitUtil_trunk.test_ntz_iterator_long : 5/7 rounds, time.total: 42.58, time.warmup: 12.39, time.bench: 30.19, round: 6.04 [+- 0.05] # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc., Benchmark_BitUtil_pop3264.test_pop_array : 15/20 rounds, time.total: 6.21, time.warmup: 1.58, time.bench: 4.63, round: 0.31 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_xor : 15/20 rounds, time.total: 9.19, time.warmup: 2.38, time.bench: 6.81, round: 0.45 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_intersect : 15/20 rounds, time.total: 9.27, time.warmup: 2.28, time.bench: 6.99, round: 0.47 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_andnot : 15/20 rounds, time.total: 8.74, time.warmup: 2.28, time.bench: 6.46, round: 0.43 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_union : 15/20 rounds, time.total: 9.24, time.warmup: 2.31, time.bench: 6.93, round: 0.46 [+- 0.01] # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc., Benchmark_BitUtil_popNtzJRE.test_pop_array : 15/20 rounds, time.total: 4.16, time.warmup: 1.05, time.bench: 3.10, round: 0.21 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_pop_xor : 15/20 rounds, time.total: 7.32, time.warmup: 1.81, time.bench: 5.51, round: 0.37 [+- 0.01] Benchmark_BitUtil_popNtzJRE.test_pop_intersect : 15/20 rounds, time.total: 7.42, time.warmup: 1.84, time.bench: 5.58, round: 0.37 [+- 0.01] Benchmark_BitUtil_popNtzJRE.test_pop_andnot : 15/20 rounds, time.total: 6.74, time.warmup: 1.69, time.bench: 5.06, round: 0.34 [+- 0.01] Benchmark_BitUtil_popNtzJRE.test_pop_union : 15/20 rounds, time.total: 7.13, time.warmup: 1.79, time.bench: 5.34, round: 0.36 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 25.39, time.warmup: 7.13, time.bench: 18.25, round: 3.65 [+- 0.03] Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 32.25, time.warmup: 9.57, time.bench: 22.68, round: 4.54 [+- 0.03] # JDK 1.7 # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc., Benchmark_BitUtil_trunk.test_pop_array : 15/20 rounds, time.total: 6.45, time.warmup: 1.66, time.bench: 4.79, round: 0.32 [+- 0.02] Benchmark_BitUtil_trunk.test_pop_xor : 15/20 rounds, time.total: 9.35, time.warmup: 2.34, time.bench: 7.01, round: 0.47 [+- 0.01] Benchmark_BitUtil_trunk.test_pop_intersect : 15/20 rounds, time.total: 9.44, time.warmup: 2.37, time.bench: 7.08, round: 0.47 [+- 0.01] Benchmark_BitUtil_trunk.test_pop_andnot : 15/20 rounds, time.total: 8.62, time.warmup: 2.19, time.bench: 6.43, round: 0.43 [+- 0.01] Benchmark_BitUtil_trunk.test_pop_union : 15/20 rounds, time.total: 9.34, time.warmup: 2.31, time.bench: 7.03, round: 0.47 [+- 0.01] Benchmark_BitUtil_trunk.test_ntz_iterator_int : 5/7 rounds, time.total: 34.76, time.warmup: 10.16, time.bench: 24.60, round: 4.92 [+- 0.03] Benchmark_BitUtil_trunk.test_ntz_iterator_long : 5/7 rounds, time.total: 41.97, time.warmup: 12.02, time.bench: 29.95, round: 5.99 [+- 0.04] # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc., Benchmark_BitUtil_pop3264.test_pop_array : 15/20 rounds, time.total: 6.30, time.warmup: 1.61, time.bench: 4.68, round: 0.31 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_xor : 15/20 rounds, time.total: 9.33, time.warmup: 2.31, time.bench: 7.02, round: 0.47 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_intersect : 15/20 rounds, time.total: 9.32, time.warmup: 2.33, time.bench: 6.99, round: 0.47 [+- 0.01] Benchmark_BitUtil_pop3264.test_pop_andnot : 15/20 rounds, time.total: 8.57, time.warmup: 2.17, time.bench: 6.39, round: 0.43 [+- 0.00] Benchmark_BitUtil_pop3264.test_pop_union : 15/20 rounds, time.total: 9.37, time.warmup: 2.31, time.bench: 7.06, round: 0.47 [+- 0.01] # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc., Benchmark_BitUtil_popNtzJRE.test_pop_array : 15/20 rounds, time.total: 4.19, time.warmup: 1.05, time.bench: 3.14, round: 0.21 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_pop_xor : 15/20 rounds, time.total: 7.26, time.warmup: 1.82, time.bench: 5.44, round: 0.36 [+- 0.01] Benchmark_BitUtil_popNtzJRE.test_pop_intersect : 15/20 rounds, time.total: 7.29, time.warmup: 1.79, time.bench: 5.50, round: 0.37 [+- 0.01] Benchmark_BitUtil_popNtzJRE.test_pop_andnot : 15/20 rounds, time.total: 6.66, time.warmup: 1.69, time.bench: 4.97, round: 0.33 [+- 0.01] Benchmark_BitUtil_popNtzJRE.test_pop_union : 15/20 rounds, time.total: 7.13, time.warmup: 1.78, time.bench: 5.34, round: 0.36 [+- 0.01] Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 25.47, time.warmup: 7.17, time.bench: 18.31, round: 3.66 [+- 0.03] Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 32.79, time.warmup: 9.22, time.bench: 23.57, round: 4.71 [+- 0.03]
        Hide
        Dawid Weiss added a comment -

        An updated set of benchmarks (simple loops and JRE ntz/pop).

        Show
        Dawid Weiss added a comment - An updated set of benchmarks (simple loops and JRE ntz/pop).
        Hide
        Dawid Weiss added a comment -

        Updated source code for the benchmarks.

        Show
        Dawid Weiss added a comment - Updated source code for the benchmarks.
        Hide
        Dawid Weiss added a comment -

        Confirmed, with a simple loop it is even faster.

        # Windows 7/64, Intel(R) Core(TM) i7 CPU 920  @ 2.67GHz
        
        #
        # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
        #
        Benchmark_BitUtil_popNtzJRE.test_pop_array        : time.bench: 3.55, round: 0.24 [+- 0.02]
        Benchmark_BitUtil_popNtzJRE.test_pop_xor          : time.bench: 6.45, round: 0.43 [+- 0.01]
        Benchmark_BitUtil_popNtzJRE.test_pop_intersect    : time.bench: 6.42, round: 0.43 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE.test_pop_andnot       : time.bench: 6.10, round: 0.41 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE.test_pop_union        : time.bench: 6.42, round: 0.43 [+- 0.00]
                                                                                                   
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_array : time.bench: 2.47, round: 0.16 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_xor   : time.bench: 4.95, round: 0.33 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_inters: time.bench: 4.96, round: 0.33 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_andnot: time.bench: 5.12, round: 0.34 [+- 0.00]
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_union : time.bench: 5.03, round: 0.34 [+- 0.01]
        
        Show
        Dawid Weiss added a comment - Confirmed, with a simple loop it is even faster. # Windows 7/64, Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz # # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc., # Benchmark_BitUtil_popNtzJRE.test_pop_array : time.bench: 3.55, round: 0.24 [+- 0.02] Benchmark_BitUtil_popNtzJRE.test_pop_xor : time.bench: 6.45, round: 0.43 [+- 0.01] Benchmark_BitUtil_popNtzJRE.test_pop_intersect : time.bench: 6.42, round: 0.43 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_pop_andnot : time.bench: 6.10, round: 0.41 [+- 0.00] Benchmark_BitUtil_popNtzJRE.test_pop_union : time.bench: 6.42, round: 0.43 [+- 0.00] Benchmark_BitUtil_popNtzJRE_simple.test_pop_array : time.bench: 2.47, round: 0.16 [+- 0.00] Benchmark_BitUtil_popNtzJRE_simple.test_pop_xor : time.bench: 4.95, round: 0.33 [+- 0.00] Benchmark_BitUtil_popNtzJRE_simple.test_pop_inters: time.bench: 4.96, round: 0.33 [+- 0.00] Benchmark_BitUtil_popNtzJRE_simple.test_pop_andnot: time.bench: 5.12, round: 0.34 [+- 0.00] Benchmark_BitUtil_popNtzJRE_simple.test_pop_union : time.bench: 5.03, round: 0.34 [+- 0.01]
        Hide
        Dawid Weiss added a comment - - edited

        I'm done with these benchmarks. The results so far indicate that

        a) if an accelerated (intrinsic) pop/ntz is available, a simple loop with Long.* methods is much faster then the current Lucene trunk's implementation. This setting is difficult to detect reliably from within Java though.

        b) my tests on 32-bit systems indicate significant improvement in bit counting routines with table lookup implementation. This may vary between architectures.

        Show
        Dawid Weiss added a comment - - edited I'm done with these benchmarks. The results so far indicate that a) if an accelerated (intrinsic) pop/ntz is available, a simple loop with Long.* methods is much faster then the current Lucene trunk's implementation. This setting is difficult to detect reliably from within Java though. b) my tests on 32-bit systems indicate significant improvement in bit counting routines with table lookup implementation. This may vary between architectures.
        Hide
        Michael McCandless added a comment -

        Maybe Lucene could do its own micro-benchmark in the current env to determine which impl should be used? If the results are so stark depending on JVM/arch, presumably it wouldn't take much time for that test to run... and the determination would be saved statically.

        Show
        Michael McCandless added a comment - Maybe Lucene could do its own micro-benchmark in the current env to determine which impl should be used? If the results are so stark depending on JVM/arch, presumably it wouldn't take much time for that test to run... and the determination would be saved statically.
        Hide
        Stanislaw Osinski added a comment -

        I ran the benchmark on a 64bit Linux running an Intel(R) Xeon(R) E5520 @ 2.27GHz. I tried both Sun's JDK 1.7-ea as well as JDK 1.6.0_18, which also has support for native POPCNT.

        JDK 1.7-ea -server -XX:+UsePopCountInstruction

        # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc.,
        Benchmark_BitUtil_trunk.test_pop_array            : 15/20 rounds, time.total: 7.69, time.warmup: 1.96, time.bench: 5.73, round: 0.38 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_xor              : 15/20 rounds, time.total: 11.13, time.warmup: 2.81, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_intersect        : 15/20 rounds, time.total: 11.13, time.warmup: 2.82, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_andnot           : 15/20 rounds, time.total: 10.46, time.warmup: 2.66, time.bench: 7.80, round: 0.52 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_union            : 15/20 rounds, time.total: 11.13, time.warmup: 2.81, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_ntz_iterator_int     : 5/7 rounds, time.total: 42.30, time.warmup: 12.02, time.bench: 30.29, round: 6.06 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_ntz_iterator_long    : 5/7 rounds, time.total: 55.48, time.warmup: 15.43, time.bench: 40.05, round: 8.01 [+- 0.06], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc.,
        Benchmark_BitUtil_pop3264.test_pop_array          : 15/20 rounds, time.total: 7.78, time.warmup: 2.05, time.bench: 5.73, round: 0.38 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_xor            : 15/20 rounds, time.total: 11.13, time.warmup: 2.82, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_intersect      : 15/20 rounds, time.total: 11.14, time.warmup: 2.82, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_andnot         : 15/20 rounds, time.total: 10.46, time.warmup: 2.66, time.bench: 7.80, round: 0.52 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_union          : 15/20 rounds, time.total: 11.13, time.warmup: 2.81, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc.,
        Benchmark_BitUtil_popNtzJRE.test_pop_array        : 15/20 rounds, time.total: 5.06, time.warmup: 1.29, time.bench: 3.77, round: 0.25 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_xor          : 15/20 rounds, time.total: 8.54, time.warmup: 2.15, time.bench: 6.39, round: 0.43 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_intersect    : 15/20 rounds, time.total: 8.54, time.warmup: 2.15, time.bench: 6.39, round: 0.43 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_andnot       : 15/20 rounds, time.total: 7.81, time.warmup: 1.99, time.bench: 5.81, round: 0.39 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_union        : 15/20 rounds, time.total: 8.54, time.warmup: 2.15, time.bench: 6.39, round: 0.43 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 33.55, time.warmup: 8.72, time.bench: 24.83, round: 4.97 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 39.61, time.warmup: 11.48, time.bench: 28.12, round: 5.62 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc.,
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_array : 15/20 rounds, time.total: 3.25, time.warmup: 0.82, time.bench: 2.43, round: 0.16 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_xor   : 15/20 rounds, time.total: 5.05, time.warmup: 1.27, time.bench: 3.78, round: 0.25 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_intersect: 15/20 rounds, time.total: 5.05, time.warmup: 1.27, time.bench: 3.78, round: 0.25 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_andnot: 15/20 rounds, time.total: 5.34, time.warmup: 1.34, time.bench: 4.00, round: 0.27 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_union : 15/20 rounds, time.total: 5.05, time.warmup: 1.27, time.bench: 3.78, round: 0.25 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        VM option '+UsePopCountInstruction'
        

        JDK 1.7-ea -server -XX:-UsePopCountInstruction

        # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc.,
        Benchmark_BitUtil_trunk.test_pop_array            : 15/20 rounds, time.total: 8.43, time.warmup: 2.29, time.bench: 6.13, round: 0.41 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_xor              : 15/20 rounds, time.total: 11.97, time.warmup: 2.94, time.bench: 9.03, round: 0.60 [+- 0.03], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_intersect        : 15/20 rounds, time.total: 12.88, time.warmup: 3.25, time.bench: 9.63, round: 0.64 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_andnot           : 15/20 rounds, time.total: 12.19, time.warmup: 3.09, time.bench: 9.11, round: 0.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_union            : 15/20 rounds, time.total: 12.87, time.warmup: 3.25, time.bench: 9.63, round: 0.64 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_ntz_iterator_int     : 5/7 rounds, time.total: 46.44, time.warmup: 13.03, time.bench: 33.41, round: 6.68 [+- 0.01], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_ntz_iterator_long    : 5/7 rounds, time.total: 56.51, time.warmup: 16.31, time.bench: 40.20, round: 8.04 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc.,
        Benchmark_BitUtil_pop3264.test_pop_array          : 15/20 rounds, time.total: 9.09, time.warmup: 2.38, time.bench: 6.71, round: 0.45 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_xor            : 15/20 rounds, time.total: 12.75, time.warmup: 3.22, time.bench: 9.54, round: 0.64 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_intersect      : 15/20 rounds, time.total: 12.75, time.warmup: 3.21, time.bench: 9.54, round: 0.64 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_andnot         : 15/20 rounds, time.total: 12.08, time.warmup: 3.06, time.bench: 9.02, round: 0.60 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_union          : 15/20 rounds, time.total: 12.76, time.warmup: 3.22, time.bench: 9.54, round: 0.64 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc.,
        Benchmark_BitUtil_popNtzJRE.test_pop_array        : 15/20 rounds, time.total: 8.98, time.warmup: 2.28, time.bench: 6.70, round: 0.45 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_xor          : 15/20 rounds, time.total: 12.75, time.warmup: 3.21, time.bench: 9.54, round: 0.64 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_intersect    : 15/20 rounds, time.total: 12.76, time.warmup: 3.22, time.bench: 9.54, round: 0.64 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_andnot       : 15/20 rounds, time.total: 12.09, time.warmup: 3.06, time.bench: 9.03, round: 0.60 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_union        : 15/20 rounds, time.total: 11.94, time.warmup: 3.22, time.bench: 8.72, round: 0.58 [+- 0.04], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 32.77, time.warmup: 8.92, time.bench: 23.84, round: 4.77 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 39.61, time.warmup: 11.49, time.bench: 28.12, round: 5.62 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc.,
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_array : 15/20 rounds, time.total: 18.03, time.warmup: 4.52, time.bench: 13.51, round: 0.90 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_xor   : 15/20 rounds, time.total: 19.65, time.warmup: 4.93, time.bench: 14.73, round: 0.98 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_intersect: 15/20 rounds, time.total: 19.65, time.warmup: 4.93, time.bench: 14.73, round: 0.98 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_andnot: 15/20 rounds, time.total: 20.49, time.warmup: 5.14, time.bench: 15.35, round: 1.02 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_union : 15/20 rounds, time.total: 19.65, time.warmup: 4.93, time.bench: 14.73, round: 0.98 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        VM option '-UsePopCountInstruction'
        

        JDK 1.6.0_18 -server -XX:+UsePopCountInstruction

        # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc.,
        Benchmark_BitUtil_trunk.test_pop_array            : 15/20 rounds, time.total: 8.42, time.warmup: 2.13, time.bench: 6.30, round: 0.42 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_xor              : 15/20 rounds, time.total: 12.12, time.warmup: 3.04, time.bench: 9.08, round: 0.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_intersect        : 15/20 rounds, time.total: 12.12, time.warmup: 3.04, time.bench: 9.08, round: 0.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_andnot           : 15/20 rounds, time.total: 11.53, time.warmup: 2.90, time.bench: 8.63, round: 0.58 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_union            : 15/20 rounds, time.total: 12.12, time.warmup: 3.04, time.bench: 9.08, round: 0.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_ntz_iterator_int     : 5/7 rounds, time.total: 46.85, time.warmup: 13.32, time.bench: 33.53, round: 6.71 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_ntz_iterator_long    : 5/7 rounds, time.total: 57.93, time.warmup: 16.79, time.bench: 41.15, round: 8.23 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc.,
        Benchmark_BitUtil_pop3264.test_pop_array          : 15/20 rounds, time.total: 8.43, time.warmup: 2.14, time.bench: 6.29, round: 0.42 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_xor            : 15/20 rounds, time.total: 12.12, time.warmup: 3.04, time.bench: 9.08, round: 0.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_intersect      : 15/20 rounds, time.total: 12.12, time.warmup: 3.04, time.bench: 9.08, round: 0.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_andnot         : 15/20 rounds, time.total: 11.53, time.warmup: 2.90, time.bench: 8.63, round: 0.58 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_union          : 15/20 rounds, time.total: 12.12, time.warmup: 3.04, time.bench: 9.08, round: 0.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc.,
        Benchmark_BitUtil_popNtzJRE.test_pop_array        : 15/20 rounds, time.total: 5.57, time.warmup: 1.42, time.bench: 4.15, round: 0.28 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_xor          : 15/20 rounds, time.total: 9.43, time.warmup: 2.36, time.bench: 7.07, round: 0.47 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_intersect    : 15/20 rounds, time.total: 9.43, time.warmup: 2.36, time.bench: 7.07, round: 0.47 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_andnot       : 15/20 rounds, time.total: 8.64, time.warmup: 2.18, time.bench: 6.46, round: 0.43 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_union        : 15/20 rounds, time.total: 9.42, time.warmup: 2.36, time.bench: 7.07, round: 0.47 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 34.36, time.warmup: 9.70, time.bench: 24.66, round: 4.93 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 43.89, time.warmup: 13.01, time.bench: 30.88, round: 6.18 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc.,
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_array : 15/20 rounds, time.total: 3.58, time.warmup: 0.90, time.bench: 2.69, round: 0.18 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_xor   : 15/20 rounds, time.total: 5.56, time.warmup: 1.39, time.bench: 4.16, round: 0.28 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_intersect: 15/20 rounds, time.total: 5.56, time.warmup: 1.39, time.bench: 4.16, round: 0.28 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_andnot: 15/20 rounds, time.total: 5.89, time.warmup: 1.48, time.bench: 4.42, round: 0.29 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_union : 15/20 rounds, time.total: 5.56, time.warmup: 1.39, time.bench: 4.16, round: 0.28 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        

        JDK 1.6.0_18 -server -XX:-UsePopCountInstruction

        # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc.,
        Benchmark_BitUtil_trunk.test_pop_array            : 15/20 rounds, time.total: 7.67, time.warmup: 1.94, time.bench: 5.74, round: 0.38 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_xor              : 15/20 rounds, time.total: 11.02, time.warmup: 2.77, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_intersect        : 15/20 rounds, time.total: 11.02, time.warmup: 2.77, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_andnot           : 15/20 rounds, time.total: 10.49, time.warmup: 2.64, time.bench: 7.85, round: 0.52 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_pop_union            : 15/20 rounds, time.total: 11.02, time.warmup: 2.77, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_ntz_iterator_int     : 5/7 rounds, time.total: 42.61, time.warmup: 12.11, time.bench: 30.49, round: 6.10 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_trunk.test_ntz_iterator_long    : 5/7 rounds, time.total: 52.69, time.warmup: 15.27, time.bench: 37.42, round: 7.48 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc.,
        Benchmark_BitUtil_pop3264.test_pop_array          : 15/20 rounds, time.total: 7.69, time.warmup: 1.95, time.bench: 5.74, round: 0.38 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_xor            : 15/20 rounds, time.total: 11.02, time.warmup: 2.76, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_intersect      : 15/20 rounds, time.total: 11.02, time.warmup: 2.77, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_andnot         : 15/20 rounds, time.total: 10.48, time.warmup: 2.64, time.bench: 7.84, round: 0.52 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_pop3264.test_pop_union          : 15/20 rounds, time.total: 11.02, time.warmup: 2.76, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc.,
        Benchmark_BitUtil_popNtzJRE.test_pop_array        : 15/20 rounds, time.total: 7.67, time.warmup: 1.94, time.bench: 5.74, round: 0.38 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_xor          : 15/20 rounds, time.total: 11.02, time.warmup: 2.76, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_intersect    : 15/20 rounds, time.total: 11.02, time.warmup: 2.77, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_andnot       : 15/20 rounds, time.total: 10.48, time.warmup: 2.64, time.bench: 7.84, round: 0.52 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_pop_union        : 15/20 rounds, time.total: 11.02, time.warmup: 2.77, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 31.27, time.warmup: 8.83, time.bench: 22.44, round: 4.49 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 39.90, time.warmup: 11.84, time.bench: 28.06, round: 5.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc.,
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_array : 15/20 rounds, time.total: 18.03, time.warmup: 4.51, time.bench: 13.52, round: 0.90 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_xor   : 15/20 rounds, time.total: 21.25, time.warmup: 5.01, time.bench: 16.24, round: 1.08 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_intersect: 15/20 rounds, time.total: 21.67, time.warmup: 5.43, time.bench: 16.25, round: 1.08 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_andnot: 15/20 rounds, time.total: 22.43, time.warmup: 5.62, time.bench: 16.81, round: 1.12 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Benchmark_BitUtil_popNtzJRE_simple.test_pop_union : 15/20 rounds, time.total: 21.65, time.warmup: 5.42, time.bench: 16.23, round: 1.08 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        
        Show
        Stanislaw Osinski added a comment - I ran the benchmark on a 64bit Linux running an Intel(R) Xeon(R) E5520 @ 2.27GHz. I tried both Sun's JDK 1.7-ea as well as JDK 1.6.0_18, which also has support for native POPCNT . JDK 1.7-ea -server -XX:+UsePopCountInstruction # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc., Benchmark_BitUtil_trunk.test_pop_array : 15/20 rounds, time.total: 7.69, time.warmup: 1.96, time.bench: 5.73, round: 0.38 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_xor : 15/20 rounds, time.total: 11.13, time.warmup: 2.81, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_intersect : 15/20 rounds, time.total: 11.13, time.warmup: 2.82, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_andnot : 15/20 rounds, time.total: 10.46, time.warmup: 2.66, time.bench: 7.80, round: 0.52 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_union : 15/20 rounds, time.total: 11.13, time.warmup: 2.81, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_ntz_iterator_int : 5/7 rounds, time.total: 42.30, time.warmup: 12.02, time.bench: 30.29, round: 6.06 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_ntz_iterator_long : 5/7 rounds, time.total: 55.48, time.warmup: 15.43, time.bench: 40.05, round: 8.01 [+- 0.06], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc., Benchmark_BitUtil_pop3264.test_pop_array : 15/20 rounds, time.total: 7.78, time.warmup: 2.05, time.bench: 5.73, round: 0.38 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_xor : 15/20 rounds, time.total: 11.13, time.warmup: 2.82, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_intersect : 15/20 rounds, time.total: 11.14, time.warmup: 2.82, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_andnot : 15/20 rounds, time.total: 10.46, time.warmup: 2.66, time.bench: 7.80, round: 0.52 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_union : 15/20 rounds, time.total: 11.13, time.warmup: 2.81, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc., Benchmark_BitUtil_popNtzJRE.test_pop_array : 15/20 rounds, time.total: 5.06, time.warmup: 1.29, time.bench: 3.77, round: 0.25 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_xor : 15/20 rounds, time.total: 8.54, time.warmup: 2.15, time.bench: 6.39, round: 0.43 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_intersect : 15/20 rounds, time.total: 8.54, time.warmup: 2.15, time.bench: 6.39, round: 0.43 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_andnot : 15/20 rounds, time.total: 7.81, time.warmup: 1.99, time.bench: 5.81, round: 0.39 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_union : 15/20 rounds, time.total: 8.54, time.warmup: 2.15, time.bench: 6.39, round: 0.43 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 33.55, time.warmup: 8.72, time.bench: 24.83, round: 4.97 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 39.61, time.warmup: 11.48, time.bench: 28.12, round: 5.62 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc., Benchmark_BitUtil_popNtzJRE_simple.test_pop_array : 15/20 rounds, time.total: 3.25, time.warmup: 0.82, time.bench: 2.43, round: 0.16 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_xor : 15/20 rounds, time.total: 5.05, time.warmup: 1.27, time.bench: 3.78, round: 0.25 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_intersect: 15/20 rounds, time.total: 5.05, time.warmup: 1.27, time.bench: 3.78, round: 0.25 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_andnot: 15/20 rounds, time.total: 5.34, time.warmup: 1.34, time.bench: 4.00, round: 0.27 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_union : 15/20 rounds, time.total: 5.05, time.warmup: 1.27, time.bench: 3.78, round: 0.25 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 VM option '+UsePopCountInstruction' JDK 1.7-ea -server -XX:-UsePopCountInstruction # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc., Benchmark_BitUtil_trunk.test_pop_array : 15/20 rounds, time.total: 8.43, time.warmup: 2.29, time.bench: 6.13, round: 0.41 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_xor : 15/20 rounds, time.total: 11.97, time.warmup: 2.94, time.bench: 9.03, round: 0.60 [+- 0.03], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_intersect : 15/20 rounds, time.total: 12.88, time.warmup: 3.25, time.bench: 9.63, round: 0.64 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_andnot : 15/20 rounds, time.total: 12.19, time.warmup: 3.09, time.bench: 9.11, round: 0.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_union : 15/20 rounds, time.total: 12.87, time.warmup: 3.25, time.bench: 9.63, round: 0.64 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_ntz_iterator_int : 5/7 rounds, time.total: 46.44, time.warmup: 13.03, time.bench: 33.41, round: 6.68 [+- 0.01], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_ntz_iterator_long : 5/7 rounds, time.total: 56.51, time.warmup: 16.31, time.bench: 40.20, round: 8.04 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc., Benchmark_BitUtil_pop3264.test_pop_array : 15/20 rounds, time.total: 9.09, time.warmup: 2.38, time.bench: 6.71, round: 0.45 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_xor : 15/20 rounds, time.total: 12.75, time.warmup: 3.22, time.bench: 9.54, round: 0.64 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_intersect : 15/20 rounds, time.total: 12.75, time.warmup: 3.21, time.bench: 9.54, round: 0.64 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_andnot : 15/20 rounds, time.total: 12.08, time.warmup: 3.06, time.bench: 9.02, round: 0.60 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_union : 15/20 rounds, time.total: 12.76, time.warmup: 3.22, time.bench: 9.54, round: 0.64 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc., Benchmark_BitUtil_popNtzJRE.test_pop_array : 15/20 rounds, time.total: 8.98, time.warmup: 2.28, time.bench: 6.70, round: 0.45 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_xor : 15/20 rounds, time.total: 12.75, time.warmup: 3.21, time.bench: 9.54, round: 0.64 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_intersect : 15/20 rounds, time.total: 12.76, time.warmup: 3.22, time.bench: 9.54, round: 0.64 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_andnot : 15/20 rounds, time.total: 12.09, time.warmup: 3.06, time.bench: 9.03, round: 0.60 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_union : 15/20 rounds, time.total: 11.94, time.warmup: 3.22, time.bench: 8.72, round: 0.58 [+- 0.04], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 32.77, time.warmup: 8.92, time.bench: 23.84, round: 4.77 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 39.61, time.warmup: 11.49, time.bench: 28.12, round: 5.62 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc., Benchmark_BitUtil_popNtzJRE_simple.test_pop_array : 15/20 rounds, time.total: 18.03, time.warmup: 4.52, time.bench: 13.51, round: 0.90 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_xor : 15/20 rounds, time.total: 19.65, time.warmup: 4.93, time.bench: 14.73, round: 0.98 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_intersect: 15/20 rounds, time.total: 19.65, time.warmup: 4.93, time.bench: 14.73, round: 0.98 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_andnot: 15/20 rounds, time.total: 20.49, time.warmup: 5.14, time.bench: 15.35, round: 1.02 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_union : 15/20 rounds, time.total: 19.65, time.warmup: 4.93, time.bench: 14.73, round: 0.98 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 VM option '-UsePopCountInstruction' JDK 1.6.0_18 -server -XX:+UsePopCountInstruction # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc., Benchmark_BitUtil_trunk.test_pop_array : 15/20 rounds, time.total: 8.42, time.warmup: 2.13, time.bench: 6.30, round: 0.42 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_xor : 15/20 rounds, time.total: 12.12, time.warmup: 3.04, time.bench: 9.08, round: 0.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_intersect : 15/20 rounds, time.total: 12.12, time.warmup: 3.04, time.bench: 9.08, round: 0.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_andnot : 15/20 rounds, time.total: 11.53, time.warmup: 2.90, time.bench: 8.63, round: 0.58 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_union : 15/20 rounds, time.total: 12.12, time.warmup: 3.04, time.bench: 9.08, round: 0.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_ntz_iterator_int : 5/7 rounds, time.total: 46.85, time.warmup: 13.32, time.bench: 33.53, round: 6.71 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_ntz_iterator_long : 5/7 rounds, time.total: 57.93, time.warmup: 16.79, time.bench: 41.15, round: 8.23 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc., Benchmark_BitUtil_pop3264.test_pop_array : 15/20 rounds, time.total: 8.43, time.warmup: 2.14, time.bench: 6.29, round: 0.42 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_xor : 15/20 rounds, time.total: 12.12, time.warmup: 3.04, time.bench: 9.08, round: 0.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_intersect : 15/20 rounds, time.total: 12.12, time.warmup: 3.04, time.bench: 9.08, round: 0.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_andnot : 15/20 rounds, time.total: 11.53, time.warmup: 2.90, time.bench: 8.63, round: 0.58 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_union : 15/20 rounds, time.total: 12.12, time.warmup: 3.04, time.bench: 9.08, round: 0.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc., Benchmark_BitUtil_popNtzJRE.test_pop_array : 15/20 rounds, time.total: 5.57, time.warmup: 1.42, time.bench: 4.15, round: 0.28 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_xor : 15/20 rounds, time.total: 9.43, time.warmup: 2.36, time.bench: 7.07, round: 0.47 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_intersect : 15/20 rounds, time.total: 9.43, time.warmup: 2.36, time.bench: 7.07, round: 0.47 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_andnot : 15/20 rounds, time.total: 8.64, time.warmup: 2.18, time.bench: 6.46, round: 0.43 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_union : 15/20 rounds, time.total: 9.42, time.warmup: 2.36, time.bench: 7.07, round: 0.47 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 34.36, time.warmup: 9.70, time.bench: 24.66, round: 4.93 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 43.89, time.warmup: 13.01, time.bench: 30.88, round: 6.18 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc., Benchmark_BitUtil_popNtzJRE_simple.test_pop_array : 15/20 rounds, time.total: 3.58, time.warmup: 0.90, time.bench: 2.69, round: 0.18 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_xor : 15/20 rounds, time.total: 5.56, time.warmup: 1.39, time.bench: 4.16, round: 0.28 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_intersect: 15/20 rounds, time.total: 5.56, time.warmup: 1.39, time.bench: 4.16, round: 0.28 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_andnot: 15/20 rounds, time.total: 5.89, time.warmup: 1.48, time.bench: 4.42, round: 0.29 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_union : 15/20 rounds, time.total: 5.56, time.warmup: 1.39, time.bench: 4.16, round: 0.28 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 JDK 1.6.0_18 -server -XX:-UsePopCountInstruction # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc., Benchmark_BitUtil_trunk.test_pop_array : 15/20 rounds, time.total: 7.67, time.warmup: 1.94, time.bench: 5.74, round: 0.38 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_xor : 15/20 rounds, time.total: 11.02, time.warmup: 2.77, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_intersect : 15/20 rounds, time.total: 11.02, time.warmup: 2.77, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_andnot : 15/20 rounds, time.total: 10.49, time.warmup: 2.64, time.bench: 7.85, round: 0.52 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_union : 15/20 rounds, time.total: 11.02, time.warmup: 2.77, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_ntz_iterator_int : 5/7 rounds, time.total: 42.61, time.warmup: 12.11, time.bench: 30.49, round: 6.10 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_ntz_iterator_long : 5/7 rounds, time.total: 52.69, time.warmup: 15.27, time.bench: 37.42, round: 7.48 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc., Benchmark_BitUtil_pop3264.test_pop_array : 15/20 rounds, time.total: 7.69, time.warmup: 1.95, time.bench: 5.74, round: 0.38 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_xor : 15/20 rounds, time.total: 11.02, time.warmup: 2.76, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_intersect : 15/20 rounds, time.total: 11.02, time.warmup: 2.77, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_andnot : 15/20 rounds, time.total: 10.48, time.warmup: 2.64, time.bench: 7.84, round: 0.52 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_union : 15/20 rounds, time.total: 11.02, time.warmup: 2.76, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc., Benchmark_BitUtil_popNtzJRE.test_pop_array : 15/20 rounds, time.total: 7.67, time.warmup: 1.94, time.bench: 5.74, round: 0.38 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_xor : 15/20 rounds, time.total: 11.02, time.warmup: 2.76, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_intersect : 15/20 rounds, time.total: 11.02, time.warmup: 2.77, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_andnot : 15/20 rounds, time.total: 10.48, time.warmup: 2.64, time.bench: 7.84, round: 0.52 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_union : 15/20 rounds, time.total: 11.02, time.warmup: 2.77, time.bench: 8.25, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 31.27, time.warmup: 8.83, time.bench: 22.44, round: 4.49 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 39.90, time.warmup: 11.84, time.bench: 28.06, round: 5.61 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.6.0_18, Java HotSpot(TM) 64-Bit Server VM, 16.0-b13, Sun Microsystems Inc., Benchmark_BitUtil_popNtzJRE_simple.test_pop_array : 15/20 rounds, time.total: 18.03, time.warmup: 4.51, time.bench: 13.52, round: 0.90 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_xor : 15/20 rounds, time.total: 21.25, time.warmup: 5.01, time.bench: 16.24, round: 1.08 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_intersect: 15/20 rounds, time.total: 21.67, time.warmup: 5.43, time.bench: 16.25, round: 1.08 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_andnot: 15/20 rounds, time.total: 22.43, time.warmup: 5.62, time.bench: 16.81, round: 1.12 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_union : 15/20 rounds, time.total: 21.65, time.warmup: 5.42, time.bench: 16.23, round: 1.08 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
        Hide
        Adrien Grand added a comment -

        Now that we have dropped support for Java 5, maybe it would make sense to make Lucene use the JDK impl of ntz? According to the release notes, the numberOfTrailingZeros method was made an intrinsic in Java 6u18[1] which is nearly 3 years old now, so this sounds like a safe bet?

        [1] http://www.oracle.com/technetwork/java/javase/6u18-142093.html

        Show
        Adrien Grand added a comment - Now that we have dropped support for Java 5, maybe it would make sense to make Lucene use the JDK impl of ntz? According to the release notes, the numberOfTrailingZeros method was made an intrinsic in Java 6u18 [1] which is nearly 3 years old now, so this sounds like a safe bet? [1] http://www.oracle.com/technetwork/java/javase/6u18-142093.html
        Hide
        Dawid Weiss added a comment -

        I think it makes sense to simplify the existing code. It's going to be faster and simpler on modern JVMs/ hardware where these are inlined as intrinsics.

        Show
        Dawid Weiss added a comment - I think it makes sense to simplify the existing code. It's going to be faster and simpler on modern JVMs/ hardware where these are inlined as intrinsics.
        Hide
        Yonik Seeley added a comment -

        Now that we have dropped support for Java 5, maybe it would make sense to make Lucene use the JDK impl of ntz?

        +1

        Show
        Yonik Seeley added a comment - Now that we have dropped support for Java 5, maybe it would make sense to make Lucene use the JDK impl of ntz? +1
        Hide
        Uwe Schindler added a comment -

        +1, we could have done this already for 4.0! I think there was a proposal to do this, but some people said it might be too early (see above).

        Show
        Uwe Schindler added a comment - +1, we could have done this already for 4.0! I think there was a proposal to do this, but some people said it might be too early (see above).
        Hide
        Adrien Grand added a comment -

        Patch:

        • replace BitUtil.ntz with Long.numberOfTrailingZeros
        • replace BitUtil.pop with Long.bitCount
        • replace the pop_* methods with simple loops using Long.bitCount

        All lucene & solr tests passed except some zk tests whose failure seem unrelated.

        Show
        Adrien Grand added a comment - Patch: replace BitUtil.ntz with Long.numberOfTrailingZeros replace BitUtil.pop with Long.bitCount replace the pop_* methods with simple loops using Long.bitCount All lucene & solr tests passed except some zk tests whose failure seem unrelated.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Adrien Grand
        http://svn.apache.org/viewvc?view=revision&revision=1411712

        LUCENE-2221: Use JVM intrinsics instead of BitUtil.

        {ntz,pop}

        .

        Show
        Commit Tag Bot added a comment - [trunk commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1411712 LUCENE-2221 : Use JVM intrinsics instead of BitUtil. {ntz,pop} .
        Hide
        Dawid Weiss added a comment -

        Thanks Adrien.

        Show
        Dawid Weiss added a comment - Thanks Adrien.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Adrien Grand
        http://svn.apache.org/viewvc?view=revision&revision=1411719

        LUCENE-2221: Use JVM intrinsics instead of BitUtil.

        {ntz,pop}

        (merged from r1411712).

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1411719 LUCENE-2221 : Use JVM intrinsics instead of BitUtil. {ntz,pop} (merged from r1411712).
        Hide
        Adrien Grand added a comment -

        Thanks Adrien.

        Thanks Dawid for having done the hard work!

        I'll give a look at Mike's benchs tonight to make sure this change didn't make anything worse...

        Show
        Adrien Grand added a comment - Thanks Adrien. Thanks Dawid for having done the hard work! I'll give a look at Mike's benchs tonight to make sure this change didn't make anything worse...
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Adrien Grand
        http://svn.apache.org/viewvc?view=revision&revision=1411719

        LUCENE-2221: Use JVM intrinsics instead of BitUtil.

        {ntz,pop}

        (merged from r1411712).

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1411719 LUCENE-2221 : Use JVM intrinsics instead of BitUtil. {ntz,pop} (merged from r1411712).

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Dawid Weiss
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development