Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2221

Micro-benchmarks for ntz and pop (BitUtils) operations.

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/other
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      As suggested by Yonik, I performed a suite of micro-benchmarks to investigate the following:

      • pop() (bitCount) seems to be implemented in the same way ("hacker's delight") as in the BitUtils class (SUN's standard library from version 1.5).
      • native intrinsics have been recently added to the HotSpot that should speed up bitCount significantly.

      I have tried to run the code on various VMs and architectures, but of course the results may vary depending on the setting.

      Micro-benchmark code, evaluation

      • The micro-benchmarks were written as JUnit tests with a custom set of rules that repeats each test measuring execution time, gc use, etc.
      • There were 5 warmup runs before each test, followed by 15 benchmarked runs. The result contain overall times, round times and standard deviations where applicable.
      • There were several tests for isolated performance of BitUtil.pop(), JDK's bitCount, BitUtil.ntz(), BitUtil.ntz2(), BitUtil.ntz3() and JDK's numberOfTrailingZeros, the test code had the following loop:
        final long [] longs = NtzPopBenchmark.random.bits;
        int cnt = 0;
        for (int i = longs.length; --i >= 0;)
        {
           cnt += Long.bitCount(longs[i]);
        }
        volatileVariable = cnt; // to prevent dead code removal.
        
      • I also added another version of pop() based on a precomputed bit counts. This version was called pop2.
      • The input array of long values was initialized to a memory taking 200MB. There were two different sets: random (random values) and single (single bit rotated to the right in each long).
      • There were tests that called BitUtil.pop_xor between the two input bitsets (random, single).
      • Additional tests that used iterators and and other BitUtil operations showed similar performance to isolated loops, so I omit them here.

      Evaluation environment

      I tested on three different machines:

      • Pentium 4, 32-bit, 3GHZ, 2GB RAM (Windows)
      • AMD Athlon(tm) 64 X2 Dual Core Processor 5200+, 64-bit, 4GB RAM (Ubuntu)
      • Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz, 64-bit, 4GB RAM (Ubuntu)

      and on various VMs:

      • 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc.,
      • 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc.,
      • 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc.,
      • 1.6.0, IBM J9 VM, 2.4, IBM Corporation,
      • BEA JRockit.
      • (ant other minor versions of the VMs above, depending on the computer).

      Results overview

      pop

      The times between BitUtil and JDK were mostly identical. However, on 32-bit systems, precached pop2 performed
      much better. Examples:

      # 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
      test_POP_JDK_random       : 15/20 rounds, time.total: 15.61, time.warmup: 4.31, time.bench: 11.30, round: 0.75 [+- 0.02]
      test_POP_JDK_single       : 15/20 rounds, time.total: 15.67, time.warmup: 4.31, time.bench: 11.36, round: 0.76 [+- 0.02]
      test_POP_BitUtil_random   : 15/20 rounds, time.total: 15.55, time.warmup: 4.33, time.bench: 11.22, round: 0.75 [+- 0.01]
      test_POP_BitUtil_single   : 15/20 rounds, time.total: 15.55, time.warmup: 4.31, time.bench: 11.23, round: 0.75 [+- 0.01]
      test_POP2_random          : 15/20 rounds, time.total:  6.69, time.warmup: 1.75, time.bench:  4.94, round: 0.33 [+- 0.00]
      test_POP2_single          : 15/20 rounds, time.total:  4.66, time.warmup: 1.22, time.bench:  3.44, round: 0.23 [+- 0.01]
      

      Note the difference between random and single distributions – most probably due to more cache hits when referring to the
      lookup table. Other VMs on this 32-bit machine:

      # 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc., 
      test_POP_JDK_random       : 15/20 rounds, time.total: 20.67, time.warmup: 5.19, time.bench: 15.48, round: 1.03 [+- 0.01]
      test_POP_JDK_single       : 15/20 rounds, time.total: 22.70, time.warmup: 5.63, time.bench: 17.08, round: 1.14 [+- 0.01]
      test_POP_BitUtil_random   : 15/20 rounds, time.total: 22.69, time.warmup: 5.63, time.bench: 17.06, round: 1.14 [+- 0.01]
      test_POP_BitUtil_single   : 15/20 rounds, time.total: 20.67, time.warmup: 5.19, time.bench: 15.48, round: 1.03 [+- 0.01]
      test_POP2_random          : 15/20 rounds, time.total:  6.30, time.warmup: 1.63, time.bench:  4.67, round: 0.31 [+- 0.01]
      test_POP2_single          : 15/20 rounds, time.total:  4.33, time.warmup: 1.16, time.bench:  3.17, round: 0.21 [+- 0.01]
      
      # 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
      test_POP_JDK_random       : 15/20 rounds, time.total: 15.28, time.warmup: 4.25, time.bench: 11.03, round: 0.74 [+- 0.03]
      test_POP_JDK_single       : 15/20 rounds, time.total: 15.16, time.warmup: 4.20, time.bench: 10.95, round: 0.73 [+- 0.01]
      test_POP_BitUtil_random   : 15/20 rounds, time.total: 15.12, time.warmup: 4.20, time.bench: 10.92, round: 0.73 [+- 0.01]
      test_POP_BitUtil_single   : 15/20 rounds, time.total: 15.13, time.warmup: 4.25, time.bench: 10.88, round: 0.73 [+- 0.01]
      test_POP2_random          : 15/20 rounds, time.total:  6.78, time.warmup: 1.72, time.bench:  5.06, round: 0.34 [+- 0.01]
      test_POP2_single          : 15/20 rounds, time.total:  4.72, time.warmup: 1.20, time.bench:  3.52, round: 0.23 [+- 0.02]
      

      On 64-bit machines, the results are nearly equal, with pop2 performing slightly worse on SUN's 1.6 compared to JDK and BitUtil
      (but this difference is really tiny and not present on all VMs; see IBM J9 and SUN's 1.5 below).

      # 1.6.0_16, Java HotSpot(TM) 64-Bit Server VM, 14.2-b01, Sun Microsystems Inc.,
      test_POP_JDK_random       : 15/20 rounds, time.total: 3.27, time.warmup: 0.81, time.bench: 2.46, round: 0.16 [+- 0.00]
      test_POP_JDK_single       : 15/20 rounds, time.total: 3.11, time.warmup: 0.76, time.bench: 2.34, round: 0.16 [+- 0.02]
      test_POP_BitUtil_random   : 15/20 rounds, time.total: 3.27, time.warmup: 0.81, time.bench: 2.46, round: 0.16 [+- 0.00]
      test_POP_BitUtil_single   : 15/20 rounds, time.total: 3.03, time.warmup: 0.77, time.bench: 2.26, round: 0.15 [+- 0.00]
      test_POP2_random          : 15/20 rounds, time.total: 3.63, time.warmup: 0.93, time.bench: 2.70, round: 0.18 [+- 0.00]
      test_POP2_single          : 15/20 rounds, time.total: 3.51, time.warmup: 0.89, time.bench: 2.62, round: 0.17 [+- 0.00]
      
      # 1.6.0, IBM J9 VM, 2.4, IBM Corporation,
      test_POP_JDK_random       : 15/20 rounds, time.total: 4.80, time.warmup: 1.24, time.bench: 3.57, round: 0.24 [+- 0.01]
      test_POP_JDK_single       : 15/20 rounds, time.total: 5.00, time.warmup: 1.44, time.bench: 3.56, round: 0.24 [+- 0.01]
      test_POP_BitUtil_random   : 15/20 rounds, time.total: 4.81, time.warmup: 1.24, time.bench: 3.56, round: 0.24 [+- 0.01]
      test_POP_BitUtil_single   : 15/20 rounds, time.total: 4.75, time.warmup: 1.19, time.bench: 3.56, round: 0.24 [+- 0.01]
      test_POP2_random          : 15/20 rounds, time.total: 3.65, time.warmup: 0.90, time.bench: 2.76, round: 0.18 [+- 0.00]
      test_POP2_single          : 15/20 rounds, time.total: 3.82, time.warmup: 0.93, time.bench: 2.89, round: 0.19 [+- 0.01]
      
      # 1.5.0_18, Java HotSpot(TM) 64-Bit Server VM, 1.5.0_18-b02, Sun Microsystems Inc.,
      test_POP_JDK_random       : 15/20 rounds, time.total: 3.72, time.warmup: 0.94, time.bench: 2.78, round: 0.19 [+- 0.00]
      test_POP_JDK_single       : 15/20 rounds, time.total: 5.96, time.warmup: 1.40, time.bench: 4.56, round: 0.30 [+- 0.00]
      test_POP_BitUtil_random   : 15/20 rounds, time.total: 6.16, time.warmup: 1.43, time.bench: 4.73, round: 0.31 [+- 0.00]
      test_POP_BitUtil_single   : 15/20 rounds, time.total: 3.62, time.warmup: 0.92, time.bench: 2.70, round: 0.18 [+- 0.00]
      test_POP2_random          : 15/20 rounds, time.total: 3.70, time.warmup: 0.96, time.bench: 2.74, round: 0.18 [+- 0.00]
      test_POP2_single          : 15/20 rounds, time.total: 3.57, time.warmup: 0.93, time.bench: 2.65, round: 0.18 [+- 0.00]
      

      The other 64-bit machine (quad-core):

      # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
      test_POP_JDK_random       : 15/20 rounds, time.total: 2.46, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00]
      test_POP_JDK_single       : 15/20 rounds, time.total: 2.49, time.warmup: 0.62, time.bench: 1.87, round: 0.12 [+- 0.01]
      test_POP_BitUtil_random   : 15/20 rounds, time.total: 2.47, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00]
      test_POP_BitUtil_single   : 15/20 rounds, time.total: 2.46, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00]
      test_POP2_random          : 15/20 rounds, time.total: 2.82, time.warmup: 0.71, time.bench: 2.11, round: 0.14 [+- 0.00]
      test_POP2_single          : 15/20 rounds, time.total: 2.15, time.warmup: 0.55, time.bench: 1.61, round: 0.11 [+- 0.00]
      

      I then replaced BitUtil.pop with BitUtil.pop2 in bit-counting methods like xor/and/or. The results are intriguing.
      On 32-bit systems, there is a measureable gain, like here:

      # 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
      test_pop_xor              : 15/20 rounds, time.total:  9.78, time.warmup: 2.59, time.bench:  7.19, round: 0.48 [+- 0.01]
      test_pop2_hd_xor          : 15/20 rounds, time.total:  8.27, time.warmup: 2.22, time.bench:  6.05, round: 0.40 [+- 0.01]
      
      # 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
      test_pop_xor              : 15/20 rounds, time.total:  9.89, time.warmup: 2.59, time.bench: 7.30, round: 0.49 [+- 0.02]
      test_pop2_hd_xor          : 15/20 rounds, time.total:  8.20, time.warmup: 2.24, time.bench: 5.97, round: 0.40 [+- 0.01]
      

      On 64-bit systems, when 64-bit values can be manipulated directly in registers, there was nearly no speedup or even
      a small performance penalty like in here:

      # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
      test_pop_xor              : 15/20 rounds, time.total: 1.76, time.warmup: 0.49, time.bench: 1.27, round: 0.09 [+- 0.00]
      test_pop2_hd_xor          : 15/20 rounds, time.total: 2.06, time.warmup: 0.55, time.bench: 1.51, round: 0.10 [+- 0.00]
      

      I'm guessing referencing memory on this fast processors is slower than manipulating registers.

      ntz

      On JVMs prior to 1.7, the {{ntz} version from Lucene was much faster in my tests than the one from JDK,
      but it also has a greater variance depending on the input bits' distribution (compare the same routine
      for random and single below).

      # 32-bit system;
      # 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
      test_NTZ_JDK_random       : 15/20 rounds, time.total:  6.69, time.warmup: 1.73, time.bench: 4.95, round: 0.33 [+- 0.01]
      test_NTZ_JDK_single       : 15/20 rounds, time.total:  7.59, time.warmup: 1.94, time.bench: 5.66, round: 0.38 [+- 0.01]
      test_NTZ_BitUtil_random   : 15/20 rounds, time.total:  2.72, time.warmup: 0.73, time.bench: 1.98, round: 0.13 [+- 0.02]
      test_NTZ_BitUtil_single   : 15/20 rounds, time.total:  5.28, time.warmup: 1.34, time.bench: 3.94, round: 0.26 [+- 0.02]
      test_NTZ2_BitUtil_random  : 15/20 rounds, time.total:  3.06, time.warmup: 0.81, time.bench: 2.25, round: 0.15 [+- 0.01]
      test_NTZ2_BitUtil_single  : 15/20 rounds, time.total:  5.36, time.warmup: 1.34, time.bench: 4.02, round: 0.27 [+- 0.01]
      test_NTZ3_BitUtil_random  : 15/20 rounds, time.total:  5.80, time.warmup: 1.48, time.bench: 4.31, round: 0.29 [+- 0.01]
      test_NTZ3_BitUtil_single  : 15/20 rounds, time.total:  6.98, time.warmup: 1.81, time.bench: 5.17, round: 0.34 [+- 0.01]
      
      # 64-bit Athlon
      # 1.6.0_16, Java HotSpot(TM) 64-Bit Server VM, 14.2-b01, Sun Microsystems Inc.,
      test_NTZ_JDK_random       : 15/20 rounds, time.total: 4.59, time.warmup: 1.16, time.bench: 3.44, round: 0.23 [+- 0.00]
      test_NTZ_JDK_single       : 15/20 rounds, time.total: 6.64, time.warmup: 1.59, time.bench: 5.04, round: 0.34 [+- 0.01]
      test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 2.09, time.warmup: 0.53, time.bench: 1.56, round: 0.10 [+- 0.00]
      test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 3.87, time.warmup: 0.98, time.bench: 2.90, round: 0.19 [+- 0.00]
      test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 2.09, time.warmup: 0.52, time.bench: 1.57, round: 0.10 [+- 0.00]
      test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 3.31, time.warmup: 0.84, time.bench: 2.47, round: 0.16 [+- 0.00]
      test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 3.31, time.warmup: 0.83, time.bench: 2.48, round: 0.17 [+- 0.00]
      test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 5.71, time.warmup: 1.39, time.bench: 4.32, round: 0.29 [+- 0.00]
      

      But then comes the 1.7 HotSport and things change radically, on 32-bit system the JDK's version is much faster for nearly-empty
      long values:

      # 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
      test_NTZ_JDK_random       : 15/20 rounds, time.total:  1.97, time.warmup: 0.61, time.bench: 1.36, round: 0.09 [+- 0.01]
      test_NTZ_JDK_single       : 15/20 rounds, time.total:  2.53, time.warmup: 0.77, time.bench: 1.77, round: 0.12 [+- 0.01]
      test_NTZ_BitUtil_random   : 15/20 rounds, time.total:  2.36, time.warmup: 0.66, time.bench: 1.70, round: 0.11 [+- 0.01]
      test_NTZ_BitUtil_single   : 15/20 rounds, time.total:  4.50, time.warmup: 1.19, time.bench: 3.31, round: 0.22 [+- 0.01]
      test_NTZ2_BitUtil_random  : 15/20 rounds, time.total:  3.08, time.warmup: 0.81, time.bench: 2.27, round: 0.15 [+- 0.01]
      test_NTZ2_BitUtil_single  : 15/20 rounds, time.total:  4.97, time.warmup: 1.28, time.bench: 3.69, round: 0.25 [+- 0.01]
      test_NTZ3_BitUtil_random  : 15/20 rounds, time.total:  5.78, time.warmup: 1.48, time.bench: 4.30, round: 0.29 [+- 0.01]
      test_NTZ3_BitUtil_single  : 15/20 rounds, time.total:  7.77, time.warmup: 1.91, time.bench: 5.86, round: 0.39 [+- 0.01]
      

      On the 64-bit quad core:

      # 1.6.0_13, Java HotSpot(TM) 64-Bit Server VM, 11.3-b02, Sun Microsystems Inc.,
      test_NTZ_JDK_random       : 15/20 rounds, time.total: 3.92, time.warmup: 0.97, time.bench: 2.94, round: 0.20 [+- 0.00]
      test_NTZ_JDK_single       : 15/20 rounds, time.total: 3.80, time.warmup: 0.97, time.bench: 2.82, round: 0.19 [+- 0.00]
      test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 0.96, time.warmup: 0.25, time.bench: 0.71, round: 0.05 [+- 0.00]
      test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 2.74, time.warmup: 0.69, time.bench: 2.04, round: 0.14 [+- 0.00]
      test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 1.22, time.warmup: 0.31, time.bench: 0.91, round: 0.06 [+- 0.00]
      test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 2.18, time.warmup: 0.56, time.bench: 1.62, round: 0.11 [+- 0.00]
      test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 2.76, time.warmup: 0.71, time.bench: 2.06, round: 0.14 [+- 0.00]
      test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 3.47, time.warmup: 0.91, time.bench: 2.56, round: 0.17 [+- 0.01]
      

      And then comes the 1.7, compare JDK's implementation with anything else (especially the time.bench for
      the single input data. Looks like this is hardware-accelerated.

      # -server -Xbatch -Xmx1024m
      # 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
      test_NTZ_JDK_random       : 15/20 rounds, time.total: 0.79, time.warmup: 0.21, time.bench: 0.58, round: 0.04 [+- 0.00]
      test_NTZ_JDK_single       : 15/20 rounds, time.total: 0.75, time.warmup: 0.20, time.bench: 0.55, round: 0.04 [+- 0.00]
      test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 0.98, time.warmup: 0.25, time.bench: 0.72, round: 0.05 [+- 0.00]
      test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 2.61, time.warmup: 0.66, time.bench: 1.95, round: 0.13 [+- 0.00]
      test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 1.30, time.warmup: 0.33, time.bench: 0.97, round: 0.06 [+- 0.00]
      test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 2.48, time.warmup: 0.61, time.bench: 1.88, round: 0.13 [+- 0.00]
      test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 2.81, time.warmup: 0.70, time.bench: 2.11, round: 0.14 [+- 0.00]
      test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 4.07, time.warmup: 1.02, time.bench: 3.05, round: 0.20 [+- 0.00]
      

      Conclusions

      It seems that any change introduced to these routines will hurt somebody in some configuration, so it's really hard
      for me to make choices. I would definitely opt for the precached pop2 version on 32-bit systems as it seems to
      be always faster or equally fast compared to other bit counting options. pop2 looked like this:

         private static byte [] bcounts = new byte [0x10000];
         static
         {
             for (int i = 0x10000; --i >= 0;)
                 bcounts[i] = (byte) Integer.bitCount(i);
         }
      
         public static int pop2(long v)
         {
             int t;
             return 
                   bcounts[(t = (int) v) & 0xffff]
                 + bcounts[t >>> 16]
                 + bcounts[(t = ((int) (v >>> 32))) & 0xffff]
                 + bcounts[t >>> 16];
         }
      

      As for the hardware-accelerated ntz, if one can detect 1.7, then using the JDK's version is speeding up things
      significantly. But I have not checked how this detection would affect speed if done at run-time (I assume a final
      static flag wouldn't cause any performance penalty) and it is definitely not worth replacing for folks with older
      VMs.

      Raw results data.

      I will attach raw results as part of the issue if you want to draw your own conclusions. Didn't have access to sparc-machine
      or to any machine with the newest Intels.

        Attachments

        1. LUCENE-2221.patch
          31 kB
          Adrien Grand
        2. lucene-bitset-benchmarks.zip
          249 kB
          Dawid Weiss
        3. benchmark.jar
          277 kB
          Dawid Weiss
        4. benchmarks.txt
          17 kB
          Dawid Weiss
        5. results-popntz.txt
          27 kB
          Dawid Weiss

          Activity

            People

            • Assignee:
              jpountz Adrien Grand
              Reporter:
              dawidweiss Dawid Weiss
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: