[LUCENE-2221] Micro-benchmarks for ntz and pop (BitUtils) operations. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Trivial
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: core/other
Labels:
None

Lucene Fields:

New

Description

As suggested by Yonik, I performed a suite of micro-benchmarks to investigate the following:

pop() (bitCount) seems to be implemented in the same way ("hacker's delight") as in the BitUtils class (SUN's standard library from version 1.5).
native intrinsics have been recently added to the HotSpot that should speed up bitCount significantly.

I have tried to run the code on various VMs and architectures, but of course the results may vary depending on the setting.

Micro-benchmark code, evaluation

The micro-benchmarks were written as JUnit tests with a custom set of rules that repeats each test measuring execution time, gc use, etc.
There were 5 warmup runs before each test, followed by 15 benchmarked runs. The result contain overall times, round times and standard deviations where applicable.

There were several tests for isolated performance of BitUtil.pop(), JDK's bitCount, BitUtil.ntz(), BitUtil.ntz2(), BitUtil.ntz3() and JDK's numberOfTrailingZeros, the test code had the following loop:

final long [] longs = NtzPopBenchmark.random.bits;
int cnt = 0;
for (int i = longs.length; --i >= 0;)
{
   cnt += Long.bitCount(longs[i]);
}
volatileVariable = cnt; // to prevent dead code removal.

I also added another version of pop() based on a precomputed bit counts. This version was called pop2.

The input array of long values was initialized to a memory taking 200MB. There were two different sets: random (random values) and single (single bit rotated to the right in each long).

There were tests that called BitUtil.pop_xor between the two input bitsets (random, single).
Additional tests that used iterators and and other BitUtil operations showed similar performance to isolated loops, so I omit them here.

Evaluation environment

I tested on three different machines:

Pentium 4, 32-bit, 3GHZ, 2GB RAM (Windows)
AMD Athlon(tm) 64 X2 Dual Core Processor 5200+, 64-bit, 4GB RAM (Ubuntu)
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz, 64-bit, 4GB RAM (Ubuntu)

and on various VMs:

1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc.,
1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc.,
1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc.,
1.6.0, IBM J9 VM, 2.4, IBM Corporation,
BEA JRockit.
(ant other minor versions of the VMs above, depending on the computer).

Results overview

`pop`

The times between BitUtil and JDK were mostly identical. However, on 32-bit systems, precached pop2 performed
much better. Examples:

# 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
test_POP_JDK_random       : 15/20 rounds, time.total: 15.61, time.warmup: 4.31, time.bench: 11.30, round: 0.75 [+- 0.02]
test_POP_JDK_single       : 15/20 rounds, time.total: 15.67, time.warmup: 4.31, time.bench: 11.36, round: 0.76 [+- 0.02]
test_POP_BitUtil_random   : 15/20 rounds, time.total: 15.55, time.warmup: 4.33, time.bench: 11.22, round: 0.75 [+- 0.01]
test_POP_BitUtil_single   : 15/20 rounds, time.total: 15.55, time.warmup: 4.31, time.bench: 11.23, round: 0.75 [+- 0.01]
test_POP2_random          : 15/20 rounds, time.total:  6.69, time.warmup: 1.75, time.bench:  4.94, round: 0.33 [+- 0.00]
test_POP2_single          : 15/20 rounds, time.total:  4.66, time.warmup: 1.22, time.bench:  3.44, round: 0.23 [+- 0.01]

Note the difference between random and single distributions – most probably due to more cache hits when referring to the
lookup table. Other VMs on this 32-bit machine:

# 1.5.0_18, Java HotSpot(TM) Server VM, 1.5.0_18-b02, Sun Microsystems Inc., 
test_POP_JDK_random       : 15/20 rounds, time.total: 20.67, time.warmup: 5.19, time.bench: 15.48, round: 1.03 [+- 0.01]
test_POP_JDK_single       : 15/20 rounds, time.total: 22.70, time.warmup: 5.63, time.bench: 17.08, round: 1.14 [+- 0.01]
test_POP_BitUtil_random   : 15/20 rounds, time.total: 22.69, time.warmup: 5.63, time.bench: 17.06, round: 1.14 [+- 0.01]
test_POP_BitUtil_single   : 15/20 rounds, time.total: 20.67, time.warmup: 5.19, time.bench: 15.48, round: 1.03 [+- 0.01]
test_POP2_random          : 15/20 rounds, time.total:  6.30, time.warmup: 1.63, time.bench:  4.67, round: 0.31 [+- 0.01]
test_POP2_single          : 15/20 rounds, time.total:  4.33, time.warmup: 1.16, time.bench:  3.17, round: 0.21 [+- 0.01]

# 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
test_POP_JDK_random       : 15/20 rounds, time.total: 15.28, time.warmup: 4.25, time.bench: 11.03, round: 0.74 [+- 0.03]
test_POP_JDK_single       : 15/20 rounds, time.total: 15.16, time.warmup: 4.20, time.bench: 10.95, round: 0.73 [+- 0.01]
test_POP_BitUtil_random   : 15/20 rounds, time.total: 15.12, time.warmup: 4.20, time.bench: 10.92, round: 0.73 [+- 0.01]
test_POP_BitUtil_single   : 15/20 rounds, time.total: 15.13, time.warmup: 4.25, time.bench: 10.88, round: 0.73 [+- 0.01]
test_POP2_random          : 15/20 rounds, time.total:  6.78, time.warmup: 1.72, time.bench:  5.06, round: 0.34 [+- 0.01]
test_POP2_single          : 15/20 rounds, time.total:  4.72, time.warmup: 1.20, time.bench:  3.52, round: 0.23 [+- 0.02]

On 64-bit machines, the results are nearly equal, with pop2 performing slightly worse on SUN's 1.6 compared to JDK and BitUtil
(but this difference is really tiny and not present on all VMs; see IBM J9 and SUN's 1.5 below).

# 1.6.0_16, Java HotSpot(TM) 64-Bit Server VM, 14.2-b01, Sun Microsystems Inc.,
test_POP_JDK_random       : 15/20 rounds, time.total: 3.27, time.warmup: 0.81, time.bench: 2.46, round: 0.16 [+- 0.00]
test_POP_JDK_single       : 15/20 rounds, time.total: 3.11, time.warmup: 0.76, time.bench: 2.34, round: 0.16 [+- 0.02]
test_POP_BitUtil_random   : 15/20 rounds, time.total: 3.27, time.warmup: 0.81, time.bench: 2.46, round: 0.16 [+- 0.00]
test_POP_BitUtil_single   : 15/20 rounds, time.total: 3.03, time.warmup: 0.77, time.bench: 2.26, round: 0.15 [+- 0.00]
test_POP2_random          : 15/20 rounds, time.total: 3.63, time.warmup: 0.93, time.bench: 2.70, round: 0.18 [+- 0.00]
test_POP2_single          : 15/20 rounds, time.total: 3.51, time.warmup: 0.89, time.bench: 2.62, round: 0.17 [+- 0.00]

# 1.6.0, IBM J9 VM, 2.4, IBM Corporation,
test_POP_JDK_random       : 15/20 rounds, time.total: 4.80, time.warmup: 1.24, time.bench: 3.57, round: 0.24 [+- 0.01]
test_POP_JDK_single       : 15/20 rounds, time.total: 5.00, time.warmup: 1.44, time.bench: 3.56, round: 0.24 [+- 0.01]
test_POP_BitUtil_random   : 15/20 rounds, time.total: 4.81, time.warmup: 1.24, time.bench: 3.56, round: 0.24 [+- 0.01]
test_POP_BitUtil_single   : 15/20 rounds, time.total: 4.75, time.warmup: 1.19, time.bench: 3.56, round: 0.24 [+- 0.01]
test_POP2_random          : 15/20 rounds, time.total: 3.65, time.warmup: 0.90, time.bench: 2.76, round: 0.18 [+- 0.00]
test_POP2_single          : 15/20 rounds, time.total: 3.82, time.warmup: 0.93, time.bench: 2.89, round: 0.19 [+- 0.01]

# 1.5.0_18, Java HotSpot(TM) 64-Bit Server VM, 1.5.0_18-b02, Sun Microsystems Inc.,
test_POP_JDK_random       : 15/20 rounds, time.total: 3.72, time.warmup: 0.94, time.bench: 2.78, round: 0.19 [+- 0.00]
test_POP_JDK_single       : 15/20 rounds, time.total: 5.96, time.warmup: 1.40, time.bench: 4.56, round: 0.30 [+- 0.00]
test_POP_BitUtil_random   : 15/20 rounds, time.total: 6.16, time.warmup: 1.43, time.bench: 4.73, round: 0.31 [+- 0.00]
test_POP_BitUtil_single   : 15/20 rounds, time.total: 3.62, time.warmup: 0.92, time.bench: 2.70, round: 0.18 [+- 0.00]
test_POP2_random          : 15/20 rounds, time.total: 3.70, time.warmup: 0.96, time.bench: 2.74, round: 0.18 [+- 0.00]
test_POP2_single          : 15/20 rounds, time.total: 3.57, time.warmup: 0.93, time.bench: 2.65, round: 0.18 [+- 0.00]

The other 64-bit machine (quad-core):

# 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
test_POP_JDK_random       : 15/20 rounds, time.total: 2.46, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00]
test_POP_JDK_single       : 15/20 rounds, time.total: 2.49, time.warmup: 0.62, time.bench: 1.87, round: 0.12 [+- 0.01]
test_POP_BitUtil_random   : 15/20 rounds, time.total: 2.47, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00]
test_POP_BitUtil_single   : 15/20 rounds, time.total: 2.46, time.warmup: 0.62, time.bench: 1.84, round: 0.12 [+- 0.00]
test_POP2_random          : 15/20 rounds, time.total: 2.82, time.warmup: 0.71, time.bench: 2.11, round: 0.14 [+- 0.00]
test_POP2_single          : 15/20 rounds, time.total: 2.15, time.warmup: 0.55, time.bench: 1.61, round: 0.11 [+- 0.00]

I then replaced BitUtil.pop with BitUtil.pop2 in bit-counting methods like xor/and/or. The results are intriguing.
On 32-bit systems, there is a measureable gain, like here:

# 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
test_pop_xor              : 15/20 rounds, time.total:  9.78, time.warmup: 2.59, time.bench:  7.19, round: 0.48 [+- 0.01]
test_pop2_hd_xor          : 15/20 rounds, time.total:  8.27, time.warmup: 2.22, time.bench:  6.05, round: 0.40 [+- 0.01]

# 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
test_pop_xor              : 15/20 rounds, time.total:  9.89, time.warmup: 2.59, time.bench: 7.30, round: 0.49 [+- 0.02]
test_pop2_hd_xor          : 15/20 rounds, time.total:  8.20, time.warmup: 2.24, time.bench: 5.97, round: 0.40 [+- 0.01]

On 64-bit systems, when 64-bit values can be manipulated directly in registers, there was nearly no speedup or even
a small performance penalty like in here:

# 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
test_pop_xor              : 15/20 rounds, time.total: 1.76, time.warmup: 0.49, time.bench: 1.27, round: 0.09 [+- 0.00]
test_pop2_hd_xor          : 15/20 rounds, time.total: 2.06, time.warmup: 0.55, time.bench: 1.51, round: 0.10 [+- 0.00]

I'm guessing referencing memory on this fast processors is slower than manipulating registers.

`ntz`

On JVMs prior to 1.7, the {{ntz} version from Lucene was much faster in my tests than the one from JDK,
but it also has a greater variance depending on the input bits' distribution (compare the same routine
for random and single below).

# 32-bit system;
# 1.6.0_17, Java HotSpot(TM) Server VM, 14.3-b01, Sun Microsystems Inc., 
test_NTZ_JDK_random       : 15/20 rounds, time.total:  6.69, time.warmup: 1.73, time.bench: 4.95, round: 0.33 [+- 0.01]
test_NTZ_JDK_single       : 15/20 rounds, time.total:  7.59, time.warmup: 1.94, time.bench: 5.66, round: 0.38 [+- 0.01]
test_NTZ_BitUtil_random   : 15/20 rounds, time.total:  2.72, time.warmup: 0.73, time.bench: 1.98, round: 0.13 [+- 0.02]
test_NTZ_BitUtil_single   : 15/20 rounds, time.total:  5.28, time.warmup: 1.34, time.bench: 3.94, round: 0.26 [+- 0.02]
test_NTZ2_BitUtil_random  : 15/20 rounds, time.total:  3.06, time.warmup: 0.81, time.bench: 2.25, round: 0.15 [+- 0.01]
test_NTZ2_BitUtil_single  : 15/20 rounds, time.total:  5.36, time.warmup: 1.34, time.bench: 4.02, round: 0.27 [+- 0.01]
test_NTZ3_BitUtil_random  : 15/20 rounds, time.total:  5.80, time.warmup: 1.48, time.bench: 4.31, round: 0.29 [+- 0.01]
test_NTZ3_BitUtil_single  : 15/20 rounds, time.total:  6.98, time.warmup: 1.81, time.bench: 5.17, round: 0.34 [+- 0.01]

# 64-bit Athlon
# 1.6.0_16, Java HotSpot(TM) 64-Bit Server VM, 14.2-b01, Sun Microsystems Inc.,
test_NTZ_JDK_random       : 15/20 rounds, time.total: 4.59, time.warmup: 1.16, time.bench: 3.44, round: 0.23 [+- 0.00]
test_NTZ_JDK_single       : 15/20 rounds, time.total: 6.64, time.warmup: 1.59, time.bench: 5.04, round: 0.34 [+- 0.01]
test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 2.09, time.warmup: 0.53, time.bench: 1.56, round: 0.10 [+- 0.00]
test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 3.87, time.warmup: 0.98, time.bench: 2.90, round: 0.19 [+- 0.00]
test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 2.09, time.warmup: 0.52, time.bench: 1.57, round: 0.10 [+- 0.00]
test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 3.31, time.warmup: 0.84, time.bench: 2.47, round: 0.16 [+- 0.00]
test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 3.31, time.warmup: 0.83, time.bench: 2.48, round: 0.17 [+- 0.00]
test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 5.71, time.warmup: 1.39, time.bench: 4.32, round: 0.29 [+- 0.00]

But then comes the 1.7 HotSport and things change radically, on 32-bit system the JDK's version is much faster for nearly-empty
long values:

# 1.7.0-ea, Java HotSpot(TM) Server VM, 17.0-b06, Sun Microsystems Inc., 
test_NTZ_JDK_random       : 15/20 rounds, time.total:  1.97, time.warmup: 0.61, time.bench: 1.36, round: 0.09 [+- 0.01]
test_NTZ_JDK_single       : 15/20 rounds, time.total:  2.53, time.warmup: 0.77, time.bench: 1.77, round: 0.12 [+- 0.01]
test_NTZ_BitUtil_random   : 15/20 rounds, time.total:  2.36, time.warmup: 0.66, time.bench: 1.70, round: 0.11 [+- 0.01]
test_NTZ_BitUtil_single   : 15/20 rounds, time.total:  4.50, time.warmup: 1.19, time.bench: 3.31, round: 0.22 [+- 0.01]
test_NTZ2_BitUtil_random  : 15/20 rounds, time.total:  3.08, time.warmup: 0.81, time.bench: 2.27, round: 0.15 [+- 0.01]
test_NTZ2_BitUtil_single  : 15/20 rounds, time.total:  4.97, time.warmup: 1.28, time.bench: 3.69, round: 0.25 [+- 0.01]
test_NTZ3_BitUtil_random  : 15/20 rounds, time.total:  5.78, time.warmup: 1.48, time.bench: 4.30, round: 0.29 [+- 0.01]
test_NTZ3_BitUtil_single  : 15/20 rounds, time.total:  7.77, time.warmup: 1.91, time.bench: 5.86, round: 0.39 [+- 0.01]

On the 64-bit quad core:

# 1.6.0_13, Java HotSpot(TM) 64-Bit Server VM, 11.3-b02, Sun Microsystems Inc.,
test_NTZ_JDK_random       : 15/20 rounds, time.total: 3.92, time.warmup: 0.97, time.bench: 2.94, round: 0.20 [+- 0.00]
test_NTZ_JDK_single       : 15/20 rounds, time.total: 3.80, time.warmup: 0.97, time.bench: 2.82, round: 0.19 [+- 0.00]
test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 0.96, time.warmup: 0.25, time.bench: 0.71, round: 0.05 [+- 0.00]
test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 2.74, time.warmup: 0.69, time.bench: 2.04, round: 0.14 [+- 0.00]
test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 1.22, time.warmup: 0.31, time.bench: 0.91, round: 0.06 [+- 0.00]
test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 2.18, time.warmup: 0.56, time.bench: 1.62, round: 0.11 [+- 0.00]
test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 2.76, time.warmup: 0.71, time.bench: 2.06, round: 0.14 [+- 0.00]
test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 3.47, time.warmup: 0.91, time.bench: 2.56, round: 0.17 [+- 0.01]

And then comes the 1.7, compare JDK's implementation with anything else (especially the time.bench for
the single input data. Looks like this is hardware-accelerated.

# -server -Xbatch -Xmx1024m
# 1.7.0-ea, Java HotSpot(TM) 64-Bit Server VM, 17.0-b06, Sun Microsystems Inc.,
test_NTZ_JDK_random       : 15/20 rounds, time.total: 0.79, time.warmup: 0.21, time.bench: 0.58, round: 0.04 [+- 0.00]
test_NTZ_JDK_single       : 15/20 rounds, time.total: 0.75, time.warmup: 0.20, time.bench: 0.55, round: 0.04 [+- 0.00]
test_NTZ_BitUtil_random   : 15/20 rounds, time.total: 0.98, time.warmup: 0.25, time.bench: 0.72, round: 0.05 [+- 0.00]
test_NTZ_BitUtil_single   : 15/20 rounds, time.total: 2.61, time.warmup: 0.66, time.bench: 1.95, round: 0.13 [+- 0.00]
test_NTZ2_BitUtil_random  : 15/20 rounds, time.total: 1.30, time.warmup: 0.33, time.bench: 0.97, round: 0.06 [+- 0.00]
test_NTZ2_BitUtil_single  : 15/20 rounds, time.total: 2.48, time.warmup: 0.61, time.bench: 1.88, round: 0.13 [+- 0.00]
test_NTZ3_BitUtil_random  : 15/20 rounds, time.total: 2.81, time.warmup: 0.70, time.bench: 2.11, round: 0.14 [+- 0.00]
test_NTZ3_BitUtil_single  : 15/20 rounds, time.total: 4.07, time.warmup: 1.02, time.bench: 3.05, round: 0.20 [+- 0.00]

Conclusions

It seems that any change introduced to these routines will hurt somebody in some configuration, so it's really hard
for me to make choices. I would definitely opt for the precached pop2 version on 32-bit systems as it seems to
be always faster or equally fast compared to other bit counting options. pop2 looked like this:

   private static byte [] bcounts = new byte [0x10000];
   static
   {
       for (int i = 0x10000; --i >= 0;)
           bcounts[i] = (byte) Integer.bitCount(i);
   }

   public static int pop2(long v)
   {
       int t;
       return 
             bcounts[(t = (int) v) & 0xffff]
           + bcounts[t >>> 16]
           + bcounts[(t = ((int) (v >>> 32))) & 0xffff]
           + bcounts[t >>> 16];
   }

As for the hardware-accelerated ntz, if one can detect 1.7, then using the JDK's version is speeding up things
significantly. But I have not checked how this detection would affect speed if done at run-time (I assume a final
static flag wouldn't cause any performance penalty) and it is definitely not worth replacing for folks with older
VMs.

Raw results data.

I will attach raw results as part of the issue if you want to draw your own conclusions. Didn't have access to sparc-machine
or to any machine with the newest Intels.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

benchmark.jar
21/Jan/10 09:56
277 kB
Dawid Weiss
benchmarks.txt
20/Jan/10 12:51
17 kB
Dawid Weiss
LUCENE-2221.patch
20/Nov/12 10:22
31 kB
Adrien Grand
lucene-bitset-benchmarks.zip
21/Jan/10 10:00
249 kB
Dawid Weiss
results-popntz.txt
18/Jan/10 17:02
27 kB
Dawid Weiss

Activity

People

Assignee:: Adrien Grand

Reporter:: Dawid Weiss

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Jan/10 21:19

Updated:: 28/Aug/22 12:19

Resolved:: 20/Nov/12 16:16