[ARROW-9965] [Java] Buffer capacity calculations are slow for fixed-width vectors - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: Java
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/25992

Description

It turns out that setSafe performs a very expensive integer division when trying to compute buffer capacity; specifically, it divides by the field size, which isn't hardcoded. Although it is typically a power of 2 for alignment reasons, this doesn't compile down to a bitshift.

This is done here: https://github.com/apache/arrow/blob/175c53d0b17708312bfd1494c65824f690a6cc9a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L189

Forcing a bitshift operation results in a large speedup in benchmarks. When turning off bounds checks (which affects another portion of set), microbenchmarks indicate that setting the elements of a vector via setSafe is increased by ~174% (almost 3 times faster). With bounds checks on, this is reduced to a 73% increase (Amdahl's).

We use setSafe right now in a hot loop to set Arrow vectors in an internal data-intensive service (for now), although in the future, we would prefer a more specialized vector append interface to skip all the other indirection and bit manipulation instructions, while not directly manipulating the exposed (native) memory.

Here is the detailed analysis:
Tests were run on a machine with an Intel 8700k. Compiled with JDK 8, and run with the latest repo-provided JDK 14 on Ubuntu 20.04.

Benchmark results with arrow.enable_unsafe_memory_access=false, patch NOT applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=false
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
*snip*
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 13.853 ± 0.058 us/op
IntBenchmarks.setWithValueHolder avgt 15 15.045 ± 0.040 us/op
IntBenchmarks.setWithWriter avgt 15 21.621 ± 0.197 us/op

Benchmark results with arrow.enable_unsafe_memory_access=false, patch applied

# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=false
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
*snip*
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 7.964 ± 0.030 us/op
IntBenchmarks.setWithValueHolder avgt 15 9.145 ± 0.031 us/op
IntBenchmarks.setWithWriter avgt 15 8.029 ± 0.051 us/op

Benchmark results with arrow.enable_unsafe_memory_access=true, patch NOT applied

# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=true
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectl

Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 9.563 ± 0.335 us/op
IntBenchmarks.setWithValueHolder avgt 15 9.266 ± 0.064 us/op
IntBenchmarks.setWithWriter avgt 15 18.806 ± 0.154 us/op

Benchmark results with arrow.enable_unsafe_memory_access=true, patch applied
# JMH version: 1.21
# VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1
# VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java
# VM options: -Darrow.enable_unsafe_memory_access=true
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly
Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 15 3.490 ± 0.175 us/op
IntBenchmarks.setWithValueHolder avgt 15 3.806 ± 0.015 us/op
IntBenchmarks.setWithWriter avgt 15 5.490 ± 0.304 us/op

I determined this by running the built-in Arrow JMH benchmarks on an 8700k. I left the CPU frequency scaling in the default state. The numbers seemed off for setting a value. I reran the benchmarks with the `prof=perfasm` option in JMH, which emitted annotated assembly for detected hotspots. Here is the relevant section:

0.06% │ ││ 0x00007f5a7f7beb6f: mov 0x30(%r12,%rsi,8),%rax ; implicit exception: dispatches to 0x00007f5a7f7bef28
│ ││ ;*getfield length {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - org.apache.arrow.memory.ArrowBuf::capacity@1 (line 138)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@4 (line 189)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
│ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
│ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
│ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
0.14% │ ││ 0x00007f5a7f7beb74: movsxd 0x10(%r12,%rdi,8),%rcx ;*i2l {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@11 (line 189)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
│ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
│ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
│ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
*snip*
1.43% ││ │ ││ 0x00007f5a7f7beb9b: idivq %rcx,%rax ;*ldiv {reexecute=0 rethrow=0 return_oop=0}
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@12 (line 189)
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
││ │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
││ │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
││ │ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
68.16% ││ ↘ ││ 0x00007f5a7f7beb9e: cmp $0x7fffffff,%rax
││ ││ 0x00007f5a7f7beba5: jnle 0x7f5a7f7bec8c ;*ifgt {reexecute=0 rethrow=0 return_oop=0}
││ ││ ; - java.lang.Math::min@3 (line 1552)
││ ││ ; - org.apache.arrow.memory.util.LargeMemoryUtil::capAtMaxInt@4 (line 44)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@13 (line 189)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
││ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
││ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
││ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)

The hot instruction is misattributed, probably due to event instruction skid. But integer division is known to be expensive to implement. We can verify this with Agner Fog's instruction tables: https://www.agner.org/optimize/instruction_tables.pdf . Searching for idiv gives high numbers in the table provided, as expected.

After noting all of this, we can apply the following patch, which produces the speedup as above:

diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
index ee47f6dd8..0c9a57bf9 100644
--- a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
+++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
@@ -47,6 +47,7 @@ import io.netty.util.internal.PlatformDependent;
public abstract class BaseFixedWidthVector extends BaseValueVector
implements FixedWidthVector, FieldVector, VectorDefinitionSetter {
private final int typeWidth;
+ private final int typeWidthAsExactBitShiftOrNegOne;

protected int lastValueCapacity;

@@ -66,6 +67,7 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
public BaseFixedWidthVector(Field field, final BufferAllocator allocator, final int typeWidth) {
super(allocator);
this.typeWidth = typeWidth;
+ this.typeWidthAsExactBitShiftOrNegOne = log2ExactOrNeg1(typeWidth);
this.field = field;
valueCount = 0;
allocationMonitor = 0;
@@ -186,7 +188,13 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
}

private int getValueBufferValueCapacity() {
- return capAtMaxInt(valueBuffer.capacity() / typeWidth);
+ if (typeWidthAsExactBitShiftOrNegOne == -1) {
+ // Slow path - integral division is very very expensive, and code here is part of
+ // setSafe's hot loop
+ // The JVM did not optimize integral division into bit shifts
+ return capAtMaxInt(valueBuffer.capacity() / typeWidth);
+ }
+ return capAtMaxInt(valueBuffer.capacity() >> typeWidthAsExactBitShiftOrNegOne);
}

private int getValidityBufferValueCapacity() {
@@ -903,4 +911,12 @@ public abstract class BaseFixedWidthVector extends BaseValueVector
return visitor.visit(this, value);
}

+ private int log2ExactOrNeg1(int x) {
+ final boolean isPowerOfTwo = x > 0 & (x & (x - 1)) == 0;
+ if (!isPowerOfTwo) {
+ return -1;
+ }
+ return (Integer.SIZE - 1) - Integer.numberOfLeadingZeros(x);
+ }
+
}

Attached is the generated assembly as printed by JMH, before and after. I renamed some variables for clarity after generating the results, but the logic is unchanged.

I also did a quick test with JDK 8 - this was where I originally ran the benchmarks. The idiv instruction was present there too.

An initial version of this patch cached the value - this produces about the same speedup.

Are people fine with this approach?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

before_patch_profile_prof_perfasm_unsafe_true
10/Sep/20 15:19
306 kB
Josiah
after_patch_profile_prof_perfasm_unsafe_true
10/Sep/20 15:19
472 kB
Josiah

Issue Links

links to

GitHub Pull Request #8214

[Java] Buffer capacity calculations are slow for fixed-width vectors

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Time Tracking