Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
Description
It turns out that setSafe performs a very expensive integer division when trying to compute buffer capacity; specifically, it divides by the field size, which isn't hardcoded. Although it is typically a power of 2 for alignment reasons, this doesn't compile down to a bitshift.
This is done here: https://github.com/apache/arrow/blob/175c53d0b17708312bfd1494c65824f690a6cc9a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L189
Forcing a bitshift operation results in a large speedup in benchmarks. When turning off bounds checks (which affects another portion of set), microbenchmarks indicate that setting the elements of a vector via setSafe is increased by ~174% (almost 3 times faster). With bounds checks on, this is reduced to a 73% increase (Amdahl's).
We use setSafe right now in a hot loop to set Arrow vectors in an internal data-intensive service (for now), although in the future, we would prefer a more specialized vector append interface to skip all the other indirection and bit manipulation instructions, while not directly manipulating the exposed (native) memory.
Here is the detailed analysis:
Tests were run on a machine with an Intel 8700k. Compiled with JDK 8, and run with the latest repo-provided JDK 14 on Ubuntu 20.04.
Benchmark results with arrow.enable_unsafe_memory_access=false, patch NOT applied # JMH version: 1.21 # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1 # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java # VM options: -Darrow.enable_unsafe_memory_access=false # Warmup: 5 iterations, 10 s each # Measurement: 5 iterations, 10 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Average time, time/op # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly *snip* Benchmark Mode Cnt Score Error Units IntBenchmarks.setIntDirectly avgt 15 13.853 ± 0.058 us/op IntBenchmarks.setWithValueHolder avgt 15 15.045 ± 0.040 us/op IntBenchmarks.setWithWriter avgt 15 21.621 ± 0.197 us/op Benchmark results with arrow.enable_unsafe_memory_access=false, patch applied # JMH version: 1.21 # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1 # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java # VM options: -Darrow.enable_unsafe_memory_access=false # Warmup: 5 iterations, 10 s each # Measurement: 5 iterations, 10 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Average time, time/op # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly *snip* Benchmark Mode Cnt Score Error Units IntBenchmarks.setIntDirectly avgt 15 7.964 ± 0.030 us/op IntBenchmarks.setWithValueHolder avgt 15 9.145 ± 0.031 us/op IntBenchmarks.setWithWriter avgt 15 8.029 ± 0.051 us/op Benchmark results with arrow.enable_unsafe_memory_access=true, patch NOT applied # JMH version: 1.21 # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1 # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java # VM options: -Darrow.enable_unsafe_memory_access=true # Warmup: 5 iterations, 10 s each # Measurement: 5 iterations, 10 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Average time, time/op # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectl Benchmark Mode Cnt Score Error Units IntBenchmarks.setIntDirectly avgt 15 9.563 ± 0.335 us/op IntBenchmarks.setWithValueHolder avgt 15 9.266 ± 0.064 us/op IntBenchmarks.setWithWriter avgt 15 18.806 ± 0.154 us/op Benchmark results with arrow.enable_unsafe_memory_access=true, patch applied # JMH version: 1.21 # VM version: JDK 14.0.1, OpenJDK 64-Bit Server VM, 14.0.1+7-Ubuntu-1ubuntu1 # VM invoker: /usr/lib/jvm/java-14-openjdk-amd64/bin/java # VM options: -Darrow.enable_unsafe_memory_access=true # Warmup: 5 iterations, 10 s each # Measurement: 5 iterations, 10 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Average time, time/op # Benchmark: org.apache.arrow.vector.IntBenchmarks.setIntDirectly Benchmark Mode Cnt Score Error Units IntBenchmarks.setIntDirectly avgt 15 3.490 ± 0.175 us/op IntBenchmarks.setWithValueHolder avgt 15 3.806 ± 0.015 us/op IntBenchmarks.setWithWriter avgt 15 5.490 ± 0.304 us/op
I determined this by running the built-in Arrow JMH benchmarks on an 8700k. I left the CPU frequency scaling in the default state. The numbers seemed off for setting a value. I reran the benchmarks with the `prof=perfasm` option in JMH, which emitted annotated assembly for detected hotspots. Here is the relevant section:
0.06% │ ││ 0x00007f5a7f7beb6f: mov 0x30(%r12,%rsi,8),%rax ; implicit exception: dispatches to 0x00007f5a7f7bef28
│ ││ ;*getfield length {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - org.apache.arrow.memory.ArrowBuf::capacity@1 (line 138)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@4 (line 189)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
│ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
│ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
│ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
0.14% │ ││ 0x00007f5a7f7beb74: movsxd 0x10(%r12,%rdi,8),%rcx ;*i2l {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@11 (line 189)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
│ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
│ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
│ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
│ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
*snip*
1.43% ││ │ ││ 0x00007f5a7f7beb9b: idivq %rcx,%rax ;*ldiv {reexecute=0 rethrow=0 return_oop=0}
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@12 (line 189)
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
││ │ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
││ │ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
││ │ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
││ │ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
68.16% ││ ↘ ││ 0x00007f5a7f7beb9e: cmp $0x7fffffff,%rax
││ ││ 0x00007f5a7f7beba5: jnle 0x7f5a7f7bec8c ;*ifgt {reexecute=0 rethrow=0 return_oop=0}
││ ││ ; - java.lang.Math::min@3 (line 1552)
││ ││ ; - org.apache.arrow.memory.util.LargeMemoryUtil::capAtMaxInt@4 (line 44)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueBufferValueCapacity@13 (line 189)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::getValueCapacity@1 (line 185)
││ ││ ; - org.apache.arrow.vector.BaseFixedWidthVector::handleSafe@2 (line 817)
││ ││ ; - org.apache.arrow.vector.IntVector::setSafe@2 (line 223)
││ ││ ; - org.apache.arrow.vector.IntBenchmarks::setWithValueHolder@51 (line 77)
││ ││ ; - org.apache.arrow.vector.generated.IntBenchmarks_setWithValueHolder_jmhTest::setWithValueHolder_avgt_jmhStub@15 (line 234)
The hot instruction is misattributed, probably due to event instruction skid. But integer division is known to be expensive to implement. We can verify this with Agner Fog's instruction tables: https://www.agner.org/optimize/instruction_tables.pdf . Searching for idiv gives high numbers in the table provided, as expected.
After noting all of this, we can apply the following patch, which produces the speedup as above:
diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java index ee47f6dd8..0c9a57bf9 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java @@ -47,6 +47,7 @@ import io.netty.util.internal.PlatformDependent; public abstract class BaseFixedWidthVector extends BaseValueVector implements FixedWidthVector, FieldVector, VectorDefinitionSetter { private final int typeWidth; + private final int typeWidthAsExactBitShiftOrNegOne; protected int lastValueCapacity; @@ -66,6 +67,7 @@ public abstract class BaseFixedWidthVector extends BaseValueVector public BaseFixedWidthVector(Field field, final BufferAllocator allocator, final int typeWidth) { super(allocator); this.typeWidth = typeWidth; + this.typeWidthAsExactBitShiftOrNegOne = log2ExactOrNeg1(typeWidth); this.field = field; valueCount = 0; allocationMonitor = 0; @@ -186,7 +188,13 @@ public abstract class BaseFixedWidthVector extends BaseValueVector } private int getValueBufferValueCapacity() { - return capAtMaxInt(valueBuffer.capacity() / typeWidth); + if (typeWidthAsExactBitShiftOrNegOne == -1) { + // Slow path - integral division is very very expensive, and code here is part of + // setSafe's hot loop + // The JVM did not optimize integral division into bit shifts + return capAtMaxInt(valueBuffer.capacity() / typeWidth); + } + return capAtMaxInt(valueBuffer.capacity() >> typeWidthAsExactBitShiftOrNegOne); } private int getValidityBufferValueCapacity() { @@ -903,4 +911,12 @@ public abstract class BaseFixedWidthVector extends BaseValueVector return visitor.visit(this, value); } + private int log2ExactOrNeg1(int x) { + final boolean isPowerOfTwo = x > 0 & (x & (x - 1)) == 0; + if (!isPowerOfTwo) { + return -1; + } + return (Integer.SIZE - 1) - Integer.numberOfLeadingZeros(x); + } + }
Attached is the generated assembly as printed by JMH, before and after. I renamed some variables for clarity after generating the results, but the logic is unchanged.
I also did a quick test with JDK 8 - this was where I originally ran the benchmarks. The idiv instruction was present there too.
An initial version of this patch cached the value - this produces about the same speedup.
Are people fine with this approach?
Attachments
Attachments
Issue Links
- links to