Description
eThere is a performance regression when calculating hash code for UTF8String:
test("hashing") { import org.apache.spark.unsafe.hash.Murmur3_x86_32 import org.apache.spark.unsafe.types.UTF8String val hasher = new Murmur3_x86_32(0) val str = UTF8String.fromString("b" * 10001) val numIter = 100000 val start = System.nanoTime for (i <- 0 until numIter) { Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) } val duration = (System.nanoTime() - start) / 1000 / numIter println(s"duration $duration us") }
To run this test in 2.3, we need to add
public static int hashUTF8String(UTF8String str, int seed) { return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), str.numBytes(), seed); }
to `Murmur3_x86_32`
In my laptop, the result for master vs 2.3 is: 120 us vs 40 us