[SPARK-16699] Fix performance bug in hash aggregate on long string keys - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.1, 2.1.0
Component/s: Spark Core
Labels:
None

Description

In the following code in `VectorizedHashMapGenerator.scala`:

    def hashBytes(b: String): String = {
      val hash = ctx.freshName("hash")
      s"""
         |int $result = 0;
         |for (int i = 0; i < $b.length; i++) {
         |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
         |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + ($result >>> 2);
         |}
       """.stripMargin
    }

when b=input.getBytes(), the current 2.0 code results in getBytes() being called n times, n being length of input. getBytes() involves memory copy is thus expensive and causes a performance degradation.
Fix is to evaluate getBytes() before the for loop.

Attachments

Issue Links

links to

[Github] Pull Request #14337 (ooq)

Activity

People

Assignee:: Qifan Pu

Reporter:: Qifan Pu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Jul/16 22:28

Updated:: 25/Jul/16 04:58

Resolved:: 25/Jul/16 04:53