Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16699

Fix performance bug in hash aggregate on long string keys

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.0.1, 2.1.0
    • Spark Core
    • None

    Description

      In the following code in `VectorizedHashMapGenerator.scala`:

          def hashBytes(b: String): String = {
            val hash = ctx.freshName("hash")
            s"""
               |int $result = 0;
               |for (int i = 0; i < $b.length; i++) {
               |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
               |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + ($result >>> 2);
               |}
             """.stripMargin
          }
      

      when b=input.getBytes(), the current 2.0 code results in getBytes() being called n times, n being length of input. getBytes() involves memory copy is thus expensive and causes a performance degradation.
      Fix is to evaluate getBytes() before the for loop.

      Attachments

        Activity

          People

            qifan Qifan Pu
            qifan Qifan Pu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: