Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-597

HashPartitioner incorrectly partitions RDD[Array[_]]

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.6.2, 0.7.0
    • None
    • None

    Description

      Java arrays have hashCodes that are based on the arrays' identities rather than their contents [1]. As a result, attempting to partition an RDD[Array[_]] using a HashPartitioner will produce an unexpected/incorrect result.

      This was the cause of a bug in PySpark, where I hash partition PairRDDs with Array[Byte] keys. In PySpark, I fixed this by using a custom Partitioner that calls Arrays.hashCode(byte[]) when passed an Array[Byte] [2].

      I would like to address this issue more generally in Spark.

      We could add logic to HashPartitioner to perform special handling for arrays, but I'm not sure whether the additional branching would add a significant performance overhead. The logic could become messy because the Arrays module defines Arrays.hashCode() for primitive arrays and Arrays.deepHashCode() for Object arrays. Perhaps Guava or Apache Commons has an implementation of this.

      An alternative would be to keep the current HashPartitioner and add logic to print warnings (or to fail with an error) when shuffling an RDD[Array[_]] using the default HashPartitioner.

      [1] http://stackoverflow.com/questions/744735/java-array-hashcode-implementation
      [2] https://github.com/JoshRosen/spark/commit/2ccf3b665280bf5b0919e3801d028126cb070dbd

      Attachments

        Issue Links

          Activity

            People

              joshrosen Josh Rosen
              joshrosen Josh Rosen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: