[SPARK-597] HashPartitioner incorrectly partitions RDD[Array[_]] - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.6.2, 0.7.0
Component/s: None
Labels:
None

Description

Java arrays have hashCodes that are based on the arrays' identities rather than their contents [1]. As a result, attempting to partition an RDD[Array[_]] using a HashPartitioner will produce an unexpected/incorrect result.

This was the cause of a bug in PySpark, where I hash partition PairRDDs with Array[Byte] keys. In PySpark, I fixed this by using a custom Partitioner that calls Arrays.hashCode(byte[]) when passed an Array[Byte] [2].

I would like to address this issue more generally in Spark.

We could add logic to HashPartitioner to perform special handling for arrays, but I'm not sure whether the additional branching would add a significant performance overhead. The logic could become messy because the Arrays module defines Arrays.hashCode() for primitive arrays and Arrays.deepHashCode() for Object arrays. Perhaps Guava or Apache Commons has an implementation of this.

An alternative would be to keep the current HashPartitioner and add logic to print warnings (or to fail with an error) when shuffling an RDD[Array[_]] using the default HashPartitioner.

[1] http://stackoverflow.com/questions/744735/java-array-hashcode-implementation
[2] https://github.com/JoshRosen/spark/commit/2ccf3b665280bf5b0919e3801d028126cb070dbd

Attachments

Issue Links

relates to

SPARK-3847 Enum.hashCode is only consistent within the same JVM

Resolved

Activity

People

Assignee:: Josh Rosen

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Oct/12 22:51

Updated:: 08/Oct/14 22:27

Resolved:: 17/Jan/13 10:43