[SPARK-1468] The hash method used by partitionBy in Pyspark doesn't deal with None correctly. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0, 1.0.0
Fix Version/s: 0.9.2, 1.0.1
Component/s: PySpark
Labels:
None

Description

In python the default hash method uses the memory address of objects. Since None is an object None will get partitioned into different partitions depending on which python process it is run in. This causes some really odd results when None key's are used in the partitionBy.

I've created a fix using a consistent hashing method that sends None to 0. That pr lives at https://github.com/apache/spark/pull/371

Attachments

Activity

People

Assignee:: Erik Selin

Reporter:: Erik Selin

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 10/Apr/14 22:57

Updated:: 03/Jun/14 20:35

Resolved:: 03/Jun/14 20:33