Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1468

The hash method used by partitionBy in Pyspark doesn't deal with None correctly.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0, 1.0.0
    • Fix Version/s: 0.9.2, 1.0.1
    • Component/s: PySpark
    • Labels:
      None

      Description

      In python the default hash method uses the memory address of objects. Since None is an object None will get partitioned into different partitions depending on which python process it is run in. This causes some really odd results when None key's are used in the partitionBy.

      I've created a fix using a consistent hashing method that sends None to 0. That pr lives at https://github.com/apache/spark/pull/371

        Attachments

          Activity

            People

            • Assignee:
              tyro89 Erik Selin
              Reporter:
              tyro89 Erik Selin
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: