[SPARK-27560] HashPartitioner uses Object.hashCode which is not seeded - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Not A Problem
Affects Version/s: 2.4.0
Fix Version/s: None
Component/s: Java API
Labels:
None
Environment:

Hide

Notebook is running spark v2.4.0 local[*]

Python 3.6.6 (default, Sep 6 2018, 13:10:03)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin

I imagine this would reproduce on all operating systems and most versions of spark though.

Show
Notebook is running spark v2.4.0 local [*] Python 3.6.6 (default, Sep 6 2018, 13:10:03) [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin I imagine this would reproduce on all operating systems and most versions of spark though.

Description

Forgive the quality of the bug report here, I am a pyspark user and not super familiar with the internals of spark, yet it seems I have a strange corner case with the HashPartitioner.

This may already be known but repartition with HashPartitioner seems to assign everything the same partition if data that was partitioned by the same column is only partially read (say one partition). I suppose it is obvious concequence of Object.hashCode being deterministic but took some while to track down.

Steps to repro:

Get dataframe with a bunch of uuids say 10000
repartition(100, 'uuid_column')
save to parquet
read from parquet
collect()[:100] then filter using pyspark.sql.functions isin (yes I know this is bad and sampleBy should probably be used here)
repartition(10, 'uuid_column')
Resulting dataframe will have all of its data in one single partition

Jupyter notebook for the above: https://gist.github.com/robo-hamburger/4752a40cb643318464e58ab66cf7d23e

I think an easy fix would be to seed the HashPartitioner like many hashtable libraries do to avoid denial of service attacks. It also might be the case this is obvious behavior for more experienced spark users

Attachments

Issue Links

links to

GitHub Pull Request #25034

Activity

People

Assignee:: Unassigned

Reporter:: Andrew McHarg

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Apr/19 21:21

Updated:: 10/Jul/19 14:56

Resolved:: 10/Jul/19 14:56