[SPARK-9872] Allow passing of 'numPartitions' to DataFrame joins - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 1.4.1
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

When I join two normal RDDs, I can set the number of shuffle partitions in the 'numPartitions' argument. When I join two DataFrames I do not have this option.

My spark job loads in 2 large files and 2 small files. When I perform a join, this will use the "spark.sql.shuffle.partitions" to determine the number of partitions. This means that the join with my small files will use exactly the same number of partitions as the join with my large files.

I can either use a low number of partitions and run out of memory on my large join, or use a high number of partitions and my small join will take far too long.

If we were able to specify the number of shuffle partitions in a DataFrame join like in an RDD join, this would not be an issue.

My long term ideal solution would be dynamic partition determination as described in ~~SPARK-4630~~. However I appreciate that it is not particularly easy to do.

Attachments

Issue Links

relates to

SPARK-4630 Dynamically determine optimal number of partitions

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Al M

Votes:: 5 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 12/Aug/15 07:57

Updated:: 12/Mar/17 07:23

Resolved:: 12/Mar/17 07:23