[SPARK-14044] Allow configuration of DynamicPartitionWriterContainer#writeRows to bypass sort step - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.6.1
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

It would be very useful to allow the disabling of this block of code within DynamicPartitionWriterContainer#writeRows at runtime:

https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala#L379-L418

The use case is that an upstream groupBy has already sorted a great many fine grained groups which are the target of the partitionBy. This partitionBy shares the same keys as the groupBy. Currently, we can't even get Spark to succeed due to the sort step and data skew in the partitions. In general, this would make more efficient use of cluster resources.

For this to work, there needs to be a way to communicate between the groupBy and the partitionBy by way of some runtime configuration. This is very similar in function to Hive's hive.optimize.sort.dynamic.partition parameter.

Attachments

Issue Links

duplicates

SPARK-19563 advoid unnecessary sort in FileFormatWriter

Resolved

is related to

HIVE-6455 Scalable dynamic partitioning and bucketing optimization

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Bob Tiernay

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Mar/16 21:31

Updated:: 30/Mar/18 18:58

Resolved:: 30/Mar/18 18:58