[SPARK-24940] Coalesce and Repartition Hint for SQL Queries - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.1
Fix Version/s: 2.4.0
Component/s: SQL
Labels:
None

Target Version/s:

2.4.0

Description

Many Spark SQL users in my company have asked for a way to control the number of output files in Spark SQL. The users prefer not to use function repartition(n) or coalesce(n) that require them to write and deploy Scala/Java/Python code.

The DataFrame API has repartition/coalesce for a long time. However, we do not have an equivalent functionality in SQL queries. We propose adding the following Hive-style Coalesce and Repartition Hint to Spark SQL.

INSERT ... SELECT /*+ COALESCE(numPartitions) */ ...
INSERT ... SELECT /*+ REPARTITION(numPartitions) */ ...

Hint names are case insensitive.

Coalesce Hint reduces the number of partitions. It only merges partitions thus minimizes the data movement.

Repartition Hint can either increase or decrease the number of partitions. It performs full shuffle of data and ensures data is equally distributed.

Repartition adds a new stage, so it does not affect the parallelism of the existing stage. In contrast, Coalesce does affect the parallelism of the existing stage since it does not add a new stage.

Multiple Inserts Queries and Named Subqueries are also supported.

Attachments

Issue Links

links to

[Github] Pull Request #21911 (jzhuge)

[Github] Pull Request #21998 (jzhuge)

Activity

People

Assignee:: John Zhuge

Reporter:: John Zhuge

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 26/Jul/18 23:54

Updated:: 12/Dec/22 18:10

Resolved:: 04/Aug/18 06:28