[SPARK-10216] Avoid creating empty files during overwrite into Hive table with group by query - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
None

Description

Exchange from GROUP BY query results in at least certain amount of partitions specified in 'spark.sql.shuffle.partition'.
Hence, even when the number of distinct group-by key is small,
INSERT INTO with GROUP BY query try to make at least 200 files (default value of 'spark.sql.shuffle.partition'),
which results in lots of empty files.
I think it is undesirable because upcoming queries on the resulting table will also make zero size partitions and unnecessary tasks do nothing on handling the queries.