[SPARK-10216] Avoid creating empty files during overwrite into Hive table with group by query - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
None

Description

Exchange from GROUP BY query results in at least certain amount of partitions specified in 'spark.sql.shuffle.partition'.
Hence, even when the number of distinct group-by key is small,
INSERT INTO with GROUP BY query try to make at least 200 files (default value of 'spark.sql.shuffle.partition'),
which results in lots of empty files.
I think it is undesirable because upcoming queries on the resulting table will also make zero size partitions and unnecessary tasks do nothing on handling the queries.

Attachments

Issue Links

breaks

SPARK-15393 Writing empty Dataframes doesn't save any _metadata files

Resolved

is duplicated by

SPARK-15393 Writing empty Dataframes doesn't save any _metadata files

Resolved

SPARK-21105 Useless empty files in hive table

Resolved

relates to

SPARK-21435 Empty files should be skipped while write to file

Resolved

links to

[Github] Pull Request #8411 (sirpkt)

[Github] Pull Request #12855 (HyukjinKwon)

[Github] Pull Request #13181 (marmbrus)

[Github] Pull Request #18654 (xuanyuanking)

(3 links to)

Activity

People

Assignee:: Hyukjin Kwon

Reporter:: Keuntae Park

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 25/Aug/15 04:04

Updated:: 12/Dec/22 18:10

Resolved:: 19/Jul/17 13:41