Description
Exchange from GROUP BY query results in at least certain amount of partitions specified in 'spark.sql.shuffle.partition'.
Hence, even when the number of distinct group-by key is small,
INSERT INTO with GROUP BY query try to make at least 200 files (default value of 'spark.sql.shuffle.partition'),
which results in lots of empty files.
I think it is undesirable because upcoming queries on the resulting table will also make zero size partitions and unnecessary tasks do nothing on handling the queries.
Attachments
Attachments
Issue Links
- breaks
-
SPARK-15393 Writing empty Dataframes doesn't save any _metadata files
- Resolved
- is duplicated by
-
SPARK-15393 Writing empty Dataframes doesn't save any _metadata files
- Resolved
-
SPARK-21105 Useless empty files in hive table
- Resolved
- relates to
-
SPARK-21435 Empty files should be skipped while write to file
- Resolved
- links to