Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10216

Avoid creating empty files during overwrite into Hive table with group by query

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.3.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Exchange from GROUP BY query results in at least certain amount of partitions specified in 'spark.sql.shuffle.partition'.
      Hence, even when the number of distinct group-by key is small,
      INSERT INTO with GROUP BY query try to make at least 200 files (default value of 'spark.sql.shuffle.partition'),
      which results in lots of empty files.
      I think it is undesirable because upcoming queries on the resulting table will also make zero size partitions and unnecessary tasks do nothing on handling the queries.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                hyukjin.kwon Hyukjin Kwon
                Reporter:
                sirpkt Keuntae Park
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: