Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10216

Avoid creating empty files during overwrite into Hive table with group by query

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0
    • 2.3.0
    • SQL
    • None

    Description

      Exchange from GROUP BY query results in at least certain amount of partitions specified in 'spark.sql.shuffle.partition'.
      Hence, even when the number of distinct group-by key is small,
      INSERT INTO with GROUP BY query try to make at least 200 files (default value of 'spark.sql.shuffle.partition'),
      which results in lots of empty files.
      I think it is undesirable because upcoming queries on the resulting table will also make zero size partitions and unnecessary tasks do nothing on handling the queries.

      Attachments

        Issue Links

          Activity

            People

              gurwls223 Hyukjin Kwon
              sirpkt Keuntae Park
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: