Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10216

Avoid creating empty files during overwrite into Hive table with group by query

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0
    • 2.3.0
    • SQL
    • None

    Description

      Exchange from GROUP BY query results in at least certain amount of partitions specified in 'spark.sql.shuffle.partition'.
      Hence, even when the number of distinct group-by key is small,
      INSERT INTO with GROUP BY query try to make at least 200 files (default value of 'spark.sql.shuffle.partition'),
      which results in lots of empty files.
      I think it is undesirable because upcoming queries on the resulting table will also make zero size partitions and unnecessary tasks do nothing on handling the queries.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            gurwls223 Hyukjin Kwon
            sirpkt Keuntae Park
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment