Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20236

Overwrite a partitioned data source table should only overwrite related partitions

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.0
    • 2.3.0
    • SQL

    Description

      When we overwrite a partitioned data source table, currently Spark will truncate the entire table to write new data, or truncate a bunch of partitions according to the given static partitions.

      For example, INSERT OVERWRITE tbl ... will truncate the entire table, INSERT OVERWRITE tbl PARTITION (a=1, b) will truncate all the partitions that starts with a=1.

      This behavior is kind of reasonable as we can know which partitions will be overwritten before runtime. However, hive has a different behavior that it only overwrites related partitions, e.g. INSERT OVERWRITE tbl SELECT 1,2,3 will only overwrite partition a=2, b=3, assuming tbl has only one data column and is partitioned by a and b.

      It seems better if we can follow hive's behavior.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            cloud_fan Wenchen Fan
            cloud_fan Wenchen Fan
            Votes:
            0 Vote for this issue
            Watchers:
            21 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment