Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20236

Overwrite a partitioned data source table should only overwrite related partitions

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.3.0
    • Component/s: SQL
    • Labels:

      Description

      When we overwrite a partitioned data source table, currently Spark will truncate the entire table to write new data, or truncate a bunch of partitions according to the given static partitions.

      For example, INSERT OVERWRITE tbl ... will truncate the entire table, INSERT OVERWRITE tbl PARTITION (a=1, b) will truncate all the partitions that starts with a=1.

      This behavior is kind of reasonable as we can know which partitions will be overwritten before runtime. However, hive has a different behavior that it only overwrites related partitions, e.g. INSERT OVERWRITE tbl SELECT 1,2,3 will only overwrite partition a=2, b=3, assuming tbl has only one data column and is partitioned by a and b.

      It seems better if we can follow hive's behavior.

        Attachments

          Activity

            People

            • Assignee:
              cloud_fan Wenchen Fan
              Reporter:
              cloud_fan Wenchen Fan
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: