[SPARK-20236] Overwrite a partitioned data source table should only overwrite related partitions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
- releasenotes

Description

When we overwrite a partitioned data source table, currently Spark will truncate the entire table to write new data, or truncate a bunch of partitions according to the given static partitions.

For example, INSERT OVERWRITE tbl ... will truncate the entire table, INSERT OVERWRITE tbl PARTITION (a=1, b) will truncate all the partitions that starts with a=1.

This behavior is kind of reasonable as we can know which partitions will be overwritten before runtime. However, hive has a different behavior that it only overwrites related partitions, e.g. INSERT OVERWRITE tbl SELECT 1,2,3 will only overwrite partition a=2, b=3, assuming tbl has only one data column and is partitioned by a and b.

It seems better if we can follow hive's behavior.

Attachments

Issue Links

is related to

SPARK-28945 Allow concurrent writes to different partitions with dynamic partition overwrite

Resolved

SPARK-30510 Publicly document options under spark.sql.*

Resolved

links to

[Github] Pull Request #18714 (cloud-fan)

[Github] Pull Request #20824 (steveloughran)

Activity

People

Assignee:: Wenchen Fan

Reporter:: Wenchen Fan

Votes:: 0 Vote for this issue

Watchers:: 20 Start watching this issue

Dates

Created:: 06/Apr/17 06:44

Updated:: 12/Feb/22 05:48

Resolved:: 03/Jan/18 14:20