[SPARK-31375] Overwriting into dynamic partitions is appending data in pyspark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.4.3
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
- bulk-closed
Environment:

databricks, s3, EMR, PySpark.

Description

While overwriting data in specific partitions using insertInto , spark is appending data to specific partitions though the mode is overwrite. Below property is set in config to ensure that we don't overwrite all partitions. If the below property is set to static it is truncating and inserting the data.

spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')

df.write.mode('overwrite').format('parquet').insertInto(<db>.<tbl>)

However if the above statement is changed to

df.write.mode('overwrite').format('parquet').insertInto(<db>.<tbl>,overwrite=True)

It starts behaving correct, I mean overwrites the data into specific partition.

It seems though the save mode has been mentioned earlier, precedence is given to the parameter set in insertInto method call. insertInto(<db>.<tbl>,overwrite=True)

It is happening in pyspark

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Sai Krishna Chaitanya Chaganti

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Apr/20 11:00

Updated:: 12/Dec/22 18:10

Resolved:: 25/May/21 01:41