Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31375

Overwriting into dynamic partitions is appending data in pyspark

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.3
    • None
    • PySpark, SQL
    • databricks, s3, EMR, PySpark.

    Description

      While overwriting data in specific partitions using insertInto  , spark is appending data to specific partitions though the mode is overwrite. Below property is set in config to ensure that we don't overwrite all partitions. If the below property is set to static it is truncating and inserting the data.

      spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')

      df.write.mode('overwrite').format('parquet').insertInto(<db>.<tbl>)

      However if the above statement is changed to 

      df.write.mode('overwrite').format('parquet').insertInto(<db>.<tbl>,overwrite=True)

       It starts behaving correct, I mean overwrites the data into specific partition.

      It seems   though the save mode has been mentioned earlier, precedence is given to the parameter set in insertInto method call.  insertInto(<db>.<tbl>,overwrite=True)  

      It is happening in pyspark

      Attachments

        Activity

          People

            Unassigned Unassigned
            Chaitanya Chaganti Sai Krishna Chaitanya Chaganti
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: