Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35299

Dataframe overwrite on S3A does not delete old files with S3 object-put to table path/

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: Spark Core

      Description

      To reproduce:

      test_table path: s3a://test_bucket/test_table/

       

      df = spark_session.sql("SELECT * FROM test_table")

      df.count()  # produce row number 1000

      #####S3 operation######

      s3 = boto3.client("s3")
      s3.put_object(
          Bucket="test_bucket", Body="", Key=f"test_table/"
      )

      #####S3 operation######

      df.write.insertInto(test_table, overwrite=True)

      #Same goes to df.write.save(mode="overwrite", format="parquet", path="s3a://test_bucket/test_table")

      df = spark_session.sql("SELECT * FROM test_table")

      df.count()  # produce row number 2000

       

      Overwrite is not functioning correctly. Old files will not be deleted on S3.

       

       

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                yushengding Yusheng Ding
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: