Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35299

Dataframe overwrite on S3A does not delete old files with S3 object-put to table path/

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.0
    • None
    • Spark Core

    Description

      To reproduce:

      test_table path: s3a://test_bucket/test_table/

       

      df = spark_session.sql("SELECT * FROM test_table")

      df.count()  # produce row number 1000

      #####S3 operation######

      s3 = boto3.client("s3")
      s3.put_object(
          Bucket="test_bucket", Body="", Key=f"test_table/"
      )

      #####S3 operation######

      df.write.insertInto(test_table, overwrite=True)

      #Same goes to df.write.save(mode="overwrite", format="parquet", path="s3a://test_bucket/test_table")

      df = spark_session.sql("SELECT * FROM test_table")

      df.count()  # produce row number 2000

       

      Overwrite is not functioning correctly. Old files will not be deleted on S3.

       

       

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yushengding Yusheng Ding
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: