Details
-
Task
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.8.3, 3.2.1
-
None
-
None
Description
Using S3A URL scheme while writing out data from Spark to S3 is creating many folder level delete markers.
Writing the same with S3 URL scheme, does not create any delete markers at all.
Spark - 2.4.4
Hadoop - 3.2.1
EMR version - 6.0.0
Write Mode - Append
[hadoop@ip-192-0-161-212 ~]$ spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 20/03/27 07:37:19 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Spark context Web UI available at http://ip-192-0-161-212.ec2.internal:4040 Spark context available as 'sc' (master = yarn, app id = application_1585294390030_0003). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala> val df = spark.sql("select 1 as a") df: org.apache.spark.sql.DataFrame = [a: int] scala> df.write.mode(org.apache.spark.sql.SaveMode.Append).save("s3://my-bucket/tmp/vijayant/test/s3/") scala> df.write.mode(org.apache.spark.sql.SaveMode.Append).save("s3a://my-bucket/tmp/vijayant/test/s3a/") scala>
Getting delete markers from `s3` write
aws s3api list-object-versions --bucket my-bucket --prefix tmp/vijayant/test/s3/ { "Versions": [ { "LastModified": "2020-03-27T07:38:17.000Z", "VersionId": "V06OzeE7j221Tq7keSGj8bveCYyJFIcf", "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"", "StorageClass": "STANDARD", "Key": "tmp/vijayant/test/s3/_SUCCESS", "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": true, "Size": 0 }, { "LastModified": "2020-03-27T07:38:16.000Z", "VersionId": "dLYtHDugLhFIdw2YHLFmoFOxXkm.21Wo", "ETag": "\"26e70a1e26c709e3e8498acd49cfaaa3-1\"", "StorageClass": "STANDARD", "Key": "tmp/vijayant/test/s3/part-00000-9d9a8925-f119-415d-b547-b742396e2ca7-c000.snappy.parquet", "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": true, "Size": 384 } ] }
Getting delete markers from `s3a` write
aws s3api list-object-versions --bucket my-bucket --prefix tmp/vijayant/test/s3a/ { "DeleteMarkers": [ { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": true, "VersionId": "NJWRZMcb_eYYwCJh_isX4H1Ox6W362Wb", "Key": "tmp/vijayant/test/s3a/", "LastModified": "2020-03-27T07:39:11.000Z" }, { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": false, "VersionId": "F0h0mLcVVwkMtcHxd95Hj7BACL4Up_Q9", "Key": "tmp/vijayant/test/s3a/", "LastModified": "2020-03-27T07:39:10.000Z" }, { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": false, "VersionId": ".sBcE6cXeggekOnSgZ4n7pyCDHnsLERK", "Key": "tmp/vijayant/test/s3a/", "LastModified": "2020-03-27T07:39:10.000Z" }, { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": false, "VersionId": "nzm39jiUPC4H0ZaS.5Shp0FYPnR8wNf9", "Key": "tmp/vijayant/test/s3a/", "LastModified": "2020-03-27T07:39:09.000Z" }, { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": false, "VersionId": "BPM65R1HkZngPDYtDL3zPZYPw_G_m9Ic", "Key": "tmp/vijayant/test/s3a/", "LastModified": "2020-03-27T07:39:08.000Z" }, { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": true, "VersionId": "LJt8_MVDOiD4UdgUqEMycxjvtinJlTNt", "Key": "tmp/vijayant/test/s3a/_temporary/", "LastModified": "2020-03-27T07:39:11.000Z" }, { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": false, "VersionId": "RqunJTn8Od0PgFR4yu44PX4kL54k6EDv", "Key": "tmp/vijayant/test/s3a/_temporary/", "LastModified": "2020-03-27T07:39:09.000Z" }, { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": false, "VersionId": "4vY8cnqUI5VJAk3VfEt_VD_KEczo3bmY", "Key": "tmp/vijayant/test/s3a/_temporary/", "LastModified": "2020-03-27T07:39:08.000Z" }, { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": true, "VersionId": "ln47YYy.yiE.k70cvqvfgYCEQoYFnKQW", "Key": "tmp/vijayant/test/s3a/_temporary/0/", "LastModified": "2020-03-27T07:39:11.000Z" }, { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": false, "VersionId": "5Bsrt7s1caM90mzGNgk0MsTU9q8UjTTA", "Key": "tmp/vijayant/test/s3a/_temporary/0/", "LastModified": "2020-03-27T07:39:09.000Z" }, { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": true, "VersionId": "pN3HzDfnmqIqrMwAL2jqKEBkvoHZALor", "Key": "tmp/vijayant/test/s3a/_temporary/0/_temporary/", "LastModified": "2020-03-27T07:39:11.000Z" }, { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": false, "VersionId": "wg91poO1KXReXxvsZHzZXrHR1IgIX8t2", "Key": "tmp/vijayant/test/s3a/_temporary/0/_temporary/", "LastModified": "2020-03-27T07:39:09.000Z" }, { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": true, "VersionId": "cv5Noykq3sMilQqJXAH3E.N7qAWnIBx7", "Key": "tmp/vijayant/test/s3a/_temporary/0/_temporary/attempt_20200327073907_0001_m_000000_1/", "LastModified": "2020-03-27T07:39:11.000Z" }, { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": false, "VersionId": "6xzt9SxlCUJaOLD8krkE3yXfQU14rErX", "Key": "tmp/vijayant/test/s3a/_temporary/0/_temporary/attempt_20200327073907_0001_m_000000_1/", "LastModified": "2020-03-27T07:39:09.000Z" }, { "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": true, "VersionId": "wGmJAo7x_gkLWAiHzxPGdPMVSus7Wcp1", "Key": "tmp/vijayant/test/s3a/_temporary/0/_temporary/attempt_20200327073907_0001_m_000000_1/part-00000-3923e1b1-406c-4202-b9a8-3bd7cb2d97b2-c000.snappy.parquet", "LastModified": "2020-03-27T07:39:10.000Z" } ], "Versions": [ { "LastModified": "2020-03-27T07:39:11.000Z", "VersionId": "2py_ZXKl7yh6fwhzksAx8Os1BriDJCBb", "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"", "StorageClass": "STANDARD", "Key": "tmp/vijayant/test/s3a/_SUCCESS", "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": true, "Size": 0 }, { "LastModified": "2020-03-27T07:39:08.000Z", "VersionId": "lDqTnLCqDYtjrOiY.V7E6AKTRQLKrqUT", "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"", "StorageClass": "STANDARD", "Key": "tmp/vijayant/test/s3a/_temporary/0/", "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": false, "Size": 0 }, { "LastModified": "2020-03-27T07:39:10.000Z", "VersionId": "g.rGoTDdmrGrNjrLchvwz3jMmGePkgiD", "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"", "StorageClass": "STANDARD", "Key": "tmp/vijayant/test/s3a/_temporary/0/_temporary/attempt_20200327073907_0001_m_000000_1/", "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": false, "Size": 0 }, { "LastModified": "2020-03-27T07:39:09.000Z", "VersionId": ".ZCpY2UW4hRlbLL87dFUJRuk021Hyq8p", "ETag": "\"3def7238a0858c17c62d7045290175cf\"", "StorageClass": "STANDARD", "Key": "tmp/vijayant/test/s3a/_temporary/0/_temporary/attempt_20200327073907_0001_m_000000_1/part-00000-3923e1b1-406c-4202-b9a8-3bd7cb2d97b2-c000.snappy.parquet", "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": false, "Size": 384 }, { "LastModified": "2020-03-27T07:39:10.000Z", "VersionId": "JSNjTDHSQqe9zSAV93bc6TXPuqA.vDJE", "ETag": "\"3def7238a0858c17c62d7045290175cf\"", "StorageClass": "STANDARD", "Key": "tmp/vijayant/test/s3a/part-00000-3923e1b1-406c-4202-b9a8-3bd7cb2d97b2-c000.snappy.parquet", "Owner": { "DisplayName": "sysops+stage", "ID": "08939105f417dc74b1fa237e211185ff2d9f528d54b1380501de07bd0657b5e1" }, "IsLatest": true, "Size": 384 } ] }
This in turn makes listing objects slow and we have even noticed timeouts due to too many delete markers.
Attachments
Issue Links
- duplicates
-
HADOOP-13230 S3A to optionally retain directory markers
- Resolved
- relates to
-
HADOOP-13811 s3a: getFileStatus fails with com.amazonaws.AmazonClientException: Failed to sanitize XML document destined for handler class
- Resolved
-
HADOOP-13421 Switch to v2 of the S3 List Objects API in S3A
- Resolved
-
HADOOP-16090 S3A Client to add explicit support for versioned stores
- Resolved