[HADOOP-13230] S3A to optionally retain directory markers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.9.0
Fix Version/s: 3.3.1
Component/s: fs/s3
Labels:
- pull-request-available

Hadoop Flags:

Incompatible change
Release Note:

Hide
The S3A connector now has an option to stop deleting directory markers as files are written. This eliminates the IO throttling the operations can cause, and avoids creating tombstone markers on versioned S3 buckets.

This feature is incompatible with all versions of Hadoop which lack the ~~HADOOP-17199~~ change to list and getFileStatus calls.

Consult the S3A documentation for further details

Show
The S3A connector now has an option to stop deleting directory markers as files are written. This eliminates the IO throttling the operations can cause, and avoids creating tombstone markers on versioned S3 buckets. This feature is incompatible with all versions of Hadoop which lack the HADOOP-17199 change to list and getFileStatus calls. Consult the S3A documentation for further details
External issue URL:
https://issues.cloudera.org/browse/IMPALA-3558
External issue ID:
https://issues.cloudera.org/browse/IMPALA-3558

Description

Users of s3a may not realize that, in some cases, it does not interoperate well with other s3 tools, such as the AWS CLI. (See ~~HIVE-13778~~, ~~IMPALA-3558~~).

Specifically, if a user:

Creates an empty directory with hadoop fs -mkdir s3a://bucket/path
Copies data into that directory via another tool, i.e. aws cli.
Tries to access the data in that directory with any Hadoop software.

Then the last step fails because the fake empty directory blob that s3a wrote in the first step, causes s3a (listStatus() etc.) to continue to treat that directory as empty, even though the second step was supposed to populate the directory with data.

I wanted to document this fact for users. We may mark this as not-fix, "by design".. May also be interesting to brainstorm solutions and/or a config option to change the behavior if folks care.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

2020-02-Fixing the S3A directory marker problem.pdf
26/Aug/20 09:48
128 kB
Steve Loughran

Issue Links

breaks

HADOOP-17403 S3A ITestPartialRenamesDeletes.testRenameDirFailsInDelete failure: missing directory marker

Resolved

causes

HADOOP-17244 HADOOP-17244. S3A directory delete tombstones dir markers prematurely.

Resolved

HADOOP-17261 s3a rename() now requires s3:deleteObjectVersion permission

Resolved

HADOOP-17293 S3A to always probe S3 in S3A getFileStatus on non-auth paths

Resolved

contains

HADOOP-17200 Renaming a file under a sibling empty directory doesn't delete dest dir's marker

Resolved

HADOOP-13430 Optimize getFileStatus in S3A

Resolved

HADOOP-16493 S3AFilesystem.initiateRename() can skip check on dest.parent status if src has same parent

Resolved

Dependency

HADOOP-17200 Renaming a file under a sibling empty directory doesn't delete dest dir's marker

Resolved

fixes

HADOOP-17217 S3A FileSystem does not correctly delete directories with fake entries

Resolved

is depended upon by

HADOOP-17217 S3A FileSystem does not correctly delete directories with fake entries

Resolved

SPARK-35299 Dataframe overwrite on S3A does not delete old files with S3 object-put to table path/

Resolved

is duplicated by

HADOOP-16846 add experimental optimization of s3a directory marker handling

Resolved

HADOOP-16942 S3A creating folder level delete markers

Resolved

is related to

HADOOP-14255 S3A to delete unnecessary fake directory objects in mkdirs()

Resolved

HADOOP-16804 s3a mkdir path/ can add 404 to S3 load balancers

Resolved

HADOOP-17199 Backport HADOOP-13230 list/getFileStatus changes for preserved directory markers

Resolved

HADOOP-17227 improve s3guard markers command line tool

Resolved

HADOOP-17228 Backport HADOOP-13230 listing changes for preserved directory markers to 3.1.x

Resolved

HADOOP-18752 Change fs.s3a.directory.marker.retention to "keep"

Resolved

HADOOP-14124 S3AFileSystem silently deletes "fake" directories when writing a file.

Resolved

HADOOP-17359 [Hadoop-Tools]S3A MultiObjectDeleteException after uploading a file

Resolved

relates to

IMPALA-3558 DROP TABLE PURGE on S3A table may not delete externally written files

Resolved

HADOOP-13164 Optimize S3AFileSystem::deleteUnnecessaryFakeDirectories

Resolved

supercedes

HADOOP-16090 S3A Client to add explicit support for versioned stores

Resolved

links to

commentable document

GitHub Pull Request #1861

GitHub Pull Request #2149

https://github.com/apache/hadoop/pull/2149

(2 contains, 1 Dependency, 1 fixes, 2 is depended upon by, 2 is duplicated by, 8 is related to, 2 relates to, 1 supercedes, 4 links to)

Activity

People

Assignee:: Steve Loughran

Reporter:: Aaron Fabbri

Votes:: 0 Vote for this issue

Watchers:: 21 Start watching this issue

Dates

Created:: 01/Jun/16 19:37

Updated:: 24/May/23 17:54

Resolved:: 15/Aug/20 19:24

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

50m