[HADOOP-16085] S3Guard: use object version or etags to protect against inconsistent read after replace/overwrite - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.0
Fix Version/s: 3.3.0
Component/s: fs/s3
Labels:
None

Release Note:

Hide
S3Guard will now track the etag of uploaded files and, if an S3 bucket is versioned, the object version. You can then control how to react to a mismatch between the data in the DynamoDB table and that in the store: warn, fail, or, when using versions, return the original value.

This adds two new columns to the table: etag and version. This is transparent to older S3A clients -but when such clients add/update data to the S3Guard table, they will not add these values. As a result, the etag/version checks will not work with files uploaded by older clients.

For a consistent experience, upgrade all clients to use the latest hadoop version.

Show
S3Guard will now track the etag of uploaded files and, if an S3 bucket is versioned, the object version. You can then control how to react to a mismatch between the data in the DynamoDB table and that in the store: warn, fail, or, when using versions, return the original value. This adds two new columns to the table: etag and version. This is transparent to older S3A clients -but when such clients add/update data to the S3Guard table, they will not add these values. As a result, the etag/version checks will not work with files uploaded by older clients. For a consistent experience, upgrade all clients to use the latest hadoop version.

Description

Currently S3Guard doesn't track S3 object versions. If a file is written in S3A with S3Guard and then subsequently overwritten, there is no protection against the next reader seeing the old version of the file instead of the new one.

It seems like the S3Guard metadata could track the S3 object version. When a file is created or updated, the object version could be written to the S3Guard metadata. When a file is read, the read out of S3 could be performed by object version, ensuring the correct version is retrieved.

I don't have a lot of direct experience with this yet, but this is my impression from looking through the code. My organization is looking to shift some datasets stored in HDFS over to S3 and is concerned about this potential issue as there are some cases in our codebase that would do an overwrite.

I imagine this idea may have been considered before but I couldn't quite track down any JIRAs discussing it. If there is one, feel free to close this with a reference to it.

Am I understanding things correctly? Is this idea feasible? Any feedback that could be provided would be appreciated. We may consider crafting a patch.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-16085_002.patch
07/Feb/19 22:01
63 kB
Ben Roling
HADOOP-16085_3.2.0_001.patch
01/Feb/19 16:30
56 kB
Ben Roling
HADOOP-16085-003.patch
15/Mar/19 19:48
129 kB
Ben Roling

Issue Links

causes

HADOOP-16332 Remove S3A's depedency on http core

Resolved

contains

HADOOP-16370 S3AFileSystem copyFile to propagate etag/version from getObjectMetadata to copy request

Resolved

depends upon

HADOOP-16190 S3A copyFile operation to include source versionID or etag in the copy request

Resolved

is depended upon by

HADOOP-14936 S3Guard: remove "experimental" from documentation

Resolved

is related to

HADOOP-16313 multipart/huge file upload tests to look at checksums returned

Open

HADOOP-15625 S3A input stream to use etags/version number to detect changed source files

Resolved

HADOOP-15894 getFileChecksum() needs to adopt S3Guard

Resolved

HADOOP-16090 S3A Client to add explicit support for versioned stores

Resolved

HADOOP-16368 S3A list operation doesn't pick up etags from results

Resolved

supercedes

HADOOP-16190 S3A copyFile operation to include source versionID or etag in the copy request

Resolved

links to

GitHub Pull Request #646

GitHub Pull Request #675

GitHub Pull Request #794

GitHub Pull Request #803

GitHub Pull Request #807

GitHub Pull Request #824

(4 is related to, 1 supercedes, 6 links to)

Activity

People

Assignee:: Ben Roling

Reporter:: Ben Roling

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 29/Jan/19 15:00

Updated:: 05/Jan/22 16:58

Resolved:: 19/May/19 21:39