Description
S3A input stream doesn't handle changing source files any better than the other cloud store connectors. Specifically: it doesn't noticed it has changed, caches the length from startup, and whenever a seek triggers a new GET, you may get one of: old data, new data, and even perhaps go from new data to old data due to eventual consistency.
We can't do anything to stop this, but we could detect changes by
- caching the etag of the first HEAD/GET (we don't get that HEAD on open with S3Guard, BTW)
- on future GET requests, verify the etag of the response
- raise an IOE if the remote file changed during the read.
It's a more dramatic failure, but it stops changes silently corrupting things.
Attachments
Attachments
Issue Links
- is related to
-
HADOOP-16313 multipart/huge file upload tests to look at checksums returned
- Open
-
HADOOP-15999 S3Guard: Better support for out-of-band operations
- Resolved
-
HADOOP-16090 S3A Client to add explicit support for versioned stores
- Resolved
- is required by
-
HADOOP-15751 AWS Data read stack trace in S3a putObjectDirect
- Resolved
- relates to
-
HADOOP-16085 S3Guard: use object version or etags to protect against inconsistent read after replace/overwrite
- Resolved
-
HADOOP-15894 getFileChecksum() needs to adopt S3Guard
- Resolved
-
HADOOP-16202 Enhance openFile() for better read performance against object stores
- Resolved
- requires
-
HADOOP-15229 Add FileSystem builder-based openFile() API to match createFile(); S3A to implement S3 Select through this API.
- Resolved
- links to