[HADOOP-9577] Actual data loss using s3n (against US Standard region) - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Won't Fix
Affects Version/s: 1.0.3
Fix Version/s: None
Component/s: fs/s3
Labels:
None

Description

The implementation of needsTaskCommit() assumes that the FileSystem used for writing temporary outputs is consistent. That happens not to be the case when using the S3 native filesystem in the US Standard region. It is actually quite common in larger jobs for the exists() call to return false even if the task attempt wrote output minutes earlier, which essentially cancels the commit operation with no error. That's real life data loss right there, folks.

The saddest part is that the Hadoop APIs do not seem to provide any legitimate means for the various RecordWriters to communicate with the OutputCommitter. In my projects I have created a static map of semaphores keyed by TaskAttemptID, which all my custom RecordWriters have to be aware of. That's pretty lame.

Attachments

Issue Links

depends upon

HADOOP-9565 Add a Blobstore interface to add to blobstore FileSystems

Patch Available

relates to

HADOOP-4637 Unhandled failures starting jobs with S3 as backing store

Resolved

Activity

Ascending order - Click to sort in descending order

Harsh J added a comment - 18/May/13 16:58

Thanks for the report Joshua! What would you suggest as a general solution for this? Disable use of temporary output locations (i.e. no FOC) for S3 automatically since it can never be consistent (which is the real bother)?

Harsh J added a comment - 18/May/13 16:58 Thanks for the report Joshua! What would you suggest as a general solution for this? Disable use of temporary output locations (i.e. no FOC) for S3 automatically since it can never be consistent (which is the real bother)?

Parviz Deyhim added a comment - 18/May/13 17:03

just as a clarification, US-Standard is eventually consistent. Other S3 regions/endpoints have read-your-write consistency: http://docs.aws.amazon.com/AmazonS3/latest/dev/LocationSelection.html

Parviz Deyhim added a comment - 18/May/13 17:03 just as a clarification, US-Standard is eventually consistent. Other S3 regions/endpoints have read-your-write consistency: http://docs.aws.amazon.com/AmazonS3/latest/dev/LocationSelection.html

Steve Loughran added a comment - 18/May/13 20:44

Parviz, US-East makes no consistency guarantees, the others say "a PUT of a new blob is immediately visible -mu make no stronger guarantees about DELETE and PUT over existing data". It's only slightly less risky.

Steve Loughran added a comment - 18/May/13 20:44 Parviz, US-East makes no consistency guarantees, the others say "a PUT of a new blob is immediately visible -mu make no stronger guarantees about DELETE and PUT over existing data". It's only slightly less risky.

Steve Loughran added a comment - 18/May/13 20:57

Joshua,

Tthe problems of blob store consistency is something I became more aware of during ~~HADOOP-8545~~; where delete's eventual consistency was visible between tests cases -some of the tests would get the data of the previous run. This was avoided by not reusing blob names, and using OpenStack Swift's X-NEWEST header, which so far has met its promise of "returning the newest value".

Even so, the other issue with blobstores is that most don't meet Hadoop's expectations of atomicity of directory create, rename and delete; rename is used during the commits of MR jobs. We have expectations of a filesystem "presents a consistent single-instance view of a hierarchical filesystem across the cluster" that are very much broken by blobstores "a set of blobs with variable consistency guarantees, onto which we have to mimic directories through bulk blob operations".

This is why we're trying to spell out our expectations more clearly in ~~HADOOP-9361~~, under which I'm thinking we need to add an explicit Blobstore interface to the blobstores to indicate ~~at least to MapReduce~~ that what implements the FileSystem isn't really, and that we need to commit differently. If you can help define and fix the behaviour ~~and verify it works~~ we'd love your participation.

For now, something we should think about spelling out in more detail is "commit to HDFS then copy to S3", and "if you do output to a blobstore, make sure speculation is turned off".

Making sure that distCP works reliably with blobstores should be a first step in blobstore-aware work.

Now

was this Apache Hadoop 1.0.3 or the Amazon variant?
was speculative execution enabled?

Steve Loughran added a comment - 18/May/13 20:57 Joshua, Tthe problems of blob store consistency is something I became more aware of during HADOOP-8545 ; where delete's eventual consistency was visible between tests cases -some of the tests would get the data of the previous run. This was avoided by not reusing blob names, and using OpenStack Swift's X-NEWEST header, which so far has met its promise of "returning the newest value". Even so, the other issue with blobstores is that most don't meet Hadoop's expectations of atomicity of directory create, rename and delete; rename is used during the commits of MR jobs. We have expectations of a filesystem "presents a consistent single-instance view of a hierarchical filesystem across the cluster" that are very much broken by blobstores "a set of blobs with variable consistency guarantees, onto which we have to mimic directories through bulk blob operations". This is why we're trying to spell out our expectations more clearly in HADOOP-9361 , under which I'm thinking we need to add an explicit Blobstore interface to the blobstores to indicate at least to MapReduce that what implements the FileSystem isn't really, and that we need to commit differently. If you can help define and fix the behaviour and verify it works we'd love your participation. For now, something we should think about spelling out in more detail is "commit to HDFS then copy to S3", and "if you do output to a blobstore, make sure speculation is turned off". Making sure that distCP works reliably with blobstores should be a first step in blobstore-aware work. Now was this Apache Hadoop 1.0.3 or the Amazon variant? was speculative execution enabled?

Joshua Caplan added a comment - 19/May/13 16:25

To answer your questions:
1. This was the Amazon variant of 1.0.3 with Multipart Uploading disabled, although looking at the Apache code even as far forward as the 2.0.x branches I don't see how the logic would be any different
2. Speculative execution was certainly enabled. I suppose if I could live without speculative execution for the jobs that write directly to S3, I could use the DirectFileOutputCommitter or some derivative to avoid this issue.

Your comments about blobstores in general and Hadoop's assumptions about FileSystems are enlightening. I definitely prefer the commit to HDFS and copy later strategy.

My reason for calling out this particular FileSystem side-effect assumption is that it seems like a gratuitous use for accomplishing the goal of deciding whether or not there's temporary output to commit. My workaround has been for the writers to tell me (in an admittedly clunky way) that they produced output. I can trust them more than I can trust the FileSystem.

Joshua Caplan added a comment - 19/May/13 16:25 To answer your questions: 1. This was the Amazon variant of 1.0.3 with Multipart Uploading disabled, although looking at the Apache code even as far forward as the 2.0.x branches I don't see how the logic would be any different 2. Speculative execution was certainly enabled. I suppose if I could live without speculative execution for the jobs that write directly to S3, I could use the DirectFileOutputCommitter or some derivative to avoid this issue. Your comments about blobstores in general and Hadoop's assumptions about FileSystems are enlightening. I definitely prefer the commit to HDFS and copy later strategy. My reason for calling out this particular FileSystem side-effect assumption is that it seems like a gratuitous use for accomplishing the goal of deciding whether or not there's temporary output to commit. My workaround has been for the writers to tell me (in an admittedly clunky way) that they produced output. I can trust them more than I can trust the FileSystem.

Steve Loughran added a comment - 16/Jan/15 13:39

I'm going to close this as something we don't currently plan to fix in the Hadoop core codebase, given that Netflix S3mper and EMR itself both offer a solution, namely support on Amazon Dynamo for a consistent metadata store.

The other way to get guaranteed create consistency is "don't use US East", which has no consistency guarantees —whereas everything else offers Create , but not Update or Delete

Steve Loughran added a comment - 16/Jan/15 13:39 I'm going to close this as something we don't currently plan to fix in the Hadoop core codebase, given that Netflix S3mper and EMR itself both offer a solution, namely support on Amazon Dynamo for a consistent metadata store. The other way to get guaranteed create consistency is "don't use US East", which has no consistency guarantees —whereas everything else offers Create , but not Update or Delete

People

Assignee:: Unassigned

Reporter:: Joshua Caplan

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 18/May/13 04:47

Updated:: 16/Jan/15 13:39

Resolved:: 16/Jan/15 13:39

Hadoop Common