Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-9577

Actual data loss using s3n (against US Standard region)

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Won't Fix
    • 1.0.3
    • None
    • fs/s3
    • None

    Description

      The implementation of needsTaskCommit() assumes that the FileSystem used for writing temporary outputs is consistent. That happens not to be the case when using the S3 native filesystem in the US Standard region. It is actually quite common in larger jobs for the exists() call to return false even if the task attempt wrote output minutes earlier, which essentially cancels the commit operation with no error. That's real life data loss right there, folks.

      The saddest part is that the Hadoop APIs do not seem to provide any legitimate means for the various RecordWriters to communicate with the OutputCommitter. In my projects I have created a static map of semaphores keyed by TaskAttemptID, which all my custom RecordWriters have to be aware of. That's pretty lame.

      Attachments

        Issue Links

          Activity

            qwertymaniac Harsh J added a comment -

            Thanks for the report Joshua! What would you suggest as a general solution for this? Disable use of temporary output locations (i.e. no FOC) for S3 automatically since it can never be consistent (which is the real bother)?

            qwertymaniac Harsh J added a comment - Thanks for the report Joshua! What would you suggest as a general solution for this? Disable use of temporary output locations (i.e. no FOC) for S3 automatically since it can never be consistent (which is the real bother)?
            pdeyhim Parviz Deyhim added a comment -

            just as a clarification, US-Standard is eventually consistent. Other S3 regions/endpoints have read-your-write consistency: http://docs.aws.amazon.com/AmazonS3/latest/dev/LocationSelection.html

            pdeyhim Parviz Deyhim added a comment - just as a clarification, US-Standard is eventually consistent. Other S3 regions/endpoints have read-your-write consistency: http://docs.aws.amazon.com/AmazonS3/latest/dev/LocationSelection.html
            stevel@apache.org Steve Loughran added a comment -

            Parviz, US-East makes no consistency guarantees, the others say "a PUT of a new blob is immediately visible -mu make no stronger guarantees about DELETE and PUT over existing data". It's only slightly less risky.

            stevel@apache.org Steve Loughran added a comment - Parviz, US-East makes no consistency guarantees, the others say "a PUT of a new blob is immediately visible -mu make no stronger guarantees about DELETE and PUT over existing data". It's only slightly less risky.
            stevel@apache.org Steve Loughran added a comment -

            Joshua,

            Tthe problems of blob store consistency is something I became more aware of during HADOOP-8545; where delete's eventual consistency was visible between tests cases -some of the tests would get the data of the previous run. This was avoided by not reusing blob names, and using OpenStack Swift's X-NEWEST header, which so far has met its promise of "returning the newest value".

            Even so, the other issue with blobstores is that most don't meet Hadoop's expectations of atomicity of directory create, rename and delete; rename is used during the commits of MR jobs. We have expectations of a filesystem "presents a consistent single-instance view of a hierarchical filesystem across the cluster" that are very much broken by blobstores "a set of blobs with variable consistency guarantees, onto which we have to mimic directories through bulk blob operations".

            This is why we're trying to spell out our expectations more clearly in HADOOP-9361, under which I'm thinking we need to add an explicit Blobstore interface to the blobstores to indicate at least to MapReduce that what implements the FileSystem isn't really, and that we need to commit differently. If you can help define and fix the behaviour and verify it works we'd love your participation.

            For now, something we should think about spelling out in more detail is "commit to HDFS then copy to S3", and "if you do output to a blobstore, make sure speculation is turned off".

            Making sure that distCP works reliably with blobstores should be a first step in blobstore-aware work.

            Now

            1. was this Apache Hadoop 1.0.3 or the Amazon variant?
            2. was speculative execution enabled?
            stevel@apache.org Steve Loughran added a comment - Joshua, Tthe problems of blob store consistency is something I became more aware of during HADOOP-8545 ; where delete's eventual consistency was visible between tests cases -some of the tests would get the data of the previous run. This was avoided by not reusing blob names, and using OpenStack Swift's X-NEWEST header, which so far has met its promise of "returning the newest value". Even so, the other issue with blobstores is that most don't meet Hadoop's expectations of atomicity of directory create, rename and delete; rename is used during the commits of MR jobs. We have expectations of a filesystem "presents a consistent single-instance view of a hierarchical filesystem across the cluster" that are very much broken by blobstores "a set of blobs with variable consistency guarantees, onto which we have to mimic directories through bulk blob operations". This is why we're trying to spell out our expectations more clearly in HADOOP-9361 , under which I'm thinking we need to add an explicit Blobstore interface to the blobstores to indicate at least to MapReduce that what implements the FileSystem isn't really, and that we need to commit differently. If you can help define and fix the behaviour and verify it works we'd love your participation. For now, something we should think about spelling out in more detail is "commit to HDFS then copy to S3", and "if you do output to a blobstore, make sure speculation is turned off". Making sure that distCP works reliably with blobstores should be a first step in blobstore-aware work. Now was this Apache Hadoop 1.0.3 or the Amazon variant? was speculative execution enabled?
            j_caplan Joshua Caplan added a comment -

            To answer your questions:
            1. This was the Amazon variant of 1.0.3 with Multipart Uploading disabled, although looking at the Apache code even as far forward as the 2.0.x branches I don't see how the logic would be any different
            2. Speculative execution was certainly enabled. I suppose if I could live without speculative execution for the jobs that write directly to S3, I could use the DirectFileOutputCommitter or some derivative to avoid this issue.

            Your comments about blobstores in general and Hadoop's assumptions about FileSystems are enlightening. I definitely prefer the commit to HDFS and copy later strategy.

            My reason for calling out this particular FileSystem side-effect assumption is that it seems like a gratuitous use for accomplishing the goal of deciding whether or not there's temporary output to commit. My workaround has been for the writers to tell me (in an admittedly clunky way) that they produced output. I can trust them more than I can trust the FileSystem.

            j_caplan Joshua Caplan added a comment - To answer your questions: 1. This was the Amazon variant of 1.0.3 with Multipart Uploading disabled, although looking at the Apache code even as far forward as the 2.0.x branches I don't see how the logic would be any different 2. Speculative execution was certainly enabled. I suppose if I could live without speculative execution for the jobs that write directly to S3, I could use the DirectFileOutputCommitter or some derivative to avoid this issue. Your comments about blobstores in general and Hadoop's assumptions about FileSystems are enlightening. I definitely prefer the commit to HDFS and copy later strategy. My reason for calling out this particular FileSystem side-effect assumption is that it seems like a gratuitous use for accomplishing the goal of deciding whether or not there's temporary output to commit. My workaround has been for the writers to tell me (in an admittedly clunky way) that they produced output. I can trust them more than I can trust the FileSystem.
            stevel@apache.org Steve Loughran added a comment -

            I'm going to close this as something we don't currently plan to fix in the Hadoop core codebase, given that Netflix S3mper and EMR itself both offer a solution, namely support on Amazon Dynamo for a consistent metadata store.

            The other way to get guaranteed create consistency is "don't use US East", which has no consistency guarantees —whereas everything else offers Create , but not Update or Delete

            stevel@apache.org Steve Loughran added a comment - I'm going to close this as something we don't currently plan to fix in the Hadoop core codebase, given that Netflix S3mper and EMR itself both offer a solution, namely support on Amazon Dynamo for a consistent metadata store. The other way to get guaranteed create consistency is "don't use US East", which has no consistency guarantees —whereas everything else offers Create , but not Update or Delete

            People

              Unassigned Unassigned
              j_caplan Joshua Caplan
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: