Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1539

prevent data loss when a cluster suffers a power loss

    Details

    • Hadoop Flags:
      Reviewed

      Description

      we have seen an instance where a external outage caused many datanodes to reboot at around the same time. This resulted in many corrupted blocks. These were recently written blocks; the current implementation of HDFS Datanodes do not sync the data of a block file when the block is closed.

      1. Have a cluster-wide config setting that causes the datanode to sync a block file when a block is finalized.
      2. Introduce a new parameter to the FileSystem.create() to trigger the new behaviour, i.e. cause the datanode to sync a block-file when it is finalized.
      3. Implement the FSDataOutputStream.hsync() to cause all data written to the specified file to be written to stable storage.

      1. syncOnClose1.txt
        6 kB
        dhruba borthakur
      2. syncOnClose2_b-1.txt
        6 kB
        Tsz Wo Nicholas Sze
      3. syncOnClose2.txt
        6 kB
        dhruba borthakur

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          We have seen this problem on a cluster that is purely used for archival purposes. I propose that we implement Option 1 listed above.

          Show
          dhruba borthakur added a comment - We have seen this problem on a cluster that is purely used for archival purposes. I propose that we implement Option 1 listed above.
          Hide
          Allen Wittenauer added a comment -

          Is there a reason why the datanode just shouldn't sync anyway? [ie., is it really worth it to make this configurable?]

          Show
          Allen Wittenauer added a comment - Is there a reason why the datanode just shouldn't sync anyway? [ie., is it really worth it to make this configurable?]
          Hide
          Todd Lipcon added a comment -

          @Allen: some file systems, if you sync() one file, will end up syncing all files, essentially. So it could be a moderately big performance hit, though it would be worth benchmarking terasort with/without - it should be fairly obvious if it's a killer.

          Show
          Todd Lipcon added a comment - @Allen: some file systems, if you sync() one file, will end up syncing all files, essentially. So it could be a moderately big performance hit, though it would be worth benchmarking terasort with/without - it should be fairly obvious if it's a killer.
          Hide
          dhruba borthakur added a comment -

          Here is a patch then makes the datanode flush and sync all data and metadata of a block file to disk when the block is closed. This occurs only if dfs.datanode.synconclose is set to true. The default value of dfs.datanode.synconclose is false.

          If the admin does not set any value for the new config parameter, then the behaviour of the datanode stys the same as it is prior to this patch.

          Show
          dhruba borthakur added a comment - Here is a patch then makes the datanode flush and sync all data and metadata of a block file to disk when the block is closed. This occurs only if dfs.datanode.synconclose is set to true. The default value of dfs.datanode.synconclose is false. If the admin does not set any value for the new config parameter, then the behaviour of the datanode stys the same as it is prior to this patch.
          Hide
          dhruba borthakur added a comment -

          @Allen: Thanks for ur comments. I jave kept the default behaviour as it is now, especially because I do not want any existing installations to see bad performance behaviour when they run with this patch. (On some customer sites, it is possible that they have enough redundant power supplies that they never have to configure this patch to be turned on)

          Show
          dhruba borthakur added a comment - @Allen: Thanks for ur comments. I jave kept the default behaviour as it is now, especially because I do not want any existing installations to see bad performance behaviour when they run with this patch. (On some customer sites, it is possible that they have enough redundant power supplies that they never have to configure this patch to be turned on)
          Hide
          Todd Lipcon added a comment -

          dhruba: do you plan to run this on your warehouse cluster or just scribe tiers? If so it would be very interesting to find out whether it affects throughput. If there is no noticeable hit I would argue to make it the default.

          Show
          Todd Lipcon added a comment - dhruba: do you plan to run this on your warehouse cluster or just scribe tiers? If so it would be very interesting to find out whether it affects throughput. If there is no noticeable hit I would argue to make it the default.
          Hide
          dhruba borthakur added a comment -

          I could make it the default, but I would like the hear the opinion of many people who are running hadoop clusters. Also, performance numbers could vary a lot based on the operating system (CentOs, Redhat, windows, ext4, xfs), etc., so it would be difficult to get it right based solely on performance. On the other hand, if the entire community thinks that it is better to have the default the prevents data loss at all costs, then this could be the default. If the debate on either side is fierce, then I would like to get this in first and then open another JIRA to debate the default settings.

          We are definitely going to first deploy this first on our "archival" cluster. This is a cluster that is used purely to backup/restore data from mySQL databases.

          Show
          dhruba borthakur added a comment - I could make it the default, but I would like the hear the opinion of many people who are running hadoop clusters. Also, performance numbers could vary a lot based on the operating system (CentOs, Redhat, windows, ext4, xfs), etc., so it would be difficult to get it right based solely on performance. On the other hand, if the entire community thinks that it is better to have the default the prevents data loss at all costs, then this could be the default. If the debate on either side is fierce, then I would like to get this in first and then open another JIRA to debate the default settings. We are definitely going to first deploy this first on our "archival" cluster. This is a cluster that is used purely to backup/restore data from mySQL databases.
          Hide
          Todd Lipcon added a comment -

          Yep, I certainly didn't intend to block this JIRA. What you've done here is definitely prudent, and we can debate/benchmark turning it on by default in another JIRA.

          Show
          Todd Lipcon added a comment - Yep, I certainly didn't intend to block this JIRA. What you've done here is definitely prudent, and we can debate/benchmark turning it on by default in another JIRA.
          Hide
          M. C. Srivas added a comment -

          Dhruba, so if there's a file with 20 blocks on 20 different servers, with 3 replicas each, we might potentially end up sync'ing 41 servers (= 1 primary + 20*2 replicas) when closing the file, correct?

          Show
          M. C. Srivas added a comment - Dhruba, so if there's a file with 20 blocks on 20 different servers, with 3 replicas each, we might potentially end up sync'ing 41 servers (= 1 primary + 20*2 replicas) when closing the file, correct?
          Hide
          dhruba borthakur added a comment -

          if there is a file with 20 blocks and each block has three replicas each, then there will be a total of 60 fflush calls, this does not matter on the number of servers.

          Show
          dhruba borthakur added a comment - if there is a file with 20 blocks and each block has three replicas each, then there will be a total of 60 fflush calls, this does not matter on the number of servers.
          Hide
          Hairong Kuang added a comment -

          BlockReceiver#cout should set to be streams.checksumOut, right?

          Show
          Hairong Kuang added a comment - BlockReceiver#cout should set to be streams.checksumOut, right?
          Hide
          dhruba borthakur added a comment -

          The first change in BlockReceiver.java is

          
          -        this.checksumOut = new DataOutputStream(new BufferedOutputStream(
          -                                                  streams.checksumOut, 
          -                                                  SMALL_BUFFER_SIZE));
          +        this.cout = new BufferedOutputStream(streams.checksumOut, 
          +                                                  SMALL_BUFFER_SIZE);
          +        this.checksumOut = new DataOutputStream(this.cout);
          
          

          is this what you meant?

          Show
          dhruba borthakur added a comment - The first change in BlockReceiver.java is - this .checksumOut = new DataOutputStream( new BufferedOutputStream( - streams.checksumOut, - SMALL_BUFFER_SIZE)); + this .cout = new BufferedOutputStream(streams.checksumOut, + SMALL_BUFFER_SIZE); + this .checksumOut = new DataOutputStream( this .cout); is this what you meant?
          Hide
          Hairong Kuang added a comment -

          Yes, should
          + this.cout = new BufferedOutputStream(streams.checksumOut,
          + SMALL_BUFFER_SIZE);
          be this.cout = streams.checksumOut?

          Show
          Hairong Kuang added a comment - Yes, should + this.cout = new BufferedOutputStream(streams.checksumOut, + SMALL_BUFFER_SIZE); be this.cout = streams.checksumOut?
          Hide
          dhruba borthakur added a comment -

          Incorporated Hairong's comments.

          Show
          dhruba borthakur added a comment - Incorporated Hairong's comments.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12467019/syncOnClose2.txt
          against trunk revision 1053203.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.server.namenode.TestStorageRestore
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/49//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/49//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/49//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12467019/syncOnClose2.txt against trunk revision 1053203. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestStorageRestore org.apache.hadoop.hdfs.TestFileConcurrentReader -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/49//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/49//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/49//console This message is automatically generated.
          Hide
          Hairong Kuang added a comment -

          +1. The patch look good.

          A minor comment is that I do not think the unit test is of much use because the bug occurs when a machine is power off but it's hard to simulate this.

          Show
          Hairong Kuang added a comment - +1. The patch look good. A minor comment is that I do not think the unit test is of much use because the bug occurs when a machine is power off but it's hard to simulate this.
          Hide
          dhruba borthakur added a comment -

          I just committed this.

          Show
          dhruba borthakur added a comment - I just committed this.
          Hide
          stack added a comment -

          Should we pull this into 1.0.3? Or 1.1.0?

          Show
          stack added a comment - Should we pull this into 1.0.3? Or 1.1.0?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Sure, let's backport this to branch-1.

          Show
          Tsz Wo Nicholas Sze added a comment - Sure, let's backport this to branch-1.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          syncOnClose2_b-1.txt: for branch-1.

          Show
          Tsz Wo Nicholas Sze added a comment - syncOnClose2_b-1.txt: for branch-1.
          Hide
          Suresh Srinivas added a comment -

          Nicholas, compared the backported patch with the original. It looks good. +1 for the patch.

          We should get this into 1.1.1.

          Show
          Suresh Srinivas added a comment - Nicholas, compared the backported patch with the original. It looks good. +1 for the patch. We should get this into 1.1.1.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Interestingly, TestFileCreation fails in branch-1 (with and without the patch) but not branch-1.1. I will file a JIRA for it.

          Show
          Tsz Wo Nicholas Sze added a comment - Interestingly, TestFileCreation fails in branch-1 (with and without the patch) but not branch-1.1. I will file a JIRA for it.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed this to branch-1 and branch-1.1.

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed this to branch-1 and branch-1.1.
          Hide
          Matt Foley added a comment -

          Closed upon release of 1.1.1.

          Show
          Matt Foley added a comment - Closed upon release of 1.1.1.
          Hide
          Dave Latham added a comment -

          Does anyone have any performance numbers for enabling this? Or, does anyone just have some experience running this on significant workloads in production? (Especially HBase?)

          Show
          Dave Latham added a comment - Does anyone have any performance numbers for enabling this? Or, does anyone just have some experience running this on significant workloads in production? (Especially HBase?)

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development