Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1158

HDFS-457 increases the chances of losing blocks

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 0.21.0
    • Fix Version/s: None
    • Component/s: datanode
    • Labels:
      None

      Description

      Whenever we restart a cluster, there's a chance of losing some blocks if more than three datanodes don't come up.
      HDFS-457 increases this chance by keeping the datanodes up even when

      1. /tmp disk goes read-only
      2. /disk0 that is used for storing PID goes read-only
        and probably more.

      In our environment, /tmp and /disk0 are from the same device.

      When trying to restart a datanode, it would fail with
      1)

      2010-05-15 05:45:45,575 WARN org.mortbay.log: tmpdir
      java.io.IOException: Read-only file system
              at java.io.UnixFileSystem.createFileExclusively(Native Method)
              at java.io.File.checkAndCreate(File.java:1704)
              at java.io.File.createTempFile(File.java:1792)
              at java.io.File.createTempFile(File.java:1828)
              at org.mortbay.jetty.webapp.WebAppContext.getTempDirectory(WebAppContext.java:745)
      

      or
      2)

      hadoop-daemon.sh: line 117: /disk/0/hadoop-datanode....com.out: Read-only file system
      hadoop-daemon.sh: line 118: /disk/0/hadoop-datanode.pid: Read-only file system
      

      I can recover the missing blocks but it takes some time.

      Also, we are losing track of block movements since log directory can also go to read-only but datanode would continue running.

      For 0.21 release, can we revert HDFS-457 or make it configurable?

      1. rev-HDFS-457.patch
        15 kB
        Konstantin Shvachko

        Issue Links

          Activity

          Hide
          Eli Collins added a comment -

          I think HDFS-1161 should be sufficient, apologies for filing what is probably a dupe of your jira.

          Show
          Eli Collins added a comment - I think HDFS-1161 should be sufficient, apologies for filing what is probably a dupe of your jira.
          Hide
          dhruba borthakur added a comment -

          as a person who is closely administering a large cluster, I completely agree with Koji. If we can do something better to handle these disks going read-only, that will be an immense help.

          Show
          dhruba borthakur added a comment - as a person who is closely administering a large cluster, I completely agree with Koji. If we can do something better to handle these disks going read-only, that will be an immense help.
          Hide
          Eli Collins added a comment -

          How about users who want the old behavior use dfs.datanode.min.valid.volumes from HDFS-1161 to configure the minimum to the total number of volumes (and we could make that the default) - which effectively restores the old behavior that the DN shuts itself down on volume failure - and we independently in this jira handle local-readonly file systems for tmp, pid, etc dirs which are independent since they may or may not be located on DN volumes?

          Show
          Eli Collins added a comment - How about users who want the old behavior use dfs.datanode.min.valid.volumes from HDFS-1161 to configure the minimum to the total number of volumes (and we could make that the default) - which effectively restores the old behavior that the DN shuts itself down on volume failure - and we independently in this jira handle local-readonly file systems for tmp, pid, etc dirs which are independent since they may or may not be located on DN volumes?
          Hide
          Koji Noguchi added a comment -

          Lowering priority. As long as HDFS-1161 makes it to 0.21, this is not a huge issue for me.
          We can keep this Jira open for further discussion or close it as duplicate of HDFS-1161.

          In addition to the question of how we should handle /tmp, pid, volume dir errors,
          maybe additional feature for the datanode to decommission when it decides to kill itself?

          Show
          Koji Noguchi added a comment - Lowering priority. As long as HDFS-1161 makes it to 0.21, this is not a huge issue for me. We can keep this Jira open for further discussion or close it as duplicate of HDFS-1161 . In addition to the question of how we should handle /tmp, pid, volume dir errors, maybe additional feature for the datanode to decommission when it decides to kill itself?
          Hide
          Konstantin Shvachko added a comment -

          This patch reverses HDFS-457 for y-0.20.

          Show
          Konstantin Shvachko added a comment - This patch reverses HDFS-457 for y-0.20.
          Hide
          dhruba borthakur added a comment -

          One idea that came up could be a policy that makes the datanode automatically start to decommission itself when it finds a bad disk. This is what we do with our cluster nodes. This had caused our block-loss problems to go away.

          what do people think of this idea? Of course it wil be a configurable policy over-and-above the protection provided via HDFS-1161?

          Show
          dhruba borthakur added a comment - One idea that came up could be a policy that makes the datanode automatically start to decommission itself when it finds a bad disk. This is what we do with our cluster nodes. This had caused our block-loss problems to go away. what do people think of this idea? Of course it wil be a configurable policy over-and-above the protection provided via HDFS-1161 ?
          Hide
          Eli Collins added a comment -

          If we make the DN automatically decommission itself when it finds a bad disk then we might as well revert HDFS-457 no?

          Seems like there are two separate issues here:

          1. DNs don't fail gracefully if the system disk (eg where /tmp, logs etc are stored) fails
          2. A DN that has lost some volumes refuses to come up when it's restarted. eg a DN that is running with 8 disks, and suffers two disks failures will continue to run with 6 disks, however if you restart the DN w/o first updating the configuration to remove the bad disks from dfs.data.dir then the DN will refuse to come up (because it checks for failed volumes on startup) and now the 6 valid disks that were just available are now no longer available. If you hit this scenario with three DNs any blocks that only resides on these DNs is no longer available.

          The solution for the first issue seems straight-forward, DNs should handle system disk failure gracefully by decommissioning themselves.

          For the second issue, it seems like the desired behavior (assuming we want DNs to tolerate disk failures) is for the DN to restart successfully as long as there are no more than dfs.datanode.failed.volumes.tolerated failed disks (otherwise it should decommissions itself). This way when the DN above restarts it continues to offer service for the 6 good disks even though there are 2 bad disks (which were presumably re-replicated when they failed originally).

          Show
          Eli Collins added a comment - If we make the DN automatically decommission itself when it finds a bad disk then we might as well revert HDFS-457 no? Seems like there are two separate issues here: DNs don't fail gracefully if the system disk (eg where /tmp, logs etc are stored) fails A DN that has lost some volumes refuses to come up when it's restarted. eg a DN that is running with 8 disks, and suffers two disks failures will continue to run with 6 disks, however if you restart the DN w/o first updating the configuration to remove the bad disks from dfs.data.dir then the DN will refuse to come up (because it checks for failed volumes on startup) and now the 6 valid disks that were just available are now no longer available. If you hit this scenario with three DNs any blocks that only resides on these DNs is no longer available. The solution for the first issue seems straight-forward, DNs should handle system disk failure gracefully by decommissioning themselves. For the second issue, it seems like the desired behavior (assuming we want DNs to tolerate disk failures) is for the DN to restart successfully as long as there are no more than dfs.datanode.failed.volumes.tolerated failed disks (otherwise it should decommissions itself). This way when the DN above restarts it continues to offer service for the 6 good disks even though there are 2 bad disks (which were presumably re-replicated when they failed originally).
          Hide
          dhruba borthakur added a comment -

          > If we make the DN automatically decommission itself when it finds a bad disk then we might as well revert HDFS-457 no?

          I think there is a major difference. The case when the datanode shuts down when it encounters a disk eror is more likely to result in "missing" blocks compared to the policy of decommissing first before shutting down.

          > The solution for the first issue seems straight-forward, DNs should handle system disk failure gracefully by decommissioning themselves.

          agreed.

          > For the second issue, it seems like the desired behavior (assuming we want DNs to tolerate disk failures) is for the DN to restart successfully

          This sounds good as long as there is a way for the administrator to figure out that a disk has gone bad, so that he/she can schedule a repair.

          Show
          dhruba borthakur added a comment - > If we make the DN automatically decommission itself when it finds a bad disk then we might as well revert HDFS-457 no? I think there is a major difference. The case when the datanode shuts down when it encounters a disk eror is more likely to result in "missing" blocks compared to the policy of decommissing first before shutting down. > The solution for the first issue seems straight-forward, DNs should handle system disk failure gracefully by decommissioning themselves. agreed. > For the second issue, it seems like the desired behavior (assuming we want DNs to tolerate disk failures) is for the DN to restart successfully This sounds good as long as there is a way for the administrator to figure out that a disk has gone bad, so that he/she can schedule a repair.
          Hide
          Allen Wittenauer added a comment -

          Should this be a blocker for 0.21.1 and 0.22 now that 457 is out in the wild with 0.21?

          Show
          Allen Wittenauer added a comment - Should this be a blocker for 0.21.1 and 0.22 now that 457 is out in the wild with 0.21?
          Hide
          Koji Noguchi added a comment -

          Should this be a blocker for 0.21.1 and 0.22 now that 457 is out in the wild with 0.21?

          Allen, sorry for the confusing Jira. After I created this Jira, Eli created HDFS-1161 to take care of my main concern. With HDFS-1161 which also went in 0.21, behavior is same as before. Datanode would kill itself when seeing any single error on disk.

          Taking out 0.21/0.22 target for now.

          Show
          Koji Noguchi added a comment - Should this be a blocker for 0.21.1 and 0.22 now that 457 is out in the wild with 0.21? Allen, sorry for the confusing Jira. After I created this Jira, Eli created HDFS-1161 to take care of my main concern. With HDFS-1161 which also went in 0.21, behavior is same as before. Datanode would kill itself when seeing any single error on disk. Taking out 0.21/0.22 target for now.
          Hide
          Allen Wittenauer added a comment -

          So this should just be closed then, right?

          Show
          Allen Wittenauer added a comment - So this should just be closed then, right?
          Hide
          Koji Noguchi added a comment -

          So this should just be closed then, right?

          I'm ok either way. As I wrote in the previous message,
          "We can keep this Jira open for further discussion or close it as duplicate of HDFS-1161. "

          We can ether change the summary or close this one as a duplicate and open a new Jira for continuing discussion.

          Dhruba, Eli ?

          Show
          Koji Noguchi added a comment - So this should just be closed then, right? I'm ok either way. As I wrote in the previous message, "We can keep this Jira open for further discussion or close it as duplicate of HDFS-1161 . " We can ether change the summary or close this one as a duplicate and open a new Jira for continuing discussion. Dhruba, Eli ?
          Hide
          Eli Collins added a comment -

          Let's leave this open, per the comment above I think there are still two issues we need to address.

          Show
          Eli Collins added a comment - Let's leave this open, per the comment above I think there are still two issues we need to address.
          Hide
          Eli Collins added a comment -

          If we make the DN automatically decommission itself when it finds a bad disk then we might as well revert HDFS-457 no?

          I think there is a major difference. The case when the datanode shuts down when it encounters a disk eror is more likely to result in "missing" blocks compared to the policy of decommissing first before shutting down.

          But if the datanode shuts down when it encounters a disk error then the datanode does not tolerate disk failures (the point for HDFS-457). Or are you saying that we should make the datanode decomission itself rather than shutdown whenever it reaches the threshold of disk failures it is configured to tolerate? I agree, we should do that.

          Show
          Eli Collins added a comment - If we make the DN automatically decommission itself when it finds a bad disk then we might as well revert HDFS-457 no? I think there is a major difference. The case when the datanode shuts down when it encounters a disk eror is more likely to result in "missing" blocks compared to the policy of decommissing first before shutting down. But if the datanode shuts down when it encounters a disk error then the datanode does not tolerate disk failures (the point for HDFS-457 ). Or are you saying that we should make the datanode decomission itself rather than shutdown whenever it reaches the threshold of disk failures it is configured to tolerate? I agree, we should do that.
          Hide
          Owen O'Malley added a comment -

          I would suggest that although you can work around the problem via HDFS-1161, which seems to effectively make HDFS-457 configurable that it would make sense to treat the primary partition as a special case.

          One way to do that would be to modify HDFS-1161 to specify a list of critical volumes instead of just a minimum number. It seems like you do want to fail the DN if the logs or pid directories aren't writable. On the other hand, if two of the "extra" volumes go down it is fine.

          Show
          Owen O'Malley added a comment - I would suggest that although you can work around the problem via HDFS-1161 , which seems to effectively make HDFS-457 configurable that it would make sense to treat the primary partition as a special case. One way to do that would be to modify HDFS-1161 to specify a list of critical volumes instead of just a minimum number. It seems like you do want to fail the DN if the logs or pid directories aren't writable. On the other hand, if two of the "extra" volumes go down it is fine.
          Hide
          Eli Collins added a comment -

          Good suggestion Owen. How about the following?

          • A dn should decommission itself rather than shutdown whenever (a) the configured threshold of disk failures has been reached or (b) when a critical volume (specified in the config, eg the volume(s) that host the logs, pid, tmp etc) has failed. In practice an admin would specify a number of volume failures should be tolerated and specify the root volume as critical.
          • The configured failed.volumes.tolerated should be respected on startup. The datanode should only refuse to startup if more than failed.volumes.tolerated are failed, or if a configured critical volume has failed (which is probably not an issue in practice since dn startup probably fails eg if the root volume has gone readonly).
          Show
          Eli Collins added a comment - Good suggestion Owen. How about the following? A dn should decommission itself rather than shutdown whenever (a) the configured threshold of disk failures has been reached or (b) when a critical volume (specified in the config, eg the volume(s) that host the logs, pid, tmp etc) has failed. In practice an admin would specify a number of volume failures should be tolerated and specify the root volume as critical. The configured failed.volumes.tolerated should be respected on startup. The datanode should only refuse to startup if more than failed.volumes.tolerated are failed, or if a configured critical volume has failed (which is probably not an issue in practice since dn startup probably fails eg if the root volume has gone readonly).
          Hide
          Eli Collins added a comment -

          I've created separate issues for the above suggestions. Linking those in and closing this issue out.

          Show
          Eli Collins added a comment - I've created separate issues for the above suggestions. Linking those in and closing this issue out.
          Hide
          Eli Collins added a comment -

          This issue is covered by HDFS-1161. Filed separate issues for the suggested enhancements.

          Show
          Eli Collins added a comment - This issue is covered by HDFS-1161 . Filed separate issues for the suggested enhancements.

            People

            • Assignee:
              Unassigned
              Reporter:
              Koji Noguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development