Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1848

Datanodes should shutdown when a critical volume fails

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: datanode
    • Labels:
      None

      Description

      A DN should shutdown when a critical volume (eg the volume that hosts the OS, logs, pid, tmp dir etc.) fails. The admin should be able to specify which volumes are critical, eg they might specify the volume that lives on the boot disk. A failure in one of these volumes would not be subject to the threshold (HDFS-1161) or result in host decommissioning (HDFS-1847) as the decommissioning process would likely fail.

        Issue Links

          Activity

          Hide
          Bharath Mundlapudi added a comment -

          I am wondering if this is necessary? Typically, critical volume (eg the volume that hosts the OS, logs, pid, tmp dir etc.) is RAID-1 and if this goes down we can safely assume Datanode to be down. I am just curious to understand the usecase? Please refer to Disk Fail Inplace Jira.

          https://issues.apache.org/jira/browse/HADOOP-7123

          In our tests with disk failures, We have verified that if the root/critical volume fails, Datanode can't even start.

          Show
          Bharath Mundlapudi added a comment - I am wondering if this is necessary? Typically, critical volume (eg the volume that hosts the OS, logs, pid, tmp dir etc.) is RAID-1 and if this goes down we can safely assume Datanode to be down. I am just curious to understand the usecase? Please refer to Disk Fail Inplace Jira. https://issues.apache.org/jira/browse/HADOOP-7123 In our tests with disk failures, We have verified that if the root/critical volume fails, Datanode can't even start.
          Hide
          dhruba borthakur added a comment -

          I too am not clear why the datanode process has to watch over "critical" disks. It would be nice if the datanode considers all disks the same.

          Show
          dhruba borthakur added a comment - I too am not clear why the datanode process has to watch over "critical" disks. It would be nice if the datanode considers all disks the same.
          Hide
          Koji Noguchi added a comment -

          > In our tests with disk failures, We have verified that if the root/critical volume fails, Datanode can't even start.
          >
          That is the problem. Even when these critical volumes become bad, datanodes would keep on running but fail at restart. With clusters running for months, we can have multiple datanodes refusing to restart leading to the data loss.

          Show
          Koji Noguchi added a comment - > In our tests with disk failures, We have verified that if the root/critical volume fails, Datanode can't even start. > That is the problem. Even when these critical volumes become bad, datanodes would keep on running but fail at restart. With clusters running for months, we can have multiple datanodes refusing to restart leading to the data loss.
          Hide
          Bharath Mundlapudi added a comment -

          That was the problem earlier, Koji. With the fixes went in for Disk Fail Inplace, we can restart datanode with failed disks until volumes tolerated is reached.

          But when a root or critical partition fails its fair to assume datanode shouldn't be restarted, because that is where all system logs or confs were present and this disk is unusable.

          Show
          Bharath Mundlapudi added a comment - That was the problem earlier, Koji. With the fixes went in for Disk Fail Inplace, we can restart datanode with failed disks until volumes tolerated is reached. But when a root or critical partition fails its fair to assume datanode shouldn't be restarted, because that is where all system logs or confs were present and this disk is unusable.
          Hide
          Koji Noguchi added a comment -

          That was the problem earlier, Koji. With the fixes went in for Disk Fail Inplace, we can restart datanode with failed disks until volumes tolerated is reached.

          Bharath, you're not getting my point. This problem still exists even after disk fail inplace feature that you're working on. Only reasons I didn't raise it internally was that our ops is going to raid the critical volumes.

          Show
          Koji Noguchi added a comment - That was the problem earlier, Koji. With the fixes went in for Disk Fail Inplace, we can restart datanode with failed disks until volumes tolerated is reached. Bharath, you're not getting my point. This problem still exists even after disk fail inplace feature that you're working on. Only reasons I didn't raise it internally was that our ops is going to raid the critical volumes.
          Hide
          Eli Collins added a comment -

          I am wondering if this is necessary? Typically, critical volume (eg the volume that hosts the OS, logs, pid, tmp dir etc.) is RAID-1 and if this goes down we can safely assume Datanode to be down.

          I don't think we should require that datanodes use RAID-1. Raiding the boot disk (OS, logs, pids etc) on every datanode wastes an extra disk per datanode in the cluster and requires datanodes have a HW raid controller or use SW raid. However this just lowers the probability of this volume failing, we still have to deal with it, and as you point out a datanode can not survive the failure of the boot disk.

          I too am not clear why the datanode process has to watch over "critical" disks. It would be nice if the datanode considers all disks the same.

          The idea is that the datanode can gracefully handle some types of volume failures but not others. For example the datanode should be able to survive the failure of a disk that just hosts blocks, but can not survive the failure of a volume that resides on the boot disk.

          Therefore if the volume that resides on the boot disk fails the datanode should fail-stop and fail-fast (because it can not tolerate this failure) but if a volume that lives on one of the data disks fails it should continue operating (or decommission itself if the threshold of volume failures has been reached). If the datanode considers all disks the same then it doesn't know whether it should fail itself or tolerate the failure. Make sense?

          Show
          Eli Collins added a comment - I am wondering if this is necessary? Typically, critical volume (eg the volume that hosts the OS, logs, pid, tmp dir etc.) is RAID-1 and if this goes down we can safely assume Datanode to be down. I don't think we should require that datanodes use RAID-1. Raiding the boot disk (OS, logs, pids etc) on every datanode wastes an extra disk per datanode in the cluster and requires datanodes have a HW raid controller or use SW raid. However this just lowers the probability of this volume failing, we still have to deal with it, and as you point out a datanode can not survive the failure of the boot disk. I too am not clear why the datanode process has to watch over "critical" disks. It would be nice if the datanode considers all disks the same. The idea is that the datanode can gracefully handle some types of volume failures but not others. For example the datanode should be able to survive the failure of a disk that just hosts blocks, but can not survive the failure of a volume that resides on the boot disk. Therefore if the volume that resides on the boot disk fails the datanode should fail-stop and fail-fast (because it can not tolerate this failure) but if a volume that lives on one of the data disks fails it should continue operating (or decommission itself if the threshold of volume failures has been reached). If the datanode considers all disks the same then it doesn't know whether it should fail itself or tolerate the failure. Make sense?
          Hide
          dhruba borthakur added a comment -

          thanks Eli/Koji for the explanation. are you saying that the datanode will watch over the root disk at "/" even though the datanode does not store any hdfs data blocks on the root disk? Or will the datanode watch over the root disk only if there is a data directory configured inside it?

          Show
          dhruba borthakur added a comment - thanks Eli/Koji for the explanation. are you saying that the datanode will watch over the root disk at "/" even though the datanode does not store any hdfs data blocks on the root disk? Or will the datanode watch over the root disk only if there is a data directory configured inside it?
          Hide
          Eli Collins added a comment -

          The DN volume checking code tests all directories specified by dfs.data.dir. And currently up to a configurable # of volumes (dfs.datanode.failed.volumes.tolerated) may fail and the DN stays on-line.

          This jira could have two parts:

          1. Allow an administrator to designate a sub-set of dfs.data.dir volumes as critical so the DN will fail-stop rather than tolerate a volume failure. If an admin puts a data dir on the boot disk they could use this option to indicate that a failure of a particular dfs.data.dir should not be tolerated. Eg fs.data.dir might be "/data0, /data1, /data2" and fs.data.dir.critical could be "/data0". So the DN has three data volumes but will only tolerate the failure of /data1 and /data2, if /data0 fails the DN should fail.
          2. The DN should in general fail-stop if the root disk fails (eg prevents it from writing a tmp file). This is separate from the dfs.data.dir volume checking, as the root disk might not be listed as a dfs.data.dir volume. The mechanism could be the same though. Eg a mount could be specified in fs.data.dir.critical that is not a dfs.data.dir but it would still be checked via the same mechanism.
          Show
          Eli Collins added a comment - The DN volume checking code tests all directories specified by dfs.data.dir. And currently up to a configurable # of volumes (dfs.datanode.failed.volumes.tolerated) may fail and the DN stays on-line. This jira could have two parts: Allow an administrator to designate a sub-set of dfs.data.dir volumes as critical so the DN will fail-stop rather than tolerate a volume failure. If an admin puts a data dir on the boot disk they could use this option to indicate that a failure of a particular dfs.data.dir should not be tolerated. Eg fs.data.dir might be "/data0, /data1, /data2" and fs.data.dir.critical could be "/data0". So the DN has three data volumes but will only tolerate the failure of /data1 and /data2, if /data0 fails the DN should fail. The DN should in general fail-stop if the root disk fails (eg prevents it from writing a tmp file). This is separate from the dfs.data.dir volume checking, as the root disk might not be listed as a dfs.data.dir volume. The mechanism could be the same though. Eg a mount could be specified in fs.data.dir.critical that is not a dfs.data.dir but it would still be checked via the same mechanism.
          Hide
          Bharath Mundlapudi added a comment -

          Thanks Eli for explaining on the usecase. I briefly talked to Koji about this Jira.

          Some more thoughts on this.

          1. If fs.data.dir.critical is not defined, then implementation should fall back to existing tolerate a volume failure case.

          2. If fs.data.dir.critical is defined, then fail-fast and fail-stop as you described.

          Case 2 you mentioned is interesting too. Today, datanode is not aware of this case since it may not be part of the dfs.data.dir config.

          I see that the key benefit of having this Jira is fail-fast. Meaning, if any of the critical volume(s) fail, we let the namenode know immediately and datanode will exit. So the replication will be taken care and cluster/datanode restarts might see less issues with missing blocks.

          W.r.t case 2 you mentioned, there are the possibilites of failures, right?

          1. Data is stored on root partition disk say /root/hadoop (binaries,conf,log), /root/data0
          Failures: /root readonly filesystem or failure, /root/data0 readonly filesystem or failure, complete disk0 failure.

          2. Data NOT stored on root partition disk, /root(disk1), /data0(disk2)
          Failures: /root readonly filesystem or failure, /data0(disk2) readonly filesystem or failure.

          3. Swap partition failure
          How will this be detected?

          I am wondering, if datanode should worry about all these issues regarding its health or should a
          configuration like in TaskTracker for health check script which will let Datanode about the disk issues,
          network issues etc is a better option?

          Show
          Bharath Mundlapudi added a comment - Thanks Eli for explaining on the usecase. I briefly talked to Koji about this Jira. Some more thoughts on this. 1. If fs.data.dir.critical is not defined, then implementation should fall back to existing tolerate a volume failure case. 2. If fs.data.dir.critical is defined, then fail-fast and fail-stop as you described. Case 2 you mentioned is interesting too. Today, datanode is not aware of this case since it may not be part of the dfs.data.dir config. I see that the key benefit of having this Jira is fail-fast. Meaning, if any of the critical volume(s) fail, we let the namenode know immediately and datanode will exit. So the replication will be taken care and cluster/datanode restarts might see less issues with missing blocks. W.r.t case 2 you mentioned, there are the possibilites of failures, right? 1. Data is stored on root partition disk say /root/hadoop (binaries,conf,log), /root/data0 Failures: /root readonly filesystem or failure, /root/data0 readonly filesystem or failure, complete disk0 failure. 2. Data NOT stored on root partition disk, /root(disk1), /data0(disk2) Failures: /root readonly filesystem or failure, /data0(disk2) readonly filesystem or failure. 3. Swap partition failure How will this be detected? I am wondering, if datanode should worry about all these issues regarding its health or should a configuration like in TaskTracker for health check script which will let Datanode about the disk issues, network issues etc is a better option?
          Hide
          Eli Collins added a comment -

          Good points Bharath.

          I think the DN should explicitly check its volumes for health as it does today and either fail-fast or tolerate failures appropriately based on the volume that failed. This may require help from an admin in the form of specifying critical volumes, or maybe we could detect these automatically.

          In general, the DN and TT need to fail-fast when they face unrecoverable failures, eg if you turn off volume checking and make the root disk read-only the DN and TT should not try to solider on. Ie some exception handling situations should result in termination of service, and if possible a shutdown.

          Show
          Eli Collins added a comment - Good points Bharath. I think the DN should explicitly check its volumes for health as it does today and either fail-fast or tolerate failures appropriately based on the volume that failed. This may require help from an admin in the form of specifying critical volumes, or maybe we could detect these automatically. In general, the DN and TT need to fail-fast when they face unrecoverable failures, eg if you turn off volume checking and make the root disk read-only the DN and TT should not try to solider on. Ie some exception handling situations should result in termination of service, and if possible a shutdown.
          Hide
          dhruba borthakur added a comment -

          Thanks Eli for the explanation. It seems fine to make the datanode check those directories where it can store data blocks, but to make it check other directories (basically use the datanode code to implement a poor man's general-purpose disk check, especially for disks that it does not use to store data blocks) seems kind-of disturbing to me.

          suppose, i have a machine that is running the tasktracker and not the datanode. Who is going to check the root disk in this case? what if I am running neither the tasktarcker nor the datnode, but instead running the backup node, then who checks for the health of the root disk on that machine?

          I would rather make the datanode only check the validity of all the directories where it is comnfigured to store data.

          Show
          dhruba borthakur added a comment - Thanks Eli for the explanation. It seems fine to make the datanode check those directories where it can store data blocks, but to make it check other directories (basically use the datanode code to implement a poor man's general-purpose disk check, especially for disks that it does not use to store data blocks) seems kind-of disturbing to me. suppose, i have a machine that is running the tasktracker and not the datanode. Who is going to check the root disk in this case? what if I am running neither the tasktarcker nor the datnode, but instead running the backup node, then who checks for the health of the root disk on that machine? I would rather make the datanode only check the validity of all the directories where it is comnfigured to store data.
          Hide
          Eli Collins added a comment - - edited

          I agree the datanode should only check the validity of all the directories where it is configured to store data.

          Point #1 is limited to to allowing an admin to specify that not all of these configured directories should necessarily be treated equal wrt the policy for tolerating failures. Ie the idea is not to use dfs.data.dir for general datanode health monitoring. There are already tools that monitor disk health, etc.

          Point #2 is that - in general - if the datanode experiences some types of failures (eg those caused by a failed root disk) it should fail-stop.

          Another way to put this is that the datanode should be proactive about checking for failures in its data volumes and re-active about other disk failures (eg of the root disk). Agree?

          Show
          Eli Collins added a comment - - edited I agree the datanode should only check the validity of all the directories where it is configured to store data. Point #1 is limited to to allowing an admin to specify that not all of these configured directories should necessarily be treated equal wrt the policy for tolerating failures. Ie the idea is not to use dfs.data.dir for general datanode health monitoring. There are already tools that monitor disk health, etc. Point #2 is that - in general - if the datanode experiences some types of failures (eg those caused by a failed root disk) it should fail-stop. Another way to put this is that the datanode should be proactive about checking for failures in its data volumes and re-active about other disk failures (eg of the root disk). Agree?
          Hide
          dhruba borthakur added a comment -

          > Point #1 is limited to allowing an admin to specify that not all of these configured directories should necessarily be treated

          Fully agree. makes complete sense to me.

          > Point #2 is if the datanode experiences some types of failures (eg those caused by a failed root disk) it should fail-stop.

          I am on the fence on this one. It seems like a good idea for a cluster administrator to have this feature.

          Show
          dhruba borthakur added a comment - > Point #1 is limited to allowing an admin to specify that not all of these configured directories should necessarily be treated Fully agree. makes complete sense to me. > Point #2 is if the datanode experiences some types of failures (eg those caused by a failed root disk) it should fail-stop. I am on the fence on this one. It seems like a good idea for a cluster administrator to have this feature.
          Hide
          Koji Noguchi added a comment -

          It seems like a good idea for a cluster administrator to have this feature.

          Would something like mapreduce's healthChecker work ?

          Show
          Koji Noguchi added a comment - It seems like a good idea for a cluster administrator to have this feature. Would something like mapreduce's healthChecker work ?
          Hide
          dhruba borthakur added a comment -

          > Would something like mapreduce's healthChecker work

          It could. I am surprised that this problem is not solved for non-hadoop related workloads at your company. If the root disk is bad, most apps would benefit if the machine is power-cycled. You must be having some company-wide diagnostics software that runs on every machine (hadoop and non-hadoop machines) and determines whether the machine in good health, isn't it?

          Show
          dhruba borthakur added a comment - > Would something like mapreduce's healthChecker work It could. I am surprised that this problem is not solved for non-hadoop related workloads at your company. If the root disk is bad, most apps would benefit if the machine is power-cycled. You must be having some company-wide diagnostics software that runs on every machine (hadoop and non-hadoop machines) and determines whether the machine in good health, isn't it?
          Hide
          Bharath Mundlapudi added a comment -

          I think, Koji's point is - should we have something like healthchecker in Datanode similar to Mapreduce? If so, periodically, Datanode launches this healthcheck to determine its health against disks, nics etc. This was the comment i made earlier. This will help admins. It is just not sufficient to have diagnostic software on every machine. We need a mechanism to communicate this information back to Datanode, right? This is required for fail-fast and then fail-stop safely. By this, Datanode can look after the disks it cares about like today and this external entity will inform about various other diagnostic information back to Datanode. Agree?

          Show
          Bharath Mundlapudi added a comment - I think, Koji's point is - should we have something like healthchecker in Datanode similar to Mapreduce? If so, periodically, Datanode launches this healthcheck to determine its health against disks, nics etc. This was the comment i made earlier. This will help admins. It is just not sufficient to have diagnostic software on every machine. We need a mechanism to communicate this information back to Datanode, right? This is required for fail-fast and then fail-stop safely. By this, Datanode can look after the disks it cares about like today and this external entity will inform about various other diagnostic information back to Datanode. Agree?
          Hide
          Koji Noguchi added a comment -

          I am surprised that this problem is not solved for non-hadoop related workloads at your company

          Dhruba, it used to be your company as well.
          Anyways, as I stated in HDFS-457, there are cases datanodes would not come up when non-root disk goes to a bad state. Two that I found were, pid dir and tmpdir.

          Show
          Koji Noguchi added a comment - I am surprised that this problem is not solved for non-hadoop related workloads at your company Dhruba, it used to be your company as well. Anyways, as I stated in HDFS-457 , there are cases datanodes would not come up when non-root disk goes to a bad state. Two that I found were, pid dir and tmpdir.
          Hide
          Steve Loughran added a comment -

          +1 for more healthchecking, with easy ways to specify what you want to check (presumably a script to exec is the option of choice, or some java class to call)

          Some standard checks for hdds (you see them in ant -diagnostics)
          -can you write to a dir
          -can you get back what you wrote
          -is the timestamp of the file roughly in sync with your clock (on network drives it may not be)

          If you are aggressive you could try to create a large file and see what happens, though if the health check hangs, something else will need to detect that and report it as a failure.

          Log drives cause problems when they aren't there or are full too.

          Show
          Steve Loughran added a comment - +1 for more healthchecking, with easy ways to specify what you want to check (presumably a script to exec is the option of choice, or some java class to call) Some standard checks for hdds (you see them in ant -diagnostics) -can you write to a dir -can you get back what you wrote -is the timestamp of the file roughly in sync with your clock (on network drives it may not be) If you are aggressive you could try to create a large file and see what happens, though if the health check hangs, something else will need to detect that and report it as a failure. Log drives cause problems when they aren't there or are full too.
          Hide
          dhruba borthakur added a comment -

          Hi Koji, I am fine implementing more health-checking to be done by datanodes as long as it is configurable.

          Show
          dhruba borthakur added a comment - Hi Koji, I am fine implementing more health-checking to be done by datanodes as long as it is configurable.
          Hide
          Tony Valderrama added a comment -

          Just as an example, this is how we implemented a feature similar to Eli's suggested part #1 on our internal 0.20-based branch. This patch won't apply to 0.20.3, though, since it doesn't include HDFS-457 et al.

          Show
          Tony Valderrama added a comment - Just as an example, this is how we implemented a feature similar to Eli's suggested part #1 on our internal 0.20-based branch. This patch won't apply to 0.20.3, though, since it doesn't include HDFS-457 et al.

            People

            • Assignee:
              Unassigned
              Reporter:
              Eli Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development