Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-138

data node process should not die if one dir goes bad

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When multiple directories are configured for the data node process to use to store blocks, it currently exits when one of them is not writable. Instead, it should either completely ignore that directory or attempt to continue reading and then marking it unusable if reads fail.

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          If you really want to do this, then we will need better reporting to the adminstrator so that he/she can come to know that this datanode needs tending to.

          In the existing scheme of things, if a partition becomre read-only, the datanode shuts down and the adminstrator can see it being listed as 'dead" in the HDFS UI.

          Show
          dhruba borthakur added a comment - If you really want to do this, then we will need better reporting to the adminstrator so that he/she can come to know that this datanode needs tending to. In the existing scheme of things, if a partition becomre read-only, the datanode shuts down and the adminstrator can see it being listed as 'dead" in the HDFS UI.
          Hide
          Konstantin Shvachko added a comment -

          Does it make more sense for a node with a bad drive to go to decommissioning mode?
          The web UI can be modified to have 3 sections (live, dead and decommissioning nodes) rather than just 2.
          And the node can be still used for replication and read-only purposes.

          Show
          Konstantin Shvachko added a comment - Does it make more sense for a node with a bad drive to go to decommissioning mode? The web UI can be modified to have 3 sections (live, dead and decommissioning nodes) rather than just 2. And the node can be still used for replication and read-only purposes.
          Hide
          Allen Wittenauer added a comment -

          Dhruba's makes an excellent point. The admin definitely needs to know more status on the data nodes in this sort of design.

          For smaller clusters, it seems like a bad thing to decommission an entire node when you only have one bad disk. It would be better for the data node to just start decomm'ing that dir and/or stop giving block reports for that dir.

          [For equal sized disks, RAID may be an alternative. But if you have un-equal sized disks, RAID isn't an option, as you'll be throwing storage away.]

          Show
          Allen Wittenauer added a comment - Dhruba's makes an excellent point. The admin definitely needs to know more status on the data nodes in this sort of design. For smaller clusters, it seems like a bad thing to decommission an entire node when you only have one bad disk. It would be better for the data node to just start decomm'ing that dir and/or stop giving block reports for that dir. [For equal sized disks, RAID may be an alternative. But if you have un-equal sized disks, RAID isn't an option, as you'll be throwing storage away.]
          Hide
          dhruba borthakur added a comment -

          @Konstantin: At the end of a decommission process, a datanode actually shuts down, doesn't it? If so, then using decommissioning might not work in the current scenario.

          @Allen: HDFS currently assumes that the entire disk space on the cluster is readable/writable. This keeps the accounting of used space/free space, etc pretty simple. If there is aviailable disk space, then it can be used to store new files in HDFS. The entire space available in the data directories are free disk space. If we allow the datanode to remember read-only disks, then there will be some changes in accounting. Similarly, there will be some changes needed for reporting and alerting adminstrators. This means that the "adminstration" of the cluster becomes slightly more complex. The advantage is that the last bit of disk space gets to be used.

          So, my question is: are you seeing this to be a real problem on production clusters?

          Show
          dhruba borthakur added a comment - @Konstantin: At the end of a decommission process, a datanode actually shuts down, doesn't it? If so, then using decommissioning might not work in the current scenario. @Allen: HDFS currently assumes that the entire disk space on the cluster is readable/writable. This keeps the accounting of used space/free space, etc pretty simple. If there is aviailable disk space, then it can be used to store new files in HDFS. The entire space available in the data directories are free disk space. If we allow the datanode to remember read-only disks, then there will be some changes in accounting. Similarly, there will be some changes needed for reporting and alerting adminstrators. This means that the "adminstration" of the cluster becomes slightly more complex. The advantage is that the last bit of disk space gets to be used. So, my question is: are you seeing this to be a real problem on production clusters?
          Hide
          Allen Wittenauer added a comment -

          Yes, we're seeing this to be a problem. We have file systems go read only all the time.

          But if it makes things easier, then go ahead and go for option one of the two I mentioned in the description: if it can't be written to, then drop it completely.

          Show
          Allen Wittenauer added a comment - Yes, we're seeing this to be a problem. We have file systems go read only all the time. But if it makes things easier, then go ahead and go for option one of the two I mentioned in the description: if it can't be written to, then drop it completely.
          Hide
          dhruba borthakur added a comment -

          I think getting solution to this problem is needed, especially since you are seeing it occur quite often. Is it caused by bad disks or buggy ext3 software? Any ideas on whether XFS on linux avoids this problem of disks going read-only?

          Show
          dhruba borthakur added a comment - I think getting solution to this problem is needed, especially since you are seeing it occur quite often. Is it caused by bad disks or buggy ext3 software? Any ideas on whether XFS on linux avoids this problem of disks going read-only?
          Hide
          Runping Qi added a comment -

          I think the map/reduce framework has to handling similar problems.
          If a drive of a machine goes bad, the tasks on that machine tend to become stragglers.
          The overall performance will be impacted.
          Overall, Hadoop is much better at handling total failure than partial failure of nodes, I think it is better to decommission a bad node at a drive failure.
          The admin may later choose to remove the drive from the configuration file and restart the node, if he does not want to take away the node for repair.

          Show
          Runping Qi added a comment - I think the map/reduce framework has to handling similar problems. If a drive of a machine goes bad, the tasks on that machine tend to become stragglers. The overall performance will be impacted. Overall, Hadoop is much better at handling total failure than partial failure of nodes, I think it is better to decommission a bad node at a drive failure. The admin may later choose to remove the drive from the configuration file and restart the node, if he does not want to take away the node for repair.
          Hide
          dhruba borthakur added a comment -

          Handling total failures is easier than handling partial failers. I agree with Runping on this one.

          Show
          dhruba borthakur added a comment - Handling total failures is easier than handling partial failers. I agree with Runping on this one.
          Hide
          Allen Wittenauer added a comment -

          I specifically targeted the HDFS framework in this bug primarily because the MR framework issues are actually worse. There is a very good chance that if you have multiple disks, you have swap spread across those disks. In the case of drive failure, this means you lose a chunk of swap. Loss of swap==less memory for streaming jobs==job failure in many, many instances. So let's not get distracted with the issues around MR, job failure, job speed, etc.

          What I'm seeing is that at any given time we have 10-20% of our nodes down. The vast majority have a single failed disk. This means we're leaving capacity on the floor, waiting for a drive replacement. Why can't these machines just stay up, providing blocks and providing space on the good drives? For large clusters, this might be a minor inconvenience but for small clusters this could be deadly.

          The current fix is done with wetware, a source of additional strain on traditionally overloaded operations teams. Random failure times vs. letting the ops team decide when a data node goes down? This seems like a no brainer from a practicality perspective. Yes, this is clearly more difficult than just killing the node completely. But over the long haul, it is going to be cheaper in human labor to fix this in Hadoop than to throw more admins at it.

          Show
          Allen Wittenauer added a comment - I specifically targeted the HDFS framework in this bug primarily because the MR framework issues are actually worse. There is a very good chance that if you have multiple disks, you have swap spread across those disks. In the case of drive failure, this means you lose a chunk of swap. Loss of swap==less memory for streaming jobs==job failure in many, many instances. So let's not get distracted with the issues around MR, job failure, job speed, etc. What I'm seeing is that at any given time we have 10-20% of our nodes down. The vast majority have a single failed disk. This means we're leaving capacity on the floor, waiting for a drive replacement. Why can't these machines just stay up, providing blocks and providing space on the good drives? For large clusters, this might be a minor inconvenience but for small clusters this could be deadly. The current fix is done with wetware, a source of additional strain on traditionally overloaded operations teams. Random failure times vs. letting the ops team decide when a data node goes down? This seems like a no brainer from a practicality perspective. Yes, this is clearly more difficult than just killing the node completely. But over the long haul, it is going to be cheaper in human labor to fix this in Hadoop than to throw more admins at it.
          Hide
          Robert Chansler added a comment -

          HDFS-457 is a close approximation.

          Show
          Robert Chansler added a comment - HDFS-457 is a close approximation.

            People

            • Assignee:
              Unassigned
              Reporter:
              Allen Wittenauer
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development