Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-378

DFSClient should track failures by datanode across all streams

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Original description: Rather than tracking the total number of times DFSInputStream failed to talk to a datanode for a particular block, such failures and the the list of datanodes involved should be scoped to individual blocks. In particular, the "deadnode" list should be a map of blocks to a list of failed nodes, the latter reset and the nodes retried per the existing semantics.

      [see comment below for new thinking, left this comment to give context to discussion]

        Issue Links

          Activity

          Hide
          Chris Douglas added a comment -

          The fix in HADOOP-1911 fixed the infinite loop, but did not address the possibility of false negatives this issue should resolve.

          Show
          Chris Douglas added a comment - The fix in HADOOP-1911 fixed the infinite loop, but did not address the possibility of false negatives this issue should resolve.
          Hide
          Todd Lipcon added a comment -

          Been thinking about this a bit tonight. It seems to me we have the following classes of errors to deal with:

          1. A DN has died but the NN does not yet know about it. Thus, the client fails entirely to connect to the DN. Ideally, the client shouldn't reconnect to this for quite some time.
          2. A DN is heavily loaded (above its max.xcievers value) and thus the client is rejected. But, we'd like to retry it reasonably often, and ideally don't want to fail a read completely, even if all replicas are in this state for a short period of time.
          3. A particular replica is corrupt or missing on a DN. Here, we just want to avoid reading this particular block from this DN until the block has been rereplicated from a healthy copy.

          Case #3 above is actually handled implicitly with no long-term/inter-operation tracking on the client, since the client will report the bad block to the NN immediately upon discovering it. Then, on the next getBlockLocations call for the same block, it will automatically be filtered out of the LocatedBlocks result by the NN. When it's been fixed up, the new valid location will end up in the LocatedBlocks result (whether on the same DN or a different one)

          Given this, I disagree with the Description of this issue - I don't think the client needs to track failures by block, just by datanode, as long as checksum failures are handled differently than connection or timeout failures.

          The remaining question is how to handle both case 1 and case 2 above in a convenient manner. Here's one idea:

          Whenever the client fails to read from a datanode, the timestamp of the failure is recorded in a map keyed by node. When a block is to be read, the list of locations is sorted based on ascending timestamp of last faillure - thus the nodes that have had problems least recently are retried first. Any node with last failure past some threshold in the past (eg 5 minutes) is considered to have never failed and is removed from the map. Any node that has no recorded failure info should be prioritized above any node that does have failure info.

          This should be fairly simple to implement without any protocol changes, and also easy to reason about. The map would ideally be DFSClient-wide so applications that use a lot of separate InputStreams won't use a lot of extra memory, and can share their view of the DN health.

          One possible improvement on the above is to use datanode heartbeat times to distinguish between case 1 and case 2. Specifically, a "relativeLastHeartbeat" field could be added to LocatedBlocks for each datanode. The DN can then use this information to remove failure info for any DN whose failures were recorded before the last heartbeat. Thus, it will retry heavily loaded nodes once per heartbeat interval, but won't retry nodes that have actually failed. The downside is that this would require a protocol change, and be harder to reason about for cases like network partitions where a DN is heartbeating fine but some set of clients can't connect to it.

          Looking forward to hearing people's thoughts.

          Show
          Todd Lipcon added a comment - Been thinking about this a bit tonight. It seems to me we have the following classes of errors to deal with: A DN has died but the NN does not yet know about it. Thus, the client fails entirely to connect to the DN. Ideally, the client shouldn't reconnect to this for quite some time. A DN is heavily loaded (above its max.xcievers value) and thus the client is rejected. But, we'd like to retry it reasonably often, and ideally don't want to fail a read completely, even if all replicas are in this state for a short period of time. A particular replica is corrupt or missing on a DN. Here, we just want to avoid reading this particular block from this DN until the block has been rereplicated from a healthy copy. Case #3 above is actually handled implicitly with no long-term/inter-operation tracking on the client, since the client will report the bad block to the NN immediately upon discovering it. Then, on the next getBlockLocations call for the same block, it will automatically be filtered out of the LocatedBlocks result by the NN. When it's been fixed up, the new valid location will end up in the LocatedBlocks result (whether on the same DN or a different one) Given this, I disagree with the Description of this issue - I don't think the client needs to track failures by block, just by datanode, as long as checksum failures are handled differently than connection or timeout failures. The remaining question is how to handle both case 1 and case 2 above in a convenient manner. Here's one idea: Whenever the client fails to read from a datanode, the timestamp of the failure is recorded in a map keyed by node. When a block is to be read, the list of locations is sorted based on ascending timestamp of last faillure - thus the nodes that have had problems least recently are retried first. Any node with last failure past some threshold in the past (eg 5 minutes) is considered to have never failed and is removed from the map. Any node that has no recorded failure info should be prioritized above any node that does have failure info. This should be fairly simple to implement without any protocol changes, and also easy to reason about. The map would ideally be DFSClient-wide so applications that use a lot of separate InputStreams won't use a lot of extra memory, and can share their view of the DN health. One possible improvement on the above is to use datanode heartbeat times to distinguish between case 1 and case 2. Specifically, a "relativeLastHeartbeat" field could be added to LocatedBlocks for each datanode. The DN can then use this information to remove failure info for any DN whose failures were recorded before the last heartbeat. Thus, it will retry heavily loaded nodes once per heartbeat interval, but won't retry nodes that have actually failed. The downside is that this would require a protocol change, and be harder to reason about for cases like network partitions where a DN is heartbeating fine but some set of clients can't connect to it. Looking forward to hearing people's thoughts.
          Hide
          stack added a comment -

          Tracking failures by DN rather than by block or as is currently done, without regard for DN or block, makes sense given the exposition above. Keeping track by DN rather than blocks should be a good deal easier to track. The map of past failures to DNs sounds great, especially the bit where DNs that are not in the failure Map are prioritized. I like it.

          The 'improvement' sounds grand too but something to do later?

          Show
          stack added a comment - Tracking failures by DN rather than by block or as is currently done, without regard for DN or block, makes sense given the exposition above. Keeping track by DN rather than blocks should be a good deal easier to track. The map of past failures to DNs sounds great, especially the bit where DNs that are not in the failure Map are prioritized. I like it. The 'improvement' sounds grand too but something to do later?

            People

            • Assignee:
              Unassigned
              Reporter:
              Chris Douglas
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development