Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1800

using map output fetch failures to blacklist nodes is problematic

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      If a mapper and a reducer cannot communicate, then either party could be at fault. The current hadoop protocol allows reducers to declare nodes running the mapper as being at fault. When sufficient number of reducers do so - then the map node can be blacklisted.

      In cases where networking problems cause substantial degradation in communication across sets of nodes - then large number of nodes can become blacklisted as a result of this protocol. The blacklisting is often wrong (reducers on the smaller side of the network partition can collectively cause nodes on the larger network partitioned to be blacklisted) and counterproductive (rerunning maps puts further load on the (already) maxed out network links).

      We should revisit how we can better identify nodes with genuine network problems (and what role, if any, map-output fetch failures have in this).

        Issue Links

          Activity

          Hide
          tlipcon Todd Lipcon added a comment -

          Hey Joydeep. Do you often have cases where sets of TT nodes can't talk to each other but both sides can still talk to the JT? This is interesting, as it seems like an unusual network architecture.

          Show
          tlipcon Todd Lipcon added a comment - Hey Joydeep. Do you often have cases where sets of TT nodes can't talk to each other but both sides can still talk to the JT? This is interesting, as it seems like an unusual network architecture.
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          if there is a total network partition - then we don't have a problem. either the cluster will fail outright (let's say JT and NN land up on different sides of the partition) - or one partition (the one that has the JT/NN) will exclude nodes from the other. (i say we don't have a problem in the sense that the response of hadoop to such an event is more or less correct).

          The problem is that we have had occurences of slow networks that are not quite partitioned. For example the uplink from one rack switch to the core switch can be flaky/degraded. in this case - control traffic from the JT to the TTs may be going through - but data traffic from mappers and reducers on the degraded racks can be really hurt. If there are problems in the core switch itself (it's underprovisioned) - then the whole cluster is having network problems. The description applies to such scenarios.

          In such a case - the appropriate response of the software should be, at worst, degraded performance (in keeping with the degraded nature of the underlying hardware) or at best, correctly identifying the the slow node(s) and not using them or using them less (this would apply to the flaky rack uplink scenario). The current response of Hadoop is neither. It makes a bad situation worse by misassigning blame (when map nodes on good racks are blamed by sufficiently large number of reducers running on bad racks). We potentially lose nodes from good racks and the resultant retry of tasks puts further stress on the strained network resource.

          A couple of things seem desirable:
          1. for enterprise data center environments that (may) have high degree of control and monitoring around their networking elements - the ability to turn off (selectively) the
          functionality in hadoop that tries to detect and correct for network problems. Diagnostics stands a much better chance to catch/identify networking problems and fix them.
          2. in environments with less control (say Amazon EC2 or hadoop running on a bunch of PCs across a company) that are more akin to a p2p network - hadoop's network fault diagnosis algorithms need improvement. A comparison to bittorrent is fair - over there every node advertises it's upload/download throughput and a node can come across as slow only in comparison to the collective stats published by all peers (and not just based on communication with a small set of other peers).

          Show
          jsensarma Joydeep Sen Sarma added a comment - if there is a total network partition - then we don't have a problem. either the cluster will fail outright (let's say JT and NN land up on different sides of the partition) - or one partition (the one that has the JT/NN) will exclude nodes from the other. (i say we don't have a problem in the sense that the response of hadoop to such an event is more or less correct). The problem is that we have had occurences of slow networks that are not quite partitioned. For example the uplink from one rack switch to the core switch can be flaky/degraded. in this case - control traffic from the JT to the TTs may be going through - but data traffic from mappers and reducers on the degraded racks can be really hurt. If there are problems in the core switch itself (it's underprovisioned) - then the whole cluster is having network problems. The description applies to such scenarios. In such a case - the appropriate response of the software should be, at worst, degraded performance (in keeping with the degraded nature of the underlying hardware) or at best, correctly identifying the the slow node(s) and not using them or using them less (this would apply to the flaky rack uplink scenario). The current response of Hadoop is neither. It makes a bad situation worse by misassigning blame (when map nodes on good racks are blamed by sufficiently large number of reducers running on bad racks). We potentially lose nodes from good racks and the resultant retry of tasks puts further stress on the strained network resource. A couple of things seem desirable: 1. for enterprise data center environments that (may) have high degree of control and monitoring around their networking elements - the ability to turn off (selectively) the functionality in hadoop that tries to detect and correct for network problems. Diagnostics stands a much better chance to catch/identify networking problems and fix them. 2. in environments with less control (say Amazon EC2 or hadoop running on a bunch of PCs across a company) that are more akin to a p2p network - hadoop's network fault diagnosis algorithms need improvement. A comparison to bittorrent is fair - over there every node advertises it's upload/download throughput and a node can come across as slow only in comparison to the collective stats published by all peers (and not just based on communication with a small set of other peers).
          Hide
          tlipcon Todd Lipcon added a comment -

          Hey Joydeep. Thanks for the further explanation - I agree we could do better here. There's an old JIRA where we threw around some ideas similar to this maybe last August or so, but can't seem to find it at the moment. Anyone remember the one I mean?

          Show
          tlipcon Todd Lipcon added a comment - Hey Joydeep. Thanks for the further explanation - I agree we could do better here. There's an old JIRA where we threw around some ideas similar to this maybe last August or so, but can't seem to find it at the moment. Anyone remember the one I mean?
          Hide
          tlipcon Todd Lipcon added a comment -

          Found the one I was thinking of - MAPREDUCE-562. It's a bit different, but worth referring back to that conversation.

          Show
          tlipcon Todd Lipcon added a comment - Found the one I was thinking of - MAPREDUCE-562 . It's a bit different, but worth referring back to that conversation.
          Hide
          acmurthy Arun C Murthy added a comment -

          FWIW the current heuristics protect reduces against a common case of a single node (on which the map ran), and works reasonably well.

          What I'm reading here is that we need better overall metrics/monitoring of the cluster and enhancements to the masters (JobTracker/NameNode) to take advantage of the metrics/monitoring stats. Is that reasonable?

          Show
          acmurthy Arun C Murthy added a comment - FWIW the current heuristics protect reduces against a common case of a single node (on which the map ran), and works reasonably well. What I'm reading here is that we need better overall metrics/monitoring of the cluster and enhancements to the masters (JobTracker/NameNode) to take advantage of the metrics/monitoring stats. Is that reasonable?
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          the problem is that the current heuristics also cause bad behavior when uplinks/core-switches degrade.

          i agree that the case of a single node that is not able to send map outputs is something that hadoop should detect/correct automatically - but i don't think the current heuristic (by itself) is a good one because of the previous point.

          i don't have a good alternative solution/proposals. a few thoughts pop to mind:

          • separate blacklisting of TTs due to map/reduce task failures from blacklisting due to map-output fetch failures. the thresholds and policies required seem different.
          • if the scope of the fault is nic/port/process/os problems affecting a 'single' node - then we should only take into map-fetch failures that happen within the same rack. (ie. assign blame to a TT only if other TTs within the same rack cannot communicate to it)
          • blame should be laid by a multitude of different hosts. It's no good if 4 reducers on TT1 cannot get map outputs from TT2 and this results in blacklisting of TT2. It's possible that TT1 itself has a bad port/nic.

          (just thinking aloud, i don't have a careful understanding of the code beyond what's been relayed to me by others ).

          Show
          jsensarma Joydeep Sen Sarma added a comment - the problem is that the current heuristics also cause bad behavior when uplinks/core-switches degrade. i agree that the case of a single node that is not able to send map outputs is something that hadoop should detect/correct automatically - but i don't think the current heuristic (by itself) is a good one because of the previous point. i don't have a good alternative solution/proposals. a few thoughts pop to mind: separate blacklisting of TTs due to map/reduce task failures from blacklisting due to map-output fetch failures. the thresholds and policies required seem different. if the scope of the fault is nic/port/process/os problems affecting a 'single' node - then we should only take into map-fetch failures that happen within the same rack. (ie. assign blame to a TT only if other TTs within the same rack cannot communicate to it) blame should be laid by a multitude of different hosts. It's no good if 4 reducers on TT1 cannot get map outputs from TT2 and this results in blacklisting of TT2. It's possible that TT1 itself has a bad port/nic. (just thinking aloud, i don't have a careful understanding of the code beyond what's been relayed to me by others ).
          Hide
          rvadali Ramkumar Vadali added a comment -

          From my understanding, a map output fetch is a HTTP GET. I agree that TCP-level network errors are not good indicators to use for blacklisting since it is not possible to distinguish between a server-side error and a network error. But HTTP-level errors, especially HTTP 5xx errors (used for server-side errors) should be used for blacklisting. Disk failures that prevent the HTTP server from reading a file would fall in this category. It is possible that such errors could be detected by Map failures, but a HTTP 5xx error is a fairly reliable indicator of error.

          So in my opinion we should retain the blacklisting, but make it smarter to use HTTP-level error information.

          Show
          rvadali Ramkumar Vadali added a comment - From my understanding, a map output fetch is a HTTP GET. I agree that TCP-level network errors are not good indicators to use for blacklisting since it is not possible to distinguish between a server-side error and a network error. But HTTP-level errors, especially HTTP 5xx errors (used for server-side errors) should be used for blacklisting. Disk failures that prevent the HTTP server from reading a file would fall in this category. It is possible that such errors could be detected by Map failures, but a HTTP 5xx error is a fairly reliable indicator of error. So in my opinion we should retain the blacklisting, but make it smarter to use HTTP-level error information.
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          that makes a lot of sense. i think most of the failures we see are TCP errors - so yeah - using just http application level error codes would help. Note that nic/port failure will not produce http 5xx errors. so in order to cover that case (which may be common as well) - we would have to incorporate network failures into fault diagnosis as well.

          Show
          jsensarma Joydeep Sen Sarma added a comment - that makes a lot of sense. i think most of the failures we see are TCP errors - so yeah - using just http application level error codes would help. Note that nic/port failure will not produce http 5xx errors. so in order to cover that case (which may be common as well) - we would have to incorporate network failures into fault diagnosis as well.
          Hide
          acmurthy Arun C Murthy added a comment -

          So in my opinion we should retain the blacklisting, but make it smarter to use HTTP-level error information.

          +1

          That is precisely what I had in mind when I said: "better overall metrics/monitoring of the cluster and enhancements to the masters (JobTracker/NameNode) to take advantage of the metrics/monitoring stats".

          Show
          acmurthy Arun C Murthy added a comment - So in my opinion we should retain the blacklisting, but make it smarter to use HTTP-level error information. +1 That is precisely what I had in mind when I said: "better overall metrics/monitoring of the cluster and enhancements to the masters (JobTracker/NameNode) to take advantage of the metrics/monitoring stats".
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          how would we detect a slow port/nic? (that currently map-fetch failures at network level end up catching)

          Show
          jsensarma Joydeep Sen Sarma added a comment - how would we detect a slow port/nic? (that currently map-fetch failures at network level end up catching)
          Hide
          jsensarma Joydeep Sen Sarma added a comment -

          btw - we don't need distributed failure detection to cover the case of application errors while fetching map outputs. if the map side task tracker encounters enough failures while retrieving map outputs - it can either commit suicide or report this fact directly to the JT (instead of relying on the reducer to do so). in that sense - such application errors are no different from errors while trying to execute map/reduce tasks.

          it seems that the only non-trivial cases that the reducer needs to report about are network error cases - that are inherently symmetric in nature. the onus then shifts to the JT to infer which party is to blame (if any) by looking at the collective set of errors being reported in the system.

          Show
          jsensarma Joydeep Sen Sarma added a comment - btw - we don't need distributed failure detection to cover the case of application errors while fetching map outputs. if the map side task tracker encounters enough failures while retrieving map outputs - it can either commit suicide or report this fact directly to the JT (instead of relying on the reducer to do so). in that sense - such application errors are no different from errors while trying to execute map/reduce tasks. it seems that the only non-trivial cases that the reducer needs to report about are network error cases - that are inherently symmetric in nature. the onus then shifts to the JT to infer which party is to blame (if any) by looking at the collective set of errors being reported in the system.

            People

            • Assignee:
              Unassigned
              Reporter:
              jsensarma Joydeep Sen Sarma
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:

                Development