Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-211

Provide a node health check script and run it periodically to check the node health status

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Provides ability to run a health check script on the tasktracker nodes and blacklist nodes if they are unhealthy.

      Description

      Hadoop must have some mechanism to find the health status of a node . It should run the health check script periodically and if there is any errors, it should black list the node. This will be really helpful when we run static mapred clusters. Else we may have to run some scripts/daemons periodically to find the node status and take it offline manually.

      1. active.png
        26 kB
        Sreekanth Ramakrishnan
      2. blacklist1.png
        40 kB
        Sreekanth Ramakrishnan
      3. blacklist2.png
        49 kB
        Sreekanth Ramakrishnan
      4. cluster_setup.pdf
        54 kB
        Sreekanth Ramakrishnan
      5. hadoop-5478-1.patch
        22 kB
        Sreekanth Ramakrishnan
      6. hadoop-5478-2.patch
        27 kB
        Sreekanth Ramakrishnan
      7. hadoop-5478-3.patch
        39 kB
        Sreekanth Ramakrishnan
      8. hadoop-5478-4.patch
        22 kB
        Sreekanth Ramakrishnan
      9. hadoop-5478-5.patch
        39 kB
        Sreekanth Ramakrishnan
      10. hadoop-5478-6.patch
        63 kB
        Sreekanth Ramakrishnan
      11. mapred-211-common-3.patch
        9 kB
        Sreekanth Ramakrishnan
      12. mapred-211-core-1.patch
        9 kB
        Sreekanth Ramakrishnan
      13. mapred-211-internal.patch
        56 kB
        Sreekanth Ramakrishnan
      14. mapred-211-mapred-1.patch
        60 kB
        Sreekanth Ramakrishnan
      15. mapred-211-mapred-2.patch
        60 kB
        Sreekanth Ramakrishnan
      16. mapred-211-mapred-3.patch
        50 kB
        Sreekanth Ramakrishnan
      17. mapred-211-mapred-4.patch
        52 kB
        Sreekanth Ramakrishnan
      18. mapred-211-mapred-5.patch
        50 kB
        Sreekanth Ramakrishnan
      19. mapred-211-mapred-7.patch
        55 kB
        Sreekanth Ramakrishnan
      20. mapred-211-mapred-8.patch
        55 kB
        Sreekanth Ramakrishnan
      21. mapred-211-mapred-9.patch
        56 kB
        Sreekanth Ramakrishnan
      22. MAPREDUCE-211-forrest.patch
        3 kB
        Hemanth Yamijala

        Issue Links

          Activity

          Hide
          Richard Theige added a comment -

          For consideration:

          • check_network_speed
          • check_filesystems
            o check for read-only filesystems
          • check_ylock
            o required, if not installed some jobs which are using ycore libs will fail
          • check_symlinks
            o checks for hadoop versions
          • check_diskspace
            o checks if there is sufficient mapred tmp space
            o TT should handle this
          Show
          Richard Theige added a comment - For consideration: check_network_speed check_filesystems o check for read-only filesystems check_ylock o required, if not installed some jobs which are using ycore libs will fail check_symlinks o checks for hadoop versions check_diskspace o checks if there is sufficient mapred tmp space o TT should handle this
          Hide
          Allen Wittenauer added a comment -

          sigh

          ylock is internal yahoo junk.

          The real solution here is that there needs to be a script call out to provide user defined functionality. Specific checks should not come out of the box by default.

          Show
          Allen Wittenauer added a comment - sigh ylock is internal yahoo junk. The real solution here is that there needs to be a script call out to provide user defined functionality. Specific checks should not come out of the box by default.
          Hide
          Allen Wittenauer added a comment -

          We were chatting a bit about this at staff today. One of the topics that came up was what should happen if the TaskTracker is found to be in a bad state. The general agreement was that it should be blacklisted with the option of coming back to life. This helps prevents transient problems from becoming permanent problems.

          Show
          Allen Wittenauer added a comment - We were chatting a bit about this at staff today. One of the topics that came up was what should happen if the TaskTracker is found to be in a bad state. The general agreement was that it should be blacklisted with the option of coming back to life. This helps prevents transient problems from becoming permanent problems.
          Hide
          Hemanth Yamijala added a comment -

          We might start working on this soon. I thought it might be a good idea to share current thinking about the specs for this feature, and start a discussion.

          In a brief discussion with the team, Eric and Owen, we came up with the following points:

          • Provide an ability to the administrator to give a path to a script file that will be periodically run on the tasktracker. The interval of running can be configured.
          • The tasktracker would run this in a separate thread and look for the exit code. If the script exits with a non-zero code, this will be reported to the JobTracker.
          • Any output from the script (upon error) will be logged and if possible, also displayed on the web UI of the tasktracker.
          • The jobtracker will blacklist this node when such a condition is reported.
          • It will be a good idea to display the 'unhealthy' nodes on the UI.
          • The tasktracker will need to continue to run the script so that if the condition is corrected, it will be reported again to the jobtracker for becoming available.
          • A point came up about whether this mechanism can lead to the state of the node toggling. Maybe we can do some hysterisis, but as a follow-up for this jira.
          • However, It may also be a good idea to show some stats like how oftern this node was blacklisted in the last 24 hours, and the current status by going to the tasktracker page on the web ui. This might help us decide if it's worthwhile introducing hysterisis.

          Does this sound good ? Any other thoughts ?

          Show
          Hemanth Yamijala added a comment - We might start working on this soon. I thought it might be a good idea to share current thinking about the specs for this feature, and start a discussion. In a brief discussion with the team, Eric and Owen, we came up with the following points: Provide an ability to the administrator to give a path to a script file that will be periodically run on the tasktracker. The interval of running can be configured. The tasktracker would run this in a separate thread and look for the exit code. If the script exits with a non-zero code, this will be reported to the JobTracker. Any output from the script (upon error) will be logged and if possible, also displayed on the web UI of the tasktracker. The jobtracker will blacklist this node when such a condition is reported. It will be a good idea to display the 'unhealthy' nodes on the UI. The tasktracker will need to continue to run the script so that if the condition is corrected, it will be reported again to the jobtracker for becoming available. A point came up about whether this mechanism can lead to the state of the node toggling. Maybe we can do some hysterisis, but as a follow-up for this jira. However, It may also be a good idea to show some stats like how oftern this node was blacklisted in the last 24 hours, and the current status by going to the tasktracker page on the web ui. This might help us decide if it's worthwhile introducing hysterisis. Does this sound good ? Any other thoughts ?
          Hide
          steve_l added a comment -

          -this fits in well with the ping/liveness stuff I've been doing. I already do this with things that GET the various JSP pages of a node, checking the health of the jetty endpoints. when those pages return a non-200 code, I return an error that includes the entire HTTP page sent back, as that is often useful.

          1. It may be handy to have this stuff independent of the TT itself, so you can run a node-health checker on anything, and even if the TT refuses to play, you could do some checking of the node.
          2. Also, could it be a bit of JavaScript instead of a shell script?
          3. A scenario to worry about is what if something bad happens (e.g. a bit of NFS goes away) that causes all health checks in a big cluster to fail simultaneously. Would this overload the JT?
          4. Incidentally, I could imagine some scripts being slow, so I wouldnt have my ping code run the .sh every call, but instead run it on a regular frequency; a ping() would return the latest results.
          Show
          steve_l added a comment - -this fits in well with the ping/liveness stuff I've been doing. I already do this with things that GET the various JSP pages of a node, checking the health of the jetty endpoints. when those pages return a non-200 code, I return an error that includes the entire HTTP page sent back, as that is often useful. It may be handy to have this stuff independent of the TT itself, so you can run a node-health checker on anything, and even if the TT refuses to play, you could do some checking of the node. Also, could it be a bit of JavaScript instead of a shell script? A scenario to worry about is what if something bad happens (e.g. a bit of NFS goes away) that causes all health checks in a big cluster to fail simultaneously. Would this overload the JT? Incidentally, I could imagine some scripts being slow, so I wouldnt have my ping code run the .sh every call, but instead run it on a regular frequency; a ping() would return the latest results.
          Hide
          steve_l added a comment -
          Show
          steve_l added a comment - some related issues HADOOP-3323 HADOOP-3767 HADOOP-3893 HADOOP-5622
          Hide
          Allen Wittenauer added a comment -

          Torque specifically looks for a line that begins with ERROR on stdout, but reports the whole line in the node status. So running pbsnodes will show the full status message and provides an easy way to audit all nodes on a giving torque server. We really need the equivalent of dfsadmin -report for the JobTracker to provide this same level of output.

          Additionally, torque ignores the exit status. In the vast majority of cases, the node is going to be good. So the approach they take is that if a script has a syntax error (and would therefore have a 'fail' as an exit code), the node should be considered good anyway.

          Show
          Allen Wittenauer added a comment - Torque specifically looks for a line that begins with ERROR on stdout, but reports the whole line in the node status. So running pbsnodes will show the full status message and provides an easy way to audit all nodes on a giving torque server. We really need the equivalent of dfsadmin -report for the JobTracker to provide this same level of output. Additionally, torque ignores the exit status. In the vast majority of cases, the node is going to be good. So the approach they take is that if a script has a syntax error (and would therefore have a 'fail' as an exit code), the node should be considered good anyway.
          Hide
          Hemanth Yamijala added a comment -

          this fits in well with the ping/liveness stuff I've been doing

          This coupled with your comment on ping() returning the latest results seems to indicate that we have a thread that periodically executes and stores the results. In that sense, maybe we could build this solution now, and when HADOOP-3628 is committed to trunk, we could integrate the solution and results to be returned as part of ping(). Does that make sense ?

          It may be handy to have this stuff independent of the TT itself, so you can run a node-health checker on anything, and even if the TT refuses to play, you could do some checking of the node.

          The health check script itself is definitely external and could be anything. All the TT would provide is the ability to run it periodically. So, I can imagine this being run standalone, or integrated with another daemon that provides a similar interface.

          Also, could it be a bit of JavaScript instead of a shell script?

          Umm. Can we execute this from the TT directly ? AFAIK, this is not possible, right ? As of now, there is no plan to support anything other than a shell script.

          A scenario to worry about is what if something bad happens (e.g. a bit of NFS goes away) that causes all health checks in a big cluster to fail simultaneously. Would this overload the JT?

          Since the plan is to send the information using the heartbeats itself, handling the load of requests should not be a problem. I am not sure how costly blacklist processing itself is on the JT, but hopefully not bad. We'll keep this in mind though.

          Show
          Hemanth Yamijala added a comment - this fits in well with the ping/liveness stuff I've been doing This coupled with your comment on ping() returning the latest results seems to indicate that we have a thread that periodically executes and stores the results. In that sense, maybe we could build this solution now, and when HADOOP-3628 is committed to trunk, we could integrate the solution and results to be returned as part of ping(). Does that make sense ? It may be handy to have this stuff independent of the TT itself, so you can run a node-health checker on anything, and even if the TT refuses to play, you could do some checking of the node. The health check script itself is definitely external and could be anything. All the TT would provide is the ability to run it periodically. So, I can imagine this being run standalone, or integrated with another daemon that provides a similar interface. Also, could it be a bit of JavaScript instead of a shell script? Umm. Can we execute this from the TT directly ? AFAIK, this is not possible, right ? As of now, there is no plan to support anything other than a shell script. A scenario to worry about is what if something bad happens (e.g. a bit of NFS goes away) that causes all health checks in a big cluster to fail simultaneously. Would this overload the JT? Since the plan is to send the information using the heartbeats itself, handling the load of requests should not be a problem. I am not sure how costly blacklist processing itself is on the JT, but hopefully not bad. We'll keep this in mind though.
          Hide
          Hemanth Yamijala added a comment -

          We really need the equivalent of dfsadmin -report for the JobTracker to provide this same level of output.

          Allen, I had command line reporting in mind when writing the specs. Worst case, if the effort seems large to cover we will do this as a follow-up JIRA, restricting the information to the web UI for now.

          Additionally, torque ignores the exit status

          If this is acceptable to all, we can do the same as well.

          Show
          Hemanth Yamijala added a comment - We really need the equivalent of dfsadmin -report for the JobTracker to provide this same level of output. Allen, I had command line reporting in mind when writing the specs. Worst case, if the effort seems large to cover we will do this as a follow-up JIRA, restricting the information to the web UI for now. Additionally, torque ignores the exit status If this is acceptable to all, we can do the same as well.
          Hide
          steve_l added a comment -

          Hemanth,

          I could certainly merge this in with HADOOP-3628; I'm busy dealing with svn merge issues right now, and don't want this held up. I think it would be handy for me if we could run this on startup and then have ping query the latest state when checked.

          >> Also, could it be a bit of JavaScript instead of a shell script?

          >Umm. Can we execute this from the TT directly ? AFAIK, this is not possible, right ? As of now, there is no plan to support anything other than a shell script.

          you can run JS direct from a java6 jvm; the script engine is in the box. I've been wondering what it would take for JS support in MR jobs, but it could be handy for system health checks too, though native scripts give you the edge for low level system health.

          Show
          steve_l added a comment - Hemanth, I could certainly merge this in with HADOOP-3628 ; I'm busy dealing with svn merge issues right now, and don't want this held up. I think it would be handy for me if we could run this on startup and then have ping query the latest state when checked. >> Also, could it be a bit of JavaScript instead of a shell script? >Umm. Can we execute this from the TT directly ? AFAIK, this is not possible, right ? As of now, there is no plan to support anything other than a shell script. you can run JS direct from a java6 jvm; the script engine is in the box. I've been wondering what it would take for JS support in MR jobs, but it could be handy for system health checks too, though native scripts give you the edge for low level system health.
          Hide
          eric baldeschwieler added a comment -

          looks good.

          I think we should track some stats per node on the JT. Just the total success & failures reported over the current and last hour and day long windows. Showing this and current health and the error line (as allen suggests) on the JT console will let an operator quickly determine if any nodes are ill.

          Do we currently track such success/failure ratios for tasks on a node? That would also be great to display on the same console page.

          Show
          eric baldeschwieler added a comment - looks good. I think we should track some stats per node on the JT. Just the total success & failures reported over the current and last hour and day long windows. Showing this and current health and the error line (as allen suggests) on the JT console will let an operator quickly determine if any nodes are ill. Do we currently track such success/failure ratios for tasks on a node? That would also be great to display on the same console page.
          Hide
          Devaraj Das added a comment -

          TaskTrackers do report the number of failures, and it should be straightforward to maintain and report the number of successes as well.

          Show
          Devaraj Das added a comment - TaskTrackers do report the number of failures, and it should be straightforward to maintain and report the number of successes as well.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching first cut patch to address the issue:

          The patch does following:

          • Patch requires two configuration items to be present in TaskTracker nodes, mapred.tasktracker.health_check_script and mapred.tasktracker.health_check_interval the mapred.tasktracker.health_check_script needs to be absolute path to script file. If the file does not exist when the TT starts up then the monitor is turned off.
          • The monitor periodically runs the shell script. It ignores the exit code of the shell script, gets the output from the script, searches for a pattern "ERROR" in the output.
          • If ERROR is present in output, the monitor, sets health of the node as unhealthy and puts entire output as status to be set to JT.
          • JT then depending on the value of the health of the node, decides to blacklist or white list the node.
          • Attached test case which tests black listing and white listing as per output of the script.
          Show
          Sreekanth Ramakrishnan added a comment - Attaching first cut patch to address the issue: The patch does following: Patch requires two configuration items to be present in TaskTracker nodes, mapred.tasktracker.health_check_script and mapred.tasktracker.health_check_interval the mapred.tasktracker.health_check_script needs to be absolute path to script file. If the file does not exist when the TT starts up then the monitor is turned off. The monitor periodically runs the shell script. It ignores the exit code of the shell script, gets the output from the script, searches for a pattern "ERROR" in the output. If ERROR is present in output, the monitor, sets health of the node as unhealthy and puts entire output as status to be set to JT. JT then depending on the value of the health of the node, decides to blacklist or white list the node. Attached test case which tests black listing and white listing as per output of the script.
          Hide
          Owen O'Malley added a comment -

          One thing that bothers me is that it requires a fork and we've already seen that forks are expensive for our servers. We'll also need a timer if the fork locks up to count as a failure.

          Show
          Owen O'Malley added a comment - One thing that bothers me is that it requires a fork and we've already seen that forks are expensive for our servers. We'll also need a timer if the fork locks up to count as a failure.
          Hide
          Hong Tang added a comment -
          • The health checking script may end up running more frequently than desired if the thread gets interrupted.
          • The health checking script may also run at a lower frequency as desired because the code did not count the actual time spent on running the script. We should have a variable remember the last launch time, and in the beginning of the loop, get the current time, and either launch the script or sleep for the difference of the check-interval and the elapsed time since the last launch time.
          • Like Owen said, we probably need to have a timeout when executing the script (and if it indeed happens, count it as failure).
          Show
          Hong Tang added a comment - The health checking script may end up running more frequently than desired if the thread gets interrupted. The health checking script may also run at a lower frequency as desired because the code did not count the actual time spent on running the script. We should have a variable remember the last launch time, and in the beginning of the loop, get the current time, and either launch the script or sleep for the difference of the check-interval and the elapsed time since the last launch time. Like Owen said, we probably need to have a timeout when executing the script (and if it indeed happens, count it as failure).
          Hide
          steve_l added a comment -

          One thing to consider here is do you really want to run the health check in-VM, or do you want to bring up a second monitoring VM that can do the health checks, keep an eye on the VM running the node, and not be a threat to the stability of the main VM itself.

          Show
          steve_l added a comment - One thing to consider here is do you really want to run the health check in-VM, or do you want to bring up a second monitoring VM that can do the health checks, keep an eye on the VM running the node, and not be a threat to the stability of the main VM itself.
          Hide
          Hong Tang added a comment -

          +1. Might it be even easier by just hooking the script with cron?

          Show
          Hong Tang added a comment - +1. Might it be even easier by just hooking the script with cron?
          Hide
          steve_l added a comment -

          Cron? No, because you want to be able to force a health check with a ping() over the network, have a URL you can hit for the load balancer.

          Show
          steve_l added a comment - Cron? No, because you want to be able to force a health check with a ping() over the network, have a URL you can hit for the load balancer.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching a patch which address all other concern except forking a shell every delay interval.

          Show
          Sreekanth Ramakrishnan added a comment - Attaching a patch which address all other concern except forking a shell every delay interval.
          Hide
          Owen O'Malley added a comment -

          I'd propose that we fork a small process off that runs the script for us every N minutes and reports status to the TT. That way we minimize the danger of causing the TT to fail. Furthermore, since it can have a tiny heap and few threads, the fork will be much cheaper. Thoughts?

          Show
          Owen O'Malley added a comment - I'd propose that we fork a small process off that runs the script for us every N minutes and reports status to the TT. That way we minimize the danger of causing the TT to fail. Furthermore, since it can have a tiny heap and few threads, the fork will be much cheaper. Thoughts?
          Hide
          Hemanth Yamijala added a comment -

          To summarize some of the comments above, the issue we are discussing is whether the node health checker script should be launched as a separate process from the tasktracker (TT) itself, rather than as a thread in the TT, as done in the patch currently. There are some motivations for doing the same:

          • A periodic process launch from a java service like the TT has caused problems in the past - for e.g. look at HADOOP-5059.
          • Owen also mentioned instances where they'd seen the service itself lock up due to the process launch (and the underlying fork()/exec()) failing.

          So, the proposal is to solve this problem by having the node health checker script as a separate process. This process can be configured with the following:

          • Path to a script
          • An interval
          • TT's address for communication.

          The process would periodically run the script (as done in the patch today) and report the status to the TT using RPC. To keep management simple, we can, in the first cut, launch this process from the TT itself and stop it when the TT is going down. In future, it should be possible to decouple this even more and have them run independently. The simplicity we buy in the first iteration is to not require administrators from worrying about managing this independently for the time being - until we gain some experience with how the health check script is running.

          Does this sound fine ?

          Show
          Hemanth Yamijala added a comment - To summarize some of the comments above, the issue we are discussing is whether the node health checker script should be launched as a separate process from the tasktracker (TT) itself, rather than as a thread in the TT, as done in the patch currently. There are some motivations for doing the same: A periodic process launch from a java service like the TT has caused problems in the past - for e.g. look at HADOOP-5059 . Owen also mentioned instances where they'd seen the service itself lock up due to the process launch (and the underlying fork()/exec()) failing. So, the proposal is to solve this problem by having the node health checker script as a separate process. This process can be configured with the following: Path to a script An interval TT's address for communication. The process would periodically run the script (as done in the patch today) and report the status to the TT using RPC. To keep management simple, we can, in the first cut, launch this process from the TT itself and stop it when the TT is going down. In future, it should be possible to decouple this even more and have them run independently. The simplicity we buy in the first iteration is to not require administrators from worrying about managing this independently for the time being - until we gain some experience with how the health check script is running. Does this sound fine ?
          Hide
          Hong Tang added a comment - - edited

          @owen: Steve's comments seem to imply that we need the capability of launch the health checking script in response to external commands.

          @Sreekanth: If I understand correctly, the new patch introduces another thread to check for command execution timeout. Is there a way to eliminate it (possibly by modifying shell execution API to accept timeouts)?

          Show
          Hong Tang added a comment - - edited @owen: Steve's comments seem to imply that we need the capability of launch the health checking script in response to external commands. @Sreekanth: If I understand correctly, the new patch introduces another thread to check for command execution timeout. Is there a way to eliminate it (possibly by modifying shell execution API to accept timeouts)?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          The process would periodically run the script (as done in the patch today) and report the status to the TT using RPC.

          Can't the status reporting be simply done via stdout/stderr with START and END markers?

          Show
          Vinod Kumar Vavilapalli added a comment - The process would periodically run the script (as done in the patch today) and report the status to the TT using RPC. Can't the status reporting be simply done via stdout/stderr with START and END markers?
          Hide
          Hemanth Yamijala added a comment -

          Steve's comments seem to imply that we need the capability of launch the health checking script in response to external commands.

          Hong, while this is true, we can make it work as long as we are using an RPC to communicate with the health checking process. One command in the RPC is certainly the heartbeat that will both report the status as well as let the TT know the health checker is running. The other could be a 'run script now' command. This can be added as an extension on top of the basic framework - for e.g. when the ping is added to the TT. Would that work ?

          Can't the status reporting be simply done via stdout/stderr with START and END markers?

          I suppose so. We've seen in the past a couple of issues if we're not careful with interacting with i/o streams of sub processes. Maybe a lot of these are fixed in Hadoop, and so not really an issue. But RPC seems simpler and cleaner. Would be glad to find what others think as well.

          Show
          Hemanth Yamijala added a comment - Steve's comments seem to imply that we need the capability of launch the health checking script in response to external commands. Hong, while this is true, we can make it work as long as we are using an RPC to communicate with the health checking process. One command in the RPC is certainly the heartbeat that will both report the status as well as let the TT know the health checker is running. The other could be a 'run script now' command. This can be added as an extension on top of the basic framework - for e.g. when the ping is added to the TT. Would that work ? Can't the status reporting be simply done via stdout/stderr with START and END markers? I suppose so. We've seen in the past a couple of issues if we're not careful with interacting with i/o streams of sub processes. Maybe a lot of these are fixed in Hadoop, and so not really an issue. But RPC seems simpler and cleaner. Would be glad to find what others think as well.
          Hide
          steve_l added a comment -

          I do a fair amount of health monitoring , and there is a lot to be said for something that runs is a separate process from any of the Hadoop services to do the checking.

          1. it could be its own service, fairly lightweight
          2. gives you the option of monitoring (and if need be killing) the TT process itself.
          3. stops you accidentally stamping on bits of the JVM. As an example, I'd deployed something that checked the health of bits of HDFS by checking that files where there, but that code closed the handle after use, killing any TT in the same process.
          Show
          steve_l added a comment - I do a fair amount of health monitoring , and there is a lot to be said for something that runs is a separate process from any of the Hadoop services to do the checking. it could be its own service, fairly lightweight gives you the option of monitoring (and if need be killing) the TT process itself. stops you accidentally stamping on bits of the JVM. As an example, I'd deployed something that checked the health of bits of HDFS by checking that files where there, but that code closed the handle after use, killing any TT in the same process.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Adding a little more to discussion, following is approach which I am taking to generate a new patch:

          • Introduce a new health monitor service which is spawned off by task tracker when it starts.
          • The service periodically reports the status of the node to the task tracker.
          • The protocol is modeled out of TaskUmbricalProtocol
          • The service would receive the host address and port as the command line arguments while starting up.
          • The service then periodically sends the status update to task tracker based on the host and port specified to the service.
          • When TaskTracker is shutdown, the NodeHealthChecker would not be able to contact TaskTracker and would shut itself down. The reason why this is done, is because task tracker's shutdown() or close() is not called when we do a stop-mapred.sh or task tracker can be killed with direct kill -9 ttpid in this case the TT might not inform all the clients which contact it to report services.
          Show
          Sreekanth Ramakrishnan added a comment - Adding a little more to discussion, following is approach which I am taking to generate a new patch: Introduce a new health monitor service which is spawned off by task tracker when it starts. The service periodically reports the status of the node to the task tracker. The protocol is modeled out of TaskUmbricalProtocol The service would receive the host address and port as the command line arguments while starting up. The service then periodically sends the status update to task tracker based on the host and port specified to the service. When TaskTracker is shutdown, the NodeHealthChecker would not be able to contact TaskTracker and would shut itself down. The reason why this is done, is because task tracker's shutdown() or close() is not called when we do a stop-mapred.sh or task tracker can be killed with direct kill -9 ttpid in this case the TT might not inform all the clients which contact it to report services.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching a patch which does following:

          • Added a new protocol which is used for node health reporting.
          • Added a new Class which implements this Interface.
          • The new class spawns off a new JVM, which does the health monitoring.
          • Moved the timeout logic into Shell
          • Retained the old test case which does
            1. Black listing due to node script reporting error
            2. White-listing of same tracker when script returns no error
            3. Blacklisting when the script times out.
          Show
          Sreekanth Ramakrishnan added a comment - Attaching a patch which does following: Added a new protocol which is used for node health reporting. Added a new Class which implements this Interface. The new class spawns off a new JVM, which does the health monitoring. Moved the timeout logic into Shell Retained the old test case which does 1. Black listing due to node script reporting error 2. White-listing of same tracker when script returns no error 3. Blacklisting when the script times out.
          Hide
          Hemanth Yamijala added a comment -

          To add to Sreekanth's comments:

          • We are using a new port number for the TT to bind to for the health checker script to send updates. The other option was to use the same port as that used for the TaskUmbilicalProtocol. We thought the health service should not mix with child tasks reporting status and hence kept it different.
          • The other important point is about how the health checker stops. Currently, the model is similar to how a child stops, in that if it can't report status to the TT, it kills itself. This is anyway required because it has to handle the case of the TT dying unexpectedly. However this is the extreme case. When the TT is stopped normally there are better options to stop the health check script. For e.g. we could add a shutdown hook to TT and send a signal to the health checker. We could make the health checker a separate daemon as well so that stop-mapred could stop it. Any of these options can be easily implemented as a follow-up once the basic structure is in place.

          Please let us know if these points make sense.

          Show
          Hemanth Yamijala added a comment - To add to Sreekanth's comments: We are using a new port number for the TT to bind to for the health checker script to send updates. The other option was to use the same port as that used for the TaskUmbilicalProtocol. We thought the health service should not mix with child tasks reporting status and hence kept it different. The other important point is about how the health checker stops. Currently, the model is similar to how a child stops, in that if it can't report status to the TT, it kills itself. This is anyway required because it has to handle the case of the TT dying unexpectedly. However this is the extreme case. When the TT is stopped normally there are better options to stop the health check script. For e.g. we could add a shutdown hook to TT and send a signal to the health checker. We could make the health checker a separate daemon as well so that stop-mapred could stop it. Any of these options can be easily implemented as a follow-up once the basic structure is in place. Please let us know if these points make sense.
          Hide
          Hong Tang added a comment -

          Both points make sense to me. One suggestion: when the health check fails to report status to TT, try to do a "kill -9" for the TT in case there might be a bug in the reporting sub-system that prevents health checker reporting states, but the other part of TT is still working fine.

          Show
          Hong Tang added a comment - Both points make sense to me. One suggestion: when the health check fails to report status to TT, try to do a "kill -9" for the TT in case there might be a bug in the reporting sub-system that prevents health checker reporting states, but the other part of TT is still working fine.
          Hide
          Hemanth Yamijala added a comment -

          when the health check fails to report status to TT, try to do a "kill -9" for the TT ...

          Hong, is this to check if the TT is alive ? In which case, did you mean another signal, like -0 or kill -3. -9 is SIGKILL and would kill the TT. Also, in that case are you suggesting that we could keep the health checker around and continue trying to report after a while ?

          Show
          Hemanth Yamijala added a comment - when the health check fails to report status to TT, try to do a "kill -9" for the TT ... Hong, is this to check if the TT is alive ? In which case, did you mean another signal, like -0 or kill -3. -9 is SIGKILL and would kill the TT. Also, in that case are you suggesting that we could keep the health checker around and continue trying to report after a while ?
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching patch merged with latest trunk.

          Show
          Sreekanth Ramakrishnan added a comment - Attaching patch merged with latest trunk.
          Hide
          steve_l added a comment -

          1. The timeouts in Shell would seem useful on their own; every shell operation ought to have timeouts for extra robustness.

          2. This would fit fairly well under the HADOOP-3628 stuff, where the monitor would start and stop with the TT lifecycle; we'd have to think about how to integrate it with the ping operation -I think returning the most recent status would be good.

          3. At some point in the future, it would be good for the policy of acting on TT failure to be moved out of the JT. In infrastructure where the response to failure is to terminate that (virtual) host and ask for a new one, you react very differently to failure. It's not something the JT needs to handle, other than pass up bad news.

          4. I'm not sure about all the kill -9 and shutdown hook stuff, it's getting into fragile waters. Hard to test, hard to debug, creates complex situations especially in test runs or stuff hosted in different runtimes

          • this helper script stuff must be optional; I would turn it off on my systems as I test health in different ways.
          • kill handlers are best designed to do very little and be robust against odd system states -and not assume any other parts of the cluster are live.

          For the curious, the way SmartFrog manages is its health is that every component tracks the last time it was asked by its parent for its health, if that time ever exceeds a (programmed) limit then it terminates itself. Every process pings the root component; its up to that to ping its children and act on failures -and to recognise and act on timeouts. This works OK for single host work, in a cluster you don't want any SPOFs and tend to take an aggregate view : there has to be one Namenode, one JT, "enough" workers. I have a component to check the health of a file in the filesystem; every time it's health is checked, it looks for the file it was bound to, checks that it is present and within a specified size range. This is handy for checking that files you value are there, and that the FS is visible across the network (very important on virtual servers with odd networking). I dont have anything similar for checking that TT's are good, the best check would be test work.

          Show
          steve_l added a comment - 1. The timeouts in Shell would seem useful on their own; every shell operation ought to have timeouts for extra robustness. 2. This would fit fairly well under the HADOOP-3628 stuff, where the monitor would start and stop with the TT lifecycle; we'd have to think about how to integrate it with the ping operation -I think returning the most recent status would be good. 3. At some point in the future, it would be good for the policy of acting on TT failure to be moved out of the JT. In infrastructure where the response to failure is to terminate that (virtual) host and ask for a new one, you react very differently to failure. It's not something the JT needs to handle, other than pass up bad news. 4. I'm not sure about all the kill -9 and shutdown hook stuff, it's getting into fragile waters. Hard to test, hard to debug, creates complex situations especially in test runs or stuff hosted in different runtimes this helper script stuff must be optional; I would turn it off on my systems as I test health in different ways. kill handlers are best designed to do very little and be robust against odd system states -and not assume any other parts of the cluster are live. For the curious, the way SmartFrog manages is its health is that every component tracks the last time it was asked by its parent for its health, if that time ever exceeds a (programmed) limit then it terminates itself. Every process pings the root component; its up to that to ping its children and act on failures -and to recognise and act on timeouts. This works OK for single host work, in a cluster you don't want any SPOFs and tend to take an aggregate view : there has to be one Namenode, one JT, "enough" workers. I have a component to check the health of a file in the filesystem; every time it's health is checked, it looks for the file it was bound to, checks that it is present and within a specified size range. This is handy for checking that files you value are there, and that the FS is visible across the network (very important on virtual servers with odd networking). I dont have anything similar for checking that TT's are good, the best check would be test work.
          Hide
          Hemanth Yamijala added a comment -

          I've started reviewing the patch. I think the last patch missed adding the new files. Please correct it.

          Show
          Hemanth Yamijala added a comment - I've started reviewing the patch. I think the last patch missed adding the new files. Please correct it.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Patch which includes the missed out file.

          Show
          Sreekanth Ramakrishnan added a comment - Patch which includes the missed out file.
          Hide
          Hemanth Yamijala added a comment -

          This is a reasonably complex patch, and in the current state, it has already come a long way. I do have the following comments:

          Comments on the spec:

          • It may be useful to allow arguments to be provided to the health check script.

          Shell:

          • It seems like it will be easier to use a Timer to do the interrupting rather than implementing the logic ourselves.
          • The timeout thread seems to be being interrupted and joined twice.
          • Do we need the setTimeoutInterval ? What happens if it is set after the command has started. I think it is better to have it set only once.

          Configuration:

          • Default value for mapred.healthChecker.interval can be a little larger, like 60000.
          • Let's call mapred.healthChecker.failure_interval mapred.healthChecker.script.timeout. Likewise the variable in NodeHealthChecker failureintervalTime as scriptTimeout or some such name.

          NodeHealthCheckerService:

          • Instead of setting the health status in the TaskTracker, I think the variable can simply be stored as a local variable by NodeHealthCheckerService, and TT can call it using an appropriately synchronized accessor method.
          • If the thread launching the health checker VM is a daemon, I don't think there's a need to join it in close.
          • Does NodeHealthCheckerService need to be a public class ?
          • Please give a name to the thread launching the health checker VM

          NodeHealthChecker:

          • Typo: HEALTH_CHECK_INTERVAL_PREOPERTY should be HEALTH_CHECK_INTERVAL_PROPERTY
          • If there is an exception in executing the script - should we treat this as a failure ? As a follow-up, should we ignore the output in such a case, as it may not be meaningful to rely upon ?
          • A related question is what to do if the output is null. Currently we don't seem to be reporting anything at all. This clearly seems like a bug. I think if it did not timeout, this should be treated as a success.
          • Also, should we be more careful in where we check for ERROR. I think the output should begin with the string ERROR, rather than ERROR appearing anywhere in the string.
          • I think we should check whether the script timed out before we check the output. It reads more logically that way.
          • I would also suggest we move the code in the finally block to a separate method for making it more obvious. Among other things it would help us unit test this very important logic easily.
          • I think the check for the health check script should be done in NodeHealthCheckerService. This will avoid us the cost of launching the VM and then failing. However, basic checks for the script can still be there in NodeHealthChecker so it can be used as an independent unit.
          • Where do the log messages of the NodeHealthChecker go ? I think it should go to a separate log file.
            Give a name to the Timer object created for running the health check script.

          JobTracker:

          • unBlacklistTracker may not blacklist a tracker if the reason for blacklisting was different. Shouldn't this be checked whenever a call is made, and the subsequent actions be done only if unBlacklistTracker returns a success.
          • When tasktracker status is updated because the node becomes healthy, shouldn't isBlackListed be set to true ?

          TaskTracker:

          • shutdown() seems to have got an extraneous log message by mistake.

          Test cases:

          • In the testcase, can we also verify the healthReport value for unhealthy trackers ?
          • I think we also need a testcase that combines task failures with health script failing and make sure we don't unblacklist a tracker if it was blacklisted due to another reason.

          Documentation:

          • New classes (like NodeHealthCheckerService) need Javadoc. Indeed, please document all new methods.
          • We also need Forrest documentation on this super cool feature. smile
          Show
          Hemanth Yamijala added a comment - This is a reasonably complex patch, and in the current state, it has already come a long way. I do have the following comments: Comments on the spec: It may be useful to allow arguments to be provided to the health check script. Shell: It seems like it will be easier to use a Timer to do the interrupting rather than implementing the logic ourselves. The timeout thread seems to be being interrupted and joined twice. Do we need the setTimeoutInterval ? What happens if it is set after the command has started. I think it is better to have it set only once. Configuration: Default value for mapred.healthChecker.interval can be a little larger, like 60000. Let's call mapred.healthChecker.failure_interval mapred.healthChecker.script.timeout. Likewise the variable in NodeHealthChecker failureintervalTime as scriptTimeout or some such name. NodeHealthCheckerService: Instead of setting the health status in the TaskTracker, I think the variable can simply be stored as a local variable by NodeHealthCheckerService, and TT can call it using an appropriately synchronized accessor method. If the thread launching the health checker VM is a daemon, I don't think there's a need to join it in close. Does NodeHealthCheckerService need to be a public class ? Please give a name to the thread launching the health checker VM NodeHealthChecker: Typo: HEALTH_CHECK_INTERVAL_PREOPERTY should be HEALTH_CHECK_INTERVAL_PROPERTY If there is an exception in executing the script - should we treat this as a failure ? As a follow-up, should we ignore the output in such a case, as it may not be meaningful to rely upon ? A related question is what to do if the output is null. Currently we don't seem to be reporting anything at all. This clearly seems like a bug. I think if it did not timeout, this should be treated as a success. Also, should we be more careful in where we check for ERROR. I think the output should begin with the string ERROR, rather than ERROR appearing anywhere in the string. I think we should check whether the script timed out before we check the output. It reads more logically that way. I would also suggest we move the code in the finally block to a separate method for making it more obvious. Among other things it would help us unit test this very important logic easily. I think the check for the health check script should be done in NodeHealthCheckerService. This will avoid us the cost of launching the VM and then failing. However, basic checks for the script can still be there in NodeHealthChecker so it can be used as an independent unit. Where do the log messages of the NodeHealthChecker go ? I think it should go to a separate log file. Give a name to the Timer object created for running the health check script. JobTracker: unBlacklistTracker may not blacklist a tracker if the reason for blacklisting was different. Shouldn't this be checked whenever a call is made, and the subsequent actions be done only if unBlacklistTracker returns a success. When tasktracker status is updated because the node becomes healthy, shouldn't isBlackListed be set to true ? TaskTracker: shutdown() seems to have got an extraneous log message by mistake. Test cases: In the testcase, can we also verify the healthReport value for unhealthy trackers ? I think we also need a testcase that combines task failures with health script failing and make sure we don't unblacklist a tracker if it was blacklisted due to another reason. Documentation: New classes (like NodeHealthCheckerService) need Javadoc. Indeed, please document all new methods. We also need Forrest documentation on this super cool feature. smile
          Hide
          Hong Tang added a comment - - edited

          Hong, is this to check if the TT is alive ? In which case, did you mean another signal, like -0 or kill -3. -9 is SIGKILL and would kill the TT. Also, in that case are you suggesting that we could keep the health checker around and continue trying to report after a while ?

          @hemanth sorry for not being clear. I gave a bit more thoughts on the problem, and I think the following logic may be simpler and more robust (1,2 are the current logic, 3 is my suggestion) :

          (1) periodically launch the health checking script;
          (2) reporting status that back to TT (both good and bad);
          (3) if it fails to receive response from TT, wait for X seconds, do an extra kill (to ensure TT is dead), and quit itself.

          I scanned through the code, it seems that NodeHealthChecker.stop() would be a good place to perform step (3).

          Show
          Hong Tang added a comment - - edited Hong, is this to check if the TT is alive ? In which case, did you mean another signal, like -0 or kill -3. -9 is SIGKILL and would kill the TT. Also, in that case are you suggesting that we could keep the health checker around and continue trying to report after a while ? @hemanth sorry for not being clear. I gave a bit more thoughts on the problem, and I think the following logic may be simpler and more robust (1,2 are the current logic, 3 is my suggestion) : (1) periodically launch the health checking script; (2) reporting status that back to TT (both good and bad); (3) if it fails to receive response from TT, wait for X seconds, do an extra kill (to ensure TT is dead), and quit itself. I scanned through the code, it seems that NodeHealthChecker.stop() would be a good place to perform step (3).
          Hide
          Allen Wittenauer added a comment -

          We are using a new port number for the TT to bind to for the health checker script to send updates. The other option was to use the same port as that used for the TaskUmbilicalProtocol. We thought the health service should not mix with child tasks reporting status and hence kept it different.

          This is disappointing. Hadoop has enough ports open that I think it qualifies as a cheese.

          Show
          Allen Wittenauer added a comment - We are using a new port number for the TT to bind to for the health checker script to send updates. The other option was to use the same port as that used for the TaskUmbilicalProtocol. We thought the health service should not mix with child tasks reporting status and hence kept it different. This is disappointing. Hadoop has enough ports open that I think it qualifies as a cheese.
          Hide
          Hemanth Yamijala added a comment -

          This is disappointing. Hadoop has enough ports open that I think it qualifies as a cheese.

          I was counting on you to object, Allen smile.

          What do others feel ? I guess we could piggyback on the port used for the TaskUmbilicalProtocol - need Sreekanth to confirm this though. The main concern is if it will interfere with the processing of the tasks reporting their status to TT. To be clear though, the amount of work done in the RPC for the health reporter call is minimal - it just sets two variables.

          Show
          Hemanth Yamijala added a comment - This is disappointing. Hadoop has enough ports open that I think it qualifies as a cheese. I was counting on you to object, Allen smile . What do others feel ? I guess we could piggyback on the port used for the TaskUmbilicalProtocol - need Sreekanth to confirm this though. The main concern is if it will interfere with the processing of the tasks reporting their status to TT. To be clear though, the amount of work done in the RPC for the health reporter call is minimal - it just sets two variables.
          Hide
          Hemanth Yamijala added a comment -

          if it fails to receive response from TT, wait for X seconds, do an extra kill (to ensure TT is dead), and quit itself.

          Hong, I am not certain about this. I am essentially viewing the TT as the master still, and the health monitor is just a helper service that monitors the health of the node, not the health of the TT itself. It seems wrong that this service could kill the master. I can conceive in future that we extend this to monitor the health of the TT also. And take corrective actions in case something is wrong with the TT. But I think that should be the topic of a different JIRA, or at a minimum an extension to this one. I would still like the scope of this to be restricted to providing a plug-in for checking the health of a node.

          Show
          Hemanth Yamijala added a comment - if it fails to receive response from TT, wait for X seconds, do an extra kill (to ensure TT is dead), and quit itself. Hong, I am not certain about this. I am essentially viewing the TT as the master still, and the health monitor is just a helper service that monitors the health of the node, not the health of the TT itself. It seems wrong that this service could kill the master. I can conceive in future that we extend this to monitor the health of the TT also. And take corrective actions in case something is wrong with the TT. But I think that should be the topic of a different JIRA, or at a minimum an extension to this one. I would still like the scope of this to be restricted to providing a plug-in for checking the health of a node.
          Hide
          Owen O'Malley added a comment -

          I'd vote for using the same RPC port. I don't think it buys us enough to be worth the hassle of dealing with the extra port.

          Show
          Owen O'Malley added a comment - I'd vote for using the same RPC port. I don't think it buys us enough to be worth the hassle of dealing with the extra port.
          Hide
          Hong Tang added a comment -

          @hemanth The problem is that if there is something wrong that prevents the health checker from communicating to TT, health checker would quit voluntarily without TT's knowledge. So the value of this health checker service could be discounted. But on a second thought, I think maybe we should stop worrying about that for now and defer more complex logic to a later time.

          Show
          Hong Tang added a comment - @hemanth The problem is that if there is something wrong that prevents the health checker from communicating to TT, health checker would quit voluntarily without TT's knowledge. So the value of this health checker service could be discounted. But on a second thought, I think maybe we should stop worrying about that for now and defer more complex logic to a later time.
          Hide
          Hemanth Yamijala added a comment -

          The problem is that if there is something wrong that prevents the health checker from communicating to TT, health checker would quit voluntarily without TT's knowledge

          That does sound like an issue. Maybe one simple solution is to send a timestamp with the TaskTrackerStatus report about when the health checker was last run. I am of course borrowing the idea from the information we have about when the last heartbeat was received from a TT. We could use that information to find out trackers that haven't updated their health for longer than a certain interval. Would that work ?

          Show
          Hemanth Yamijala added a comment - The problem is that if there is something wrong that prevents the health checker from communicating to TT, health checker would quit voluntarily without TT's knowledge That does sound like an issue. Maybe one simple solution is to send a timestamp with the TaskTrackerStatus report about when the health checker was last run. I am of course borrowing the idea from the information we have about when the last heartbeat was received from a TT. We could use that information to find out trackers that haven't updated their health for longer than a certain interval. Would that work ?
          Hide
          Hong Tang added a comment -

          Maybe one simple solution is to send a timestamp with the TaskTrackerStatus report about when the health checker was last run. I am of course borrowing the idea from the information we have about when the last heartbeat was received from a TT. We could use that information to find out trackers that haven't updated their health for longer than a certain interval. Would that work ?

          I like it.

          Show
          Hong Tang added a comment - Maybe one simple solution is to send a timestamp with the TaskTrackerStatus report about when the health checker was last run. I am of course borrowing the idea from the information we have about when the last heartbeat was received from a TT. We could use that information to find out trackers that haven't updated their health for longer than a certain interval. Would that work ? I like it.
          Hide
          Hong Tang added a comment -

          @owen Is there ever any chance the main RPC being overwhelmed and thus may not be able to respond promptly to health status information?

          Show
          Hong Tang added a comment - @owen Is there ever any chance the main RPC being overwhelmed and thus may not be able to respond promptly to health status information?
          Hide
          steve_l added a comment -
          1. timestamps would help, would push more analysis up to whoever is asking the TT.
          2. If the main RPC is so owverwhelmed it can't answer a health query, that's a sign of a problem anyway. Your TT is no longer live

          The health checking would fit in with the notion of a Ping operation, as raised in HADOOP-5622. Every service should have a way of saying "are you up". The failure to answer the query: trouble. If the call returns with an error: trouble. If the call returns saying it is well, then all you know is that the service thinks it is well, but still may not be capable of useful work.

          What this operation does do is set more requirements on what gets returned -you probably want to return something machine readable, that can be extended by different services, depending on their view of the world. A hashtable containing writable stuff, perhaps.

          Show
          steve_l added a comment - timestamps would help, would push more analysis up to whoever is asking the TT. If the main RPC is so owverwhelmed it can't answer a health query, that's a sign of a problem anyway. Your TT is no longer live The health checking would fit in with the notion of a Ping operation, as raised in HADOOP-5622 . Every service should have a way of saying "are you up". The failure to answer the query: trouble. If the call returns with an error: trouble. If the call returns saying it is well, then all you know is that the service thinks it is well, but still may not be capable of useful work. What this operation does do is set more requirements on what gets returned -you probably want to return something machine readable, that can be extended by different services, depending on their view of the world. A hashtable containing writable stuff, perhaps.
          Hide
          Hong Tang added a comment -

          The health checking would fit in with the notion of a Ping operation, as raised in HADOOP-5622. Every service should have a way of saying "are you up". The failure to answer the query: trouble. If the call returns with an error: trouble. If the call returns saying it is well, then all you know is that the service thinks it is well, but still may not be capable of useful work.

          Though, at both times, the only one who knows about the trouble is the health checker and not the rest of the world.

          Show
          Hong Tang added a comment - The health checking would fit in with the notion of a Ping operation, as raised in HADOOP-5622 . Every service should have a way of saying "are you up". The failure to answer the query: trouble. If the call returns with an error: trouble. If the call returns saying it is well, then all you know is that the service thinks it is well, but still may not be capable of useful work. Though, at both times, the only one who knows about the trouble is the health checker and not the rest of the world.
          Hide
          steve_l added a comment -

          > Though, at both times, the only one who knows about the trouble is the health checker and not the rest of the world.

          why is why your management tools

          1. need to be HA toys themselves
          2. need to be able to ask the apps for their health
          3. may need to be able to do test jobs to probe system health
          4. may need the ability to react to failure according to the infrastructure in which HDFS is running, and your policy.

          If HDFS is running in anything that supports the EC2 APIs, if a TT is playing up I'd start by rebooting that node, if it still doesn't come up, decomission the namenode, terminate the VM and ask for a new one. That's a very different policy from a physical cluster, where you may want to blacklist the TT while its datanode services stays live.

          Show
          steve_l added a comment - > Though, at both times, the only one who knows about the trouble is the health checker and not the rest of the world. why is why your management tools need to be HA toys themselves need to be able to ask the apps for their health may need to be able to do test jobs to probe system health may need the ability to react to failure according to the infrastructure in which HDFS is running, and your policy. If HDFS is running in anything that supports the EC2 APIs, if a TT is playing up I'd start by rebooting that node, if it still doesn't come up, decomission the namenode, terminate the VM and ask for a new one. That's a very different policy from a physical cluster, where you may want to blacklist the TT while its datanode services stays live.
          Hide
          Hemanth Yamijala added a comment -

          Folks, I still maintain that the focus of this jira is just checking health of the node as determined by an administrator supplied script. The last few comments are focusing more on health of a TT. For the purpose of making incremental progress, let us stick to the original scope and defer discussions of checking the health on the TT, and corrective actions there-of, to a separate jira.

          So, to summarize, the health checker kills itself if it cannot communicate with the TT (similar to the child JVM). If this happens because the TT is down, well and good. The 'lost tasktracker' logic of the jobtracker would ensure this status is captured. If this happens because the TT was overwhelmed, well, maybe the TT is not 'healthy' any more. But the fact that we are reporting timestamps of the last health status gives the administrators an opportunity to know that something is amiss on this node, because it's health has not been updated for a while. Either way we can alert ourselves to problems. So, the purpose is still solved. Of course, there are better, more automated ways to do it. That would qualify for a next increment.

          Hope this makes sense.

          Show
          Hemanth Yamijala added a comment - Folks, I still maintain that the focus of this jira is just checking health of the node as determined by an administrator supplied script. The last few comments are focusing more on health of a TT. For the purpose of making incremental progress, let us stick to the original scope and defer discussions of checking the health on the TT, and corrective actions there-of, to a separate jira. So, to summarize, the health checker kills itself if it cannot communicate with the TT (similar to the child JVM). If this happens because the TT is down, well and good. The 'lost tasktracker' logic of the jobtracker would ensure this status is captured. If this happens because the TT was overwhelmed, well, maybe the TT is not 'healthy' any more. But the fact that we are reporting timestamps of the last health status gives the administrators an opportunity to know that something is amiss on this node, because it's health has not been updated for a while. Either way we can alert ourselves to problems. So, the purpose is still solved. Of course, there are better, more automated ways to do it. That would qualify for a next increment. Hope this makes sense.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching documentation pdf incorporating documentation change.

          Attaching screenshot listing UI changes.

          Show
          Sreekanth Ramakrishnan added a comment - Attaching documentation pdf incorporating documentation change. Attaching screenshot listing UI changes.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching latest patch incorporating Hemanths comments:

          Major Changes:

          • Now NodeHealthChecker reports health status on same port which tasks use to report status.
          • NodeHealthChecker can now be run in seperate VM or as thread, thread based start up can be used in MiniMRCluster
          • Changed testcase to also test conditions with blacklisting across jobs and also verifying cluster capacity after we blacklist tracker.
          • Also added new configuration entry which takes node health scripts arguments.
          Show
          Sreekanth Ramakrishnan added a comment - Attaching latest patch incorporating Hemanths comments: Major Changes: Now NodeHealthChecker reports health status on same port which tasks use to report status. NodeHealthChecker can now be run in seperate VM or as thread, thread based start up can be used in MiniMRCluster Changed testcase to also test conditions with blacklisting across jobs and also verifying cluster capacity after we blacklist tracker. Also added new configuration entry which takes node health scripts arguments.
          Hide
          Hong Tang added a comment -

          +1 to hemanth's comments.

          Show
          Hong Tang added a comment - +1 to hemanth's comments.
          Hide
          Allen Wittenauer added a comment -

          But the fact that we are reporting timestamps of the last health status gives the administrators an opportunity to know that something is amiss on this node, because it's health has not been updated for a while.

          Hmm. What interface do admins have that make this obvious? If a cluster has 2500 TTs, it isn't going to be obvious in a web UI that any given TT is sick.

          Show
          Allen Wittenauer added a comment - But the fact that we are reporting timestamps of the last health status gives the administrators an opportunity to know that something is amiss on this node, because it's health has not been updated for a while. Hmm. What interface do admins have that make this obvious? If a cluster has 2500 TTs, it isn't going to be obvious in a web UI that any given TT is sick.
          Hide
          Hemanth Yamijala added a comment -

          What interface do admins have that make this obvious? If a cluster has 2500 TTs, it isn't going to be obvious in a web UI that any given TT is sick.

          Allen, just to be clear, nodes detected as unhealthy by the health check script are blacklisted as was discussed and as you commented in one of your earliest comments on this JIRA. Therefore the 'mapred job -list-blacklisted-trackers' will easily point out these.

          We are only discussing a specific case where the health checker VM exits on a node, say because of some transient problem on the TT, but the TT is still up, and possibly unhealthy. It was only for detecting these TTs that I was talking about the timestamps. I agree with you that it will be difficult to look this information up on a web UI. But the fact that we are having this information centrally will help us build / enhance existing tools.

          I believe even with the current scope this feature is worth a try on live clusters. Given we have no way of detecting unhealthy nodes now, this is an improvement. If indeed we find many instances where the health checker is going down while TTs are up, we can easily provide additional tools based on the information currently being reported. Hope that makes sense.

          Show
          Hemanth Yamijala added a comment - What interface do admins have that make this obvious? If a cluster has 2500 TTs, it isn't going to be obvious in a web UI that any given TT is sick. Allen, just to be clear, nodes detected as unhealthy by the health check script are blacklisted as was discussed and as you commented in one of your earliest comments on this JIRA. Therefore the 'mapred job -list-blacklisted-trackers' will easily point out these. We are only discussing a specific case where the health checker VM exits on a node, say because of some transient problem on the TT, but the TT is still up, and possibly unhealthy. It was only for detecting these TTs that I was talking about the timestamps. I agree with you that it will be difficult to look this information up on a web UI. But the fact that we are having this information centrally will help us build / enhance existing tools. I believe even with the current scope this feature is worth a try on live clusters. Given we have no way of detecting unhealthy nodes now, this is an improvement. If indeed we find many instances where the health checker is going down while TTs are up, we can easily provide additional tools based on the information currently being reported. Hope that makes sense.
          Hide
          Hemanth Yamijala added a comment -

          Some comments on the second iteration. I still need to look at the forrest documentation, will do so after the code changes are fine:

          Shell.java:

          • Refactor the constuctors in ShellCommandExecutor to all reach one constructor that takes the timeout interval, instead of the current implementation which adds a timeout parameter to all the constructors.
          • Also the static API should be overloaded to provide the timeout parameter.
          • If the timer fires after the process exits but before the completed boolean is set to true, process.destroy is called twice and also the command is set as timed out. Maybe we should check if the process's exit value can be successfully retrieved to see if the process has terminated.
          • I think we should set completed to true in the same place as before, which is after the error thread is joined. This will keep the behavior similar.

          NodeHealthChecker:

          • ignoreOutput is set to true if the script gets an exception. In this case, we are reporting the node as healthy, whereas in reality we don't know that. I don't think we should report true in such a case. Choices could be to not report anything, or to report it as unhealthy, with the exception string as the message - so it will show up to the administrators. Needs some discussion though.
          • If there was a general exception in executing the script, ignoreOutput is set to false, and we are trying to read the output, which may not be valid.
          • Regarding the error string, we should probably check what would be a right model for this. I actually meant that we check only if the entire output begins with ERROR. But what is currently implemented may also be right.
          • Minor nit, let's read the output, only if the command has not timed out.
          • No name given to the NodeHealthMonitorTimer.
          • JobConf is being loaded multiple times in this class. This is a costly operation. I don't think it needs to be loaded that often. We can cache the value of JobConf.
          • We are attempting to set the execute permissions on the script. That implicitly assumes the file will be owned by the user running the mapred script. This may not be right. I think it is easier to assume that the script has execute permissions set, and we'll fail if it doesn't. We could check this in shouldRun.
          • NodeHealthMonitorTimer is not a timer, call it NodeHealthMonitorExecutor
          • We don't need the timedout variable in the NodeHealthMonitorTimer instance. It can always be got from the shellcommandexecutor.
          • I am thinking that 'lastReported' time can actually be filled in by the NodeHealthCheckerService itself.

          JobTracker:

          • It looks like the blacklisting and whitelisting of nodes is not exactly correct. The following invariants must hold true for a node.
            • A node can be blacklisted if it is unhealthy, or if the number of failures exceeds the configured limits. It can fail due to both reasons, in which case both of these must be captured in the fault info.
            • A node can be whitelisted iff it becomes healthy AND the number of failures falls below the limits. For e.g this means that shouldAssignTasksToTracker should check the health of the node also.
            • One of the two conditions alone can change - that is, an unhealthy node can become healthy, or number of failures alone could fall below the limits. In such cases, the node will still be blacklisted, but the reason for blacklisting would change.
          • It seems like thinking about this problem in terms of a state transition will help coding.

          TaskTracker:

          • The update to the health monitor variables is done atomically, but the read is not so, because the read does not lock the tasktracker.
          • Why do we need to stop the health checker service in start.

          Documentation:

          • Class documentation is missing in javadocs for the new classes, though methods are documented.

          machines.jsp:

          • Shouldn't the number of columns be 8/9 instead of 9/10.
          Show
          Hemanth Yamijala added a comment - Some comments on the second iteration. I still need to look at the forrest documentation, will do so after the code changes are fine: Shell.java: Refactor the constuctors in ShellCommandExecutor to all reach one constructor that takes the timeout interval, instead of the current implementation which adds a timeout parameter to all the constructors. Also the static API should be overloaded to provide the timeout parameter. If the timer fires after the process exits but before the completed boolean is set to true, process.destroy is called twice and also the command is set as timed out. Maybe we should check if the process's exit value can be successfully retrieved to see if the process has terminated. I think we should set completed to true in the same place as before, which is after the error thread is joined. This will keep the behavior similar. NodeHealthChecker: ignoreOutput is set to true if the script gets an exception. In this case, we are reporting the node as healthy, whereas in reality we don't know that. I don't think we should report true in such a case. Choices could be to not report anything, or to report it as unhealthy, with the exception string as the message - so it will show up to the administrators. Needs some discussion though. If there was a general exception in executing the script, ignoreOutput is set to false, and we are trying to read the output, which may not be valid. Regarding the error string, we should probably check what would be a right model for this. I actually meant that we check only if the entire output begins with ERROR. But what is currently implemented may also be right. Minor nit, let's read the output, only if the command has not timed out. No name given to the NodeHealthMonitorTimer. JobConf is being loaded multiple times in this class. This is a costly operation. I don't think it needs to be loaded that often. We can cache the value of JobConf. We are attempting to set the execute permissions on the script. That implicitly assumes the file will be owned by the user running the mapred script. This may not be right. I think it is easier to assume that the script has execute permissions set, and we'll fail if it doesn't. We could check this in shouldRun. NodeHealthMonitorTimer is not a timer, call it NodeHealthMonitorExecutor We don't need the timedout variable in the NodeHealthMonitorTimer instance. It can always be got from the shellcommandexecutor. I am thinking that 'lastReported' time can actually be filled in by the NodeHealthCheckerService itself. JobTracker: It looks like the blacklisting and whitelisting of nodes is not exactly correct. The following invariants must hold true for a node. A node can be blacklisted if it is unhealthy, or if the number of failures exceeds the configured limits. It can fail due to both reasons, in which case both of these must be captured in the fault info. A node can be whitelisted iff it becomes healthy AND the number of failures falls below the limits. For e.g this means that shouldAssignTasksToTracker should check the health of the node also. One of the two conditions alone can change - that is, an unhealthy node can become healthy, or number of failures alone could fall below the limits. In such cases, the node will still be blacklisted, but the reason for blacklisting would change. It seems like thinking about this problem in terms of a state transition will help coding. TaskTracker: The update to the health monitor variables is done atomically, but the read is not so, because the read does not lock the tasktracker. Why do we need to stop the health checker service in start. Documentation: Class documentation is missing in javadocs for the new classes, though methods are documented. machines.jsp: Shouldn't the number of columns be 8/9 instead of 9/10.
          Hide
          Hong Tang added a comment -

          "Node Healthy Status" - can we show "healthy/unhealthy" instead of "true/false"?

          Show
          Hong Tang added a comment - "Node Healthy Status" - can we show "healthy/unhealthy" instead of "true/false"?
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching patches which incorporate Hemanths comments. Split the changes into two parts, commons and mapred.

          Show
          Sreekanth Ramakrishnan added a comment - Attaching patches which incorporate Hemanths comments. Split the changes into two parts, commons and mapred.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching latest patch, some debug statements accidentally gotten into the mapred-211-mapred-1.patch

          Show
          Sreekanth Ramakrishnan added a comment - Attaching latest patch, some debug statements accidentally gotten into the mapred-211-mapred-1.patch
          Hide
          eric baldeschwieler added a comment -

          -1

          I think this patch has gotten too complex. I don't follow the need for a sub-process to launch the scripts and attendant complexity.

          1) The TT forks processing for a living. The health check should be happening less frequently than task launch, so we should expect a simple fork to work. If it does not, the TT is not healthy...

          2) I see advantages to having an autonomous process doing the health checking. There are pluses and minuses to that. But a dedicated child of the TT has none of those advantages.

          3) Complexity == bugs == more unhealthy TTs.

          We should strip this down to as simple a patch as possible.

          Show
          eric baldeschwieler added a comment - -1 I think this patch has gotten too complex. I don't follow the need for a sub-process to launch the scripts and attendant complexity. 1) The TT forks processing for a living. The health check should be happening less frequently than task launch, so we should expect a simple fork to work. If it does not, the TT is not healthy... 2) I see advantages to having an autonomous process doing the health checking. There are pluses and minuses to that. But a dedicated child of the TT has none of those advantages. 3) Complexity == bugs == more unhealthy TTs. We should strip this down to as simple a patch as possible.
          Hide
          Hemanth Yamijala added a comment -

          Caught up with Owen and Sreekanth, and tried to drive to a consensus. We think Eric's arguments make sense. For the sake of closing on this patch, we are doing an about turn to the original model of having a simple thread that launches the health check script periodically. However, the change from the original patch will address Hong's and Owen's concerns about fork / script timing out, because the changes to the Shell class which introduce timeouts on launched processes will be retained. This will prevent the TT from holding up. Note there are test cases which actually test a timing out script.

          At a point in future, I can imagine a completely separate service for doing the health checking (as was mentioned by Steve in one of the comments above). However, we'll get there when we get there. We will just start off with a simple in-VM process for now.

          Show
          Hemanth Yamijala added a comment - Caught up with Owen and Sreekanth, and tried to drive to a consensus. We think Eric's arguments make sense. For the sake of closing on this patch, we are doing an about turn to the original model of having a simple thread that launches the health check script periodically. However, the change from the original patch will address Hong's and Owen's concerns about fork / script timing out, because the changes to the Shell class which introduce timeouts on launched processes will be retained. This will prevent the TT from holding up. Note there are test cases which actually test a timing out script. At a point in future, I can imagine a completely separate service for doing the health checking (as was mentioned by Steve in one of the comments above). However, we'll get there when we get there. We will just start off with a simple in-VM process for now.
          Hide
          Hong Tang added a comment -

          Going through the jira history again after I read eric and hemanth's comments, I began to realize that we are deviating from the original intention of having something simple to start with, and instead are adding bits and pieces that could/should go to future iterations or a completely different implementation (eg as a separate service). So +1 on the about turn.

          @hemanth, just to be sure, we will still keep the time-stamp for the last successful launch of health check, right?

          Show
          Hong Tang added a comment - Going through the jira history again after I read eric and hemanth's comments, I began to realize that we are deviating from the original intention of having something simple to start with, and instead are adding bits and pieces that could/should go to future iterations or a completely different implementation (eg as a separate service). So +1 on the about turn. @hemanth, just to be sure, we will still keep the time-stamp for the last successful launch of health check, right?
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching UI from the latest patch:

          • Made changes according to Hong's comment. Now showing "healthy/UnHealthy" instead of True/False
          • Added a column called Seconds since node reported healthy.
          Show
          Sreekanth Ramakrishnan added a comment - Attaching UI from the latest patch: Made changes according to Hong's comment. Now showing "healthy/UnHealthy" instead of True/False Added a column called Seconds since node reported healthy.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Latest patch with following changes:

          • Change from seperate VM to in VM of TaskTracker.
          • Merged with latest trunk.
          Show
          Sreekanth Ramakrishnan added a comment - Latest patch with following changes: Change from seperate VM to in VM of TaskTracker. Merged with latest trunk.
          Hide
          Hemanth Yamijala added a comment -

          Some comments on this issue:

          • NodeHealthCheckerService.initalize should be NodeHealthCheckerService.initialize (typo)
          • reportHealthStatus can be simplified, currently it is proving to be somewhat difficult to follow the logic. One option could be to do the following:
             private Enum HealthCheckExitStatus {
               SUCCESS,
               TIMED_OUT,
               FAILED_WITH_EXIT_CODE,
               FAILED_WITH_EXCEPTION
             }
            

            Fill this up where the script is run with the appropriate status. Then pass it to reportHealthStatus. Then we could just use the single Enum value and have a if..else if.. else block that reads better. Would that work ?

          • Please give name to the Timer. (I've pointed this out in previous comments also)
          • When the health checker is timed out, we should not be setting the timestamp.
          • What do we show on the UI if the health checker is not configured on a TT. My suggestion would be to fill in timestamp as 0 in the TTstatus and if yes, then show status as N/A and no timestamp or message.
          • We are not getting the values of the health status (status, message and time) atomically in Tasktracker. We should probably lock on the health status service object and get these values when filling up the task tracker status. (Raised this issue earlier)
          • InterTrackerProtocol version should be changed (because TaskTrackerStatus structure has changed)
          • TaskTrackerHealthStatus constructor has a constructor taking numberOfRestarts, which is not required.
          • On JobTracker, it looks like currently we are storing all trackers - even healthy ones in the potentiallyFaultyTrackers data structure. This is unnecessary. If we fix this, we should also ensure that when a node's fault count falls to zero and is healthy, it is removed from this structure.
          • The formatting in machines.jsp doesn't seem right. Can you please check it ?
          Show
          Hemanth Yamijala added a comment - Some comments on this issue: NodeHealthCheckerService.initalize should be NodeHealthCheckerService.initialize (typo) reportHealthStatus can be simplified, currently it is proving to be somewhat difficult to follow the logic. One option could be to do the following: private Enum HealthCheckExitStatus { SUCCESS, TIMED_OUT, FAILED_WITH_EXIT_CODE, FAILED_WITH_EXCEPTION } Fill this up where the script is run with the appropriate status. Then pass it to reportHealthStatus. Then we could just use the single Enum value and have a if..else if.. else block that reads better. Would that work ? Please give name to the Timer. (I've pointed this out in previous comments also) When the health checker is timed out, we should not be setting the timestamp. What do we show on the UI if the health checker is not configured on a TT. My suggestion would be to fill in timestamp as 0 in the TTstatus and if yes, then show status as N/A and no timestamp or message. We are not getting the values of the health status (status, message and time) atomically in Tasktracker. We should probably lock on the health status service object and get these values when filling up the task tracker status. (Raised this issue earlier) InterTrackerProtocol version should be changed (because TaskTrackerStatus structure has changed) TaskTrackerHealthStatus constructor has a constructor taking numberOfRestarts, which is not required. On JobTracker, it looks like currently we are storing all trackers - even healthy ones in the potentiallyFaultyTrackers data structure. This is unnecessary. If we fix this, we should also ensure that when a node's fault count falls to zero and is healthy, it is removed from this structure. The formatting in machines.jsp doesn't seem right. Can you please check it ?
          Hide
          Sreekanth Ramakrishnan added a comment -

          NodeHealthCheckerService.initalize should be NodeHealthCheckerService.initialize (typo)

          Done.

          reportHealthStatus can be simplified, currently it is proving to be somewhat difficult to follow the logic. One option could be to do the following:

          Done, introduced a new Enum HealthCheckExitStatus and doing the health report based on switch instead of nested if

          Please give name to the Timer. (I've pointed this out in previous comments also)

          Done. Timer is named NodeHealthMonitor-Timer

          When the health checker is timed out, we should not be setting the timestamp.

          Done.

          What do we show on the UI if the health checker is not configured on a TT. My suggestion would be to fill in timestamp as 0 in the TTstatus and if yes, then show status as N/A and no timestamp or message.

          Done, when health checker is not configured, we are setting last updated time of node health to zero. Based on that setting the health status string as N/A.

          We are not getting the values of the health status (status, message and time) atomically in Tasktracker. We should probably lock on the health status service object and get these values when filling up the task tracker status. (Raised this issue earlier)

          Done, introduced a method org.apache.hadoop.mapred.NodeHealthCheckerService.setHealthStatus(TaskTrackerHealthStatus) which is synchronized at object level all sets of status in NodeHealthChecker are synchronized.

          InterTrackerProtocol version should be changed (because TaskTrackerStatus structure has changed)

          Done.

          TaskTrackerHealthStatus constructor has a constructor taking numberOfRestarts, which is not required.

          Done.

          On JobTracker, it looks like currently we are storing all trackers - even healthy ones in the potentiallyFaultyTrackers data structure. This is unnecessary. If we fix this, we should also ensure that when a node's fault count falls to zero and is healthy, it is removed from this structure.

          Done, we are creating the FaultInfo lazily, and removing it in unBlackList based on the fault count.

          The formatting in machines.jsp doesn't seem right. Can you please check it ?

          Done, removed accidental tab characters.

          Show
          Sreekanth Ramakrishnan added a comment - NodeHealthCheckerService.initalize should be NodeHealthCheckerService.initialize (typo) Done. reportHealthStatus can be simplified, currently it is proving to be somewhat difficult to follow the logic. One option could be to do the following: Done, introduced a new Enum HealthCheckExitStatus and doing the health report based on switch instead of nested if Please give name to the Timer. (I've pointed this out in previous comments also) Done. Timer is named NodeHealthMonitor-Timer When the health checker is timed out, we should not be setting the timestamp. Done. What do we show on the UI if the health checker is not configured on a TT. My suggestion would be to fill in timestamp as 0 in the TTstatus and if yes, then show status as N/A and no timestamp or message. Done, when health checker is not configured, we are setting last updated time of node health to zero. Based on that setting the health status string as N/A. We are not getting the values of the health status (status, message and time) atomically in Tasktracker. We should probably lock on the health status service object and get these values when filling up the task tracker status. (Raised this issue earlier) Done, introduced a method org.apache.hadoop.mapred.NodeHealthCheckerService.setHealthStatus(TaskTrackerHealthStatus) which is synchronized at object level all sets of status in NodeHealthChecker are synchronized. InterTrackerProtocol version should be changed (because TaskTrackerStatus structure has changed) Done. TaskTrackerHealthStatus constructor has a constructor taking numberOfRestarts, which is not required. Done. On JobTracker, it looks like currently we are storing all trackers - even healthy ones in the potentiallyFaultyTrackers data structure. This is unnecessary. If we fix this, we should also ensure that when a node's fault count falls to zero and is healthy, it is removed from this structure. Done, we are creating the FaultInfo lazily, and removing it in unBlackList based on the fault count. The formatting in machines.jsp doesn't seem right. Can you please check it ? Done, removed accidental tab characters.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching latest patch incorporating Hemanths offline comments:

          • Fixing bug in the exceedsFaults() which previously didnt record if the tracker was blacklisted before.
          • Refactoring the node health status changes into a single common method.
          Show
          Sreekanth Ramakrishnan added a comment - Attaching latest patch incorporating Hemanths offline comments: Fixing bug in the exceedsFaults() which previously didnt record if the tracker was blacklisted before. Refactoring the node health status changes into a single common method.
          Hide
          Hemanth Yamijala added a comment -

          This is looking good now. The changes on the jobtracker are making more sense than before.

          I have a few minor nits:

          • The test methods added in JobTracker.java seem to be incorrectly indented.
          • In the JSP, can you please check if the colspan value is correct ?
          • Again in the JSP, we are computing the time since health reported success as Math.abs(time reported - current time). It can simply be (current time - time reported)
          • The test case's intent is perfect. Unfortunately, it can timeout if something goes wrong with the TTs. I would suggest we remove this problem.
          • I have a few editorial comments on the forrest documentation. It is easier to correct the documentation and give you a patch for just that file, I'll do that rather than commenting on the JIRAs.
          Show
          Hemanth Yamijala added a comment - This is looking good now. The changes on the jobtracker are making more sense than before. I have a few minor nits: The test methods added in JobTracker.java seem to be incorrectly indented. In the JSP, can you please check if the colspan value is correct ? Again in the JSP, we are computing the time since health reported success as Math.abs(time reported - current time). It can simply be (current time - time reported) The test case's intent is perfect. Unfortunately, it can timeout if something goes wrong with the TTs. I would suggest we remove this problem. I have a few editorial comments on the forrest documentation. It is easier to correct the documentation and give you a patch for just that file, I'll do that rather than commenting on the JIRAs.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Incorporating Hemanth's Comments:

          • Changed to the current number of columns in machines.jsp
          • Changed indentation of JobTracker test methods.
          • Removed old test case added two new test cases instead TestTaskTrackerBlacklisting and TestNodeHealthService
          • TestNodeHealthService test functioning of NodeHealthService i.e on TaskTracker side
          • TestTaskTrackerBlacklisting tests functionality of blacklisting on JobTracker end.
          Show
          Sreekanth Ramakrishnan added a comment - Incorporating Hemanth's Comments: Changed to the current number of columns in machines.jsp Changed indentation of JobTracker test methods. Removed old test case added two new test cases instead TestTaskTrackerBlacklisting and TestNodeHealthService TestNodeHealthService test functioning of NodeHealthService i.e on TaskTracker side TestTaskTrackerBlacklisting tests functionality of blacklisting on JobTracker end.
          Hide
          Hemanth Yamijala added a comment -

          The recent changes also look fine. There is one small problem though that Sreekanth and I noticed while reviewing the tests. If the health check script is configured, but before it has run, we report a heartbeat, the default values of the node health are set to unhealthy. They should be healthy.

          Also, I have attached here, a patch for the cluster_setup.xml file correcting documentation. Can you please use this text as is ?

          Show
          Hemanth Yamijala added a comment - The recent changes also look fine. There is one small problem though that Sreekanth and I noticed while reviewing the tests. If the health check script is configured, but before it has run, we report a heartbeat, the default values of the node health are set to unhealthy. They should be healthy. Also, I have attached here, a patch for the cluster_setup.xml file correcting documentation. Can you please use this text as is ?
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching patch with modification from Hemanth and setting initial node health value as true.

          The patch would break the trunk compilation as it requires the changes which are present in latest trunk.

          Show
          Sreekanth Ramakrishnan added a comment - Attaching patch with modification from Hemanth and setting initial node health value as true. The patch would break the trunk compilation as it requires the changes which are present in latest trunk.
          Hide
          Hemanth Yamijala added a comment -

          +1. I think this is good to go. Can you please run test and test-patch using the latest hadoop-common jar files, and upload results. I will commit it once done. Regarding the break in compilation, this is due to the dependent issue - HADOOP-6106. I updated a comment there about how we are planning to get the dependency into the HDFS and Map/Reduce sub projects to fix this.

          Show
          Hemanth Yamijala added a comment - +1. I think this is good to go. Can you please run test and test-patch using the latest hadoop-common jar files, and upload results. I will commit it once done. Regarding the break in compilation, this is due to the dependent issue - HADOOP-6106 . I updated a comment there about how we are planning to get the dependency into the HDFS and Map/Reduce sub projects to fix this.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching latest patch fixing findbugs and release audit warnings.

          Fixed also a bug found while TT reinit, the health checker null check was not done in TaskTracker.close()

          Show
          Sreekanth Ramakrishnan added a comment - Attaching latest patch fixing findbugs and release audit warnings. Fixed also a bug found while TT reinit, the health checker null check was not done in TaskTracker.close()
          Hide
          Sreekanth Ramakrishnan added a comment -

          output from ant test-patch

              [exec] +1 overall.
               [exec]
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec]
               [exec]     +1 tests included.  The patch appears to include 7 new or modified tests.
               [exec]
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec]
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec]
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec]
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
               [exec]
          
          Show
          Sreekanth Ramakrishnan added a comment - output from ant test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 7 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec]
          Hide
          Sreekanth Ramakrishnan added a comment -

          All tests passed locally except:

          TestJobTrackerRestart - MAPREDUCE-683 also failing consistently on trunk
          TestJobTrackerRestartWithLostTracker - MAPREDUCE-171 also failing on trunk.

          Show
          Sreekanth Ramakrishnan added a comment - All tests passed locally except: TestJobTrackerRestart - MAPREDUCE-683 also failing consistently on trunk TestJobTrackerRestartWithLostTracker - MAPREDUCE-171 also failing on trunk.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Internal Yahoo! patch for the issue.

          Show
          Sreekanth Ramakrishnan added a comment - Internal Yahoo! patch for the issue.
          Hide
          Sreekanth Ramakrishnan added a comment -

          All contrib test passed except TestQueueCapacities mentioned in MAPREDUCE-522

          Show
          Sreekanth Ramakrishnan added a comment - All contrib test passed except TestQueueCapacities mentioned in MAPREDUCE-522
          Hide
          Hemanth Yamijala added a comment -

          Since the test cases were failing on trunk as well, I committed this. Thanks, Sreekanth !

          Show
          Hemanth Yamijala added a comment - Since the test cases were failing on trunk as well, I committed this. Thanks, Sreekanth !
          Hide
          Allen Wittenauer added a comment -

          > Internal Yahoo! patch for the issue.

          Posting to the Internet sort of makes it not internal anymore.

          Show
          Allen Wittenauer added a comment - > Internal Yahoo! patch for the issue. Posting to the Internet sort of makes it not internal anymore.
          Hide
          Hemanth Yamijala added a comment -

          Allen, this is as per the new rules to have all patches that we are running at Yahoo! to be publicly available following the announcement at the Hadoop Summit. But I agree, it is not internal anymore. I would just change the emoticon, because it seems like a good thing

          Show
          Hemanth Yamijala added a comment - Allen, this is as per the new rules to have all patches that we are running at Yahoo! to be publicly available following the announcement at the Hadoop Summit. But I agree, it is not internal anymore. I would just change the emoticon, because it seems like a good thing
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #15 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/15/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #15 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/15/ )

            People

            • Assignee:
              Sreekanth Ramakrishnan
              Reporter:
              Aroop Maliakkal
            • Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development