Hadoop Common
  1. Hadoop Common
  2. HADOOP-491

streaming jobs should allow programs that don't do any IO for a long time

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: None
    • Labels:
      None

      Description

      The jobtracker relies on task to send heartbeats to know the tasks are still alive.
      There is a 600 seconds timeout preset.
      hadoop streaming also uses input to or output from the program it spawns to indicate progress, sending appropriate heartbeats.
      Some spawned programs spend longer that 600 seconds without any output while being perfectly healthy.

      It would be good to enhance the interface between hadoop streaming and the programs it spawns to track a healthy program in the absense of output.

      There are certain dangers with this protocol: e.g. a task can run a separate thread that does nothing but send "i'm alive" message. This would be a user bug to abuse the API in such way.

      1. HADOOP-491_20070212_3.patch
        4 kB
        Arun C Murthy
      2. HADOOP-491_20070206_2.patch
        3 kB
        Arun C Murthy
      3. HADOOP-491_20070205_1.patch
        4 kB
        Arun C Murthy

        Activity

        Hide
        Doug Cutting added a comment -

        I just committed this. Thanks Arun.

        Show
        Doug Cutting added a comment - I just committed this. Thanks Arun.
        Hide
        Hadoop QA added a comment -

        +1, because http://issues.apache.org/jira/secure/attachment/12350901/HADOOP-491_20070212_3.patch applied and successfully tested against trunk revision r505557.

        Show
        Hadoop QA added a comment - +1, because http://issues.apache.org/jira/secure/attachment/12350901/HADOOP-491_20070212_3.patch applied and successfully tested against trunk revision r505557.
        Hide
        Arun C Murthy added a comment -

        My bad Doug; fixed now...

        Show
        Arun C Murthy added a comment - My bad Doug; fixed now...
        Hide
        Doug Cutting added a comment -

        The call to Configuration#getLong() should be outside the loop over running tasks. Optimally, we'd make this once per job, but, at a minimum, it should be outside the loop.

        Show
        Doug Cutting added a comment - The call to Configuration#getLong() should be outside the loop over running tasks. Optimally, we'd make this once per job, but, at a minimum, it should be outside the loop.
        Hide
        Owen O'Malley added a comment -

        This looks good +1

        Show
        Owen O'Malley added a comment - This looks good +1
        Hide
        Owen O'Malley added a comment -

        In Java jobs, if it either reads input, writes output, or explicitly calls the Reporter object. However, in streaming it is a little more complicated, because everything is asynchronous. In theory the same is true in streaming, but the details are more subtle.

        Show
        Owen O'Malley added a comment - In Java jobs, if it either reads input, writes output, or explicitly calls the Reporter object. However, in streaming it is a little more complicated, because everything is asynchronous. In theory the same is true in streaming, but the details are more subtle.
        Hide
        Raghu Angadi added a comment -

        If a streaming jobs reads from Input but does not write to output for a long time, is that considered progress?

        Show
        Raghu Angadi added a comment - If a streaming jobs reads from Input but does not write to output for a long time, is that considered progress?
        Hide
        Owen O'Malley added a comment -

        That did happen and it helped. Because streaming can use binaries that the user doesn't control, they don't always have the option to add printing bytes to stderr. Furthermore, as we start sending more of the output back to the user's console, it is less clear that having the application send data to stderr is a good idea.

        Show
        Owen O'Malley added a comment - That did happen and it helped. Because streaming can use binaries that the user doesn't control, they don't always have the option to add printing bytes to stderr. Furthermore, as we start sending more of the output back to the user's console, it is less clear that having the application send data to stderr is a good idea.
        Hide
        Doug Cutting added a comment -

        Didn't we talk at some point about using stderr for progress from streaming? Did that ever happen? Would that address this?

        Show
        Doug Cutting added a comment - Didn't we talk at some point about using stderr for progress from streaming? Did that ever happen? Would that address this?
        Hide
        Owen O'Malley added a comment -

        This looks good so far, it just needs the default for streaming set. The logic here is that streaming applications don't have an easy way of reporting to the framework that they are making progress. So it is far too easy for them to be killed as being dead when they are just working away. Clearly, disabling the timeout makes it easy to write applications that just get stuck, but it seems better to allow applications to get stuck rather than killing productive tasks.

        Show
        Owen O'Malley added a comment - This looks good so far, it just needs the default for streaming set. The logic here is that streaming applications don't have an easy way of reporting to the framework that they are making progress. So it is far too easy for them to be killed as being dead when they are just working away. Clearly, disabling the timeout makes it easy to write applications that just get stuck, but it seems better to allow applications to get stuck rather than killing productive tasks.
        Hide
        Arun C Murthy added a comment -

        Another patch without the default for streaming...

        Show
        Arun C Murthy added a comment - Another patch without the default for streaming...
        Hide
        Doug Cutting added a comment -

        So streaming jobs should have no timeout by default? I can sort of see adding the feature of disabling task timeouts, and also of facillitating this from streaming, but do streaming applications really never hang? Should we change the default for all applications, not just streaming? I'm trying to understand the logic here.

        Also, as a new feature, shouldn't this be targetted for 0.12.0?

        Show
        Doug Cutting added a comment - So streaming jobs should have no timeout by default? I can sort of see adding the feature of disabling task timeouts, and also of facillitating this from streaming, but do streaming applications really never hang? Should we change the default for all applications, not just streaming? I'm trying to understand the logic here. Also, as a new feature, shouldn't this be targetted for 0.12.0?
        Hide
        Arun C Murthy added a comment -

        Thanks for the feedback Doug (actually the old patch was horribly whacked); here is another patch which uses 'mapred.task.timeout', which is now per-job and not per-tracker (hence, I've removed TaskTracker.taskTimeout) and has the necessary 'zero-means-never' semantics.

        Show
        Arun C Murthy added a comment - Thanks for the feedback Doug (actually the old patch was horribly whacked); here is another patch which uses 'mapred.task.timeout', which is now per-job and not per-tracker (hence, I've removed TaskTracker.taskTimeout) and has the necessary 'zero-means-never' semantics.
        Hide
        Doug Cutting added a comment -

        I think instead of adding a new configuration option, this should simply use mapred.task.timeout, perhaps changing that to be per-job rather than per-tracker, and implementing zero-means-never.

        Show
        Doug Cutting added a comment - I think instead of adding a new configuration option, this should simply use mapred.task.timeout, perhaps changing that to be per-job rather than per-tracker, and implementing zero-means-never.
        Hide
        Hadoop QA added a comment -

        +1, because http://issues.apache.org/jira/secure/attachment/12350017/HADOOP-491_20070131_1.patch applied and successfully tested against trunk revision r502402.

        Show
        Hadoop QA added a comment - +1, because http://issues.apache.org/jira/secure/attachment/12350017/HADOOP-491_20070131_1.patch applied and successfully tested against trunk revision r502402.
        Hide
        Arun C Murthy added a comment -

        Here is a straight-forward patch which lets a per-job configuration knob for task-launch timeout, and is set to '0' for streaming jobs...

        Show
        Arun C Murthy added a comment - Here is a straight-forward patch which lets a per-job configuration knob for task-launch timeout, and is set to '0' for streaming jobs...
        Hide
        Owen O'Malley added a comment -

        I think the right way to address this is to support timeouts of "0" that mean there should be no task timeouts. The default in streaming can be set to 0, since it is impossible for the streaming process to call reporter.progress().

        Show
        Owen O'Malley added a comment - I think the right way to address this is to support timeouts of "0" that mean there should be no task timeouts. The default in streaming can be set to 0, since it is impossible for the streaming process to call reporter.progress().
        Hide
        Doug Cutting added a comment -

        There already is an API for tasks to say they're alive: Reporter.setStatus() and progress().

        This may not be well documented, so perhaps this is a documentation bug?

        Show
        Doug Cutting added a comment - There already is an API for tasks to say they're alive: Reporter.setStatus() and progress(). This may not be well documented, so perhaps this is a documentation bug?

          People

          • Assignee:
            Arun C Murthy
            Reporter:
            arkady borkovsky
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development