Hadoop Common
  1. Hadoop Common
  2. HADOOP-263

task status should include timestamps for when a job transitions

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2.1
    • Fix Version/s: 0.7.0
    • Component/s: None
    • Labels:
      None

      Description

      It would help users to understand what happened if the task status included information about when the task transitioned:

      Map:
      started
      finished

      Reduce:
      started
      shuffle finished
      sort finished
      finished

      1. patch.txt
        12 kB
        Sanjay Dahiya
      2. patch.txt
        12 kB
        Sanjay Dahiya
      3. patch.txt
        12 kB
        Sanjay Dahiya
      4. patch.txt
        15 kB
        Sanjay Dahiya
      5. Hadoop-263.patch
        16 kB
        Sanjay Dahiya
      6. Hadoop-263_1.patch
        17 kB
        Sanjay Dahiya
      7. Hadoop-263_2.patch
        19 kB
        Sanjay Dahiya
      8. Hadoop-263_3.patch
        19 kB
        Sanjay Dahiya
      9. Hadoop-263_4.patch
        30 kB
        Sanjay Dahiya
      10. Hadoop-263_5.patch
        31 kB
        Sanjay Dahiya

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Sanjay!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Sanjay!
          Hide
          Sanjay Dahiya added a comment -

          Here is an updated patch.

          Show
          Sanjay Dahiya added a comment - Here is an updated patch.
          Hide
          Sanjay Dahiya added a comment -

          Ok, I will add to WritableUtils, instead of enum.ordinal() I will use string value of enum in DataOutput ( unless someone objects to it)

          Show
          Sanjay Dahiya added a comment - Ok, I will add to WritableUtils, instead of enum.ordinal() I will use string value of enum in DataOutput ( unless someone objects to it)
          Hide
          Owen O'Malley added a comment -

          As of HADOOP-306, we can put enum types into RPC and your change would be much cleaner if the TaskUmbilicalProtocol.progress' phase parameter was really passed as an enum.

          Furthermore, we should probably make methods in WritableUtils for reading and writing enums:

          public static <T extends Enum<T>> T readEnum(DataInput in, Class<T> enumType
          ) throws IOException;
          public static void writeEnum(DataOutput out, Enum enum) throws IOException;

          The Phase field of Task should be private.

          You can pull the replicated code for setting the phase out of the constructors and do:

          class ReduceTask ...

          { setPhase(Phase.SHUFFLE); }

          public ReduceTask()

          { ... }

          public ReduceTask(...) { ... }

          the same can be done for MapTask.

          Show
          Owen O'Malley added a comment - As of HADOOP-306 , we can put enum types into RPC and your change would be much cleaner if the TaskUmbilicalProtocol.progress' phase parameter was really passed as an enum. Furthermore, we should probably make methods in WritableUtils for reading and writing enums: public static <T extends Enum<T>> T readEnum(DataInput in, Class<T> enumType ) throws IOException; public static void writeEnum(DataOutput out, Enum enum) throws IOException; The Phase field of Task should be private. You can pull the replicated code for setting the phase out of the constructors and do: class ReduceTask ... { setPhase(Phase.SHUFFLE); } public ReduceTask() { ... } public ReduceTask(...) { ... } the same can be done for MapTask.
          Hide
          Sanjay Dahiya added a comment -

          Here is an updated patch, as doug suggested I have added an extra argument to TaskUmbilicalProtocol.progress(String taskid, float progress, String state, int phase). phase is Phase.ordinal().
          All classes deriving from TaskUmbilicalProtocol are updated with this argument, but its value is ignored in IsolationRunner and LocalJobRunner.
          In JobTracker.ExpireLaunchingTasks.run() the phase is not available if its a reduce task, in such case it uses Phase.UNKNOWN.
          Also both map and reduce tasks use TaskUmbilicalProtocol.progress, so Map tasks return Phase.MAP and reduce tasks return Phase.SHUFFLE, PHASE.SORT or Phase.REDUCE.

          Show
          Sanjay Dahiya added a comment - Here is an updated patch, as doug suggested I have added an extra argument to TaskUmbilicalProtocol.progress(String taskid, float progress, String state, int phase). phase is Phase.ordinal(). All classes deriving from TaskUmbilicalProtocol are updated with this argument, but its value is ignored in IsolationRunner and LocalJobRunner. In JobTracker.ExpireLaunchingTasks.run() the phase is not available if its a reduce task, in such case it uses Phase.UNKNOWN. Also both map and reduce tasks use TaskUmbilicalProtocol.progress, so Map tasks return Phase.MAP and reduce tasks return Phase.SHUFFLE, PHASE.SORT or Phase.REDUCE.
          Hide
          Doug Cutting added a comment -

          I am uncomfortable using progress status strings to determine which phase a task is in. Rather we should have a proper API for this. (This is similar to HADOOP-544, but different.)

          Perhaps we should have an enum in ReduceTask named Phase. TaskUmbilicalProtocol.progress() could get a new 'int phase' parameter that would include the phase.ordinal() value. Would that work?

          Show
          Doug Cutting added a comment - I am uncomfortable using progress status strings to determine which phase a task is in. Rather we should have a proper API for this. (This is similar to HADOOP-544 , but different.) Perhaps we should have an enum in ReduceTask named Phase. TaskUmbilicalProtocol.progress() could get a new 'int phase' parameter that would include the phase.ordinal() value. Would that work?
          Hide
          Sanjay Dahiya added a comment -

          The patch no longer works with current trunk. Here is an updated patch.

          Show
          Sanjay Dahiya added a comment - The patch no longer works with current trunk. Here is an updated patch.
          Hide
          Sanjay Dahiya added a comment -

          Here is an updated patch.
          Task finish time is now set to failed time in case task fails. If the task fails at task tracker then failed time is set there and sent over to jobtracker. if the task tracker itself goes down then job tracker sets the failed time to when it discovers task failed.
          A side effect is that in case a task starts and before it could send its first status, task tracker dies, in that case the failed time will be recorded but task start time will not be recorded as task tracker went down before it could report that. We rely on task tracker to report the correct task launch time. Start/finish times for TIPs will be available in any case.

          Show
          Sanjay Dahiya added a comment - Here is an updated patch. Task finish time is now set to failed time in case task fails. If the task fails at task tracker then failed time is set there and sent over to jobtracker. if the task tracker itself goes down then job tracker sets the failed time to when it discovers task failed. A side effect is that in case a task starts and before it could send its first status, task tracker dies, in that case the failed time will be recorded but task start time will not be recorded as task tracker went down before it could report that. We rely on task tracker to report the correct task launch time. Start/finish times for TIPs will be available in any case.
          Hide
          Sanjay Dahiya added a comment -

          I left plain empty get/set methods without comments. Since we are putting javadoc for all public methods for new code, i'll add these now.

          Also for failed task time, I will set finish time to the time failure was reported.

          Show
          Sanjay Dahiya added a comment - I left plain empty get/set methods without comments. Since we are putting javadoc for all public methods for new code, i'll add these now. Also for failed task time, I will set finish time to the time failure was reported.
          Hide
          Owen O'Malley added a comment -

          There are still a lot (8?) of public procedures without java doc in this patch.

          The set ? time methods could all be package local rather than public, couldn't they? Restricting their visibility is a good thing.

          Show
          Owen O'Malley added a comment - There are still a lot (8?) of public procedures without java doc in this patch. The set ? time methods could all be package local rather than public, couldn't they? Restricting their visibility is a good thing.
          Hide
          Mahadev konar added a comment -

          Could we include the time taken for task even if it fails? This would be quite handy in calcutating the total machine hours each job takes.

          Show
          Mahadev konar added a comment - Could we include the time taken for task even if it fails? This would be quite handy in calcutating the total machine hours each job takes.
          Hide
          Sanjay Dahiya added a comment -

          Thanks Owen for the comments.

          Here is an updated patch, I also added time difference for start-shuffle-sort phases along with finish time stamps on JSPs.

          Show
          Sanjay Dahiya added a comment - Thanks Owen for the comments. Here is an updated patch, I also added time difference for start-shuffle-sort phases along with finish time stamps on JSPs.
          Hide
          Owen O'Malley added a comment -

          A couple more things:
          1. You don't have javadoc on the public methods in TaskStatus.
          2. For the task details you should always have at least one result, so just test for that case explicitly and put a message about "no task attempts found". Please remove the new type parameter.
          3. Can we factor some more of the duplicated code in the jsp into a single instance?

          Show
          Owen O'Malley added a comment - A couple more things: 1. You don't have javadoc on the public methods in TaskStatus. 2. For the task details you should always have at least one result, so just test for that case explicitly and put a message about "no task attempts found". Please remove the new type parameter. 3. Can we factor some more of the duplicated code in the jsp into a single instance?
          Hide
          Sanjay Dahiya added a comment -

          Thanks for the comments, Owen

          Here is an updated patch.

          Show
          Sanjay Dahiya added a comment - Thanks for the comments, Owen Here is an updated patch.
          Hide
          Sanjay Dahiya added a comment -

          4. You don't need to add the type parameter to the taskdetails.jsp because it is available in TaskStatus.getIsMap()

          I added this to simplify the html generation, taskdetails.jsp is used for both Map and Reduce tasks, we get TaskStatuses from tracker.getTaskStatuses(). The header is generated even if there is no task status. using an extra parameter simplifies this.

          I agree with rest of the comments.

          Show
          Sanjay Dahiya added a comment - 4. You don't need to add the type parameter to the taskdetails.jsp because it is available in TaskStatus.getIsMap() I added this to simplify the html generation, taskdetails.jsp is used for both Map and Reduce tasks, we get TaskStatuses from tracker.getTaskStatuses(). The header is generated even if there is no task status. using an extra parameter simplifies this. I agree with rest of the comments.
          Hide
          Owen O'Malley added a comment -

          Ok, looking at the current patch, I have a couple of comments:
          1. The shuffle & sort times should be set to the current time when a later stage is finished. (This would happen when the shuffle or sort finishes in less than a single reporting cycle.) I'd probably put the relevant check/update in the setters for the times in TaskStatus.
          2. You need to pull the replicated code for formatting delta times into StringUtils. And if a field (and all of the larger fields) are 0, you shouldn't print the label except for the last one. (eg. 0 time => "0 secs")
          3. Reporting time to milliseconds when the real granularity is 10 seconds is overkill. Just report to seconds.
          4. You don't need to add the type parameter to the taskdetails.jsp because it is available in TaskStatus.getIsMap()
          5. You should probably add the times to the TaskReport constructor rather than creating the object and setting it to be consistent with the other fields.

          Show
          Owen O'Malley added a comment - Ok, looking at the current patch, I have a couple of comments: 1. The shuffle & sort times should be set to the current time when a later stage is finished. (This would happen when the shuffle or sort finishes in less than a single reporting cycle.) I'd probably put the relevant check/update in the setters for the times in TaskStatus. 2. You need to pull the replicated code for formatting delta times into StringUtils. And if a field (and all of the larger fields) are 0, you shouldn't print the label except for the last one. (eg. 0 time => "0 secs") 3. Reporting time to milliseconds when the real granularity is 10 seconds is overkill. Just report to seconds. 4. You don't need to add the type parameter to the taskdetails.jsp because it is available in TaskStatus.getIsMap() 5. You should probably add the times to the TaskReport constructor rather than creating the object and setting it to be consistent with the other fields.
          Hide
          Sanjay Dahiya added a comment -

          Thanks Owen for the review, here is an updated patch.

          The earlier patch failed with latest trunk due to some changes in TaskInProgress, so I redid some part of it. Attached patch works with the latest trunk.

          Show
          Sanjay Dahiya added a comment - Thanks Owen for the review, here is an updated patch. The earlier patch failed with latest trunk due to some changes in TaskInProgress, so I redid some part of it. Attached patch works with the latest trunk.
          Hide
          Owen O'Malley added a comment -

          A couple of points:
          1. I'd express the start times as absolute times and the others as relative times in hours, minutes and seconds.
          2. the web ui shouldn't include shuffle & sort time columns for map tasks.
          3. "blank" times should be written as " " in the html so that the browsers make the box for the cell.
          4. usually, when you are creating format objects, you keep them in static fields rather than creating them on each call.
          5. the start times should be when the task starts running not when the job is submitted
          6. the shuffle time is never being set

          Show
          Owen O'Malley added a comment - A couple of points: 1. I'd express the start times as absolute times and the others as relative times in hours, minutes and seconds. 2. the web ui shouldn't include shuffle & sort time columns for map tasks. 3. "blank" times should be written as " " in the html so that the browsers make the box for the cell. 4. usually, when you are creating format objects, you keep them in static fields rather than creating them on each call. 5. the start times should be when the task starts running not when the job is submitted 6. the shuffle time is never being set
          Hide
          Sanjay Dahiya added a comment -

          Updated the display. now showing start/finish times in separate columns.

          Show
          Sanjay Dahiya added a comment - Updated the display. now showing start/finish times in separate columns.
          Hide
          Sanjay Dahiya added a comment -

          A new patch, now shows start, finish times for tasks and shuffle/sort finished time for reduce tasks.
          few points, which should probably be separate issues.

          • TaskStatus, TeskReport replicate some status data which is present as member variables in TaskTracker.TaskInProgress, TaskInProgress. We can reduce the duplication of data by maintaining objects of TaskStatus and get/set from that directly, using simple composition.
          • The status string passed around needs to be a public static final String defined in TaskStatus and used everywhere else, so it can be compared against reliably wrt future changes. Currently its reduce > copy (...), reduce > sort, reduce > reduce.
          Show
          Sanjay Dahiya added a comment - A new patch, now shows start, finish times for tasks and shuffle/sort finished time for reduce tasks. few points, which should probably be separate issues. TaskStatus, TeskReport replicate some status data which is present as member variables in TaskTracker.TaskInProgress, TaskInProgress. We can reduce the duplication of data by maintaining objects of TaskStatus and get/set from that directly, using simple composition. The status string passed around needs to be a public static final String defined in TaskStatus and used everywhere else, so it can be compared against reliably wrt future changes. Currently its reduce > copy (...), reduce > sort, reduce > reduce.
          Hide
          Owen O'Malley added a comment -

          A couple of points need work:
          1. It only catches the finish time of the task.
          2. It includes a couple of comment/whitespace changes that are accidental.
          3. It calls the added field "timestamp", which doesn't say what the time represents.

          I was expecting to see:
          2 new fields in TaskStatus:
          private long timeStarted;
          private long timeStopped;
          a new class derived from TaskStatus named ReduceTaskStatus that includes:
          private long timeShuffleFinished;
          private long timeSortFinished;

          This patch does point out the high simularity of the TaskStatus and TaskReport. They are basically the same type just slightly filtered by which protocol they are being used on. Why don't you add timeStarted and timeStopped to TaskReport and we can drop TaskReport in a later patch.

          Show
          Owen O'Malley added a comment - A couple of points need work: 1. It only catches the finish time of the task. 2. It includes a couple of comment/whitespace changes that are accidental. 3. It calls the added field "timestamp", which doesn't say what the time represents. I was expecting to see: 2 new fields in TaskStatus: private long timeStarted; private long timeStopped; a new class derived from TaskStatus named ReduceTaskStatus that includes: private long timeShuffleFinished; private long timeSortFinished; This patch does point out the high simularity of the TaskStatus and TaskReport. They are basically the same type just slightly filtered by which protocol they are being used on. Why don't you add timeStarted and timeStopped to TaskReport and we can drop TaskReport in a later patch.
          Hide
          Sanjay Dahiya added a comment -

          This patch displays a timestamp with task status on the webpage. The timestamp is of time when task tracker receives a state change update. Time stamps are also displayed for individual task attempts in addition to state changes.

          Show
          Sanjay Dahiya added a comment - This patch displays a timestamp with task status on the webpage. The timestamp is of time when task tracker receives a state change update. Time stamps are also displayed for individual task attempts in addition to state changes.
          Hide
          Sanjay Dahiya added a comment -

          Here is what I am thinking for this.

          We add an extra field (timestamp) in org.apache.hadoop.mapred.TaskStatus. The timestamp gets TaskTracker.TaskInprogress.reportProgress(), when a task reports progress. As part of Task the timestamp is updated in Progress.set(), Progress.setStatus() and complete(). The timestamp is sent over to TaskTracker over RPC along with status.

          Another option is to update the timestamp when TaskTracker gets the progress report, without making any change to the Task. But sending timestamp over RPC gives a more accurate timestamp of the last progress update.

          Show
          Sanjay Dahiya added a comment - Here is what I am thinking for this. We add an extra field (timestamp) in org.apache.hadoop.mapred.TaskStatus. The timestamp gets TaskTracker.TaskInprogress.reportProgress(), when a task reports progress. As part of Task the timestamp is updated in Progress.set(), Progress.setStatus() and complete(). The timestamp is sent over to TaskTracker over RPC along with status. Another option is to update the timestamp when TaskTracker gets the progress report, without making any change to the Task. But sending timestamp over RPC gives a more accurate timestamp of the last progress update.

            People

            • Assignee:
              Sanjay Dahiya
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development