Issue Details (XML | Word | Printable)

Key: HADOOP-4445
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Sreekanth Ramakrishnan
Reporter: Karam Singh
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Wrong number of running map/reduce tasks are displayed in queue information.

Created: 17/Oct/08 01:06 PM   Updated: 08/Jul/09 04:40 PM
Return to search
Component/s: None
Affects Version/s: 0.19.0
Fix Version/s: 0.20.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-4445-1.patch 2008-12-05 05:59 AM Sreekanth Ramakrishnan 11 kB
Text File Licensed for inclusion in ASF works HADOOP-4445-2.patch 2008-12-05 09:07 AM Sreekanth Ramakrishnan 12 kB
Text File Licensed for inclusion in ASF works HADOOP-4445-3.patch 2008-12-05 10:35 AM Sreekanth Ramakrishnan 13 kB
Text File Licensed for inclusion in ASF works HADOOP-4445-4.patch 2008-12-05 01:57 PM Sreekanth Ramakrishnan 15 kB
Image Attachments:

1. runningjobs.png
(55 kB)
Environment: Hadoop r705159, Queue=default, GC=100% MapCapacity=ReduceCapacity=212
Issue Links:
Blocker
 

Hadoop Flags: Reviewed, Incompatible change
Release Note: Changed JobTracker UI to better present the number of active tasks.
Resolution Date: 05/Dec/08 04:53 PM


 Description  « Hide
Wrong number of running map/reduce tasks are displayed in queue information.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Karam Singh added a comment - 17/Oct/08 01:07 PM
Submitted a job with map and reduce tasks = 212.
When job starts getting slots, JobTracker UI displayed following information about "Cluster Summary" and "Scheduling Information" -:

Maps=9, Reduces=0, Total Submissions=5, Nodes =106, Map Task Capacity=212 Reduce Task Capacity =212, Avg. Tasks/Node=4.00

Scheduling Information shows -:
Queue Name=default
Guaranteed Capacity Maps (%) :100.0
Guaranteed Capacity Reduces (%) :100.0
Current Capacity Maps : 212
Current Capacity Reduces : 212
User Limit : 25
Reclaim Time limit : 300
Number of Running Maps : 105
Number of Running Reduces : 8
Number of Waiting Maps : 107
Number of Waiting Reduces : 204
Priority Supported : YES

This type of information is displayed only when job is started getting slots.


Sreekanth Ramakrishnan added a comment - 22/Oct/08 06:39 AM
The information disparity comes in because, when a task is scheduled to be executed the Capacity Scheduler increases the number of the tasks running in a queue and then it passes the actual list of the tasks to the task tracker. So there is a disparity between what is actually running and what is scheduled to run on the task tracker.

So I think we should be changing the Number of Running <task> to represent Running <task> Scheduled and <Task> To be scheduled.

Would that make sense to the end user?


Hemanth Yamijala added a comment - 22/Oct/08 06:52 AM
Once the number of tasks is incremented in the scheduler, we have assigned it to the TaskTracker. So, when it comes back in the next heartbeat, this is going to be incremented on the JT. Is this understanding correct ?

If yes, when the TTs come in for the next heartbeat, the difference in maps should not be as high as shown in the bug comment. Is this happening ?


Sreekanth Ramakrishnan added a comment - 23/Oct/08 07:37 AM
I had an offline discussion with Karam about this issue, as mentioned in his comment, this indeed happens for a short duration between heartbeats when the task slots are actually getting assigned for a particular job. And the cluster information and scheduling information get synched up in the next heartbeat.

Hemanth Yamijala added a comment - 23/Oct/08 07:53 AM
In that case, I am thinking if it makes sense to not show this information at all in the scheduler's UI. Essentially it is a measure of how many tasks are scheduled, which mostly would become equal to number of tasks run in the next heartbeat. By removing this information, the user does not get to see how much is scheduled for a while. I think that is ok. So I would vote for removing this info from the scheduler UI. Others ?

Vinod K V added a comment - 23/Oct/08 08:03 AM
+1 for removing it from the scheduler information displayed, given that there is nothing else we can do. But I think this should be done only after fixing HADOOP-4444, i.e after we make sure that at steady state they are indeed equal.

Sreekanth Ramakrishnan added a comment - 23/Oct/08 08:05 AM
I think, we should be remove the redundant information which is being displayed by the task scheduler as, we are already displaying the number of maps and reduces running cluster wide and job wise split in the job details table.

Hemanth Yamijala added a comment - 29/Oct/08 09:17 AM
I am marking HADOOP-4444 as a blocker for this bug, as we would not be able to verify the fix for the patch, if this one gets fixed. I spoke to Vinod and found that's the same reason why he too wants HADOOP-4444 fixed first.

Sreekanth Ramakrishnan added a comment - 30/Oct/08 11:16 AM
I am marking HADOOP-4446 as blocker for this issue because, there is change in the scheduling information format and this issues also requires a change in the same area, so the patches would conflict, otherwise we should have a single patch for both issues put together.

Vinod K V added a comment - 30/Oct/08 12:28 PM
The number of Waiting {Maps|Reduces} reduces is derived from the running counts which are not synchronized with the exact counts all the time. So, they won't be correct all the time, even though at the point of their actual usage in the scheduler we make sure that they are correct. Hence, the waiting counts should be removed from the scheduler information.

The correct waiting counts can be maintained by JT(just like running counts) and stored in the cluster-summary; will file a JIRA to get that information there.


Vivek Ratan added a comment - 03/Nov/08 06:00 AM
Hemanth and I looked at what's going on here. Essentially, there are two sources of truth, regarding the number of running tasks in the system. Each JobInProgress object maintains counts of running map and reduce tasks. These counts are incremented when a task is assigned to a TT (in obtainNewMapTask() or obtainNewReduceTask()). These counts are used the by the CapacityScheduler . The cluster summary, represented by the ClusterStatus object, also contains counts of the total number of maps and reduce tasks. These are incremented by the JT using the TT status. The counts maintained by the JobInProgress objects and the ClusterStatus object, are off by a heartbeat. The former increments its counts when a task is assigned. Once the task runs on a TT, its running status is conveyed to the JT in the TT's next heartbeat. During startup, a lot of TTs approach the JT for tasks to run. As a result, the counts of running tasks across all JobInProgress objects are much higher than the cluster count, since the cluster count is updated only when the TTs report their status in their next hearbeat. That explains the discrepancy reported in this Jira. In steady state, these two counts are mostly identical, or off by a little bit, as TTs finish their tasks at different times.

This is not really a bug, as it's not clear which count is 'correct'. We're reporting from two different sources: the cluster summary and the Scheduler (which gets it info from the JobInProgress objects). But different numbers do get reflected in the UI. So the best fix is to probably indicate in the Scheduler part of the UI that its computation is off from the cluster summary by a heartbeat. Maybe a little explanation in the bottom that says something like: "This info varies from that of the cluster summary by a heartbeat".

I don't think we should change anything in the scheduler or the cluster summary. They're both doing the right thing their own way. An alternate solution is to have the cluster summary use the counts from the JobInProgress objects, but this is performance-intensive, and was presumably the reason why the cluster summary maintains its own count.

You do want to the leave the rest of the UI as is. The cluster summary is useful, as is the per-queue information of running tasks (reported by the Scheduler) as it lets users know whether the queue is running above/at/below its guaranteed capacity.

Hence, the waiting counts should be removed from the scheduler information.

The scheduler maintains a partial waiting count of map/reduce tasks. It doesn't need to know the total number of pending tasks if this total is larger than the cluster capacity. So, for performance reasons, it only counts up to the cluster capacity. HADOOP-4576 has been opened for this purpose and suggests that we display pending jobs instead of pending tasks, as the former seems more useful to users.


Amar Kamat added a comment - 03/Nov/08 11:41 AM
Vivek's comment makes sense. I guess the confusion is because of the lack of information provided along with the counts. If we fix on the naming and say
running task : The task that is currenly getting executed on some tracker
scheduled task : The task that is scheduled by the scheduler/jt for running and which might not be actually running

then we can show running-tasks info in cluster-summary and scheduled-tasks info in the scheduler's info and get away with the issue. Thoughts?


Vivek Ratan added a comment - 10/Nov/08 04:58 AM

then we can show running-tasks info in cluster-summary and scheduled-tasks info in the scheduler's info and get away with the issue.

I personally don't think you want to introduce such fine-grained differences. Having the user think of 'running' tasks and 'scheduled' tasks separately seems unnecessary as it doesn't buy them much. If the numbers are off by a bit, that doesn't seem so bad, as long as you can maybe add a little disclaimer, or provide an explanation in a FAQ. My guess is, users will really be concerned about the stats for their queue. As long as they can see whether a queue is fully occupied or not, and where they are in the waiting list, they should be fine.


Amar Kamat added a comment - 10/Nov/08 06:39 AM

If the numbers are off by a bit

At the max the numbers can be off by # nodes in the cluster.


Hemanth Yamijala added a comment - 10/Nov/08 03:50 PM
There are a few more issues with this information display. I had an offline discussion with Vivek, and we came up with a few observations and ideas.
  • The information is accessed by the UI update thread and the updateQSI method without proper synchronization.This should be addressed. Currently, the QueueSchedulingInfo object is a simple data object, and the information in a given instance of this should be accessed together. Currently, the access to this object is done synchronized via the TaskSchedulingMgr object. Maybe then, instead of accessing the QSI fields directly, it should access it via the TaskSchedulingMgr.
  • The capacity scheduler also updates only the reduce scheduler or the map scheduler in a given heartbeat. So in a scenario where reduce tasks are finishing along with map tasks, since we update the reduce scheduler in preference to the map scheduler, the information for the map tasks could be off by more than a heartbeat. However, in a steady state, this may not be that big an issue.

There are some options to address this:

  • We could make it explicit that the information is not synch'ed with the cluster summary (as mentioned by Vivek above, though the information should probably not be treated as off by only a heartbeat)
  • We could ensure that the information of either the map and reduce scheduler is updated at least once every so often. For e.g. we could update it once every 3 heartbeats or so.
  • We could also have an updater thread that runs periodically and updates the numbers every time it runs. We could use the same code for updates as the updateQSI method itself, thought it could maintain a separate copy of the data, so as to not introduce synchronization constraints on the scheduler.

The advantage with the last two approaches is that we could deterministically say how far off the scheduling info would be, as compared to the cluster status. For e.g. if the updater thread runs once every 30 seconds, we could say the information would be off by 30 seconds.

Since in any case it appears that the information cannot be completely in sync, maybe we should leave it simple for now, mark that the information is not synchronized with the cluster status, and see if in steady state the information is way off. If that happens, we could fix it using one of the methods I've stated above. Thoughts ?


Sreekanth Ramakrishnan added a comment - 01/Dec/08 05:42 AM
Instead of relying on freshness of the queue information objects which are used by scheduler to maintain the scheduling information, we can compute the number of running tasks when the information is requested for by the user.

Following can be done for the same:

  • Call JobQueueManager.getRunningJobs(queue).
  • Iterate thro' list of jobs and keep adding to global running tasks count.

We also should be fixing HADOOP-4576 along with this so that we don't iterate thro' all the waiting jobs in order to compute the pending total tasks in a queue.

Only downside of this approach would be that, whenever there is a UI request we would be iterating thro' running jobs in order to compute the number of running task, which might be a performance problem.

Any comments on the approach described above?


Hemanth Yamijala added a comment - 01/Dec/08 02:13 PM
We can probably check what happens with a reasonable size of running jobs per queue - say 50 jobs. If the performance isn't a concern, this may be a reasonable approach.

Hemanth Yamijala added a comment - 01/Dec/08 02:15 PM
We should also check that if this info is periodically checked, say once every 30 seconds or so (equal to a refresh interval on the web page), the performance is still fine.

In case this is turning out to be an issue, caching the values maybe an option. We can update it periodically like I suggested above.


Sreekanth Ramakrishnan added a comment - 02/Dec/08 04:20 AM
Checked with 80 jobs in a queue, the maximum time taken for getting running and pending tasks was 2 milliseconds(inclusive of scheduler.getJobs()) by refreshing the job tracker and queue details page every 15 seconds.

So if we have 10 queues, we would have a maximum delay of 20 milliseconds when user requests for information. This can be further reduced if we get in HADOOP-4576 as then we would count tasks in running job queue and then subtract the running job number from total job number to get waiting job count.


Sreekanth Ramakrishnan added a comment - 02/Dec/08 05:13 AM
Marking HADOOP-4576 as a blocker for this issue as, we are optimizing the calculation of number of waiting job count in HADOOP-4576 and then fixing running job count in this JIRA.

Sreekanth Ramakrishnan added a comment - 05/Dec/08 05:59 AM
Attaching a patch fixing this issue:

The patch does the following:

1. in SchedulingInfo.toString() it does JobQueueManager.getRunningJobs(queue) and iterates thro' running job list and keeps adding number of running tasks in the queue. We need not worry about the jobs which are initialized but not scheduled as those jobs would have running maps and reduces as zero.
2. Modified testSchedulingInfo() by assigning a map and reduce task to the tracker and check the number of running tasks. Then finished the job and checked the same. Then check for case where running job fails.

I have moved the comment from within testSchedulingInfo() to outside. I have removed the FakeJobInProgress.fail() and I am using taskTrackerManager.finalizeJob api instead as the latter raises events which is required for cleaning up the running job queues.


Amar Kamat added a comment - 05/Dec/08 08:18 AM
Few comments :
TestCapacityScheduler
1)
-    String[] infoStrings = schedulingInfo.split("\n");
-    
+    String[] infoStrings = schedulingInfo.split("\n");

Unnecessary diff.

2) I feel we should break testSchedulingInformation() into 3 sub testcases namely

  • testQueueSchedulingInformation()
  • testRunningSchedulingInformation()
  • testWaitingSchedulingInformation()
    etc .... or something like that. The reason being any small change might require understanding the complete test case and also making/adjusting changes to support the complete testcase as a whole.

Amar Kamat added a comment - 05/Dec/08 08:33 AM
Had an offline discussion with Sreekanth and it seems that we can hold the test case splitting for now. Also we should add a disclaimer that cluster-status and scheduler-info will be off. I personally dont think its worth mentioning the number by which it will be off.

Sreekanth Ramakrishnan added a comment - 05/Dec/08 09:03 AM
Attaching screen shot so that can show what would be displayed to the user.

Sreekanth Ramakrishnan added a comment - 05/Dec/08 09:07 AM
Incorporating Amar's comment in the patch.

I am not splitting test case in this patch as, I am testing default scheduling information plus running and waiting count side by side. Even if we are splitting the test case then logical division would be running,waiting and default scheduling information. Because running and waiting information can be tested side by side.

The disclaimer which I was testing has been removed as I have changed the disclaimer. Can you please check the same?


Sreekanth Ramakrishnan added a comment - 05/Dec/08 10:29 AM
After offline discussion with Hemanth and Amar, it was decided that we would show used maps and reduce in terms of percentage instead of running maps and reduce. Attaching screenshot with changes made to display percentages.

Sreekanth Ramakrishnan added a comment - 05/Dec/08 10:35 AM
Changes made to display maps and reduces as Used Maps (%) and Used Reduces (%), made appropriate changes in Test case

Sreekanth Ramakrishnan added a comment - 05/Dec/08 11:05 AM
Had an offline discussion with Amar, I have incorporated changes which were suggested by him. The screenshot attached displays the UI which would be presented to the user.

Amar Kamat added a comment - 05/Dec/08 11:25 AM
I feel Used Maps doesnt convey the sense that the maps are scheduled. Scheduled Maps or Running Maps might make more sense. Also there is no clear cut distinction as to what information is config and what is dynamic. I feel we should have 2 sections namely Config/Constants and Runtime Status. Disclaimers might actually confuse the users.

Sreekanth Ramakrishnan added a comment - 05/Dec/08 01:56 PM
Attaching latest screenshot incorporating all changes

Sreekanth Ramakrishnan added a comment - 05/Dec/08 01:57 PM
Attaching patch incorporating comments from above

Amar Kamat added a comment - 05/Dec/08 04:03 PM
+1. Looks good.

Hemanth Yamijala added a comment - 05/Dec/08 04:22 PM
Output of test-patch:

[exec] +1 overall.

[exec] +1 @author. The patch does not contain any @author tags.

[exec] +1 tests included. The patch appears to include 4 new or modified tests.

[exec] +1 javadoc. The javadoc tool did not generate any warning messages.

[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.


Hemanth Yamijala added a comment - 05/Dec/08 04:32 PM
Ant test-contrib passed without errors. The patch only touches capacity scheduler and nothing in core. So, I did not run core tests.

Hemanth Yamijala added a comment - 05/Dec/08 04:53 PM
I just committed this. Thanks, Sreekanth !

Hudson added a comment - 06/Dec/08 02:02 PM

Robert Chansler added a comment - 03/Mar/09 06:27 PM
Edit release note for publication.