Hadoop Common
  1. Hadoop Common
  2. HADOOP-4444

Scheduling Information for queues shows one map/reduce task less then actual running map/reduce tasks.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 0.19.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Hadoop r705159, Queue=default, GC=100% MapCapacity=ReduceCapacity=212

      Description

      Scheduling Information for queues shows one map/reduce task less then actual running map/reduce tasks.

      1. HADOOP-4444-1.patch
        0.8 kB
        Sreekanth Ramakrishnan
      2. HADOOP-4444.patch
        1 kB
        Sreekanth Ramakrishnan

        Issue Links

          Activity

          Hide
          Vivek Ratan added a comment -

          Please see HADOOP-4445 for the explanation of why all this is really happening. This Jira still remains a clone of HADOOP-4445.

          Show
          Vivek Ratan added a comment - Please see HADOOP-4445 for the explanation of why all this is really happening. This Jira still remains a clone of HADOOP-4445 .
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Closing it as duplicate of HADOOP-4445.

          Show
          Vinod Kumar Vavilapalli added a comment - Closing it as duplicate of HADOOP-4445 .
          Hide
          Vinod Kumar Vavilapalli added a comment -

          All these issues seem to be bugs(but not, as established by now) because of two different sources - scheduler's own structures and the cluster-summary. Even after doing what is done in the latest patch, we still will find the discrepancies because JIP.runningTasks is updated immediately after giving away new tasks but not when it is actually running (i.e when removed from expireLaunchingTasks list).

          After discussing this with Hemanth, realized that at the place where we actually use these counts (in the scheduler) we do have the correct values. We are anyway going to remove these counts from the scheduler UI (HADOOP-4445) as they are already displayed in cluster-summary. Once we fix HADOOP-4445, these issues will no longer be visible.

          Show
          Vinod Kumar Vavilapalli added a comment - All these issues seem to be bugs(but not, as established by now) because of two different sources - scheduler's own structures and the cluster-summary. Even after doing what is done in the latest patch, we still will find the discrepancies because JIP.runningTasks is updated immediately after giving away new tasks but not when it is actually running (i.e when removed from expireLaunchingTasks list). After discussing this with Hemanth, realized that at the place where we actually use these counts (in the scheduler) we do have the correct values. We are anyway going to remove these counts from the scheduler UI ( HADOOP-4445 ) as they are already displayed in cluster-summary. Once we fix HADOOP-4445 , these issues will no longer be visible.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching patch fixing the issue. Updating the scheduling information after the scheduler has obtained tasks to run on TaskTrackers.

          Show
          Sreekanth Ramakrishnan added a comment - Attaching patch fixing the issue. Updating the scheduling information after the scheduler has obtained tasks to run on TaskTrackers.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          [...] not every queue scheduling information was getting updated in each heart beat, either MAP or REDUCE scheduler information was updated per heart beat.

          As for this, I also think we need to fix it. May or may not be in this patch.

          Show
          Vinod Kumar Vavilapalli added a comment - [...] not every queue scheduling information was getting updated in each heart beat, either MAP or REDUCE scheduler information was updated per heart beat. As for this, I also think we need to fix it. May or may not be in this patch.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Found out the actual reason for this behavior. Scheduler obtains running and pending count from JobInProgress. It does the following when giving out a new task -
          1) Update queue information. This includes updating running count from all jobs.
          2) Give a new Task. This includes incrementing running count for this job.

          The incremented running count for a job in the last step is not reflected in the scheduler information till the TT reaches scheduler again i.e. when scheduler goes out for giving another new task to this TT. And when a TT hits the situation when it no longer needs new tasks, running count incremented in the last assignment doesn't get reflected in the scheduler information as the TT never reaches scheduler.

          A simple fix would be to updated queue information also after task assignment.

          Show
          Vinod Kumar Vavilapalli added a comment - Found out the actual reason for this behavior. Scheduler obtains running and pending count from JobInProgress. It does the following when giving out a new task - 1) Update queue information. This includes updating running count from all jobs. 2) Give a new Task. This includes incrementing running count for this job. The incremented running count for a job in the last step is not reflected in the scheduler information till the TT reaches scheduler again i.e. when scheduler goes out for giving another new task to this TT. And when a TT hits the situation when it no longer needs new tasks, running count incremented in the last assignment doesn't get reflected in the scheduler information as the TT never reaches scheduler. A simple fix would be to updated queue information also after task assignment.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching the ant test-patch output.

               [exec] 
               [exec] -1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
               [exec]                         Please justify why no tests are needed for this patch.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
               [exec] 
          

          The patch does not require a test case because it is just a change to update the scheduling information to be correctly displayed on the UI.

          Show
          Sreekanth Ramakrishnan added a comment - Attaching the ant test-patch output. [exec] [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] The patch does not require a test case because it is just a change to update the scheduling information to be correctly displayed on the UI.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching patch file which should fix this issue. This issue was happening because not every queue scheduling information was getting updated in each heart beat, either MAP or REDUCE scheduler information was updated per heart beat.

          Show
          Sreekanth Ramakrishnan added a comment - Attaching patch file which should fix this issue. This issue was happening because not every queue scheduling information was getting updated in each heart beat, either MAP or REDUCE scheduler information was updated per heart beat.
          Hide
          Karam Singh added a comment - - edited

          Submitted a job with map and reduce tasks = 212.
          When all maps and reduces tasks of job starts running, JobTracker UI displayed following information about "Cluster Summary" and "Scheduling Information" -:

          Maps=212, Reduces=212, Total Submissions=2, Nodes =106, Map Task Capacity=212 Reduce Task Capacity =212, Avg. Tasks/Node=4.00

          Scheduling Information shows -:
          Queue Name=default
          Guaranteed Capacity Maps (%) :100.0
          Guaranteed Capacity Reduces (%) :100.0
          Current Capacity Maps : 212
          Current Capacity Reduces : 212
          User Limit : 25
          Reclaim Time limit : 300
          Number of Running Maps : 211
          Number of Running Reduces : 211
          Number of Waiting Maps : 1
          Number of Waiting Reduces : 1
          Priority Supported : YES

          Job Details page displays -:
          JobID: job_200810171104_0003 on <jobtracker_hostname>
          Kind =map, Num Tasks=212, Pending=0, Running=0, Complete=0, Killed=0, Failed/Killed Task Attempts=0/0
          Kind=reduce Num Task=212, Pending=0, Running=0, Complete=0, Killed=0, Failed/Killed Task Attempts=0/0.

          This also shows that actually 212 MR tasks are running while queue info displays 211 MR tasks running and 1 in waiting.
          Note -: if there are ore then one jobs running queue information display one task than then actual task running

          Show
          Karam Singh added a comment - - edited Submitted a job with map and reduce tasks = 212. When all maps and reduces tasks of job starts running, JobTracker UI displayed following information about "Cluster Summary" and "Scheduling Information" -: Maps=212, Reduces=212, Total Submissions=2, Nodes =106, Map Task Capacity=212 Reduce Task Capacity =212, Avg. Tasks/Node=4.00 Scheduling Information shows -: Queue Name=default Guaranteed Capacity Maps (%) :100.0 Guaranteed Capacity Reduces (%) :100.0 Current Capacity Maps : 212 Current Capacity Reduces : 212 User Limit : 25 Reclaim Time limit : 300 Number of Running Maps : 211 Number of Running Reduces : 211 Number of Waiting Maps : 1 Number of Waiting Reduces : 1 Priority Supported : YES Job Details page displays -: JobID: job_200810171104_0003 on <jobtracker_hostname> Kind =map, Num Tasks=212, Pending=0, Running=0, Complete=0, Killed=0, Failed/Killed Task Attempts=0/0 Kind=reduce Num Task=212, Pending=0, Running=0, Complete=0, Killed=0, Failed/Killed Task Attempts=0/0. This also shows that actually 212 MR tasks are running while queue info displays 211 MR tasks running and 1 in waiting. Note -: if there are ore then one jobs running queue information display one task than then actual task running

            People

            • Assignee:
              Sreekanth Ramakrishnan
              Reporter:
              Karam Singh
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development