Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1238

mapred metrics shows negative count of waiting maps and reduces

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.205.0
    • Fix Version/s: 1.0.3
    • Component/s: jobtracker
    • Labels:
      None
    • Target Version/s:

      Description

      Negative waiting_maps and waiting_reduces count is observed in the mapred metrics

      1. MAPREDUCE-1238-branch-1.patch
        2 kB
        Thomas Graves
      2. MAPREDUCE-1238-v0.20-1.patch
        1 kB
        Koji Noguchi

        Activity

        Hide
        Matt Foley added a comment -

        Closed upon release of Hadoop-1.0.3.

        Show
        Matt Foley added a comment - Closed upon release of Hadoop-1.0.3.
        Hide
        Matt Foley added a comment -

        accepted into 1.0.3 at TGraves's request.

        Show
        Matt Foley added a comment - accepted into 1.0.3 at TGraves's request.
        Hide
        Robert Joseph Evans added a comment -

        I just put this into branch-1. I did not put it into trunk as this is a 1.0 specific bug.

        Show
        Robert Joseph Evans added a comment - I just put this into branch-1. I did not put it into trunk as this is a 1.0 specific bug.
        Hide
        Robert Joseph Evans added a comment -

        +1 the patch looks good to me.

        Show
        Robert Joseph Evans added a comment - +1 the patch looks good to me.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12522103/MAPREDUCE-1238-branch-1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2181//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522103/MAPREDUCE-1238-branch-1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2181//console This message is automatically generated.
        Hide
        Thomas Graves added a comment -

        test-patch output. The 8 findbugs warnings are there in branch-1 without my change.

        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
        [exec] Please justify why no tests are needed for this patch.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings.
        [exec]
        [exec]

        Show
        Thomas Graves added a comment - test-patch output. The 8 findbugs warnings are there in branch-1 without my change. [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings. [exec] [exec]
        Hide
        Thomas Graves added a comment -

        There is no unit test along with this, I don't see any easy way to add one since its a race condition as to when the kill happens. I've manually tested it and made sure the kill comes in before init, during init, and after init and verified the waiting_* metrics correct after each.

        Show
        Thomas Graves added a comment - There is no unit test along with this, I don't see any easy way to add one since its a race condition as to when the kill happens. I've manually tested it and made sure the kill comes in before init, during init, and after init and verified the waiting_* metrics correct after each.
        Hide
        Thomas Graves added a comment -

        I believe this will still miss a case when it is killed while in progress of initializing the job. If it receives a kill during that initialization it waits until init is done and then throws to do the kill. The tasksInited will not have been set to true because the throw happens right before that is set and thus the metrics won't be properly decremented. Working on a fix for that.

        Show
        Thomas Graves added a comment - I believe this will still miss a case when it is killed while in progress of initializing the job. If it receives a kill during that initialization it waits until init is done and then throws to do the kill. The tasksInited will not have been set to true because the throw happens right before that is set and thus the metrics won't be properly decremented. Working on a fix for that.
        Hide
        Koji Noguchi added a comment - - edited

        This is not my patch but was pointed out internally by a dev but nobody followed up. Uploading here to see if this makes sense.

        Copy&Pasting his comment.

        
        

        I tried to reproduce this on a small cluster (60 nodes) with hadoop 0.20.202.

        Steps to reproduce the issue:
        ===============================
        1. Setup file sink for jobtracker, so that we can get waiting_maps counter in a
        separate file
        2. Have queue configs similar to that of production (i took **** config)
        3. Submit a simple sort job with -Dmapred.job.queue.name="search_general". This
        queue should not be present in the cluster. Now, the waiting_maps would get
        into -ve value. Example is given below.

        1298346748441 mapred.jobtracker: context=mapred, sessionId=,
        hostName=, waiting_maps=-120, waiting_reduces=-16,
        jobs_failed=1, jobs_preparing=0

        Problem:
        ========
        1. WaitingMaps are incremented in JobInProgress.initTasks(). If a user gets an
        exception even before tasks are initialized, JobInProgress decrements the
        waiting_maps wrongly in garbageCollect(). This causes -ve values in
        waiting_maps and waiting_reduces.

        Tried the following code change in JobInProgress for fixing:
        ============================================================
        //check if tasks are initialized, and decrement waiting_maps accordingly.
        if (tasksInited)

        { // Let the JobTracker know that a job is complete jobtracker.getInstrumentation().decWaitingMaps(getJobID(), pendingMaps()); jobtracker.getInstrumentation().decWaitingReduces(getJobID(), pendingReduces()); }

        Need to check with dev for reviewing the above logic.

        Show
        Koji Noguchi added a comment - - edited This is not my patch but was pointed out internally by a dev but nobody followed up. Uploading here to see if this makes sense. Copy&Pasting his comment. I tried to reproduce this on a small cluster (60 nodes) with hadoop 0.20.202. Steps to reproduce the issue: =============================== 1. Setup file sink for jobtracker, so that we can get waiting_maps counter in a separate file 2. Have queue configs similar to that of production (i took **** config) 3. Submit a simple sort job with -Dmapred.job.queue.name="search_general". This queue should not be present in the cluster. Now, the waiting_maps would get into -ve value. Example is given below. 1298346748441 mapred.jobtracker: context=mapred, sessionId=, hostName=, waiting_maps=-120, waiting_reduces=-16, jobs_failed=1, jobs_preparing=0 Problem: ======== 1. WaitingMaps are incremented in JobInProgress.initTasks(). If a user gets an exception even before tasks are initialized, JobInProgress decrements the waiting_maps wrongly in garbageCollect(). This causes -ve values in waiting_maps and waiting_reduces. Tried the following code change in JobInProgress for fixing: ============================================================ //check if tasks are initialized, and decrement waiting_maps accordingly. if (tasksInited) { // Let the JobTracker know that a job is complete jobtracker.getInstrumentation().decWaitingMaps(getJobID(), pendingMaps()); jobtracker.getInstrumentation().decWaitingReduces(getJobID(), pendingReduces()); } Need to check with dev for reviewing the above logic.
        Hide
        Koji Noguchi added a comment -

        Is this a duplicate of MAPREDUCE-1233?

        Arun, not sure. I think MAPREDUCE-1233 is already in the version we run and we are still seeing negative metrics for waiting maps/reduces.

        Show
        Koji Noguchi added a comment - Is this a duplicate of MAPREDUCE-1233 ? Arun, not sure. I think MAPREDUCE-1233 is already in the version we run and we are still seeing negative metrics for waiting maps/reduces.
        Hide
        Arun C Murthy added a comment -

        Is this a duplicate of MAPREDUCE-1233?

        Show
        Arun C Murthy added a comment - Is this a duplicate of MAPREDUCE-1233 ?
        Hide
        Ramya Sunil added a comment -

        The steps to replicate this bug are:

        • JT is brought up through default configuration of CapacityTaskScheduler
        • Submit a job and kill it.
        • Negative waiting_maps and waiting_reduces is seen in the metrics logs
        Show
        Ramya Sunil added a comment - The steps to replicate this bug are: JT is brought up through default configuration of CapacityTaskScheduler Submit a job and kill it. Negative waiting_maps and waiting_reduces is seen in the metrics logs

          People

          • Assignee:
            Thomas Graves
            Reporter:
            Ramya Sunil
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development