Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.17.1
    • Fix Version/s: 0.17.2
    • Component/s: None
    • Labels:
      None

      Description

      On a cluster with about 1700 nodes, when a job with about 100,000 maps and 10,000 reduces completed, the JobTracker, even with 80 handlers, could not handle the rpc call load during promotion of the job, such that at the end, because of the discarded heartbeats, the JobTracker lost nearly all TaskTrackers (about 10 TaskTrackers left). Promotion took more than 40 minutes.
      They reconnected and everything recovered, but this might have been just luck.
      Shouldn't there be an adaptive throttling of the rate in heartbeats and TaskCompletionEvents?

      Sample messsages:
      2008-07-22 18:21:55,831 WARN org.apache.hadoop.ipc.Server: Call queue overflow discarding oldest call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@115f6b6, false, true, 18137) from xxx
      2008-07-22 18:21:55,834WARN org.apache.hadoop.ipc.Server: Call queue overflow discarding oldest call getTaskCompletionEvents(job_200807190635_0012, 119567, 50) from yyy
      ...
      2008-07-22 19:02:28,821 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9020, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@19d32fa, false, true, 18199) from zzz: discarded for being too old (40936)
      2008-07-22 19:02:28,821 WARN org.apache.hadoop.ipc.Server: IPC Server handler 34 on 9020, call getTaskCompletionEvents(job_200807190635_0012, 119567, 50) from uuu: discarded for being too old (40978)

      1. patch-3813-1.txt
        0.9 kB
        Arun C Murthy
      2. patch-3813-0.17.txt
        0.6 kB
        Amareshwari Sriramadasu
      3. patch-3813.txt
        0.6 kB
        Amareshwari Sriramadasu

        Activity

        Hide
        Arun C Murthy added a comment -

        Christian, is this the first time you noticed this? Is it reproducible? Thanks.

        Show
        Arun C Murthy added a comment - Christian, is this the first time you noticed this? Is it reproducible? Thanks.
        Hide
        Christian Kunz added a comment -

        Arun, this was first time with a large job.

        We just had another job completed with a similar number of maps, but only a single reducer, and the JobTracker exhibited this problem just for a couple of seconds (57 discarded heartbeats vs 130,000 discarded rpc calls in the previous job). But from this point of view, yes it is reproducible.

        In my first comment I probably should not have said 'Promotion', but 'Removal of completed tasks', because the first job had a lot of failed and speculatively executed tasks with a lot of temporary output in dfs, making the cleanup operation more intense.

        Show
        Christian Kunz added a comment - Arun, this was first time with a large job. We just had another job completed with a similar number of maps, but only a single reducer, and the JobTracker exhibited this problem just for a couple of seconds (57 discarded heartbeats vs 130,000 discarded rpc calls in the previous job). But from this point of view, yes it is reproducible. In my first comment I probably should not have said 'Promotion', but 'Removal of completed tasks', because the first job had a lot of failed and speculatively executed tasks with a lot of temporary output in dfs, making the cleanup operation more intense.
        Hide
        Devaraj Das added a comment -

        Assigning to Amareshwari for investigation

        Show
        Devaraj Das added a comment - Assigning to Amareshwari for investigation
        Hide
        Amareshwari Sriramadasu added a comment -

        The code in JobInProgress.garbageCollect uses FileUtil.fullyDelete to delete the temporary directory, which is consuming time, during which Jobtracker is locked.

                FileSystem fileSys = tmpDir.getFileSystem(conf);
                if (fileSys.exists(tmpDir)) {
                  FileUtil.fullyDelete(fileSys, tmpDir);
                }
        

        I performed a delete on a directory with 1000 directories inside. It took few milli seconds with fs.delete(dir, true) and it took 2 minutes with FileUtil.fullDelete(FileSystem, dir).

        The Jira HADOOP-3202 to deprecate FileUtil.fullyDelete is still open.

        Show
        Amareshwari Sriramadasu added a comment - The code in JobInProgress.garbageCollect uses FileUtil.fullyDelete to delete the temporary directory, which is consuming time, during which Jobtracker is locked. FileSystem fileSys = tmpDir.getFileSystem(conf); if (fileSys.exists(tmpDir)) { FileUtil.fullyDelete(fileSys, tmpDir); } I performed a delete on a directory with 1000 directories inside. It took few milli seconds with fs.delete(dir, true) and it took 2 minutes with FileUtil.fullDelete(FileSystem, dir). The Jira HADOOP-3202 to deprecate FileUtil.fullyDelete is still open.
        Hide
        Amareshwari Sriramadasu added a comment -

        Here is patch changing the code in JobInProgress to call FileSystem.delete insteadof FileUtil.fullyDelete.

        Show
        Amareshwari Sriramadasu added a comment - Here is patch changing the code in JobInProgress to call FileSystem.delete insteadof FileUtil.fullyDelete.
        Hide
        dhruba borthakur added a comment -

        +1. FullyDelete has lots of overhead compared to fs.delete(true). Good catch!

        Show
        dhruba borthakur added a comment - +1. FullyDelete has lots of overhead compared to fs.delete(true). Good catch!
        Hide
        dhruba borthakur added a comment -

        It would be great if we can make this patch into 0.17

        Show
        dhruba borthakur added a comment - It would be great if we can make this patch into 0.17
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12386711/patch-3813.txt
        against trunk revision 679202.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2929/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2929/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2929/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2929/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12386711/patch-3813.txt against trunk revision 679202. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2929/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2929/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2929/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2929/console This message is automatically generated.
        Hide
        Amareshwari Sriramadasu added a comment -

        Here is a patch for branch-0.17. Earlier patch applies to both trunk and branch-0.18

        Show
        Amareshwari Sriramadasu added a comment - Here is a patch for branch-0.17. Earlier patch applies to both trunk and branch-0.18
        Hide
        Amareshwari Sriramadasu added a comment -

        TestCLI failure is not related to the patch.

        Show
        Amareshwari Sriramadasu added a comment - TestCLI failure is not related to the patch.
        Hide
        Arun C Murthy added a comment - - edited

        +1. This patch looks fine, the question is whether we need to do more to help ease Christian's pain?

        Christian - do you think you can use this patch/build and re-run this? If you cannot do it right-away I propose we move it to hadoop-0.19. I'm ok committing this as-is too. Thoughts?

        TestCLI failure is unrelated to this patch - HADOOP-3809.

        Show
        Arun C Murthy added a comment - - edited +1. This patch looks fine, the question is whether we need to do more to help ease Christian's pain? Christian - do you think you can use this patch/build and re-run this? If you cannot do it right-away I propose we move it to hadoop-0.19. I'm ok committing this as-is too. Thoughts? TestCLI failure is unrelated to this patch - HADOOP-3809 .
        Hide
        dhruba borthakur added a comment -

        It would be nice if this patch gets into 0.17.2. The patch looks very simple but its impact could be large on long-lived jobtrackers that serve plenty of jobs.

        Show
        dhruba borthakur added a comment - It would be nice if this patch gets into 0.17.2. The patch looks very simple but its impact could be large on long-lived jobtrackers that serve plenty of jobs.
        Hide
        Christian Kunz added a comment -

        Unfortunately, I cannot apply the patch right away.

        Show
        Christian Kunz added a comment - Unfortunately, I cannot apply the patch right away.
        Hide
        Arun C Murthy added a comment -

        Same patch, after removing the now unused import of org.apache.hadoop.fs.FileUtil.

        Show
        Arun C Murthy added a comment - Same patch, after removing the now unused import of org.apache.hadoop.fs.FileUtil.
        Hide
        Arun C Murthy added a comment -

        I just committed this. Thanks, Amareshwari!

        Show
        Arun C Murthy added a comment - I just committed this. Thanks, Amareshwari!
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )

          People

          • Assignee:
            Amareshwari Sriramadasu
            Reporter:
            Christian Kunz
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development