Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3343

TaskTracker Out of Memory because of distributed cache

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.205.0
    • Fix Version/s: 1.0.1
    • Component/s: mrv1
    • Labels:
    • Target Version/s:

      Description

      This Out of Memory happens when you run large number of jobs (using the distributed cache) on a TaskTracker.

      Seems the basic issue is with the distributedCacheManager (instance of TrackerDistributedCacheManager in TaskTracker.java), this gets created during TaskTracker.initialize(), and it keeps references to TaskDistributedCacheManager for every submitted job via the jobArchives Map, also references to CacheStatus via cachedArchives map. I am not seeing these cleaned up between jobs, so this can out of memory problems after really large number of jobs are submitted. We have seen this issue in a number of cases.

        Activity

        Hide
        Matt Foley added a comment -

        Closed upon release of 1.0.1.

        Show
        Matt Foley added a comment - Closed upon release of 1.0.1.
        Hide
        Matt Foley added a comment -

        This patch has been tested at user sites and is believed stable. Nathan Roberts requested that I include it in 1.0.1, as its absence is causing ops problems with 1.0.0.

        Show
        Matt Foley added a comment - This patch has been tested at user sites and is believed stable. Nathan Roberts requested that I include it in 1.0.1, as its absence is causing ops problems with 1.0.0.
        Hide
        Matt Foley added a comment -

        Resolving since patch committed to stated version.

        Show
        Matt Foley added a comment - Resolving since patch committed to stated version.
        Hide
        zhaoyunjiong added a comment -

        It's my pleasure. Thanks to all of you.

        Show
        zhaoyunjiong added a comment - It's my pleasure. Thanks to all of you.
        Hide
        Eli Collins added a comment -

        Ah, I missed that. +1 to the latest patch. I'll commit this to branch-20-security for 206. Thanks Zhao!

        Show
        Eli Collins added a comment - Ah, I missed that. +1 to the latest patch. I'll commit this to branch-20-security for 206. Thanks Zhao!
        Hide
        zhaoyunjiong added a comment -

        Unit test already cover the job is removed from the archives.
        If we remove the call to removeTaskDistributedCacheManager in the test, the last line of testRemoveTaskDistributedCacheManager will fail.

        Show
        zhaoyunjiong added a comment - Unit test already cover the job is removed from the archives. If we remove the call to removeTaskDistributedCacheManager in the test, the last line of testRemoveTaskDistributedCacheManager will fail.
        Hide
        Eli Collins added a comment -

        Ping?

        Show
        Eli Collins added a comment - Ping?
        Hide
        Eli Collins added a comment -

        How does the test cover that the job is removed from the archives? Looks like it should pass even if we remove the call to removeTaskDistributedCacheManager in TT and the test.

        Show
        Eli Collins added a comment - How does the test cover that the job is removed from the archives? Looks like it should pass even if we remove the call to removeTaskDistributedCacheManager in TT and the test.
        Hide
        zhaoyunjiong added a comment -

        Eli Collins are right, no need for catch exception in removeTaskDistributedCacheManager.
        Thanks for your comments.
        Also thanks for Ahmed Radwan kindly updated my patch.

        I notice the assignee is me now. What else should I do to commit this patch?

        Show
        zhaoyunjiong added a comment - Eli Collins are right, no need for catch exception in removeTaskDistributedCacheManager. Thanks for your comments. Also thanks for Ahmed Radwan kindly updated my patch. I notice the assignee is me now. What else should I do to commit this patch?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12503477/MAPREDUCE-3343_rev2.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1297//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12503477/MAPREDUCE-3343_rev2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1297//console This message is automatically generated.
        Hide
        Ahmed Radwan added a comment -

        Here is zhaoyunjiong's patch incorporating Eli's additional comments.

        Show
        Ahmed Radwan added a comment - Here is zhaoyunjiong's patch incorporating Eli's additional comments.
        Hide
        Eli Collins added a comment -

        Thanks for submitting a patch!

        • Why catch Exception in removeTaskDistributedCacheManager? The key should never be null right?
        • getTaskDistributedCacheManager can be package protection instead of public right?
        • Nit: please use two spaces instead of tabs per http://wiki.apache.org/hadoop/HowToContribute

        Otherwise looks great.

        Show
        Eli Collins added a comment - Thanks for submitting a patch! Why catch Exception in removeTaskDistributedCacheManager? The key should never be null right? getTaskDistributedCacheManager can be package protection instead of public right? Nit: please use two spaces instead of tabs per http://wiki.apache.org/hadoop/HowToContribute Otherwise looks great.
        Hide
        Ahmed Radwan added a comment -

        +1 lgtm. Thanks zhaoyunjiong.

        Show
        Ahmed Radwan added a comment - +1 lgtm. Thanks zhaoyunjiong.
        Hide
        zhaoyunjiong added a comment -

        The result of test-patch on the 0.20.205:

        [exec] BUILD SUCCESSFUL
        [exec] Total time: 4 minutes 59 seconds
        [exec]
        [exec]
        [exec]
        [exec]
        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
        [exec]
        [exec]
        [exec]
        [exec]
        [exec] ======================================================================
        [exec] ======================================================================
        [exec] Finished build.
        [exec] ======================================================================
        [exec] ======================================================================

        Show
        zhaoyunjiong added a comment - The result of test-patch on the 0.20.205: [exec] BUILD SUCCESSFUL [exec] Total time: 4 minutes 59 seconds [exec] [exec] [exec] [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] [exec] [exec] [exec] ====================================================================== [exec] ====================================================================== [exec] Finished build. [exec] ====================================================================== [exec] ======================================================================
        Hide
        Robert Joseph Evans added a comment -

        All that is left is to run test-patch on the 0.20.205 line and posting the results here. Here are instructions on how to do that. Look under the section testing your patch. Assuming that you get a +1 from that I am also a +1 (Non-binding)

        Show
        Robert Joseph Evans added a comment - All that is left is to run test-patch on the 0.20.205 line and posting the results here. Here are instructions on how to do that. Look under the section testing your patch. Assuming that you get a +1 from that I am also a +1 (Non-binding)
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12502884/mapreduce-3343-release-0.20.205.0.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1269//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12502884/mapreduce-3343-release-0.20.205.0.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1269//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12502883/mapreduce-3343-release-0.20.205.0.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1268//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12502883/mapreduce-3343-release-0.20.205.0.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1268//console This message is automatically generated.
        Hide
        zhaoyunjiong added a comment -

        Add a unit test.
        This patch works fine in our production cluster.

        Show
        zhaoyunjiong added a comment - Add a unit test. This patch works fine in our production cluster.
        Hide
        Robert Joseph Evans added a comment -

        The patch itself looks good to me, but I would like to see some tests added, or a justification why no tests are needed.

        Show
        Robert Joseph Evans added a comment - The patch itself looks good to me, but I would like to see some tests added, or a justification why no tests are needed.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12502735/bug-fix-avoid-memory-leak-in-TrackerDistributedCacheManager.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1255//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12502735/bug-fix-avoid-memory-leak-in-TrackerDistributedCacheManager.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1255//console This message is automatically generated.
        Hide
        zhaoyunjiong added a comment -

        patch for avoid memory leak.

        Show
        zhaoyunjiong added a comment - patch for avoid memory leak.
        Hide
        zhaoyunjiong added a comment -

        Remove job's TaskDistributedCacheManager from TrackerDistributedCacheManager when job is done to avoid memory leak.

        Show
        zhaoyunjiong added a comment - Remove job's TaskDistributedCacheManager from TrackerDistributedCacheManager when job is done to avoid memory leak.
        Hide
        Robert Joseph Evans added a comment -

        If the analysis is correct then it looks like the issue has been around for a long time. In SVN revision 1077679 the jobArchives map was added along with a releaseJob method that would remove the entries from jobArchives and release the resources held by the TaskDistributedCacheManager. However, this method appears to have never been called. The very next revision to TrackerDistributedCacheManager.java 1077687 removed that method and had TaskTracker.java release the resources for the TaskDistributedCacheManager directly, not removing it from the jobArchives Map.

        It looks like this bug has been in the code since security was introduced.

        Show
        Robert Joseph Evans added a comment - If the analysis is correct then it looks like the issue has been around for a long time. In SVN revision 1077679 the jobArchives map was added along with a releaseJob method that would remove the entries from jobArchives and release the resources held by the TaskDistributedCacheManager. However, this method appears to have never been called. The very next revision to TrackerDistributedCacheManager.java 1077687 removed that method and had TaskTracker.java release the resources for the TaskDistributedCacheManager directly, not removing it from the jobArchives Map. It looks like this bug has been in the code since security was introduced.

          People

          • Assignee:
            zhaoyunjiong
            Reporter:
            Ahmed Radwan
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development