Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5351

JobTracker memory leak caused by CleanupQueue reopening FileSystem

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.1.2
    • 1-win, 1.2.1
    • jobtracker
    • None

    Description

      When a job is completed, closeAllForUGI is called to close all the cached FileSystems in the FileSystem cache. However, the CleanupQueue may run after this occurs and call FileSystem.get() to delete the staging directory, adding a FileSystem to the cache that will never be closed.

      People on the user-list have reported this causing their JobTrackers to OOME every two weeks.

      Attachments

        1. MAPREDUCE-5351.patch
          3 kB
          Sandy Ryza
        2. MAPREDUCE-5351-1.patch
          6 kB
          Sandy Ryza
        3. MAPREDUCE-5351-2.patch
          6 kB
          Sandy Ryza
        4. MAPREDUCE-5351-addendum.patch
          0.6 kB
          Sandy Ryza
        5. MAPREDUCE-5351-addendum-1.patch
          2 kB
          Sandy Ryza
        6. JobInProgress_JobHistory.patch
          3 kB
          jandyfish

        Issue Links

          Activity

            sandyr Sandy Ryza added a comment -

            Uploaded a patch that should fix the issue, but haven't yet tried to verify it. I'm not sure how to write a test for this.

            sandyr Sandy Ryza added a comment - Uploaded a patch that should fix the issue, but haven't yet tried to verify it. I'm not sure how to write a test for this.

            The changes in CleanupQueue.java, wouldn't be enough to close the FS instance got there instead doing a closeAll?

            On how to test it, could we temporary (for testing) instrument the FS cache size and monitor it does grow over time?

            tucu00 Alejandro Abdelnur added a comment - The changes in CleanupQueue.java, wouldn't be enough to close the FS instance got there instead doing a closeAll? On how to test it, could we temporary (for testing) instrument the FS cache size and monitor it does grow over time?
            acmurthy Arun Murthy added a comment -

            sandyr Agree with tucu00... can you please help me understand the motivation for changes in JIP?

            It would be nice to put this into 1.2.1, I believe Matt is trying to get it done over the weekend... any chance we can expedite this?

            acmurthy Arun Murthy added a comment - sandyr Agree with tucu00 ... can you please help me understand the motivation for changes in JIP? It would be nice to put this into 1.2.1, I believe Matt is trying to get it done over the weekend... any chance we can expedite this?
            sandyr Sandy Ryza added a comment -

            The motivation for the changes in JIP is that we don't want to close the FileSystem and then immediately reopen it from the CleanupQueue code. Just closing the FS instance (not doing closeAllForUGI) should be fine.

            I'll upload a patch today or tomorrow.

            sandyr Sandy Ryza added a comment - The motivation for the changes in JIP is that we don't want to close the FileSystem and then immediately reopen it from the CleanupQueue code. Just closing the FS instance (not doing closeAllForUGI) should be fine. I'll upload a patch today or tomorrow.
            acmurthy Arun Murthy added a comment -

            Ok, makes sense. Thanks.

            acmurthy Arun Murthy added a comment - Ok, makes sense. Thanks.
            sandyr Sandy Ryza added a comment -

            Uploaded a new patch with a test

            sandyr Sandy Ryza added a comment - Uploaded a new patch with a test
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12590103/MAPREDUCE-5351-1.patch
            against trunk revision .

            -1 patch. The patch command could not apply the patch.

            Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3811//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590103/MAPREDUCE-5351-1.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3811//console This message is automatically generated.
            acmurthy Arun Murthy added a comment -

            sandyr Thanks for the quick turnaround.

            One nit - you should, at least, mark the new api in FS as @Private.

            acmurthy Arun Murthy added a comment - sandyr Thanks for the quick turnaround. One nit - you should, at least, mark the new api in FS as @Private.
            sandyr Sandy Ryza added a comment -

            Thanks for looking it over. Uploading a patch that marks it as private.

            sandyr Sandy Ryza added a comment - Thanks for looking it over. Uploading a patch that marks it as private.
            acmurthy Arun Murthy added a comment -

            +1 lgtm.

            tucu00 You ok with this too?

            acmurthy Arun Murthy added a comment - +1 lgtm. tucu00 You ok with this too?
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12590105/MAPREDUCE-5351-2.patch
            against trunk revision .

            -1 patch. The patch command could not apply the patch.

            Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3812//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590105/MAPREDUCE-5351-2.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3812//console This message is automatically generated.
            acmurthy Arun Murthy added a comment -

            I just committed this. Thanks sandyr!


            tucu00 - Sorry, I didn't wait longer to unblock Matt for the release. Please file a follow-on if you have any concernse. Thanks, and sorry again!

            acmurthy Arun Murthy added a comment - I just committed this. Thanks sandyr ! tucu00 - Sorry, I didn't wait longer to unblock Matt for the release. Please file a follow-on if you have any concernse. Thanks, and sorry again!
            arpitgupta Arpit Gupta added a comment -

            Reopening as with this fix we are seeing jobs fail with the following exception

            13/07/02 16:06:57 DEBUG mapred.JobClient: Printing tokens for job: job_201307020820_0012
            13/07/02 16:06:57 DEBUG ipc.Client: IPC Client (47) connection to host/ip:50300 from hortonar sending #32
            13/07/02 16:06:57 DEBUG ipc.Client: IPC Client (47) connection to host/ip:50300 from hortonar got value #32
            13/07/02 16:06:57 DEBUG retry.RetryUtils: RETRY 0) policy=TryOnceThenFail, exception=org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed
                    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383)
                    at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633)
                    at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364)
                    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166)
                    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350)
                    at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599)
                    at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561)
                    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                    at java.lang.reflect.Method.invoke(Method.java:597)
                    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
                    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
                    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440)
                    at java.security.AccessController.doPrivileged(Native Method)
                    at javax.security.auth.Subject.doAs(Subject.java:396)
                    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
                    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438)
            
            13/07/02 16:06:57 INFO mapred.JobClient: Cleaning up the staging area hdfs://host:8020/user/hortonar/.staging/job_201307020820_0012
            13/07/02 16:06:57 ERROR security.UserGroupInformation: PriviledgedActionException as:hortonar cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed
                    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383)
                    at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633)
                    at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364)
                    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166)
                    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350)
                    at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599)
                    at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561)
                    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                    at java.lang.reflect.Method.invoke(Method.java:597)
                    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
                    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
                    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440)
                    at java.security.AccessController.doPrivileged(Native Method)
                    at javax.security.auth.Subject.doAs(Subject.java:396)
                    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
                    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438)
            
            org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed
                    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383)
                    at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633)
                    at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364)
                    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166)
                    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350)
                    at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599)
                    at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561)
                    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                    at java.lang.reflect.Method.invoke(Method.java:597)
                    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
                    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
                    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440)
                    at java.security.AccessController.doPrivileged(Native Method)
                    at javax.security.auth.Subject.doAs(Subject.java:396)
            
            arpitgupta Arpit Gupta added a comment - Reopening as with this fix we are seeing jobs fail with the following exception 13/07/02 16:06:57 DEBUG mapred.JobClient: Printing tokens for job: job_201307020820_0012 13/07/02 16:06:57 DEBUG ipc.Client: IPC Client (47) connection to host/ip:50300 from hortonar sending #32 13/07/02 16:06:57 DEBUG ipc.Client: IPC Client (47) connection to host/ip:50300 from hortonar got value #32 13/07/02 16:06:57 DEBUG retry.RetryUtils: RETRY 0) policy=TryOnceThenFail, exception=org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438) 13/07/02 16:06:57 INFO mapred.JobClient: Cleaning up the staging area hdfs: //host:8020/user/hortonar/.staging/job_201307020820_0012 13/07/02 16:06:57 ERROR security.UserGroupInformation: PriviledgedActionException as:hortonar cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438) org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396)
            sandyr Sandy Ryza added a comment -

            arpitgupta, how often is this occurring for you?

            sandyr Sandy Ryza added a comment - arpitgupta , how often is this occurring for you?
            arpitgupta Arpit Gupta added a comment -

            Its happening very frequently that quite a few mr jobs failed because of this exception. Various example jobs, jobs through pig, oozie etc also failed because of this.

            arpitgupta Arpit Gupta added a comment - Its happening very frequently that quite a few mr jobs failed because of this exception. Various example jobs, jobs through pig, oozie etc also failed because of this.
            sandyr Sandy Ryza added a comment -

            Thanks. I'll look into this.

            sandyr Sandy Ryza added a comment - Thanks. I'll look into this.
            sandyr Sandy Ryza added a comment -

            Attaching an addendum patch that should fix the issue. I was able to reproduce the issue by running pi jobs repeatedly. With the patch, the exception no longer occurs.

            sandyr Sandy Ryza added a comment - Attaching an addendum patch that should fix the issue. I was able to reproduce the issue by running pi jobs repeatedly. With the patch, the exception no longer occurs.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12590494/MAPREDUCE-5351-addendum.patch
            against trunk revision .

            -1 patch. The patch command could not apply the patch.

            Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3822//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590494/MAPREDUCE-5351-addendum.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3822//console This message is automatically generated.
            acmurthy Arun Murthy added a comment -

            sandyr - JIP.cleanupJob always results in PathDeletionContext with ugi being null... how does this fix the original problem? I'm missing something? Tx.

            acmurthy Arun Murthy added a comment - sandyr - JIP.cleanupJob always results in PathDeletionContext with ugi being null... how does this fix the original problem? I'm missing something? Tx.
            sandyr Sandy Ryza added a comment -

            In JIP.cleanupJob:

                    Path tempDir = jobtracker.getSystemDirectoryForJob(getJobID());
                    CleanupQueue.getInstance().addToQueue(
                        new PathDeletionContext(tempDir, conf));
            
                    // delete the staging area for the job and cancel delegation token
                    String jobTempDir = conf.get("mapreduce.job.dir");
                    if (jobTempDir != null && conf.getKeepTaskFilesPattern() == null &&
                        !conf.getKeepFailedTaskFiles()) {
                      Path jobTempDirPath = new Path(jobTempDir);
                      tempDirFs = jobTempDirPath.getFileSystem(conf);
                      CleanupQueue.getInstance().addToQueue(
                          new PathDeletionContext(jobTempDirPath, conf, userUGI, jobId));
                    }
            

            The CleanupQueue is used twice, once with the UGI set and once without.

            sandyr Sandy Ryza added a comment - In JIP.cleanupJob: Path tempDir = jobtracker.getSystemDirectoryForJob(getJobID()); CleanupQueue.getInstance().addToQueue( new PathDeletionContext(tempDir, conf)); // delete the staging area for the job and cancel delegation token String jobTempDir = conf.get( "mapreduce.job.dir" ); if (jobTempDir != null && conf.getKeepTaskFilesPattern() == null && !conf.getKeepFailedTaskFiles()) { Path jobTempDirPath = new Path(jobTempDir); tempDirFs = jobTempDirPath.getFileSystem(conf); CleanupQueue.getInstance().addToQueue( new PathDeletionContext(jobTempDirPath, conf, userUGI, jobId)); } The CleanupQueue is used twice, once with the UGI set and once without.
            acmurthy Arun Murthy added a comment -

            Duh, good point.

            Can you please add a comment to the addendum patch explaining the rationale for the check? Also, a test case specifically covering this bug would be good. Thanks!

            acmurthy Arun Murthy added a comment - Duh, good point. Can you please add a comment to the addendum patch explaining the rationale for the check? Also, a test case specifically covering this bug would be good. Thanks!
            sandyr Sandy Ryza added a comment -

            Uploaded a new patch that updates the comments and the test case.

            sandyr Sandy Ryza added a comment - Uploaded a new patch that updates the comments and the test case.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12590672/MAPREDUCE-5351-addendum-1.patch
            against trunk revision .

            -1 patch. The patch command could not apply the patch.

            Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3826//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590672/MAPREDUCE-5351-addendum-1.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3826//console This message is automatically generated.

            The original patch as well as the addendum look good to me. The new test looks good too. +1 for pushing the addendum.

            vinodkv Vinod Kumar Vavilapalli added a comment - The original patch as well as the addendum look good to me. The new test looks good too. +1 for pushing the addendum.
            mattf Matthew Foley added a comment -

            Both patch and addendum patch have been applied by Arun, to both branch-1.2 and branch-1. Resolving as Fixed. Thanks, Sandy!

            mattf Matthew Foley added a comment - Both patch and addendum patch have been applied by Arun, to both branch-1.2 and branch-1. Resolving as Fixed. Thanks, Sandy!
            cnauroth Chris Nauroth added a comment -

            I have also committed this patch branch-1-win.

            cnauroth Chris Nauroth added a comment - I have also committed this patch branch-1-win.
            jandyfish jandyfish added a comment -

            find a bug :
            open a filesystem when JobHistory create userLogFile,but this filesystem instance will not be close,and it still cache in FileSystem.cache。
            i comminited this svn patch JobInProcess_JobHistory.path。i hope it can help.

            jandyfish jandyfish added a comment - find a bug : open a filesystem when JobHistory create userLogFile,but this filesystem instance will not be close,and it still cache in FileSystem.cache。 i comminited this svn patch JobInProcess_JobHistory.path。i hope it can help.
            jandyfish jandyfish added a comment -

            find a bug :
            open a filesystem when JobHistory create userLogFile,but this filesystem instance will not be close,and it still cache in FileSystem.cache。
            i comminited this svn patch JobInProcess_JobHistory.path。i hope it can help.

            jandyfish jandyfish added a comment - find a bug : open a filesystem when JobHistory create userLogFile,but this filesystem instance will not be close,and it still cache in FileSystem.cache。 i comminited this svn patch JobInProcess_JobHistory.path。i hope it can help.
            cnauroth Chris Nauroth added a comment -

            Hi, jandyfish. Thank you for reporting this and posting a patch.

            Have you seen the related issue MAPREDUCE-5508? I believe that fixes the bug you saw. The MAPREDUCE-5508 patch has already been reviewed and committed. We tested it at scale (hundreds of nodes/thousands of job executions) and saw no occurrences of the leak.

            cnauroth Chris Nauroth added a comment - Hi, jandyfish . Thank you for reporting this and posting a patch. Have you seen the related issue MAPREDUCE-5508 ? I believe that fixes the bug you saw. The MAPREDUCE-5508 patch has already been reviewed and committed. We tested it at scale (hundreds of nodes/thousands of job executions) and saw no occurrences of the leak.
            jandyfish jandyfish added a comment -

            Hi,Chris Nauroth.
            When I solve this problem, do not find MAPREDUCE-5508 patch. And then submit patch when not paying attention
            The MAPREDUCE-5508 path is perfect and work well 。Thank you.

            jandyfish jandyfish added a comment - Hi,Chris Nauroth. When I solve this problem, do not find MAPREDUCE-5508 patch. And then submit patch when not paying attention The MAPREDUCE-5508 path is perfect and work well 。Thank you.
            viswaj viswanathan added a comment -

            Hi Chris/Jandy,

            Do I need to update the JobInProgress_JobHistory.patch also to resolve this issue, because I didn't face OOME issue after updating the MAPREDUCE-5508 but Jobtracker hangs after a 7 to 8 days.

            Please help.

            Thanks,
            Viswa.J

            viswaj viswanathan added a comment - Hi Chris/Jandy, Do I need to update the JobInProgress_JobHistory.patch also to resolve this issue, because I didn't face OOME issue after updating the MAPREDUCE-5508 but Jobtracker hangs after a 7 to 8 days. Please help. Thanks, Viswa.J
            cnauroth Chris Nauroth added a comment -

            No, AFAIK the JobInProgress_JobHistory.patch is not required, and this leak has been fixed by the patches already committed to svn for MAPREDUCE-5351 and MAPREDUCE-5508. If you're seeing a hang, but not a memory leak, then this sounds like a different issue. For example, MAPREDUCE-5606 is another known bug that can cause a JT to hang.

            cnauroth Chris Nauroth added a comment - No, AFAIK the JobInProgress_JobHistory.patch is not required, and this leak has been fixed by the patches already committed to svn for MAPREDUCE-5351 and MAPREDUCE-5508 . If you're seeing a hang, but not a memory leak, then this sounds like a different issue. For example, MAPREDUCE-5606 is another known bug that can cause a JT to hang.
            viswaj viswanathan added a comment -

            Hi Chris,

            For that JT hangs, you have asked to disable the hadoop user history
            location. So doing this will solve the problem. What exactly happens if I
            set this value to NONE. By setting this value, JT heap memory will not
            increase as much like now.

            Also I'm using the version 1.2.1, when the JT memory reaches 6.68/8.89 GB
            it start hangs every ten days. Not able to submit the job and UI is not
            loading at all.

            Please help.

            viswaj viswanathan added a comment - Hi Chris, For that JT hangs, you have asked to disable the hadoop user history location. So doing this will solve the problem. What exactly happens if I set this value to NONE. By setting this value, JT heap memory will not increase as much like now. Also I'm using the version 1.2.1, when the JT memory reaches 6.68/8.89 GB it start hangs every ten days. Not able to submit the job and UI is not loading at all. Please help.
            viswaj viswanathan added a comment -

            Hi Chris,

            JT memory reaches 6.68/8.89 GB and not able to submit the job and UI is not
            loading at all. But didn't see any JT OOM exceptions.

            Have taken the thread dump of Jobtracker, and the JT thread dump as follows:

            Deadlock Detection:

            Can't print deadlocks:null
            Thread 25817: (state = BLOCKED)

            • java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be imprecise)
            • org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=274, line=397 (Compiled frame)
            • org.apache.hadoop.hdfs.LeaseRenewer.access$600(org.apache.hadoop.hdfs.LeaseRenewer, int) @bci=2, line=69 (Interpreted frame)
            • org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=8, line=273 (Interpreted frame)
            • java.lang.Thread.run() @bci=11, line=662 (Interpreted frame)

            Locked ownable synchronizers:

            • None

            Thread 25815: (state = BLOCKED)

            • java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
            • org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run() @bci=245, line=3000 (Compiled frame)

            Locked ownable synchronizers:

            • None

            Thread 25813: (state = BLOCKED)

            • java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
            • org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame)
            • org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame)

            Locked ownable synchronizers:

            • None

            Thread 25812: (state = BLOCKED)

            • java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
            • org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame)
            • org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame)

            Locked ownable synchronizers:

            • None

            Thread 25790: (state = BLOCKED)

            • java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)

            Thread 25788: (state = BLOCKED)

            • java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
            • org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame)
            • org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame)

            Locked ownable synchronizers:

            • None

            Thread 25786: (state = BLOCKED)

            • java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
            • org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame)
            • org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame)

            Locked ownable synchronizers:

            • None

            Thread 25761: (state = BLOCKED)

            • sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 (Compiled frame; information may be imprecise)
            • sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=210 (Compiled frame)
            • sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=65 (Compiled frame)
            • sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled frame)
            • sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Interpreted frame)
            • org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel, int, long) @bci=46, line=332 (Interpreted frame)
            • org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int) @bci=80, line=157 (Compiled frame)
            • org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer) @bci=6, line=155 (Compiled frame)
            • org.apache.hadoop.net.SocketInputStream.read(byte[], int, int) @bci=7, line=128 (Compiled frame)
            • java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=116 (Interpreted frame)
            • org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(byte[], int, int) @bci=4, line=364 (Interpreted frame)
            • java.io.BufferedInputStream.fill() @bci=175, line=218 (Compiled frame)
            • java.io.BufferedInputStream.read() @bci=12, line=237 (Compiled frame)
            • java.io.DataInputStream.readInt() @bci=4, line=370 (Compiled frame)
            • org.apache.hadoop.ipc.Client$Connection.receiveResponse() @bci=19, line=845 (Compiled frame)
            • org.apache.hadoop.ipc.Client$Connection.run() @bci=62, line=790 (Compiled frame)

            -------------------------------------------------------------------------------------------------------------------------------

            And Jobtracker heap summary as follows:

            using thread-local object allocation.
            Parallel GC with 10 thread(s)

            Heap Configuration:
            MinHeapFreeRatio = 40
            MaxHeapFreeRatio = 70
            MaxHeapSize = 10737418240 (10240.0MB)
            NewSize = 1310720 (1.25MB)
            MaxNewSize = 17592186044415 MB
            OldSize = 5439488 (5.1875MB)
            NewRatio = 2
            SurvivorRatio = 8
            PermSize = 21757952 (20.75MB)
            MaxPermSize = 85983232 (82.0MB)

            Heap Usage:
            PS Young Generation
            Eden Space:
            capacity = 6488064 (6.1875MB)
            used = 6488064 (6.1875MB)
            free = 0 (0.0MB)
            100.0% used
            From Space:
            capacity = 9764864 (9.3125MB)
            used = 0 (0.0MB)
            free = 9764864 (9.3125MB)
            0.0% used
            To Space:
            capacity = 9764864 (9.3125MB)
            used = 0 (0.0MB)
            free = 9764864 (9.3125MB)
            0.0% used
            PS Old Generation
            capacity = 7158300672 (6826.6875MB)
            used = 7158240200 (6826.629829406738MB)
            free = 60472 (0.05767059326171875MB)
            99.99915521849708% used
            PS Perm Generation
            capacity = 26738688 (25.5MB)
            used = 26428648 (25.204322814941406MB)
            free = 310040 (0.29567718505859375MB)
            98.8404816272212% used

            Please help. It affects our production system.

            Thanks,
            Viswa

            viswaj viswanathan added a comment - Hi Chris, JT memory reaches 6.68/8.89 GB and not able to submit the job and UI is not loading at all. But didn't see any JT OOM exceptions. Have taken the thread dump of Jobtracker, and the JT thread dump as follows: Deadlock Detection: Can't print deadlocks:null Thread 25817: (state = BLOCKED) java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=274, line=397 (Compiled frame) org.apache.hadoop.hdfs.LeaseRenewer.access$600(org.apache.hadoop.hdfs.LeaseRenewer, int) @bci=2, line=69 (Interpreted frame) org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=8, line=273 (Interpreted frame) java.lang.Thread.run() @bci=11, line=662 (Interpreted frame) Locked ownable synchronizers: None Thread 25815: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run() @bci=245, line=3000 (Compiled frame) Locked ownable synchronizers: None Thread 25813: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame) Locked ownable synchronizers: None Thread 25812: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame) Locked ownable synchronizers: None Thread 25790: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) Thread 25788: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame) Locked ownable synchronizers: None Thread 25786: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame) Locked ownable synchronizers: None Thread 25761: (state = BLOCKED) sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 (Compiled frame; information may be imprecise) sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=210 (Compiled frame) sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=65 (Compiled frame) sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled frame) sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Interpreted frame) org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel, int, long) @bci=46, line=332 (Interpreted frame) org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int) @bci=80, line=157 (Compiled frame) org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer) @bci=6, line=155 (Compiled frame) org.apache.hadoop.net.SocketInputStream.read(byte[], int, int) @bci=7, line=128 (Compiled frame) java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=116 (Interpreted frame) org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(byte[], int, int) @bci=4, line=364 (Interpreted frame) java.io.BufferedInputStream.fill() @bci=175, line=218 (Compiled frame) java.io.BufferedInputStream.read() @bci=12, line=237 (Compiled frame) java.io.DataInputStream.readInt() @bci=4, line=370 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.receiveResponse() @bci=19, line=845 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=62, line=790 (Compiled frame) ------------------------------------------------------------------------------------------------------------------------------- And Jobtracker heap summary as follows: using thread-local object allocation. Parallel GC with 10 thread(s) Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 10737418240 (10240.0MB) NewSize = 1310720 (1.25MB) MaxNewSize = 17592186044415 MB OldSize = 5439488 (5.1875MB) NewRatio = 2 SurvivorRatio = 8 PermSize = 21757952 (20.75MB) MaxPermSize = 85983232 (82.0MB) Heap Usage: PS Young Generation Eden Space: capacity = 6488064 (6.1875MB) used = 6488064 (6.1875MB) free = 0 (0.0MB) 100.0% used From Space: capacity = 9764864 (9.3125MB) used = 0 (0.0MB) free = 9764864 (9.3125MB) 0.0% used To Space: capacity = 9764864 (9.3125MB) used = 0 (0.0MB) free = 9764864 (9.3125MB) 0.0% used PS Old Generation capacity = 7158300672 (6826.6875MB) used = 7158240200 (6826.629829406738MB) free = 60472 (0.05767059326171875MB) 99.99915521849708% used PS Perm Generation capacity = 26738688 (25.5MB) used = 26428648 (25.204322814941406MB) free = 310040 (0.29567718505859375MB) 98.8404816272212% used Please help. It affects our production system. Thanks, Viswa
            viswaj viswanathan added a comment -

            Hi Chris,

            Appreciate your response.

            Please help.

            viswaj viswanathan added a comment - Hi Chris, Appreciate your response. Please help.

            People

              sandyr Sandy Ryza
              sandyr Sandy Ryza
              Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: