Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5351

JobTracker memory leak caused by CleanupQueue reopening FileSystem

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 1.1.2
    • Fix Version/s: 1-win, 1.2.1
    • Component/s: jobtracker
    • Labels:
      None

      Description

      When a job is completed, closeAllForUGI is called to close all the cached FileSystems in the FileSystem cache. However, the CleanupQueue may run after this occurs and call FileSystem.get() to delete the staging directory, adding a FileSystem to the cache that will never be closed.

      People on the user-list have reported this causing their JobTrackers to OOME every two weeks.

      1. JobInProgress_JobHistory.patch
        3 kB
        jandyfish
      2. MAPREDUCE-5351-addendum-1.patch
        2 kB
        Sandy Ryza
      3. MAPREDUCE-5351-addendum.patch
        0.6 kB
        Sandy Ryza
      4. MAPREDUCE-5351-2.patch
        6 kB
        Sandy Ryza
      5. MAPREDUCE-5351-1.patch
        6 kB
        Sandy Ryza
      6. MAPREDUCE-5351.patch
        3 kB
        Sandy Ryza

        Issue Links

          Activity

          Hide
          Sandy Ryza added a comment -

          Uploaded a patch that should fix the issue, but haven't yet tried to verify it. I'm not sure how to write a test for this.

          Show
          Sandy Ryza added a comment - Uploaded a patch that should fix the issue, but haven't yet tried to verify it. I'm not sure how to write a test for this.
          Hide
          Alejandro Abdelnur added a comment -

          The changes in CleanupQueue.java, wouldn't be enough to close the FS instance got there instead doing a closeAll?

          On how to test it, could we temporary (for testing) instrument the FS cache size and monitor it does grow over time?

          Show
          Alejandro Abdelnur added a comment - The changes in CleanupQueue.java, wouldn't be enough to close the FS instance got there instead doing a closeAll? On how to test it, could we temporary (for testing) instrument the FS cache size and monitor it does grow over time?
          Hide
          Arun C Murthy added a comment -

          Sandy Ryza Agree with Alejandro Abdelnur... can you please help me understand the motivation for changes in JIP?

          It would be nice to put this into 1.2.1, I believe Matt is trying to get it done over the weekend... any chance we can expedite this?

          Show
          Arun C Murthy added a comment - Sandy Ryza Agree with Alejandro Abdelnur ... can you please help me understand the motivation for changes in JIP? It would be nice to put this into 1.2.1, I believe Matt is trying to get it done over the weekend... any chance we can expedite this?
          Hide
          Sandy Ryza added a comment -

          The motivation for the changes in JIP is that we don't want to close the FileSystem and then immediately reopen it from the CleanupQueue code. Just closing the FS instance (not doing closeAllForUGI) should be fine.

          I'll upload a patch today or tomorrow.

          Show
          Sandy Ryza added a comment - The motivation for the changes in JIP is that we don't want to close the FileSystem and then immediately reopen it from the CleanupQueue code. Just closing the FS instance (not doing closeAllForUGI) should be fine. I'll upload a patch today or tomorrow.
          Hide
          Arun C Murthy added a comment -

          Ok, makes sense. Thanks.

          Show
          Arun C Murthy added a comment - Ok, makes sense. Thanks.
          Hide
          Sandy Ryza added a comment -

          Uploaded a new patch with a test

          Show
          Sandy Ryza added a comment - Uploaded a new patch with a test
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12590103/MAPREDUCE-5351-1.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3811//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590103/MAPREDUCE-5351-1.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3811//console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -

          Sandy Ryza Thanks for the quick turnaround.

          One nit - you should, at least, mark the new api in FS as @Private.

          Show
          Arun C Murthy added a comment - Sandy Ryza Thanks for the quick turnaround. One nit - you should, at least, mark the new api in FS as @Private.
          Hide
          Sandy Ryza added a comment -

          Thanks for looking it over. Uploading a patch that marks it as private.

          Show
          Sandy Ryza added a comment - Thanks for looking it over. Uploading a patch that marks it as private.
          Hide
          Arun C Murthy added a comment -

          +1 lgtm.

          Alejandro Abdelnur You ok with this too?

          Show
          Arun C Murthy added a comment - +1 lgtm. Alejandro Abdelnur You ok with this too?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12590105/MAPREDUCE-5351-2.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3812//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590105/MAPREDUCE-5351-2.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3812//console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -

          I just committed this. Thanks Sandy Ryza!


          Alejandro Abdelnur - Sorry, I didn't wait longer to unblock Matt for the release. Please file a follow-on if you have any concernse. Thanks, and sorry again!

          Show
          Arun C Murthy added a comment - I just committed this. Thanks Sandy Ryza ! Alejandro Abdelnur - Sorry, I didn't wait longer to unblock Matt for the release. Please file a follow-on if you have any concernse. Thanks, and sorry again!
          Hide
          Arpit Gupta added a comment -

          Reopening as with this fix we are seeing jobs fail with the following exception

          13/07/02 16:06:57 DEBUG mapred.JobClient: Printing tokens for job: job_201307020820_0012
          13/07/02 16:06:57 DEBUG ipc.Client: IPC Client (47) connection to host/ip:50300 from hortonar sending #32
          13/07/02 16:06:57 DEBUG ipc.Client: IPC Client (47) connection to host/ip:50300 from hortonar got value #32
          13/07/02 16:06:57 DEBUG retry.RetryUtils: RETRY 0) policy=TryOnceThenFail, exception=org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed
                  at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383)
                  at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633)
                  at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364)
                  at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166)
                  at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350)
                  at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599)
                  at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561)
                  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                  at java.lang.reflect.Method.invoke(Method.java:597)
                  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
                  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
                  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440)
                  at java.security.AccessController.doPrivileged(Native Method)
                  at javax.security.auth.Subject.doAs(Subject.java:396)
                  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
                  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438)
          
          13/07/02 16:06:57 INFO mapred.JobClient: Cleaning up the staging area hdfs://host:8020/user/hortonar/.staging/job_201307020820_0012
          13/07/02 16:06:57 ERROR security.UserGroupInformation: PriviledgedActionException as:hortonar cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed
                  at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383)
                  at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633)
                  at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364)
                  at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166)
                  at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350)
                  at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599)
                  at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561)
                  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                  at java.lang.reflect.Method.invoke(Method.java:597)
                  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
                  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
                  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440)
                  at java.security.AccessController.doPrivileged(Native Method)
                  at javax.security.auth.Subject.doAs(Subject.java:396)
                  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
                  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438)
          
          org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed
                  at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383)
                  at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633)
                  at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364)
                  at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166)
                  at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350)
                  at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599)
                  at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561)
                  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                  at java.lang.reflect.Method.invoke(Method.java:597)
                  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
                  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
                  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440)
                  at java.security.AccessController.doPrivileged(Native Method)
                  at javax.security.auth.Subject.doAs(Subject.java:396)
          
          Show
          Arpit Gupta added a comment - Reopening as with this fix we are seeing jobs fail with the following exception 13/07/02 16:06:57 DEBUG mapred.JobClient: Printing tokens for job: job_201307020820_0012 13/07/02 16:06:57 DEBUG ipc.Client: IPC Client (47) connection to host/ip:50300 from hortonar sending #32 13/07/02 16:06:57 DEBUG ipc.Client: IPC Client (47) connection to host/ip:50300 from hortonar got value #32 13/07/02 16:06:57 DEBUG retry.RetryUtils: RETRY 0) policy=TryOnceThenFail, exception=org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438) 13/07/02 16:06:57 INFO mapred.JobClient: Cleaning up the staging area hdfs: //host:8020/user/hortonar/.staging/job_201307020820_0012 13/07/02 16:06:57 ERROR security.UserGroupInformation: PriviledgedActionException as:hortonar cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438) org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396)
          Hide
          Sandy Ryza added a comment -

          Arpit Gupta, how often is this occurring for you?

          Show
          Sandy Ryza added a comment - Arpit Gupta , how often is this occurring for you?
          Hide
          Arpit Gupta added a comment -

          Its happening very frequently that quite a few mr jobs failed because of this exception. Various example jobs, jobs through pig, oozie etc also failed because of this.

          Show
          Arpit Gupta added a comment - Its happening very frequently that quite a few mr jobs failed because of this exception. Various example jobs, jobs through pig, oozie etc also failed because of this.
          Hide
          Sandy Ryza added a comment -

          Thanks. I'll look into this.

          Show
          Sandy Ryza added a comment - Thanks. I'll look into this.
          Hide
          Sandy Ryza added a comment -

          Attaching an addendum patch that should fix the issue. I was able to reproduce the issue by running pi jobs repeatedly. With the patch, the exception no longer occurs.

          Show
          Sandy Ryza added a comment - Attaching an addendum patch that should fix the issue. I was able to reproduce the issue by running pi jobs repeatedly. With the patch, the exception no longer occurs.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12590494/MAPREDUCE-5351-addendum.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3822//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590494/MAPREDUCE-5351-addendum.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3822//console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -

          Sandy Ryza - JIP.cleanupJob always results in PathDeletionContext with ugi being null... how does this fix the original problem? I'm missing something? Tx.

          Show
          Arun C Murthy added a comment - Sandy Ryza - JIP.cleanupJob always results in PathDeletionContext with ugi being null... how does this fix the original problem? I'm missing something? Tx.
          Hide
          Sandy Ryza added a comment -

          In JIP.cleanupJob:

                  Path tempDir = jobtracker.getSystemDirectoryForJob(getJobID());
                  CleanupQueue.getInstance().addToQueue(
                      new PathDeletionContext(tempDir, conf));
          
                  // delete the staging area for the job and cancel delegation token
                  String jobTempDir = conf.get("mapreduce.job.dir");
                  if (jobTempDir != null && conf.getKeepTaskFilesPattern() == null &&
                      !conf.getKeepFailedTaskFiles()) {
                    Path jobTempDirPath = new Path(jobTempDir);
                    tempDirFs = jobTempDirPath.getFileSystem(conf);
                    CleanupQueue.getInstance().addToQueue(
                        new PathDeletionContext(jobTempDirPath, conf, userUGI, jobId));
                  }
          

          The CleanupQueue is used twice, once with the UGI set and once without.

          Show
          Sandy Ryza added a comment - In JIP.cleanupJob: Path tempDir = jobtracker.getSystemDirectoryForJob(getJobID()); CleanupQueue.getInstance().addToQueue( new PathDeletionContext(tempDir, conf)); // delete the staging area for the job and cancel delegation token String jobTempDir = conf.get( "mapreduce.job.dir" ); if (jobTempDir != null && conf.getKeepTaskFilesPattern() == null && !conf.getKeepFailedTaskFiles()) { Path jobTempDirPath = new Path(jobTempDir); tempDirFs = jobTempDirPath.getFileSystem(conf); CleanupQueue.getInstance().addToQueue( new PathDeletionContext(jobTempDirPath, conf, userUGI, jobId)); } The CleanupQueue is used twice, once with the UGI set and once without.
          Hide
          Arun C Murthy added a comment -

          Duh, good point.

          Can you please add a comment to the addendum patch explaining the rationale for the check? Also, a test case specifically covering this bug would be good. Thanks!

          Show
          Arun C Murthy added a comment - Duh, good point. Can you please add a comment to the addendum patch explaining the rationale for the check? Also, a test case specifically covering this bug would be good. Thanks!
          Hide
          Sandy Ryza added a comment -

          Uploaded a new patch that updates the comments and the test case.

          Show
          Sandy Ryza added a comment - Uploaded a new patch that updates the comments and the test case.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12590672/MAPREDUCE-5351-addendum-1.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3826//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590672/MAPREDUCE-5351-addendum-1.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3826//console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          The original patch as well as the addendum look good to me. The new test looks good too. +1 for pushing the addendum.

          Show
          Vinod Kumar Vavilapalli added a comment - The original patch as well as the addendum look good to me. The new test looks good too. +1 for pushing the addendum.
          Hide
          Matt Foley added a comment -

          Both patch and addendum patch have been applied by Arun, to both branch-1.2 and branch-1. Resolving as Fixed. Thanks, Sandy!

          Show
          Matt Foley added a comment - Both patch and addendum patch have been applied by Arun, to both branch-1.2 and branch-1. Resolving as Fixed. Thanks, Sandy!
          Hide
          Chris Nauroth added a comment -

          I have also committed this patch branch-1-win.

          Show
          Chris Nauroth added a comment - I have also committed this patch branch-1-win.
          Hide
          jandyfish added a comment -

          find a bug :
          open a filesystem when JobHistory create userLogFile,but this filesystem instance will not be close,and it still cache in FileSystem.cache。
          i comminited this svn patch JobInProcess_JobHistory.path。i hope it can help.

          Show
          jandyfish added a comment - find a bug : open a filesystem when JobHistory create userLogFile,but this filesystem instance will not be close,and it still cache in FileSystem.cache。 i comminited this svn patch JobInProcess_JobHistory.path。i hope it can help.
          Hide
          jandyfish added a comment -

          find a bug :
          open a filesystem when JobHistory create userLogFile,but this filesystem instance will not be close,and it still cache in FileSystem.cache。
          i comminited this svn patch JobInProcess_JobHistory.path。i hope it can help.

          Show
          jandyfish added a comment - find a bug : open a filesystem when JobHistory create userLogFile,but this filesystem instance will not be close,and it still cache in FileSystem.cache。 i comminited this svn patch JobInProcess_JobHistory.path。i hope it can help.
          Hide
          Chris Nauroth added a comment -

          Hi, jandyfish. Thank you for reporting this and posting a patch.

          Have you seen the related issue MAPREDUCE-5508? I believe that fixes the bug you saw. The MAPREDUCE-5508 patch has already been reviewed and committed. We tested it at scale (hundreds of nodes/thousands of job executions) and saw no occurrences of the leak.

          Show
          Chris Nauroth added a comment - Hi, jandyfish . Thank you for reporting this and posting a patch. Have you seen the related issue MAPREDUCE-5508 ? I believe that fixes the bug you saw. The MAPREDUCE-5508 patch has already been reviewed and committed. We tested it at scale (hundreds of nodes/thousands of job executions) and saw no occurrences of the leak.
          Hide
          jandyfish added a comment -

          Hi,Chris Nauroth.
          When I solve this problem, do not find MAPREDUCE-5508 patch. And then submit patch when not paying attention
          The MAPREDUCE-5508 path is perfect and work well 。Thank you.

          Show
          jandyfish added a comment - Hi,Chris Nauroth. When I solve this problem, do not find MAPREDUCE-5508 patch. And then submit patch when not paying attention The MAPREDUCE-5508 path is perfect and work well 。Thank you.
          Hide
          viswanathan added a comment -

          Hi Chris/Jandy,

          Do I need to update the JobInProgress_JobHistory.patch also to resolve this issue, because I didn't face OOME issue after updating the MAPREDUCE-5508 but Jobtracker hangs after a 7 to 8 days.

          Please help.

          Thanks,
          Viswa.J

          Show
          viswanathan added a comment - Hi Chris/Jandy, Do I need to update the JobInProgress_JobHistory.patch also to resolve this issue, because I didn't face OOME issue after updating the MAPREDUCE-5508 but Jobtracker hangs after a 7 to 8 days. Please help. Thanks, Viswa.J
          Hide
          Chris Nauroth added a comment -

          No, AFAIK the JobInProgress_JobHistory.patch is not required, and this leak has been fixed by the patches already committed to svn for MAPREDUCE-5351 and MAPREDUCE-5508. If you're seeing a hang, but not a memory leak, then this sounds like a different issue. For example, MAPREDUCE-5606 is another known bug that can cause a JT to hang.

          Show
          Chris Nauroth added a comment - No, AFAIK the JobInProgress_JobHistory.patch is not required, and this leak has been fixed by the patches already committed to svn for MAPREDUCE-5351 and MAPREDUCE-5508 . If you're seeing a hang, but not a memory leak, then this sounds like a different issue. For example, MAPREDUCE-5606 is another known bug that can cause a JT to hang.
          Hide
          viswanathan added a comment -

          Hi Chris,

          For that JT hangs, you have asked to disable the hadoop user history
          location. So doing this will solve the problem. What exactly happens if I
          set this value to NONE. By setting this value, JT heap memory will not
          increase as much like now.

          Also I'm using the version 1.2.1, when the JT memory reaches 6.68/8.89 GB
          it start hangs every ten days. Not able to submit the job and UI is not
          loading at all.

          Please help.

          Show
          viswanathan added a comment - Hi Chris, For that JT hangs, you have asked to disable the hadoop user history location. So doing this will solve the problem. What exactly happens if I set this value to NONE. By setting this value, JT heap memory will not increase as much like now. Also I'm using the version 1.2.1, when the JT memory reaches 6.68/8.89 GB it start hangs every ten days. Not able to submit the job and UI is not loading at all. Please help.
          Hide
          viswanathan added a comment -

          Hi Chris,

          JT memory reaches 6.68/8.89 GB and not able to submit the job and UI is not
          loading at all. But didn't see any JT OOM exceptions.

          Have taken the thread dump of Jobtracker, and the JT thread dump as follows:

          Deadlock Detection:

          Can't print deadlocks:null
          Thread 25817: (state = BLOCKED)

          • java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be imprecise)
          • org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=274, line=397 (Compiled frame)
          • org.apache.hadoop.hdfs.LeaseRenewer.access$600(org.apache.hadoop.hdfs.LeaseRenewer, int) @bci=2, line=69 (Interpreted frame)
          • org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=8, line=273 (Interpreted frame)
          • java.lang.Thread.run() @bci=11, line=662 (Interpreted frame)

          Locked ownable synchronizers:

          • None

          Thread 25815: (state = BLOCKED)

          • java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
          • org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run() @bci=245, line=3000 (Compiled frame)

          Locked ownable synchronizers:

          • None

          Thread 25813: (state = BLOCKED)

          • java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
          • org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame)
          • org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame)

          Locked ownable synchronizers:

          • None

          Thread 25812: (state = BLOCKED)

          • java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
          • org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame)
          • org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame)

          Locked ownable synchronizers:

          • None

          Thread 25790: (state = BLOCKED)

          • java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)

          Thread 25788: (state = BLOCKED)

          • java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
          • org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame)
          • org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame)

          Locked ownable synchronizers:

          • None

          Thread 25786: (state = BLOCKED)

          • java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
          • org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame)
          • org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame)

          Locked ownable synchronizers:

          • None

          Thread 25761: (state = BLOCKED)

          • sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 (Compiled frame; information may be imprecise)
          • sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=210 (Compiled frame)
          • sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=65 (Compiled frame)
          • sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled frame)
          • sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Interpreted frame)
          • org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel, int, long) @bci=46, line=332 (Interpreted frame)
          • org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int) @bci=80, line=157 (Compiled frame)
          • org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer) @bci=6, line=155 (Compiled frame)
          • org.apache.hadoop.net.SocketInputStream.read(byte[], int, int) @bci=7, line=128 (Compiled frame)
          • java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=116 (Interpreted frame)
          • org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(byte[], int, int) @bci=4, line=364 (Interpreted frame)
          • java.io.BufferedInputStream.fill() @bci=175, line=218 (Compiled frame)
          • java.io.BufferedInputStream.read() @bci=12, line=237 (Compiled frame)
          • java.io.DataInputStream.readInt() @bci=4, line=370 (Compiled frame)
          • org.apache.hadoop.ipc.Client$Connection.receiveResponse() @bci=19, line=845 (Compiled frame)
          • org.apache.hadoop.ipc.Client$Connection.run() @bci=62, line=790 (Compiled frame)

          -------------------------------------------------------------------------------------------------------------------------------

          And Jobtracker heap summary as follows:

          using thread-local object allocation.
          Parallel GC with 10 thread(s)

          Heap Configuration:
          MinHeapFreeRatio = 40
          MaxHeapFreeRatio = 70
          MaxHeapSize = 10737418240 (10240.0MB)
          NewSize = 1310720 (1.25MB)
          MaxNewSize = 17592186044415 MB
          OldSize = 5439488 (5.1875MB)
          NewRatio = 2
          SurvivorRatio = 8
          PermSize = 21757952 (20.75MB)
          MaxPermSize = 85983232 (82.0MB)

          Heap Usage:
          PS Young Generation
          Eden Space:
          capacity = 6488064 (6.1875MB)
          used = 6488064 (6.1875MB)
          free = 0 (0.0MB)
          100.0% used
          From Space:
          capacity = 9764864 (9.3125MB)
          used = 0 (0.0MB)
          free = 9764864 (9.3125MB)
          0.0% used
          To Space:
          capacity = 9764864 (9.3125MB)
          used = 0 (0.0MB)
          free = 9764864 (9.3125MB)
          0.0% used
          PS Old Generation
          capacity = 7158300672 (6826.6875MB)
          used = 7158240200 (6826.629829406738MB)
          free = 60472 (0.05767059326171875MB)
          99.99915521849708% used
          PS Perm Generation
          capacity = 26738688 (25.5MB)
          used = 26428648 (25.204322814941406MB)
          free = 310040 (0.29567718505859375MB)
          98.8404816272212% used

          Please help. It affects our production system.

          Thanks,
          Viswa

          Show
          viswanathan added a comment - Hi Chris, JT memory reaches 6.68/8.89 GB and not able to submit the job and UI is not loading at all. But didn't see any JT OOM exceptions. Have taken the thread dump of Jobtracker, and the JT thread dump as follows: Deadlock Detection: Can't print deadlocks:null Thread 25817: (state = BLOCKED) java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=274, line=397 (Compiled frame) org.apache.hadoop.hdfs.LeaseRenewer.access$600(org.apache.hadoop.hdfs.LeaseRenewer, int) @bci=2, line=69 (Interpreted frame) org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=8, line=273 (Interpreted frame) java.lang.Thread.run() @bci=11, line=662 (Interpreted frame) Locked ownable synchronizers: None Thread 25815: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run() @bci=245, line=3000 (Compiled frame) Locked ownable synchronizers: None Thread 25813: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame) Locked ownable synchronizers: None Thread 25812: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame) Locked ownable synchronizers: None Thread 25790: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) Thread 25788: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame) Locked ownable synchronizers: None Thread 25786: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame) Locked ownable synchronizers: None Thread 25761: (state = BLOCKED) sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 (Compiled frame; information may be imprecise) sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=210 (Compiled frame) sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=65 (Compiled frame) sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled frame) sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Interpreted frame) org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel, int, long) @bci=46, line=332 (Interpreted frame) org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int) @bci=80, line=157 (Compiled frame) org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer) @bci=6, line=155 (Compiled frame) org.apache.hadoop.net.SocketInputStream.read(byte[], int, int) @bci=7, line=128 (Compiled frame) java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=116 (Interpreted frame) org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(byte[], int, int) @bci=4, line=364 (Interpreted frame) java.io.BufferedInputStream.fill() @bci=175, line=218 (Compiled frame) java.io.BufferedInputStream.read() @bci=12, line=237 (Compiled frame) java.io.DataInputStream.readInt() @bci=4, line=370 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.receiveResponse() @bci=19, line=845 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=62, line=790 (Compiled frame) ------------------------------------------------------------------------------------------------------------------------------- And Jobtracker heap summary as follows: using thread-local object allocation. Parallel GC with 10 thread(s) Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 10737418240 (10240.0MB) NewSize = 1310720 (1.25MB) MaxNewSize = 17592186044415 MB OldSize = 5439488 (5.1875MB) NewRatio = 2 SurvivorRatio = 8 PermSize = 21757952 (20.75MB) MaxPermSize = 85983232 (82.0MB) Heap Usage: PS Young Generation Eden Space: capacity = 6488064 (6.1875MB) used = 6488064 (6.1875MB) free = 0 (0.0MB) 100.0% used From Space: capacity = 9764864 (9.3125MB) used = 0 (0.0MB) free = 9764864 (9.3125MB) 0.0% used To Space: capacity = 9764864 (9.3125MB) used = 0 (0.0MB) free = 9764864 (9.3125MB) 0.0% used PS Old Generation capacity = 7158300672 (6826.6875MB) used = 7158240200 (6826.629829406738MB) free = 60472 (0.05767059326171875MB) 99.99915521849708% used PS Perm Generation capacity = 26738688 (25.5MB) used = 26428648 (25.204322814941406MB) free = 310040 (0.29567718505859375MB) 98.8404816272212% used Please help. It affects our production system. Thanks, Viswa
          Hide
          viswanathan added a comment -

          Hi Chris,

          Appreciate your response.

          Please help.

          Show
          viswanathan added a comment - Hi Chris, Appreciate your response. Please help.

            People

            • Assignee:
              Sandy Ryza
              Reporter:
              Sandy Ryza
            • Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development