[MAPREDUCE-5351] JobTracker memory leak caused by CleanupQueue reopening FileSystem - ASF JIRA

viswanathan added a comment - 15/Dec/13 04:44

Hi Chris,

Appreciate your response.

Please help.

viswanathan added a comment - 15/Dec/13 04:44 Hi Chris, Appreciate your response. Please help.

viswanathan added a comment - 14/Dec/13 17:04

Hi Chris,

JT memory reaches 6.68/8.89 GB and not able to submit the job and UI is not
loading at all. But didn't see any JT OOM exceptions.

Have taken the thread dump of Jobtracker, and the JT thread dump as follows:

Deadlock Detection:

Can't print deadlocks:null
Thread 25817: (state = BLOCKED)

java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be imprecise)
org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=274, line=397 (Compiled frame)
org.apache.hadoop.hdfs.LeaseRenewer.access$600(org.apache.hadoop.hdfs.LeaseRenewer, int) @bci=2, line=69 (Interpreted frame)
org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=8, line=273 (Interpreted frame)
java.lang.Thread.run() @bci=11, line=662 (Interpreted frame)

Locked ownable synchronizers:

None

Thread 25815: (state = BLOCKED)

java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run() @bci=245, line=3000 (Compiled frame)

Locked ownable synchronizers:

None

Thread 25813: (state = BLOCKED)

java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame)
org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame)

Locked ownable synchronizers:

None

Thread 25812: (state = BLOCKED)

java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame)
org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame)

Locked ownable synchronizers:

None

Thread 25790: (state = BLOCKED)

java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)

Thread 25788: (state = BLOCKED)

java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame)
org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame)

Locked ownable synchronizers:

None

Thread 25786: (state = BLOCKED)

java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame)
org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame)

Locked ownable synchronizers:

None

Thread 25761: (state = BLOCKED)

sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 (Compiled frame; information may be imprecise)
sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=210 (Compiled frame)
sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=65 (Compiled frame)
sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled frame)
sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Interpreted frame)
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel, int, long) @bci=46, line=332 (Interpreted frame)
org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int) @bci=80, line=157 (Compiled frame)
org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer) @bci=6, line=155 (Compiled frame)
org.apache.hadoop.net.SocketInputStream.read(byte[], int, int) @bci=7, line=128 (Compiled frame)
java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=116 (Interpreted frame)
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(byte[], int, int) @bci=4, line=364 (Interpreted frame)
java.io.BufferedInputStream.fill() @bci=175, line=218 (Compiled frame)
java.io.BufferedInputStream.read() @bci=12, line=237 (Compiled frame)
java.io.DataInputStream.readInt() @bci=4, line=370 (Compiled frame)
org.apache.hadoop.ipc.Client$Connection.receiveResponse() @bci=19, line=845 (Compiled frame)
org.apache.hadoop.ipc.Client$Connection.run() @bci=62, line=790 (Compiled frame)

-------------------------------------------------------------------------------------------------------------------------------

And Jobtracker heap summary as follows:

using thread-local object allocation.
Parallel GC with 10 thread(s)

Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 10737418240 (10240.0MB)
NewSize = 1310720 (1.25MB)
MaxNewSize = 17592186044415 MB
OldSize = 5439488 (5.1875MB)
NewRatio = 2
SurvivorRatio = 8
PermSize = 21757952 (20.75MB)
MaxPermSize = 85983232 (82.0MB)

Heap Usage:
PS Young Generation
Eden Space:
capacity = 6488064 (6.1875MB)
used = 6488064 (6.1875MB)
free = 0 (0.0MB)
100.0% used
From Space:
capacity = 9764864 (9.3125MB)
used = 0 (0.0MB)
free = 9764864 (9.3125MB)
0.0% used
To Space:
capacity = 9764864 (9.3125MB)
used = 0 (0.0MB)
free = 9764864 (9.3125MB)
0.0% used
PS Old Generation
capacity = 7158300672 (6826.6875MB)
used = 7158240200 (6826.629829406738MB)
free = 60472 (0.05767059326171875MB)
99.99915521849708% used
PS Perm Generation
capacity = 26738688 (25.5MB)
used = 26428648 (25.204322814941406MB)
free = 310040 (0.29567718505859375MB)
98.8404816272212% used

Please help. It affects our production system.

Thanks,
Viswa

viswanathan added a comment - 14/Dec/13 17:04 Hi Chris, JT memory reaches 6.68/8.89 GB and not able to submit the job and UI is not loading at all. But didn't see any JT OOM exceptions. Have taken the thread dump of Jobtracker, and the JT thread dump as follows: Deadlock Detection: Can't print deadlocks:null Thread 25817: (state = BLOCKED) java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=274, line=397 (Compiled frame) org.apache.hadoop.hdfs.LeaseRenewer.access$600(org.apache.hadoop.hdfs.LeaseRenewer, int) @bci=2, line=69 (Interpreted frame) org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=8, line=273 (Interpreted frame) java.lang.Thread.run() @bci=11, line=662 (Interpreted frame) Locked ownable synchronizers: None Thread 25815: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run() @bci=245, line=3000 (Compiled frame) Locked ownable synchronizers: None Thread 25813: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame) Locked ownable synchronizers: None Thread 25812: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame) Locked ownable synchronizers: None Thread 25790: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) Thread 25788: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame) Locked ownable synchronizers: None Thread 25786: (state = BLOCKED) java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled frame) Locked ownable synchronizers: None Thread 25761: (state = BLOCKED) sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 (Compiled frame; information may be imprecise) sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=210 (Compiled frame) sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=65 (Compiled frame) sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled frame) sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Interpreted frame) org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel, int, long) @bci=46, line=332 (Interpreted frame) org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int) @bci=80, line=157 (Compiled frame) org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer) @bci=6, line=155 (Compiled frame) org.apache.hadoop.net.SocketInputStream.read(byte[], int, int) @bci=7, line=128 (Compiled frame) java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=116 (Interpreted frame) org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(byte[], int, int) @bci=4, line=364 (Interpreted frame) java.io.BufferedInputStream.fill() @bci=175, line=218 (Compiled frame) java.io.BufferedInputStream.read() @bci=12, line=237 (Compiled frame) java.io.DataInputStream.readInt() @bci=4, line=370 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.receiveResponse() @bci=19, line=845 (Compiled frame) org.apache.hadoop.ipc.Client$Connection.run() @bci=62, line=790 (Compiled frame) ------------------------------------------------------------------------------------------------------------------------------- And Jobtracker heap summary as follows: using thread-local object allocation. Parallel GC with 10 thread(s) Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 10737418240 (10240.0MB) NewSize = 1310720 (1.25MB) MaxNewSize = 17592186044415 MB OldSize = 5439488 (5.1875MB) NewRatio = 2 SurvivorRatio = 8 PermSize = 21757952 (20.75MB) MaxPermSize = 85983232 (82.0MB) Heap Usage: PS Young Generation Eden Space: capacity = 6488064 (6.1875MB) used = 6488064 (6.1875MB) free = 0 (0.0MB) 100.0% used From Space: capacity = 9764864 (9.3125MB) used = 0 (0.0MB) free = 9764864 (9.3125MB) 0.0% used To Space: capacity = 9764864 (9.3125MB) used = 0 (0.0MB) free = 9764864 (9.3125MB) 0.0% used PS Old Generation capacity = 7158300672 (6826.6875MB) used = 7158240200 (6826.629829406738MB) free = 60472 (0.05767059326171875MB) 99.99915521849708% used PS Perm Generation capacity = 26738688 (25.5MB) used = 26428648 (25.204322814941406MB) free = 310040 (0.29567718505859375MB) 98.8404816272212% used Please help. It affects our production system. Thanks, Viswa

viswanathan added a comment - 14/Dec/13 02:00

Hi Chris,

For that JT hangs, you have asked to disable the hadoop user history
location. So doing this will solve the problem. What exactly happens if I
set this value to NONE. By setting this value, JT heap memory will not
increase as much like now.

Also I'm using the version 1.2.1, when the JT memory reaches 6.68/8.89 GB
it start hangs every ten days. Not able to submit the job and UI is not
loading at all.

Please help.

viswanathan added a comment - 14/Dec/13 02:00 Hi Chris, For that JT hangs, you have asked to disable the hadoop user history location. So doing this will solve the problem. What exactly happens if I set this value to NONE. By setting this value, JT heap memory will not increase as much like now. Also I'm using the version 1.2.1, when the JT memory reaches 6.68/8.89 GB it start hangs every ten days. Not able to submit the job and UI is not loading at all. Please help.

Chris Nauroth added a comment - 13/Dec/13 16:46

No, AFAIK the JobInProgress_JobHistory.patch is not required, and this leak has been fixed by the patches already committed to svn for ~~MAPREDUCE-5351~~ and ~~MAPREDUCE-5508~~. If you're seeing a hang, but not a memory leak, then this sounds like a different issue. For example, ~~MAPREDUCE-5606~~ is another known bug that can cause a JT to hang.

Chris Nauroth added a comment - 13/Dec/13 16:46 No, AFAIK the JobInProgress_JobHistory.patch is not required, and this leak has been fixed by the patches already committed to svn for MAPREDUCE-5351 and MAPREDUCE-5508 . If you're seeing a hang, but not a memory leak, then this sounds like a different issue. For example, MAPREDUCE-5606 is another known bug that can cause a JT to hang.

viswanathan added a comment - 13/Dec/13 08:33

Hi Chris/Jandy,

Do I need to update the JobInProgress_JobHistory.patch also to resolve this issue, because I didn't face OOME issue after updating the ~~MAPREDUCE-5508~~ but Jobtracker hangs after a 7 to 8 days.

Please help.

Thanks,
Viswa.J

viswanathan added a comment - 13/Dec/13 08:33 Hi Chris/Jandy, Do I need to update the JobInProgress_JobHistory.patch also to resolve this issue, because I didn't face OOME issue after updating the MAPREDUCE-5508 but Jobtracker hangs after a 7 to 8 days. Please help. Thanks, Viswa.J

jandyfish added a comment - 10/Dec/13 10:22

Hi,Chris Nauroth.
When I solve this problem, do not find ~~MAPREDUCE-5508~~ patch. And then submit patch when not paying attention
The ~~MAPREDUCE-5508~~ path is perfect and work well 。Thank you.

jandyfish added a comment - 10/Dec/13 10:22 Hi,Chris Nauroth. When I solve this problem, do not find MAPREDUCE-5508 patch. And then submit patch when not paying attention The MAPREDUCE-5508 path is perfect and work well 。Thank you.

Chris Nauroth added a comment - 11/Nov/13 17:10

Hi, jandyfish. Thank you for reporting this and posting a patch.

Have you seen the related issue ~~MAPREDUCE-5508~~? I believe that fixes the bug you saw. The ~~MAPREDUCE-5508~~ patch has already been reviewed and committed. We tested it at scale (hundreds of nodes/thousands of job executions) and saw no occurrences of the leak.

Chris Nauroth added a comment - 11/Nov/13 17:10 Hi, jandyfish . Thank you for reporting this and posting a patch. Have you seen the related issue MAPREDUCE-5508 ? I believe that fixes the bug you saw. The MAPREDUCE-5508 patch has already been reviewed and committed. We tested it at scale (hundreds of nodes/thousands of job executions) and saw no occurrences of the leak.

jandyfish added a comment - 11/Nov/13 05:37

find a bug :
open a filesystem when JobHistory create userLogFile,but this filesystem instance will not be close,and it still cache in FileSystem.cache。
i comminited this svn patch JobInProcess_JobHistory.path。i hope it can help.

jandyfish added a comment - 11/Nov/13 05:37 find a bug : open a filesystem when JobHistory create userLogFile,but this filesystem instance will not be close,and it still cache in FileSystem.cache。 i comminited this svn patch JobInProcess_JobHistory.path。i hope it can help.

jandyfish added a comment - 11/Nov/13 05:35

find a bug :
open a filesystem when JobHistory create userLogFile,but this filesystem instance will not be close,and it still cache in FileSystem.cache。
i comminited this svn patch JobInProcess_JobHistory.path。i hope it can help.

jandyfish added a comment - 11/Nov/13 05:35 find a bug : open a filesystem when JobHistory create userLogFile,but this filesystem instance will not be close,and it still cache in FileSystem.cache。 i comminited this svn patch JobInProcess_JobHistory.path。i hope it can help.

Chris Nauroth added a comment - 24/Sep/13 05:21

I have also committed this patch branch-1-win.

Chris Nauroth added a comment - 24/Sep/13 05:21 I have also committed this patch branch-1-win.

Matthew Foley added a comment - 07/Jul/13 05:19

Both patch and addendum patch have been applied by Arun, to both branch-1.2 and branch-1. Resolving as Fixed. Thanks, Sandy!

Matthew Foley added a comment - 07/Jul/13 05:19 Both patch and addendum patch have been applied by Arun, to both branch-1.2 and branch-1. Resolving as Fixed. Thanks, Sandy!

Vinod Kumar Vavilapalli added a comment - 04/Jul/13 17:39

The original patch as well as the addendum look good to me. The new test looks good too. +1 for pushing the addendum.

Vinod Kumar Vavilapalli added a comment - 04/Jul/13 17:39 The original patch as well as the addendum look good to me. The new test looks good too. +1 for pushing the addendum.

Hadoop QA added a comment - 03/Jul/13 17:21

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12590672/MAPREDUCE-5351-addendum-1.patch
against trunk revision .

-1 patch. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3826//console

This message is automatically generated.

Hadoop QA added a comment - 03/Jul/13 17:21 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590672/MAPREDUCE-5351-addendum-1.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3826//console This message is automatically generated.

Sandy Ryza added a comment - 03/Jul/13 17:17

Uploaded a new patch that updates the comments and the test case.

Sandy Ryza added a comment - 03/Jul/13 17:17 Uploaded a new patch that updates the comments and the test case.

Arun Murthy added a comment - 03/Jul/13 00:23

Duh, good point.

Can you please add a comment to the addendum patch explaining the rationale for the check? Also, a test case specifically covering this bug would be good. Thanks!

Arun Murthy added a comment - 03/Jul/13 00:23 Duh, good point. Can you please add a comment to the addendum patch explaining the rationale for the check? Also, a test case specifically covering this bug would be good. Thanks!

Sandy Ryza added a comment - 02/Jul/13 21:01

In JIP.cleanupJob:

        Path tempDir = jobtracker.getSystemDirectoryForJob(getJobID());
        CleanupQueue.getInstance().addToQueue(
            new PathDeletionContext(tempDir, conf));

        // delete the staging area for the job and cancel delegation token
        String jobTempDir = conf.get("mapreduce.job.dir");
        if (jobTempDir != null && conf.getKeepTaskFilesPattern() == null &&
            !conf.getKeepFailedTaskFiles()) {
          Path jobTempDirPath = new Path(jobTempDir);
          tempDirFs = jobTempDirPath.getFileSystem(conf);
          CleanupQueue.getInstance().addToQueue(
              new PathDeletionContext(jobTempDirPath, conf, userUGI, jobId));
        }

The CleanupQueue is used twice, once with the UGI set and once without.

Sandy Ryza added a comment - 02/Jul/13 21:01 In JIP.cleanupJob: Path tempDir = jobtracker.getSystemDirectoryForJob(getJobID()); CleanupQueue.getInstance().addToQueue( new PathDeletionContext(tempDir, conf)); // delete the staging area for the job and cancel delegation token String jobTempDir = conf.get( "mapreduce.job.dir" ); if (jobTempDir != null && conf.getKeepTaskFilesPattern() == null && !conf.getKeepFailedTaskFiles()) { Path jobTempDirPath = new Path(jobTempDir); tempDirFs = jobTempDirPath.getFileSystem(conf); CleanupQueue.getInstance().addToQueue( new PathDeletionContext(jobTempDirPath, conf, userUGI, jobId)); } The CleanupQueue is used twice, once with the UGI set and once without.

Arun Murthy added a comment - 02/Jul/13 20:52

sandyr - JIP.cleanupJob always results in PathDeletionContext with ugi being null... how does this fix the original problem? I'm missing something? Tx.

Arun Murthy added a comment - 02/Jul/13 20:52 sandyr - JIP.cleanupJob always results in PathDeletionContext with ugi being null... how does this fix the original problem? I'm missing something? Tx.

Hadoop QA added a comment - 02/Jul/13 20:21

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12590494/MAPREDUCE-5351-addendum.patch
against trunk revision .

-1 patch. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3822//console

This message is automatically generated.

Hadoop QA added a comment - 02/Jul/13 20:21 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590494/MAPREDUCE-5351-addendum.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3822//console This message is automatically generated.

Sandy Ryza added a comment - 02/Jul/13 20:13

Attaching an addendum patch that should fix the issue. I was able to reproduce the issue by running pi jobs repeatedly. With the patch, the exception no longer occurs.

Sandy Ryza added a comment - 02/Jul/13 20:13 Attaching an addendum patch that should fix the issue. I was able to reproduce the issue by running pi jobs repeatedly. With the patch, the exception no longer occurs.

Sandy Ryza added a comment - 02/Jul/13 18:24

Thanks. I'll look into this.

Sandy Ryza added a comment - 02/Jul/13 18:24 Thanks. I'll look into this.

Arpit Gupta added a comment - 02/Jul/13 18:08

Its happening very frequently that quite a few mr jobs failed because of this exception. Various example jobs, jobs through pig, oozie etc also failed because of this.

Arpit Gupta added a comment - 02/Jul/13 18:08 Its happening very frequently that quite a few mr jobs failed because of this exception. Various example jobs, jobs through pig, oozie etc also failed because of this.

Sandy Ryza added a comment - 02/Jul/13 17:59

arpitgupta, how often is this occurring for you?

Sandy Ryza added a comment - 02/Jul/13 17:59 arpitgupta , how often is this occurring for you?

Arpit Gupta added a comment - 02/Jul/13 17:08

Reopening as with this fix we are seeing jobs fail with the following exception

13/07/02 16:06:57 DEBUG mapred.JobClient: Printing tokens for job: job_201307020820_0012
13/07/02 16:06:57 DEBUG ipc.Client: IPC Client (47) connection to host/ip:50300 from hortonar sending #32
13/07/02 16:06:57 DEBUG ipc.Client: IPC Client (47) connection to host/ip:50300 from hortonar got value #32
13/07/02 16:06:57 DEBUG retry.RetryUtils: RETRY 0) policy=TryOnceThenFail, exception=org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383)
        at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633)
        at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364)
        at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166)
        at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350)
        at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599)
        at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438)

13/07/02 16:06:57 INFO mapred.JobClient: Cleaning up the staging area hdfs://host:8020/user/hortonar/.staging/job_201307020820_0012
13/07/02 16:06:57 ERROR security.UserGroupInformation: PriviledgedActionException as:hortonar cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383)
        at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633)
        at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364)
        at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166)
        at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350)
        at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599)
        at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438)

org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383)
        at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633)
        at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364)
        at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166)
        at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350)
        at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599)
        at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)

Arpit Gupta added a comment - 02/Jul/13 17:08 Reopening as with this fix we are seeing jobs fail with the following exception 13/07/02 16:06:57 DEBUG mapred.JobClient: Printing tokens for job: job_201307020820_0012 13/07/02 16:06:57 DEBUG ipc.Client: IPC Client (47) connection to host/ip:50300 from hortonar sending #32 13/07/02 16:06:57 DEBUG ipc.Client: IPC Client (47) connection to host/ip:50300 from hortonar got value #32 13/07/02 16:06:57 DEBUG retry.RetryUtils: RETRY 0) policy=TryOnceThenFail, exception=org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438) 13/07/02 16:06:57 INFO mapred.JobClient: Cleaning up the staging area hdfs: //host:8020/user/hortonar/.staging/job_201307020820_0012 13/07/02 16:06:57 ERROR security.UserGroupInformation: PriviledgedActionException as:hortonar cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438) org.apache.hadoop.ipc.RemoteException: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:383) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1633) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:364) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1166) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:350) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3599) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3561) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396)

Arun Murthy added a comment - 29/Jun/13 00:58

I just committed this. Thanks sandyr!

tucu00 - Sorry, I didn't wait longer to unblock Matt for the release. Please file a follow-on if you have any concernse. Thanks, and sorry again!

Arun Murthy added a comment - 29/Jun/13 00:58 I just committed this. Thanks sandyr ! tucu00 - Sorry, I didn't wait longer to unblock Matt for the release. Please file a follow-on if you have any concernse. Thanks, and sorry again!

Hadoop QA added a comment - 28/Jun/13 23:01

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12590105/MAPREDUCE-5351-2.patch
against trunk revision .

-1 patch. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3812//console

This message is automatically generated.

Hadoop QA added a comment - 28/Jun/13 23:01 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590105/MAPREDUCE-5351-2.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3812//console This message is automatically generated.

Arun Murthy added a comment - 28/Jun/13 23:00

+1 lgtm.

tucu00 You ok with this too?

Arun Murthy added a comment - 28/Jun/13 23:00 +1 lgtm. tucu00 You ok with this too?

Sandy Ryza added a comment - 28/Jun/13 22:52

Thanks for looking it over. Uploading a patch that marks it as private.

Sandy Ryza added a comment - 28/Jun/13 22:52 Thanks for looking it over. Uploading a patch that marks it as private.

Arun Murthy added a comment - 28/Jun/13 22:49

sandyr Thanks for the quick turnaround.

One nit - you should, at least, mark the new api in FS as @Private.

Arun Murthy added a comment - 28/Jun/13 22:49 sandyr Thanks for the quick turnaround. One nit - you should, at least, mark the new api in FS as @Private.

Hadoop QA added a comment - 28/Jun/13 22:31

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12590103/MAPREDUCE-5351-1.patch
against trunk revision .

-1 patch. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3811//console

This message is automatically generated.

Hadoop QA added a comment - 28/Jun/13 22:31 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590103/MAPREDUCE-5351-1.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3811//console This message is automatically generated.

Sandy Ryza added a comment - 28/Jun/13 22:30

Uploaded a new patch with a test

Sandy Ryza added a comment - 28/Jun/13 22:30 Uploaded a new patch with a test

Arun Murthy added a comment - 28/Jun/13 22:26

Ok, makes sense. Thanks.

Arun Murthy added a comment - 28/Jun/13 22:26 Ok, makes sense. Thanks.

Sandy Ryza added a comment - 28/Jun/13 22:19

The motivation for the changes in JIP is that we don't want to close the FileSystem and then immediately reopen it from the CleanupQueue code. Just closing the FS instance (not doing closeAllForUGI) should be fine.

I'll upload a patch today or tomorrow.

Sandy Ryza added a comment - 28/Jun/13 22:19 The motivation for the changes in JIP is that we don't want to close the FileSystem and then immediately reopen it from the CleanupQueue code. Just closing the FS instance (not doing closeAllForUGI) should be fine. I'll upload a patch today or tomorrow.

Arun Murthy added a comment - 28/Jun/13 21:21

sandyr Agree with tucu00... can you please help me understand the motivation for changes in JIP?

It would be nice to put this into 1.2.1, I believe Matt is trying to get it done over the weekend... any chance we can expedite this?

Arun Murthy added a comment - 28/Jun/13 21:21 sandyr Agree with tucu00 ... can you please help me understand the motivation for changes in JIP? It would be nice to put this into 1.2.1, I believe Matt is trying to get it done over the weekend... any chance we can expedite this?

Alejandro Abdelnur added a comment - 25/Jun/13 12:57

The changes in CleanupQueue.java, wouldn't be enough to close the FS instance got there instead doing a closeAll?

On how to test it, could we temporary (for testing) instrument the FS cache size and monitor it does grow over time?

Alejandro Abdelnur added a comment - 25/Jun/13 12:57 The changes in CleanupQueue.java, wouldn't be enough to close the FS instance got there instead doing a closeAll? On how to test it, could we temporary (for testing) instrument the FS cache size and monitor it does grow over time?

Sandy Ryza added a comment - 25/Jun/13 00:41

Uploaded a patch that should fix the issue, but haven't yet tried to verify it. I'm not sure how to write a test for this.

Sandy Ryza added a comment - 25/Jun/13 00:41 Uploaded a patch that should fix the issue, but haven't yet tried to verify it. I'm not sure how to write a test for this.

Hadoop Map/Reduce

JobTracker memory leak caused by CleanupQueue reopening FileSystem

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates