Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-925

Make it harder to accidentally close a shared DFSClient

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 0.21.0
    • Fix Version/s: None
    • Component/s: hdfs-client
    • Labels:
      None

      Description

      Every so often I get stack traces telling me that DFSClient is closed, usually in {{org.apache.hadoop.hdfs.DFSClient.checkOpen() }} . The root cause of this is usually that one thread has closed a shared fsclient while another thread still has a reference to it. If the other thread then asks for a new client it will get one and the cache repopulated but if has one already, then I get to see a stack trace.

      It's effectively a race condition between clients in different threads.

      1. HDFS-925.patch
        5 kB
        Steve Loughran
      2. HDFS-925.patch
        4 kB
        Steve Loughran
      3. HDFS-925.patch
        2 kB
        Steve Loughran
      4. HDFS-925.patch
        2 kB
        Steve Loughran
      5. HADOOP-5933.patch
        2 kB
        Steve Loughran
      6. HADOOP-5933.patch
        1 kB
        Steve Loughran

        Issue Links

          Activity

          Hide
          Steve Loughran added a comment -

          Some options

          1. if the log is at debug level, generate an exception in close() and save it until the next checkOpen() call is reached -and use that exception as the nested cause of the exception that is raised there.
          2. some complicated reference count mechanism with its own leakage problems
          3. add the ability to reopen things if they were in cache and got purged.

          I've done the first one of these to track down problems, and while I now know where I shouldn't be calling close(), there's a risk that my code will now leak filesystem clients.

          Show
          Steve Loughran added a comment - Some options if the log is at debug level, generate an exception in close() and save it until the next checkOpen() call is reached -and use that exception as the nested cause of the exception that is raised there. some complicated reference count mechanism with its own leakage problems add the ability to reopen things if they were in cache and got purged. I've done the first one of these to track down problems, and while I now know where I shouldn't be calling close(), there's a risk that my code will now leak filesystem clients.
          Hide
          Steve Loughran added a comment -

          This is the first solution, some extra diagnostics. It's main cost when the log is not set to debug is one extra reference.

          I don't really like the log settings changing program behaviour, so I'm not sure if anyone does want to check this patch in; it's just what I put together to track down my problems. The real problem is that the caching system isn't compatible with the users of DfsClient calling close() on their clients.

          Show
          Steve Loughran added a comment - This is the first solution, some extra diagnostics. It's main cost when the log is not set to debug is one extra reference. I don't really like the log settings changing program behaviour, so I'm not sure if anyone does want to check this patch in; it's just what I put together to track down my problems. The real problem is that the caching system isn't compatible with the users of DfsClient calling close() on their clients.
          Hide
          Hong Tang added a comment -

          I am not sure if it would be a good idea to alter code execution paths based on logging levels - this could lead to harder-to-maintain-or-test code, and harder to understand logging messages (without reading the code, a user would expect debug level logging a superset of info/warn logging).

          Show
          Hong Tang added a comment - I am not sure if it would be a good idea to alter code execution paths based on logging levels - this could lead to harder-to-maintain-or-test code, and harder to understand logging messages (without reading the code, a user would expect debug level logging a superset of info/warn logging).
          Hide
          Raghu Angadi added a comment -

          > I am not sure if it would be a good idea to alter code execution paths based on logging levels

          +1. If this feature is committed the behavior should be same with or without Debug enabled. As a practical matter it is pretty hard to ask users to enable debug since that prints boatloads of other stuff.

          +1 for the feature. Looking at how hard it is for users debug such problems, this seems like a useful feature. User still need to add code to getCause().. that is ok.

          Show
          Raghu Angadi added a comment - > I am not sure if it would be a good idea to alter code execution paths based on logging levels +1. If this feature is committed the behavior should be same with or without Debug enabled. As a practical matter it is pretty hard to ask users to enable debug since that prints boatloads of other stuff. +1 for the feature. Looking at how hard it is for users debug such problems, this seems like a useful feature. User still need to add code to getCause().. that is ok.
          Hide
          Raghu Angadi added a comment -

          > If the other thread then asks for a new client it will get one and the cache repopulated but if has one already, then I get to see a stack trace.

          Steve, what is this issue? I didn't think there was a cache for DFSClients. Can you post the stacktrace in your test?

          Show
          Raghu Angadi added a comment - > If the other thread then asks for a new client it will get one and the cache repopulated but if has one already, then I get to see a stack trace. Steve, what is this issue? I didn't think there was a cache for DFSClients. Can you post the stacktrace in your test?
          Hide
          Steve Loughran added a comment -

          There isn't direct caching for DFSClients, but there is in Filesystem.get(), which is how I've been getting DFSClient instances (and those of other filesystems)

          Up until march I could have different things get filesystems, do some work and then close them, but now in SVN_HEAD I'm seeing stack traces when different threads try to work with filesystem instances they have been holding on to. So one thread running a TaskTracker is happily spinning away, until something does a quick check on a different thread that a specific file exists in the fileysystem, does a close afterwards.

          My stack traces are here: http://jira.smartfrog.org/jira/browse/SFOS-1208

          The semantics of FileSystem.get() have changed; if I moved my code to the new FileSystem.newInstance() method then things should work again. That doesn't mean we dont benefit from tracing who closed the instance, only that anyone else doing work in different threads who were getting the filesystem clients by way of FileSystem.get() are going to encounter the same problems. I just saw them first

          Show
          Steve Loughran added a comment - There isn't direct caching for DFSClients, but there is in Filesystem.get(), which is how I've been getting DFSClient instances (and those of other filesystems) Up until march I could have different things get filesystems, do some work and then close them, but now in SVN_HEAD I'm seeing stack traces when different threads try to work with filesystem instances they have been holding on to. So one thread running a TaskTracker is happily spinning away, until something does a quick check on a different thread that a specific file exists in the fileysystem, does a close afterwards. My stack traces are here: http://jira.smartfrog.org/jira/browse/SFOS-1208 The semantics of FileSystem.get() have changed; if I moved my code to the new FileSystem.newInstance() method then things should work again. That doesn't mean we dont benefit from tracing who closed the instance, only that anyone else doing work in different threads who were getting the filesystem clients by way of FileSystem.get() are going to encounter the same problems. I just saw them first
          Hide
          Steve Loughran added a comment -

          Revisiting this code, we could log the location a filesystem was closed regardless of debug settings. This adds the cost of creating a new exception to the close() operation, but ensures consistent behavior, and simplifies testing. We create a filesystem, close it twice, see what comes back.

          Show
          Steve Loughran added a comment - Revisiting this code, we could log the location a filesystem was closed regardless of debug settings. This adds the cost of creating a new exception to the close() operation, but ensures consistent behavior, and simplifies testing. We create a filesystem, close it twice, see what comes back.
          Hide
          Steve Loughran added a comment -

          This revision always logs the location the client was closed, and has a package-scoped accessor to the value. It should be possible to write a test for this now.

          Show
          Steve Loughran added a comment - This revision always logs the location the client was closed, and has a package-scoped accessor to the value. It should be possible to write a test for this now.
          Hide
          Steve Loughran added a comment -

          patch in sync w/ head

          Show
          Steve Loughran added a comment - patch in sync w/ head
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12435708/HDFS-925.patch
          against trunk revision 908628.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/230/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/230/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/230/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/230/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435708/HDFS-925.patch against trunk revision 908628. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/230/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/230/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/230/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/230/console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12435708/HDFS-925.patch
          against trunk revision 1051669.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.server.balancer.TestBalancer
          org.apache.hadoop.hdfs.server.namenode.TestStorageRestore
          org.apache.hadoop.hdfs.TestHDFSTrash

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/32//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/32//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/32//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435708/HDFS-925.patch against trunk revision 1051669. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.balancer.TestBalancer org.apache.hadoop.hdfs.server.namenode.TestStorageRestore org.apache.hadoop.hdfs.TestHDFSTrash -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/32//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/32//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/32//console This message is automatically generated.
          Hide
          Konstantin Shvachko added a comment -

          Is this still a problem?
          Could you please provide a test case for the failure condition.
          In the patch I didn't like that you effectively pass a message via the Exception variable.

          Show
          Konstantin Shvachko added a comment - Is this still a problem? Could you please provide a test case for the failure condition. In the patch I didn't like that you effectively pass a message via the Exception variable.
          Hide
          Steve Loughran added a comment -

          I'm not seeing this as a problem so much as I'm doing less hadoop work directly, and what I am doing is doing more in separate processes. I fear the problem still exists though.

          Show
          Steve Loughran added a comment - I'm not seeing this as a problem so much as I'm doing less hadoop work directly, and what I am doing is doing more in separate processes. I fear the problem still exists though.
          Hide
          Steve Loughran added a comment -

          Patch against trunk

          Show
          Steve Loughran added a comment - Patch against trunk
          Hide
          Steve Loughran added a comment -

          patch adds a test, checks for the text and the nested exception

          Show
          Steve Loughran added a comment - patch adds a test, checks for the text and the nested exception
          Hide
          Steve Loughran added a comment -

          The latest patch adds a test for expected behaviour.

          1. If you call FileSystem.get(conf) you get a shared client, if you ask for a new instance, you get a different instance.

          2. If you close any of the shared instances, all other attempts to use that shared instance will fail.

          3. If you call FileSystem.getNewInstance(conf), that configuration is different.

          4. If, after closing a shared client instance, if you call FileSystem.get(conf) , you get a new instance.

          This whole thing dates to an incompatible change made to cache shared clients. It doesn't work across threads, not reliably. Either the clients are reference counted, or you call getNewInstance() to get a thread safe instance. I think I've moved my code to the new API call, its been out there for a while now, but this helps track down problems for people who haven't moved.

          Even if people don't want the extra diagnostics (pity), the test could be used to verify the semantics of FileSystem.get() and FileSystem.getNewInstance().

          Show
          Steve Loughran added a comment - The latest patch adds a test for expected behaviour. 1. If you call FileSystem.get(conf) you get a shared client, if you ask for a new instance, you get a different instance. 2. If you close any of the shared instances, all other attempts to use that shared instance will fail. 3. If you call FileSystem.getNewInstance(conf), that configuration is different. 4. If, after closing a shared client instance, if you call FileSystem.get(conf) , you get a new instance. This whole thing dates to an incompatible change made to cache shared clients. It doesn't work across threads, not reliably. Either the clients are reference counted, or you call getNewInstance() to get a thread safe instance. I think I've moved my code to the new API call, its been out there for a while now, but this helps track down problems for people who haven't moved. Even if people don't want the extra diagnostics (pity), the test could be used to verify the semantics of FileSystem.get() and FileSystem.getNewInstance().
          Hide
          Steve Loughran added a comment -

          Reviewing the JT code, this is still a problem: the JobTracker does call FileSystem.get(); MAPREDUCE-437 talks about this. This patch merely makes it easier to diagnose who else closed the shared instance, not prevent it.

          Show
          Steve Loughran added a comment - Reviewing the JT code, this is still a problem: the JobTracker does call FileSystem.get(); MAPREDUCE-437 talks about this. This patch merely makes it easier to diagnose who else closed the shared instance, not prevent it.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12470272/HDFS-925.patch
          against trunk revision 1072023.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/200//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/200//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/200//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12470272/HDFS-925.patch against trunk revision 1072023. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestFileConcurrentReader -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/200//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/200//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/200//console This message is automatically generated.
          Hide
          Steve Loughran added a comment -

          Given that in 0.20.x there's a hack to get a private client and there's a proper API for this in 0.21+, I am going to propose closing this as a WONTFIX. Instead move services like the JT and everything in the Mini clusters to using private instances that are closed when the in-VM clusters are terminated.

          Show
          Steve Loughran added a comment - Given that in 0.20.x there's a hack to get a private client and there's a proper API for this in 0.21+, I am going to propose closing this as a WONTFIX. Instead move services like the JT and everything in the Mini clusters to using private instances that are closed when the in-VM clusters are terminated.

            People

            • Assignee:
              Steve Loughran
              Reporter:
              Steve Loughran
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development