Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.23.1, 0.24.0
    • Fix Version/s: 0.23.1
    • Component/s: mrv2, nodemanager
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Fixed a deadlock in NodeManager LocalDirectories's handling service.
    • Target Version/s:

      Description

      MAPREDUCE-3121 cloned Configuration object in LocalDirsHandlerService.init() to avoid others to access that configuration object. But since it is used in local FileSystem object creation in LocalDirAllocator.AllocatorPerContext and the same FileSystem object is used in ShuffleHandler.Shuffle.localDirAllocator, this is causing a deadlock when accessing this configuration object from LocalDirsHandlerService and ShuffleHandler along with AllocatorPerContext object.

      1. deadlock.txt
        7 kB
        Ravi Gummadi
      2. 3519.v1.patch
        3 kB
        Ravi Gummadi
      3. 3519.patch
        4 kB
        Ravi Gummadi

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #923 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/923/)
        MAPREDUCE-3519. Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi.

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212680
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #923 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/923/ ) MAPREDUCE-3519 . Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi. vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212680 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #890 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/890/)
        MAPREDUCE-3519. Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi.

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212680
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #890 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/890/ ) MAPREDUCE-3519 . Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi. vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212680 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-0.23-Build #121 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/121/)
        MAPREDUCE-3519. Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi.
        svn merge -c 1212680 --ignore-ancestry ../../trunk

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212681
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Build #121 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/121/ ) MAPREDUCE-3519 . Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi. svn merge -c 1212680 --ignore-ancestry ../../trunk vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212681 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #103 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/103/)
        MAPREDUCE-3519. Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi.
        svn merge -c 1212680 --ignore-ancestry ../../trunk

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212681
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #103 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/103/ ) MAPREDUCE-3519 . Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi. svn merge -c 1212680 --ignore-ancestry ../../trunk vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212681 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #1416 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1416/)
        MAPREDUCE-3519. Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi.

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212680
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #1416 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1416/ ) MAPREDUCE-3519 . Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi. vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212680 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-0.23-Commit #276 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/276/)
        MAPREDUCE-3519. Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi.
        svn merge -c 1212680 --ignore-ancestry ../../trunk

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212681
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Commit #276 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/276/ ) MAPREDUCE-3519 . Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi. svn merge -c 1212680 --ignore-ancestry ../../trunk vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212681 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #1394 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1394/)
        MAPREDUCE-3519. Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi.

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212680
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #1394 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1394/ ) MAPREDUCE-3519 . Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi. vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212680 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-0.23-Commit #266 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/266/)
        MAPREDUCE-3519. Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi.
        svn merge -c 1212680 --ignore-ancestry ../../trunk

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212681
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-0.23-Commit #266 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/266/ ) MAPREDUCE-3519 . Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi. svn merge -c 1212680 --ignore-ancestry ../../trunk vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212681 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #1468 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1468/)
        MAPREDUCE-3519. Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi.

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212680
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #1468 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1468/ ) MAPREDUCE-3519 . Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi. vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212680 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Commit #256 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/256/)
        MAPREDUCE-3519. Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi.
        svn merge -c 1212680 --ignore-ancestry ../../trunk

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212681
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Commit #256 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/256/ ) MAPREDUCE-3519 . Fixed a deadlock in NodeManager LocalDirectories's handling service. Contributed by Ravi Gummadi. svn merge -c 1212680 --ignore-ancestry ../../trunk vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1212681 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        Hide
        Vinod Kumar Vavilapalli added a comment -

        I just committed this to trunk and branch-0.23. Thanks Ravi!

        Show
        Vinod Kumar Vavilapalli added a comment - I just committed this to trunk and branch-0.23. Thanks Ravi!
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Configuration seems to have a host of other synchronization problems, we are either lucky to not hit them or plain ignorant.

        Looked at the latest patch. This is fine especially in the light of Ravi's comment above.

        +1.

        Show
        Vinod Kumar Vavilapalli added a comment - Configuration seems to have a host of other synchronization problems, we are either lucky to not hit them or plain ignorant. Looked at the latest patch. This is fine especially in the light of Ravi's comment above. +1.
        Hide
        Robert Joseph Evans added a comment -

        I am +1 on this Thanks for the quick fix.

        Show
        Robert Joseph Evans added a comment - I am +1 on this Thanks for the quick fix.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12506731/3519.v1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1415//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1415//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12506731/3519.v1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1415//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1415//console This message is automatically generated.
        Hide
        Ravi Gummadi added a comment -

        OK. I created HADOOP-7900 for the issue of 2 calls to conf.get() in confChanged(). Uploaded a patch there with the suggested code change.

        Show
        Ravi Gummadi added a comment - OK. I created HADOOP-7900 for the issue of 2 calls to conf.get() in confChanged(). Uploaded a patch there with the suggested code change.
        Hide
        Robert Joseph Evans added a comment -

        I moved this issue to blocker because we have seen this issue show up in our tests and it is blocking us from running tests because it stops some of the tests from completing.

        Show
        Robert Joseph Evans added a comment - I moved this issue to blocker because we have seen this issue show up in our tests and it is blocking us from running tests because it stops some of the tests from completing.
        Hide
        Robert Joseph Evans added a comment -

        I like the patch. I also traced down from where the synchronized call is happening looking for a potential race condition between reads and writes. The only place I found that uses conf from LocalDirsHandlerService.getLocalPathForWrite is LocalDirAllocator.confChanged. Inside that code we access conf three times, once to get the local file system, so I don't see that as being an issue. But we call it twice to get the value of the contextConfigItemName. The first to see if the value has changed, and the second to turn that value into an array. It would be better to change the code so that it only calls conf.get once for that value, and then to turn it into a trimmed array of Strings call StringUtils.getTrimmedStrings, which is what the second method does internally.

        Show
        Robert Joseph Evans added a comment - I like the patch. I also traced down from where the synchronized call is happening looking for a potential race condition between reads and writes. The only place I found that uses conf from LocalDirsHandlerService.getLocalPathForWrite is LocalDirAllocator.confChanged. Inside that code we access conf three times, once to get the local file system, so I don't see that as being an issue. But we call it twice to get the value of the contextConfigItemName. The first to see if the value has changed, and the second to turn that value into an array. It would be better to change the code so that it only calls conf.get once for that value, and then to turn it into a trimmed array of Strings call StringUtils.getTrimmedStrings, which is what the second method does internally.
        Hide
        Ravi Gummadi added a comment -

        Attaching patch by removing the synchronized(conf) blocks from LocalDirsHandlerService.java

        Show
        Ravi Gummadi added a comment - Attaching patch by removing the synchronized(conf) blocks from LocalDirsHandlerService.java
        Hide
        Ravi Gummadi added a comment -

        Changing Configuration.java to have a ReadWriteLock and affecting everybody (not just hadoop-mapreduce-project) seems to be complex and needs more thinking in terms of dependencies because almost every class in hadoop uses Configuration. We can move to this model of ReadWriteLock in the long run (if possible).

        In the short term, I don't see a simple way to handle this deadlock without removing the synchronized(conf) blocks in the methods in LocalDirsHandlerService.java.

        Further, I don't see an issue/problem in removing these synchronized(conf) blocks in LocalDirsHandlerService.java. Everywhere else in hadoop source code there are no such synchronization blocks when reading from OR writing to configuration object(s). As the methods in Configuration.java like getProps() are already synchronized and the actual setting of a property using Configuration.properties.setProperty() and getting the value of a property using Configuration.properties.getProperty() seem to be atomic, I don't see an issue in removing these synchronized(conf) blocks from LocalDirsHandlerService.java.

        Please provide your comments/suggestions.

        Show
        Ravi Gummadi added a comment - Changing Configuration.java to have a ReadWriteLock and affecting everybody (not just hadoop-mapreduce-project) seems to be complex and needs more thinking in terms of dependencies because almost every class in hadoop uses Configuration. We can move to this model of ReadWriteLock in the long run (if possible). In the short term, I don't see a simple way to handle this deadlock without removing the synchronized(conf) blocks in the methods in LocalDirsHandlerService.java. Further, I don't see an issue/problem in removing these synchronized(conf) blocks in LocalDirsHandlerService.java. Everywhere else in hadoop source code there are no such synchronization blocks when reading from OR writing to configuration object(s). As the methods in Configuration.java like getProps() are already synchronized and the actual setting of a property using Configuration.properties.setProperty() and getting the value of a property using Configuration.properties.getProperty() seem to be atomic, I don't see an issue in removing these synchronized(conf) blocks from LocalDirsHandlerService.java. Please provide your comments/suggestions.
        Hide
        Ravi Gummadi added a comment -

        As reads of Configuration object happen often and writes to Configuration object don't happen often, having separate locks for read and write should improve things a lot in terms of performance.

        Show
        Ravi Gummadi added a comment - As reads of Configuration object happen often and writes to Configuration object don't happen often, having separate locks for read and write should improve things a lot in terms of performance.
        Hide
        Robert Joseph Evans added a comment -

        We have seen one other deadlock in the past when trying to lock the Configuration object. The other one I am thinking of was specifically withing Configuration itself when trying to write out the configuration to a stream and the reader of the stream in the same process was trying to read in a configuration value. That was also fixed by making sure that the configuration objects were separate instances.

        In yarn with the message passing framework and how initialization happens automatically we are moving much more towards having a single global Configuration object shared by all parts of the system. I think it is time for us to look at moving to read/write locks on the Configuration class. This will not solve all cases of deadlocks, because we could still have a writer holding the lock and someone else trying to read or write holding a different lock. But both of the situations I have seen so far are two readers deadlocked on reads. No writes at all. With read write locks the reads would both be able to finish successfully without blocking. This would also have the advantage of a potential performance boost for anyone using Configuration. It would break backwards compatibility for anyone who is doing what we see here and synchronizing on configuration so it would have to be part of the release notes and we would have to inform downstream projects. But I think that the benefits outweigh the drawbacks. Especially for YARN where we are already changing a lot.

        Show
        Robert Joseph Evans added a comment - We have seen one other deadlock in the past when trying to lock the Configuration object. The other one I am thinking of was specifically withing Configuration itself when trying to write out the configuration to a stream and the reader of the stream in the same process was trying to read in a configuration value. That was also fixed by making sure that the configuration objects were separate instances. In yarn with the message passing framework and how initialization happens automatically we are moving much more towards having a single global Configuration object shared by all parts of the system. I think it is time for us to look at moving to read/write locks on the Configuration class. This will not solve all cases of deadlocks, because we could still have a writer holding the lock and someone else trying to read or write holding a different lock. But both of the situations I have seen so far are two readers deadlocked on reads. No writes at all. With read write locks the reads would both be able to finish successfully without blocking. This would also have the advantage of a potential performance boost for anyone using Configuration. It would break backwards compatibility for anyone who is doing what we see here and synchronizing on configuration so it would have to be part of the release notes and we would have to inform downstream projects. But I think that the benefits outweigh the drawbacks. Especially for YARN where we are already changing a lot.
        Hide
        Ravi Gummadi added a comment -

        The patch attached may not solve the issue as the ShuffleHandler seems to be getting access to updated conf because of FileSystem object (as FileSystem object is saving the config and reusing it). So whoever creates the local FileSystem object first, that config will be reused for other accesses to local FileSystem and thus ShuffleHandler is able to "try to get lock for the cloned conf".

        Investigating further...

        Show
        Ravi Gummadi added a comment - The patch attached may not solve the issue as the ShuffleHandler seems to be getting access to updated conf because of FileSystem object (as FileSystem object is saving the config and reusing it). So whoever creates the local FileSystem object first, that config will be reused for other accesses to local FileSystem and thus ShuffleHandler is able to "try to get lock for the cloned conf". Investigating further...
        Hide
        Ravi Gummadi added a comment -

        Attaching the deadlock details with better formatting as earlier comment seems a little complex to understand.

        Show
        Ravi Gummadi added a comment - Attaching the deadlock details with better formatting as earlier comment seems a little complex to understand.
        Hide
        Ravi Gummadi added a comment -

        Attaching patch with the fix.

        With this patch, nobody else other than LocalDirsHandlerService can access the configuration-which-gets-updated-based-on-disks-health.

        Please review and provide your comments.

        Show
        Ravi Gummadi added a comment - Attaching patch with the fix. With this patch, nobody else other than LocalDirsHandlerService can access the configuration-which-gets-updated-based-on-disks-health. Please review and provide your comments.
        Hide
        Ravi Gummadi added a comment -

        Here is one deadlock that is observed:

        Found one Java-level deadlock:
        =============================
        "ContainersLauncher #32":
        waiting to lock monitor 0x08e84674 (object 0xb5bec810, a org.apache.hadoop.conf.Configuration),
        which is held by "ContainersLauncher #27"
        "ContainersLauncher #27":
        waiting to lock monitor 0x08ea65b0 (object 0xb5cd2898, a org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext),
        which is held by "New I/O server worker #1-3"
        "New I/O server worker #1-3":
        waiting to lock monitor 0x08e84674 (object 0xb5bec810, a org.apache.hadoop.conf.Configuration),
        which is held by "ContainersLauncher #27"

        Java stack information for the threads listed above:
        ===================================================
        "ContainersLauncher #32":
        at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLogPathForWrite(LocalDirsHandlerService.java:283)

        • waiting to lock <0xb5bec810> (a org.apache.hadoop.conf.Configuration)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:126)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
          at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
          at java.util.concurrent.FutureTask.run(FutureTask.java:138)
          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          at java.lang.Thread.run(Thread.java:619)
          "ContainersLauncher #27":
          at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:331)
        • waiting to lock <0xb5cd2898> (a org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext)
          at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
          at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
          at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:115)
          at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForWrite(LocalDirsHandlerService.java:262)
        • locked <0xb5bec810> (a org.apache.hadoop.conf.Configuration)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:150)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
          at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
          at java.util.concurrent.FutureTask.run(FutureTask.java:138)
          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          at java.lang.Thread.run(Thread.java:619)
          "New I/O server worker #1-3":
          at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1373)
        • waiting to lock <0xb5bec810> (a org.apache.hadoop.conf.Configuration)
          at org.apache.hadoop.conf.Configuration.get(Configuration.java:569)
          at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:586)
          at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:730)
          at org.apache.hadoop.fs.FileSystem.getDefaultBlockSize(FileSystem.java:1805)
          at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:429)
          at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:315)
          at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1069)
          at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:425)
        • locked <0xb5cd2898> (a org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext)
          at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:164)
          at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.sendMapOutput(ShuffleHandler.java:464)
          at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:397)
          at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
          at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545)
          at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754)
          at org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:148)
          at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545)
          at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754)
          at org.jboss.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:116)
          at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
          at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545)
          at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754)
          at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:302)
          at org.jboss.netty.handler.codec.replay.ReplayingDecoder.unfoldAndfireMessageReceived(ReplayingDecoder.java:522)
          at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:506)
          at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:443)
          at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
          at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545)
          at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:540)
          at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274)
          at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261)
          at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349)
          at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)
          at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)
          at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
          at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          at java.lang.Thread.run(Thread.java:619)

        Found 1 deadlock.

        Show
        Ravi Gummadi added a comment - Here is one deadlock that is observed: Found one Java-level deadlock: ============================= "ContainersLauncher #32": waiting to lock monitor 0x08e84674 (object 0xb5bec810, a org.apache.hadoop.conf.Configuration), which is held by "ContainersLauncher #27" "ContainersLauncher #27": waiting to lock monitor 0x08ea65b0 (object 0xb5cd2898, a org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext), which is held by "New I/O server worker #1-3" "New I/O server worker #1-3": waiting to lock monitor 0x08e84674 (object 0xb5bec810, a org.apache.hadoop.conf.Configuration), which is held by "ContainersLauncher #27" Java stack information for the threads listed above: =================================================== "ContainersLauncher #32": at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLogPathForWrite(LocalDirsHandlerService.java:283) waiting to lock <0xb5bec810> (a org.apache.hadoop.conf.Configuration) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:126) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) "ContainersLauncher #27": at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:331) waiting to lock <0xb5cd2898> (a org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:115) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForWrite(LocalDirsHandlerService.java:262) locked <0xb5bec810> (a org.apache.hadoop.conf.Configuration) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:150) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) "New I/O server worker #1-3": at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1373) waiting to lock <0xb5bec810> (a org.apache.hadoop.conf.Configuration) at org.apache.hadoop.conf.Configuration.get(Configuration.java:569) at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:586) at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:730) at org.apache.hadoop.fs.FileSystem.getDefaultBlockSize(FileSystem.java:1805) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:429) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:315) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1069) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:425) locked <0xb5cd2898> (a org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:164) at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.sendMapOutput(ShuffleHandler.java:464) at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:397) at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545) at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754) at org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:148) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545) at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754) at org.jboss.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:116) at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545) at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:302) at org.jboss.netty.handler.codec.replay.ReplayingDecoder.unfoldAndfireMessageReceived(ReplayingDecoder.java:522) at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:506) at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:443) at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:540) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349) at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200) at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Found 1 deadlock.

          People

          • Assignee:
            Ravi Gummadi
            Reporter:
            Ravi Gummadi
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development