Hadoop Common
  1. Hadoop Common
  2. HADOOP-6269

Missing synchronization for defaultResources in Configuration.addResource

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.1
    • Fix Version/s: 0.20.2
    • Component/s: conf
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Configuration.defaultResources is a simple ArrayList. In two places in Configuration it is accessed without appropriate synchronization, which we've seen to occasionally result in ConcurrentModificationExceptions.

      1. HADOOP-6269-2v20.patch
        1 kB
        Chris Douglas
      2. HADOOP-6269-2.patch
        1 kB
        Sreekanth Ramakrishnan
      3. HADOOP-6269-1.patch
        2 kB
        Sreekanth Ramakrishnan
      4. hadoop-6269.txt
        1 kB
        Todd Lipcon
      5. hadoop-6269.txt
        3 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #91 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/91/)
          . Fix threading issue with defaultResource in Configuration.
          Contributed by Sreekanth Ramakrishnan

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #91 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/91/ ) . Fix threading issue with defaultResource in Configuration. Contributed by Sreekanth Ramakrishnan
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #165 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/165/)
          . Fix threading issue with defaultResource in Configuration.
          Contributed by Sreekanth Ramakrishnan

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #165 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/165/ ) . Fix threading issue with defaultResource in Configuration. Contributed by Sreekanth Ramakrishnan
          Hide
          Chris Douglas added a comment -

          +1

          I committed this. Thanks, Sreekanth!

          Show
          Chris Douglas added a comment - +1 I committed this. Thanks, Sreekanth!
          Hide
          Sreekanth Ramakrishnan added a comment -

          Patch fixes a ConcurrentModificationException which occurs due to a timing issue when a thread tries to write into a list which iterator is iterating, and is not easily replicated in a JUNIT environment, hence no test case attached.

          Show
          Sreekanth Ramakrishnan added a comment - Patch fixes a ConcurrentModificationException which occurs due to a timing issue when a thread tries to write into a list which iterator is iterating, and is not easily replicated in a JUNIT environment, hence no test case attached.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12423795/HADOOP-6269-2.patch
          against trunk revision 832249.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/123/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/123/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/123/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/123/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423795/HADOOP-6269-2.patch against trunk revision 832249. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/123/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/123/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/123/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/123/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          New patch looks good to me. +1 for the simpler approach

          Show
          Todd Lipcon added a comment - New patch looks good to me. +1 for the simpler approach
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching patch, revert the removal of synchronized on addDefaultResources

          Show
          Sreekanth Ramakrishnan added a comment - Attaching patch, revert the removal of synchronized on addDefaultResources
          Hide
          Hemanth Yamijala added a comment -

          I looked at the patch using CopyOnWriteArrayList. It seems like a reasonable approach to me, except for one point that I will mention in a bit. First I would like to see what others think about the usage of CopyOnWriteArrayList.

          The defaultResources structure is modified only once per resource in the lifetime of an application - like a hadoop daemon or client. It is iterated over more number of times - typically every time a configuration instance is loaded in the application. Going from the javadoc for the CopyOnWriteArrayList, it does seem the usage is appropriate. The result of using this structure is a simpler code change that is fixing the problem at hand. Given these points, does the patch seem like a right direction ? (Todd, your thoughts ?)

          Now, regarding the patch, even if we do agree on using it, I suppose we might still need to synchronize addDefaultResources, because the contains check needs to be synchronized with the add check, no ?

          Show
          Hemanth Yamijala added a comment - I looked at the patch using CopyOnWriteArrayList. It seems like a reasonable approach to me, except for one point that I will mention in a bit. First I would like to see what others think about the usage of CopyOnWriteArrayList. The defaultResources structure is modified only once per resource in the lifetime of an application - like a hadoop daemon or client. It is iterated over more number of times - typically every time a configuration instance is loaded in the application. Going from the javadoc for the CopyOnWriteArrayList, it does seem the usage is appropriate. The result of using this structure is a simpler code change that is fixing the problem at hand. Given these points, does the patch seem like a right direction ? (Todd, your thoughts ?) Now, regarding the patch, even if we do agree on using it, I suppose we might still need to synchronize addDefaultResources, because the contains check needs to be synchronized with the add check, no ?
          Hide
          Sreekanth Ramakrishnan added a comment -

          Verified the patch with MAPREDUCE-1062 the exception which was reproducible every time has disappeared. Can someone review it?

          Show
          Sreekanth Ramakrishnan added a comment - Verified the patch with MAPREDUCE-1062 the exception which was reproducible every time has disappeared. Can someone review it?
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching a file changing type of defaultResources to CopyOnWriteArrayList and removed synchronization totally. Am not attaching a test case as it is difficult to write a test case which would modify the defaultResources while loadResources is being called in Configuration

          Show
          Sreekanth Ramakrishnan added a comment - Attaching a file changing type of defaultResources to CopyOnWriteArrayList and removed synchronization totally. Am not attaching a test case as it is difficult to write a test case which would modify the defaultResources while loadResources is being called in Configuration
          Hide
          Sreekanth Ramakrishnan added a comment -

          True, it was just a suggestion that if we can avoid class level locking it is better to do or remove synchronization fully. If we use one of concurrent data structures in java, we can avoid class level locking or any kind of locking right?

          Show
          Sreekanth Ramakrishnan added a comment - True, it was just a suggestion that if we can avoid class level locking it is better to do or remove synchronization fully. If we use one of concurrent data structures in java, we can avoid class level locking or any kind of locking right?
          Hide
          Todd Lipcon added a comment -

          Hi Sreekanth,

          What's the concern with the synchronization in the patch? This is not at all a hot section (it should be called maybe 3-5 times in the course of an entire execution) so I don't think the lock granularity really matters.

          Show
          Todd Lipcon added a comment - Hi Sreekanth, What's the concern with the synchronization in the patch? This is not at all a hot section (it should be called maybe 3-5 times in the course of an entire execution) so I don't think the lock granularity really matters.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Took a look at the patch.

          • Can we avoid taking a lock at Class level instead do it at much finer level i.e. taking a lock at defaultResources and acquire lock only on it?
          • Or isn't better to directly use java.util.concurrent.CopyOnWriteArrayList here? It does same thing which is being done in patch without synchronization?
          Show
          Sreekanth Ramakrishnan added a comment - Took a look at the patch. Can we avoid taking a lock at Class level instead do it at much finer level i.e. taking a lock at defaultResources and acquire lock only on it? Or isn't better to directly use java.util.concurrent.CopyOnWriteArrayList here? It does same thing which is being done in patch without synchronization?
          Hide
          Todd Lipcon added a comment -

          OK. Here's the summary of a correct (I believe) fix:

          • To avoid deadlock we need a lock ordering
          • We'll order instance locks before class locks (ie you must never lock an instance when holding a class lock)

          The issue is that addDefaultResource is currently static synchronized and locks a class instance by calling conf.reloadConfiguration (synchronized non-static)

          Attaching a patch which breaks this synchronization by holding the lock only long enough to make a defensive copy of the shared list. I don't think this is a performance issue since addDefaultResource is called very rarely.

          Tests pass. Will test on the application that keeps turning up this bug here. Review would be appreciated.

          Show
          Todd Lipcon added a comment - OK. Here's the summary of a correct (I believe) fix: To avoid deadlock we need a lock ordering We'll order instance locks before class locks (ie you must never lock an instance when holding a class lock) The issue is that addDefaultResource is currently static synchronized and locks a class instance by calling conf.reloadConfiguration (synchronized non-static) Attaching a patch which breaks this synchronization by holding the lock only long enough to make a defensive copy of the shared list. I don't think this is a performance issue since addDefaultResource is called very rarely. Tests pass. Will test on the application that keeps turning up this bug here. Review would be appreciated.
          Hide
          Todd Lipcon added a comment -

          Under some more testing this patch ends up occasionally producing a deadlock:

          Thread 30 (pool-2-thread-1):
            State: BLOCKED
            Blocked count: 4
            Waited count: 0
            Blocked on org.apache.hadoop.mapred.JobConf@30e3c624
            Blocked by 1 (main)
            Stack:
              org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:326)   <-- synch on object
              org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:257) <-- synch on class
              org.apache.hadoop.thriftfs.ThriftPluginServer.<clinit>(ThriftPluginServer.java:64)
              org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin.start(ThriftJobTrackerPlugin.java:415)
              org.apache.hadoop.util.PluginDispatcher$2.run(PluginDispatcher.java:104)
              org.apache.hadoop.util.PluginDispatcher$2.run(PluginDispatcher.java:101)
              org.apache.hadoop.util.PluginDispatcher$1.run(PluginDispatcher.java:84)
              java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              java.lang.Thread.run(Thread.java:619)
          
          
          Thread 1 (main):
            State: BLOCKED
            Blocked count: 14
            Waited count: 10
            Blocked on java.lang.Class@4987b287
            Blocked by 30 (pool-2-thread-1)
            Stack:
              org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1056) <-- synch on class
              org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1005)  <-- synch on object
              org.apache.hadoop.conf.Configuration.get(Configuration.java:382)
              org.apache.hadoop.conf.Configuration.getInt(Configuration.java:451)
              org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:180)
              org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:168)
              org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
              org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1363)
              org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
              org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1378)
              org.apache.hadoop.fs.FileSystem.get(FileSystem.java:192)
              org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
              org.apache.hadoop.mapred.JobTracker$RecoveryManager.updateRestartCount(JobTracker.java:1166)
              org.apache.hadoop.mapred.JobTracker.offerService(JobTracker.java:1822)
              org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3711)
          

          So please don't commit

          I'm now doing a more thorough audit of the synchronization in Configuration so we can hope to solve the issue completely.

          Show
          Todd Lipcon added a comment - Under some more testing this patch ends up occasionally producing a deadlock: Thread 30 (pool-2-thread-1): State: BLOCKED Blocked count: 4 Waited count: 0 Blocked on org.apache.hadoop.mapred.JobConf@30e3c624 Blocked by 1 (main) Stack: org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:326) <-- synch on object org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:257) <-- synch on class org.apache.hadoop.thriftfs.ThriftPluginServer.<clinit>(ThriftPluginServer.java:64) org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin.start(ThriftJobTrackerPlugin.java:415) org.apache.hadoop.util.PluginDispatcher$2.run(PluginDispatcher.java:104) org.apache.hadoop.util.PluginDispatcher$2.run(PluginDispatcher.java:101) org.apache.hadoop.util.PluginDispatcher$1.run(PluginDispatcher.java:84) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:619) Thread 1 (main): State: BLOCKED Blocked count: 14 Waited count: 10 Blocked on java.lang.Class@4987b287 Blocked by 30 (pool-2-thread-1) Stack: org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1056) <-- synch on class org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1005) <-- synch on object org.apache.hadoop.conf.Configuration.get(Configuration.java:382) org.apache.hadoop.conf.Configuration.getInt(Configuration.java:451) org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:180) org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:168) org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1363) org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1378) org.apache.hadoop.fs.FileSystem.get(FileSystem.java:192) org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) org.apache.hadoop.mapred.JobTracker$RecoveryManager.updateRestartCount(JobTracker.java:1166) org.apache.hadoop.mapred.JobTracker.offerService(JobTracker.java:1822) org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3711) So please don't commit I'm now doing a more thorough audit of the synchronization in Configuration so we can hope to solve the issue completely.
          Hide
          Todd Lipcon added a comment -

          Konstantin: Given that we couldn't reproduce this regularly, I think it was in actuality a different thread doing the modification. With a brief look I don't see any way in which loadResource itself could modify defaultResources - it doesn't run user code or load any classes.

          Show
          Todd Lipcon added a comment - Konstantin: Given that we couldn't reproduce this regularly, I think it was in actuality a different thread doing the modification. With a brief look I don't see any way in which loadResource itself could modify defaultResources - it doesn't run user code or load any classes.
          Hide
          Konstantin Shvachko added a comment -

          +1 The patch does the right thing synchronizing under-synchronized static variable defaultResources.

          BUT,

          The most common cause of ConcurrentModificationException is not the lack of synchronization, but rather the fact that something is modifying the collection on which you are currently iterating (in the same thread).
          In your case if somebody is modifying defaultResources inside loadResource() then you get ConcurrentModificationException. And synchronizing the loop will not help it.
          I am just checking whether this fixes what it intends to fix.

          Show
          Konstantin Shvachko added a comment - +1 The patch does the right thing synchronizing under-synchronized static variable defaultResources. BUT, The most common cause of ConcurrentModificationException is not the lack of synchronization, but rather the fact that something is modifying the collection on which you are currently iterating (in the same thread). In your case if somebody is modifying defaultResources inside loadResource() then you get ConcurrentModificationException . And synchronizing the loop will not help it. I am just checking whether this fixes what it intends to fix.
          Hide
          Todd Lipcon added a comment -

          -1 tests included. The patch doesn't appear to include any new or modified tests.

          There's no good way to regression test this since it's a very rare race. The patch is very small and should be hand verifiable, and correct operation of Configuration is ensured by hundreds of tests using it.

          Show
          Todd Lipcon added a comment - -1 tests included. The patch doesn't appear to include any new or modified tests. There's no good way to regression test this since it's a very rare race. The patch is very small and should be hand verifiable, and correct operation of Configuration is ensured by hundreds of tests using it.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12419943/hadoop-6269.txt
          against trunk revision 816398.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/44/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/44/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/44/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/44/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12419943/hadoop-6269.txt against trunk revision 816398. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/44/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/44/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/44/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/44/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Patch against trunk

          Show
          Todd Lipcon added a comment - Patch against trunk

            People

            • Assignee:
              Sreekanth Ramakrishnan
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development