Hadoop Common
  1. Hadoop Common
  2. HADOOP-951

java.util.ConcurrentModificationException in FSNamesystem.chooseTargets

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 0.10.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      2007-01-26 01:14:37,509 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020 call error: java.io.IOException: java.util.ConcurrentModificationException
      java.io.IOException: java.util.ConcurrentModificationException
      at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:449)
      at java.util.AbstractList$Itr.next(AbstractList.java:420)
      at org.apache.hadoop.dfs.FSNamesystem.chooseTargets(FSNamesystem.java:2282)
      at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:484)
      at org.apache.hadoop.dfs.NameNode.create(NameNode.java:238)
      at sun.reflect.GeneratedMethodAccessor60.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:585)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:337)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:538)

      Not sure if it's related, but this exception happend when namenode was replicating many blocks.

      1. jstack.27108
        14 kB
        dhruba borthakur

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          This problem does not happen any more because this entire section has been re-written.

          Show
          dhruba borthakur added a comment - This problem does not happen any more because this entire section has been re-written.
          Hide
          dhruba borthakur added a comment -

          I saw another instance of this one. The unit test caused the namenode to to into a 99% CPU. This thread has the FSnamesystem lock and was looping. All datanodes timed out. The stack trace shows that thread 27163 is using all 99% CPU and the state is "IN_JAVA". But it does not show the stack trace in detail. I am assuming that this is another case of using a HashMap without appropriate locking. The complete stack trace is attached here. Here is the Java bug that explains this behaviour:

          http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6423457

          Show
          dhruba borthakur added a comment - I saw another instance of this one. The unit test caused the namenode to to into a 99% CPU. This thread has the FSnamesystem lock and was looping. All datanodes timed out. The stack trace shows that thread 27163 is using all 99% CPU and the state is "IN_JAVA". But it does not show the stack trace in detail. I am assuming that this is another case of using a HashMap without appropriate locking. The complete stack trace is attached here. Here is the Java bug that explains this behaviour: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6423457
          Hide
          Milind Bhandarkar added a comment -

          The rack-awareness patch in HADOOP-692 completely rewrites chooseTargets method. Do we still want to fix this bug ? Because it would be irrelevant when that patch is applied.

          Show
          Milind Bhandarkar added a comment - The rack-awareness patch in HADOOP-692 completely rewrites chooseTargets method. Do we still want to fix this bug ? Because it would be irrelevant when that patch is applied.
          Hide
          Konstantin Shvachko added a comment -

          IMO this issue is the race condition bug and should be fixed in a traditional way.

          ConcurrentHashMap is an interesting concept, we should definitely take it into consideration
          when we decide to optimize our data structures and locking.
          I cannot estimate the amount of changes the name-node code will require at this point
          if we introduce ConcurrentHashMap because of its different semantics.
          And we should not forget in chooseTargets() the name-node is under the global lock.
          Fine grained locking under that lock seems to be a waste of resources.

          Show
          Konstantin Shvachko added a comment - IMO this issue is the race condition bug and should be fixed in a traditional way. ConcurrentHashMap is an interesting concept, we should definitely take it into consideration when we decide to optimize our data structures and locking. I cannot estimate the amount of changes the name-node code will require at this point if we introduce ConcurrentHashMap because of its different semantics. And we should not forget in chooseTargets() the name-node is under the global lock. Fine grained locking under that lock seems to be a waste of resources.
          Hide
          Tom White added a comment -

          One of the differences in using ConcurrentHashMap is that it offers slightly weaker semantics for some Map operations. For example, iterators are "weakly consistent" - that is, they don't freeze the collection while it is being iterated over. Often this doesn't matter - indeed it's one of the trade offs that we might be looking for - but it is worth bearing in mind.

          Brian Goetz has written extensively about the concurrency collections. These are worth a read (despite being a few years old): http://www-128.ibm.com/developerworks/java/library/j-jtp07233.html and http://www-128.ibm.com/developerworks/java/library/j-jtp08223/
          His book, Java Concurrency in Practice (http://jcip.net/) is good and very detailed.

          I guess we can find out memory usage by running some benchmarks. I haven't seen anything on the web comparing ConcurrentHashMap to HashMap in terms of memory usage.

          Show
          Tom White added a comment - One of the differences in using ConcurrentHashMap is that it offers slightly weaker semantics for some Map operations. For example, iterators are "weakly consistent" - that is, they don't freeze the collection while it is being iterated over. Often this doesn't matter - indeed it's one of the trade offs that we might be looking for - but it is worth bearing in mind. Brian Goetz has written extensively about the concurrency collections. These are worth a read (despite being a few years old): http://www-128.ibm.com/developerworks/java/library/j-jtp07233.html and http://www-128.ibm.com/developerworks/java/library/j-jtp08223/ His book, Java Concurrency in Practice ( http://jcip.net/ ) is good and very detailed. I guess we can find out memory usage by running some benchmarks. I haven't seen anything on the web comparing ConcurrentHashMap to HashMap in terms of memory usage.
          Hide
          Owen O'Malley added a comment -

          Catching ConcurrentModificationException is not acceptable. You are not guaranteed to always get the exception. The race condition needs to be identified and prevented.

          I haven't used the java.util.concurrent containers yet, so I don't know what the gotchas are. At the very least, we need to be conscious of memory usage in the namenode.

          Show
          Owen O'Malley added a comment - Catching ConcurrentModificationException is not acceptable. You are not guaranteed to always get the exception. The race condition needs to be identified and prevented. I haven't used the java.util.concurrent containers yet, so I don't know what the gotchas are. At the very least, we need to be conscious of memory usage in the namenode.
          Hide
          Tom White added a comment -

          Is there any reason that the heartbeats need to be a list? If not, then we could use java.util.concurrent.ConcurrentHashMap as a set (use a constant value). This would give much more concurrency than locking the whole collection. This change would also be beneficial for the other instance maps in FSNamesystem.

          Show
          Tom White added a comment - Is there any reason that the heartbeats need to be a list? If not, then we could use java.util.concurrent.ConcurrentHashMap as a set (use a constant value). This would give much more concurrency than locking the whole collection. This change would also be beneficial for the other instance maps in FSNamesystem.
          Hide
          dhruba borthakur added a comment -

          pendingTransfers currently iterates over all the elements of 'heartbeats'. Instead can it be changed to invoke heartbeats.get and then catch the ConcurrentModificationException? Maybe we can get away with not locking 'heartbeats' in the most frequently traveled codepath.

          Show
          dhruba borthakur added a comment - pendingTransfers currently iterates over all the elements of 'heartbeats'. Instead can it be changed to invoke heartbeats.get and then catch the ConcurrentModificationException? Maybe we can get away with not locking 'heartbeats' in the most frequently traveled codepath.
          Hide
          Konstantin Shvachko added a comment -

          chooseTargets() iterates over the heartbeats list. Looks like somebody else modified the list during iteration.
          chooseTargets() should synchronize on the heartbeats list, since we have the heartbeats lock separate from
          the global namespace lock now.
          We should also check other places where heartbeats is used, and make sure both locks are enforced.

          Show
          Konstantin Shvachko added a comment - chooseTargets() iterates over the heartbeats list. Looks like somebody else modified the list during iteration. chooseTargets() should synchronize on the heartbeats list, since we have the heartbeats lock separate from the global namespace lock now. We should also check other places where heartbeats is used, and make sure both locks are enforced.

            People

            • Assignee:
              Unassigned
              Reporter:
              Koji Noguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development