Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None

      Description

      The heartbeat monitor thread encounters a ConcurrentModificationException while iterating over the "heartbeats" data structure. This occurs when the namenode was getting restarted. There are actuallt two bugs here:

      1. The Heartbeat Monitor thread needs to catch Exceptions and continue, instead of exiting.
      2. The heartbeats data structures is protected by the heartbeats lock. The registerDatanode() method invokes removeDatanode() without acquiring the heartbeats monitor lock. This causes the ConcurrentModificationException.

      1. heartbeatmonitor3.patch
        3 kB
        dhruba borthakur
      2. heartbeatmonitor-0.12.3.patch
        3 kB
        dhruba borthakur

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          namenode .out file.
          Exception in thread
          "org.apache.hadoop.dfs.FSNamesystem$HeartbeatMonitor@5b9d2de4" java.util.ConcurrentModificationException
          at
          java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
          at java.util.AbstractList$Itr.next(AbstractList.java:343)
          at
          org.apache.hadoop.dfs.FSNamesystem.heartbeatCheck(FSNamesystem.java:1933)
          at
          org.apache.hadoop.dfs.FSNamesystem$HeartbeatMonitor.run(FSNamesystem.java:1697)
          at java.lang.Thread.run(Thread.java:619)

          Show
          dhruba borthakur added a comment - namenode .out file. Exception in thread "org.apache.hadoop.dfs.FSNamesystem$HeartbeatMonitor@5b9d2de4" java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.dfs.FSNamesystem.heartbeatCheck(FSNamesystem.java:1933) at org.apache.hadoop.dfs.FSNamesystem$HeartbeatMonitor.run(FSNamesystem.java:1697) at java.lang.Thread.run(Thread.java:619)
          Hide
          Koji Noguchi added a comment -

          Namenode just prints to stderr(.out file) and keeps on running without HeartbeatMonitor thread.
          As a result, namenode tries to assign blocks to the dead datanodes.

          Show
          Koji Noguchi added a comment - Namenode just prints to stderr(.out file) and keeps on running without HeartbeatMonitor thread. As a result, namenode tries to assign blocks to the dead datanodes.
          Hide
          dhruba borthakur added a comment -

          Use a try-catch to ensure that heartbeat monitor continues to run. Protec removeDataNodes by using the heartbeats monitor lock.

          Show
          dhruba borthakur added a comment - Use a try-catch to ensure that heartbeat monitor continues to run. Protec removeDataNodes by using the heartbeats monitor lock.
          Hide
          dhruba borthakur added a comment -

          Incorporated Raghu's comments of protecting the node.isAlive field by using the heartbeats monitor lock.

          Show
          dhruba borthakur added a comment - Incorporated Raghu's comments of protecting the node.isAlive field by using the heartbeats monitor lock.
          Hide
          Raghu Angadi added a comment -

          Another minor change:

          Also since this patch catches all exceptions inside couple of threads (just like other threads), could we log the exceptions at error level instead of info? This way we can differentiate these unexpected exceptions from other expected ones while grepping the logs.

          Show
          Raghu Angadi added a comment - Another minor change: Also since this patch catches all exceptions inside couple of threads (just like other threads), could we log the exceptions at error level instead of info? This way we can differentiate these unexpected exceptions from other expected ones while grepping the logs.
          Show
          Hadoop QA added a comment - +1 http://issues.apache.org/jira/secure/attachment/12356658/heartbeatmonitor2.patch applied and successfully tested against trunk revision r534234. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/104/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/104/console
          Hide
          dhruba borthakur added a comment -

          Incorporated Raghu's comments about logging levels.

          Show
          dhruba borthakur added a comment - Incorporated Raghu's comments about logging levels.
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Dhruba!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Dhruba!
          Hide
          Hadoop QA added a comment -
          Show
          Hadoop QA added a comment - Integrated in Hadoop-Nightly #77 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/77/ )
          Hide
          dhruba borthakur added a comment -

          Patch for 0.12.3 release.

          Show
          dhruba borthakur added a comment - Patch for 0.12.3 release.

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development