Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None

      Description

      The heartbeat monitor thread encounters a ConcurrentModificationException while iterating over the "heartbeats" data structure. This occurs when the namenode was getting restarted. There are actuallt two bugs here:

      1. The Heartbeat Monitor thread needs to catch Exceptions and continue, instead of exiting.
      2. The heartbeats data structures is protected by the heartbeats lock. The registerDatanode() method invokes removeDatanode() without acquiring the heartbeats monitor lock. This causes the ConcurrentModificationException.

      1. heartbeatmonitor-0.12.3.patch
        3 kB
        dhruba borthakur
      2. heartbeatmonitor3.patch
        3 kB
        dhruba borthakur

        Issue Links

          Activity

          dhruba borthakur created issue -
          dhruba borthakur made changes -
          Field Original Value New Value
          Summary heartbeat monitor thread goea away heartbeat monitor thread goes away
          dhruba borthakur made changes -
          Link This issue is related to HADOOP-1255 [ HADOOP-1255 ]
          Hide
          dhruba borthakur added a comment -

          namenode .out file.
          Exception in thread
          "org.apache.hadoop.dfs.FSNamesystem$HeartbeatMonitor@5b9d2de4" java.util.ConcurrentModificationException
          at
          java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
          at java.util.AbstractList$Itr.next(AbstractList.java:343)
          at
          org.apache.hadoop.dfs.FSNamesystem.heartbeatCheck(FSNamesystem.java:1933)
          at
          org.apache.hadoop.dfs.FSNamesystem$HeartbeatMonitor.run(FSNamesystem.java:1697)
          at java.lang.Thread.run(Thread.java:619)

          Show
          dhruba borthakur added a comment - namenode .out file. Exception in thread "org.apache.hadoop.dfs.FSNamesystem$HeartbeatMonitor@5b9d2de4" java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.dfs.FSNamesystem.heartbeatCheck(FSNamesystem.java:1933) at org.apache.hadoop.dfs.FSNamesystem$HeartbeatMonitor.run(FSNamesystem.java:1697) at java.lang.Thread.run(Thread.java:619)
          Hide
          Koji Noguchi added a comment -

          Namenode just prints to stderr(.out file) and keeps on running without HeartbeatMonitor thread.
          As a result, namenode tries to assign blocks to the dead datanodes.

          Show
          Koji Noguchi added a comment - Namenode just prints to stderr(.out file) and keeps on running without HeartbeatMonitor thread. As a result, namenode tries to assign blocks to the dead datanodes.
          Koji Noguchi made changes -
          Priority Major [ 3 ] Blocker [ 1 ]
          Hide
          dhruba borthakur added a comment -

          Use a try-catch to ensure that heartbeat monitor continues to run. Protec removeDataNodes by using the heartbeats monitor lock.

          Show
          dhruba borthakur added a comment - Use a try-catch to ensure that heartbeat monitor continues to run. Protec removeDataNodes by using the heartbeats monitor lock.
          dhruba borthakur made changes -
          Attachment heartbeatmonitor.patch [ 12356587 ]
          Hide
          dhruba borthakur added a comment -

          Incorporated Raghu's comments of protecting the node.isAlive field by using the heartbeats monitor lock.

          Show
          dhruba borthakur added a comment - Incorporated Raghu's comments of protecting the node.isAlive field by using the heartbeats monitor lock.
          dhruba borthakur made changes -
          Attachment heartbeatmonitor2.patch [ 12356658 ]
          dhruba borthakur made changes -
          Assignee dhruba borthakur [ dhruba ]
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Raghu Angadi added a comment -

          Another minor change:

          Also since this patch catches all exceptions inside couple of threads (just like other threads), could we log the exceptions at error level instead of info? This way we can differentiate these unexpected exceptions from other expected ones while grepping the logs.

          Show
          Raghu Angadi added a comment - Another minor change: Also since this patch catches all exceptions inside couple of threads (just like other threads), could we log the exceptions at error level instead of info? This way we can differentiate these unexpected exceptions from other expected ones while grepping the logs.
          Show
          Hadoop QA added a comment - +1 http://issues.apache.org/jira/secure/attachment/12356658/heartbeatmonitor2.patch applied and successfully tested against trunk revision r534234. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/104/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/104/console
          Hide
          dhruba borthakur added a comment -

          Incorporated Raghu's comments about logging levels.

          Show
          dhruba borthakur added a comment - Incorporated Raghu's comments about logging levels.
          dhruba borthakur made changes -
          Attachment heartbeatmonitor3.patch [ 12356662 ]
          dhruba borthakur made changes -
          Attachment heartbeatmonitor.patch [ 12356587 ]
          dhruba borthakur made changes -
          Attachment heartbeatmonitor2.patch [ 12356658 ]
          Doug Cutting committed 534624 (2 files)
          Reviews: none

          HADOOP-1312. Fix a ConcurrentModificationException in NameNode that killed the heartbeat monitoring thread. Contributed by Dhruba.

          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Dhruba!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Dhruba!
          Doug Cutting made changes -
          Resolution Fixed [ 1 ]
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.13.0 [ 12312348 ]
          Hide
          Hadoop QA added a comment -
          Show
          Hadoop QA added a comment - Integrated in Hadoop-Nightly #77 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/77/ )
          Hide
          dhruba borthakur added a comment -

          Patch for 0.12.3 release.

          Show
          dhruba borthakur added a comment - Patch for 0.12.3 release.
          dhruba borthakur made changes -
          Attachment heartbeatmonitor-0.12.3.patch [ 12356809 ]
          Doug Cutting made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Owen O'Malley made changes -
          Component/s dfs [ 12310710 ]

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development