Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None

      Description

      The heartbeat monitor thread encounters a ConcurrentModificationException while iterating over the "heartbeats" data structure. This occurs when the namenode was getting restarted. There are actuallt two bugs here:

      1. The Heartbeat Monitor thread needs to catch Exceptions and continue, instead of exiting.
      2. The heartbeats data structures is protected by the heartbeats lock. The registerDatanode() method invokes removeDatanode() without acquiring the heartbeats monitor lock. This causes the ConcurrentModificationException.

      1. heartbeatmonitor3.patch
        3 kB
        dhruba borthakur
      2. heartbeatmonitor-0.12.3.patch
        3 kB
        dhruba borthakur

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          19h 59m 1 dhruba borthakur 02/May/07 20:13
          Patch Available Patch Available Resolved Resolved
          2h 24m 1 Doug Cutting 02/May/07 22:37
          Resolved Resolved Closed Closed
          36d 23h 2m 1 Doug Cutting 08/Jun/07 21:40
          Owen O'Malley made changes -
          Component/s dfs [ 12310710 ]
          Doug Cutting made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          dhruba borthakur made changes -
          Attachment heartbeatmonitor-0.12.3.patch [ 12356809 ]
          Hide
          dhruba borthakur added a comment -

          Patch for 0.12.3 release.

          Show
          dhruba borthakur added a comment - Patch for 0.12.3 release.
          Hide
          Hadoop QA added a comment -
          Show
          Hadoop QA added a comment - Integrated in Hadoop-Nightly #77 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/77/ )
          Doug Cutting made changes -
          Resolution Fixed [ 1 ]
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.13.0 [ 12312348 ]
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Dhruba!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Dhruba!
          dhruba borthakur made changes -
          Attachment heartbeatmonitor2.patch [ 12356658 ]
          dhruba borthakur made changes -
          Attachment heartbeatmonitor.patch [ 12356587 ]
          dhruba borthakur made changes -
          Attachment heartbeatmonitor3.patch [ 12356662 ]
          Hide
          dhruba borthakur added a comment -

          Incorporated Raghu's comments about logging levels.

          Show
          dhruba borthakur added a comment - Incorporated Raghu's comments about logging levels.
          Show
          Hadoop QA added a comment - +1 http://issues.apache.org/jira/secure/attachment/12356658/heartbeatmonitor2.patch applied and successfully tested against trunk revision r534234. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/104/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/104/console
          Hide
          Raghu Angadi added a comment -

          Another minor change:

          Also since this patch catches all exceptions inside couple of threads (just like other threads), could we log the exceptions at error level instead of info? This way we can differentiate these unexpected exceptions from other expected ones while grepping the logs.

          Show
          Raghu Angadi added a comment - Another minor change: Also since this patch catches all exceptions inside couple of threads (just like other threads), could we log the exceptions at error level instead of info? This way we can differentiate these unexpected exceptions from other expected ones while grepping the logs.
          dhruba borthakur made changes -
          Assignee dhruba borthakur [ dhruba ]
          Status Open [ 1 ] Patch Available [ 10002 ]
          dhruba borthakur made changes -
          Attachment heartbeatmonitor2.patch [ 12356658 ]
          Hide
          dhruba borthakur added a comment -

          Incorporated Raghu's comments of protecting the node.isAlive field by using the heartbeats monitor lock.

          Show
          dhruba borthakur added a comment - Incorporated Raghu's comments of protecting the node.isAlive field by using the heartbeats monitor lock.
          dhruba borthakur made changes -
          Attachment heartbeatmonitor.patch [ 12356587 ]
          Hide
          dhruba borthakur added a comment -

          Use a try-catch to ensure that heartbeat monitor continues to run. Protec removeDataNodes by using the heartbeats monitor lock.

          Show
          dhruba borthakur added a comment - Use a try-catch to ensure that heartbeat monitor continues to run. Protec removeDataNodes by using the heartbeats monitor lock.
          Koji Noguchi made changes -
          Priority Major [ 3 ] Blocker [ 1 ]
          Hide
          Koji Noguchi added a comment -

          Namenode just prints to stderr(.out file) and keeps on running without HeartbeatMonitor thread.
          As a result, namenode tries to assign blocks to the dead datanodes.

          Show
          Koji Noguchi added a comment - Namenode just prints to stderr(.out file) and keeps on running without HeartbeatMonitor thread. As a result, namenode tries to assign blocks to the dead datanodes.
          Hide
          dhruba borthakur added a comment -

          namenode .out file.
          Exception in thread
          "org.apache.hadoop.dfs.FSNamesystem$HeartbeatMonitor@5b9d2de4" java.util.ConcurrentModificationException
          at
          java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
          at java.util.AbstractList$Itr.next(AbstractList.java:343)
          at
          org.apache.hadoop.dfs.FSNamesystem.heartbeatCheck(FSNamesystem.java:1933)
          at
          org.apache.hadoop.dfs.FSNamesystem$HeartbeatMonitor.run(FSNamesystem.java:1697)
          at java.lang.Thread.run(Thread.java:619)

          Show
          dhruba borthakur added a comment - namenode .out file. Exception in thread "org.apache.hadoop.dfs.FSNamesystem$HeartbeatMonitor@5b9d2de4" java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.dfs.FSNamesystem.heartbeatCheck(FSNamesystem.java:1933) at org.apache.hadoop.dfs.FSNamesystem$HeartbeatMonitor.run(FSNamesystem.java:1697) at java.lang.Thread.run(Thread.java:619)
          dhruba borthakur made changes -
          Link This issue is related to HADOOP-1255 [ HADOOP-1255 ]
          dhruba borthakur made changes -
          Field Original Value New Value
          Summary heartbeat monitor thread goea away heartbeat monitor thread goes away
          dhruba borthakur created issue -

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development