Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-15684

triggerActiveLogRoll stuck on dead name node, when ConnectTimeoutException happens.

VotersStop watchingWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.0.0-alpha1
    • 3.2.0, 3.0.4, 3.1.2
    • ha
    • Reviewed
    • When a namenode A sends request RollEditLog to a remote NN, either the remote NN is standby or IO Exception happens, A should continue to try next NN, instead of getting stuck on the problematic one. This Patch is based on trunk.
    • Patch

    Description

      When name node call triggerActiveLogRoll, and the cachedActiveProxy is a dead name node, it will throws a ConnectTimeoutException, expected behavior is to try next NN, but current logic doesn't do so, instead, it keeps trying the dead, mistakenly take it as active.

       

      2018-08-17 10:02:12,001 WARN [Edit log tailer] org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN

      org.apache.hadoop.net.ConnectTimeoutException: Call From SourceMachine001/SourceIP to001 TargetMachine001.ap.gbl:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:298)

       

      C:\Users\rotang>ping TargetMachine001

      Pinging TargetMachine001[TargetIP001] with 32 bytes of data:
      Request timed out.
      Request timed out.
      Request timed out.
      Request timed out.

       Attachment is a log file saying how it repeatedly retries a dead name node, and a fix patch.

       I replaced the actual machine name/ip as SourceMachine001/SourceIP001 and TargetMachine001/TargetIP001.

       

      How to Repro:

      In a good running NNs, take down the active NN (don't let it come back during test), and then the stand by NNs will keep trying dead (old active) NN, because it is the cached one.

      Attachments

        1. 0001-RollEditLog-try-next-NN-when-exception-happens.patch
          6 kB
          Rong Tang
        2. hadoop--rollingUpgrade-SourceMachine001.log
          15 kB
          Rong Tang
        3. HADOOP-15684.000.patch
          6 kB
          Rong Tang
        4. HADOOP-15684.001.patch
          7 kB
          Rong Tang
        5. HADOOP-15684.002.patch
          11 kB
          Rong Tang
        6. HADOOP-15684.003.patch
          11 kB
          Rong Tang
        7. HADOOP-15684.004.patch
          11 kB
          Rong Tang
        8. HADOOP-15684.005.patch
          10 kB
          Íñigo Goiri

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            trjianjianjiao Rong Tang
            trjianjianjiao Rong Tang
            Votes:
            0 Vote for this issue
            Watchers:
            10 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment