Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-17232

RBF: Fix NoNamenodesAvailableException for a long time, when use observer

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • rbf
    • Reviewed

    Description

      Describe

      I solved the NoNamenodesAvailableException for a long time, when failover without using observer, but when using observer, there are still many problems.

      1.  When the observer fails and there is no active namenode at this time, even if we can rotate the cache, the next request will shuffle the observer namenode to the front of the cache due to the use of the observer, so retry will still send the request to the failed observer node.
      2. If there are multiple observers, and an exception occurs when accessing an observer and there is no active namenode at this time, a NoNamenodesAvailableException will be caused and the server will try again. However, since using the observer will put the observer node at the front of the cache, it may still fail.
      3. When there are multiple observers, one of which is unavailable and there is no active namenode at this time, we should continue to try the next observer, so that the currently unavailable observer can be marked as unavailable, and subsequent requests can avoid the unavailable observer.
      4. If it is due to an illegal operation, that is, even if the operation is sent to the active namenode, an exception will occur, resulting in NoNamenodesAvailableException. If the cache is rotated at this time, the next normal request will be sent to the namenode that is indeed the standby, causing an error in the legal request. , so illegal operations should not rotate the cache.

       

      Detailed bug description: HDFS-17166

       

      • case  1:
      • router's cache : [ observer-1(problematic), standby-2, standby-3(actually active) ]
      • client read  -> observer-1   throw   NoNamenodesAvailableException  -> rotate the cache -> [ standby-2, standby-3(actually active),observer-1(problematic) ]
      • client retry read ->  shuffleObserverNN ->   [ observer-1(problematic), standby-2, standby-3(actually active) ] -> observer-1   throw   NoNamenodesAvailableException  -> rotate the cache -> [ standby-2, standby-3(actually active),observer-1(problematic) ]
      • .....
      • client  (reties > max.attempts )   ->    Read failed
         
      • case 2:
      • router's cache :   [ observer-1(problematic), observer-2, standby-3, standby-4(actually active) ]  
      • client read  -> observer-1   throw   NoNamenodesAvailableException  -> rotate the cache -> [ observer-2, standby-3, standby-4(actually active),observer-1(problematic) ]
      • client retry read ->  shuffleObserverNN ->  [ observer-1(problematic), observer-2, standby-3, standby-4(actually active) ] (may happen) -> observer-1   throw   NoNamenodesAvailableException  -> rotate the cache -> [ observer-2, standby-3, standby-4(actually active),observer-1(problematic) ]
      • .....
      • client  may (reties > max.attempts )   ->    Read failed
      • case 3:
      • router's cache :   [ standby-1, standby-2(actually active) ]  
      • client request  -> standby-1   throw   NoNamenodesAvailableException  -> rotate the cache -> [ standby-2(actually active),standby-1 ]
      • client retry request ->  standby-2(actually active) success
      • client Illegal request -> standby-2(actually active)  throw   NoNamenodesAvailableException -> rotate the cache -> [standby1, standby-2(actually active) ]
      • client legal request -> standby1 throw   NoNamenodesAvailableException failed

      How to reproduce

      I have provided unit tests:TestNoNamenodesAvailableLongTime

      You can use the original code and run my new unit tests to reproduce the above problems.

       

       

      Attachments

        1. HDFS-17232.001.patch
          30 kB
          Jian Zhang

        Issue Links

          Activity

            People

              keepromise Jian Zhang
              keepromise Jian Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: