Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-13891 HDFS RBF stabilization phase I
  3. HDFS-14230

RBF: Throw RetriableException instead of IOException when no namenodes available

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.0, 3.1.1, 2.9.2, 3.0.3
    • 3.3.0, HDFS-13891
    • None
    • None
    • Reviewed

    Description

      Failover usually happens when upgrading namenodes. And there are no active namenodes within some seconds, Accessing HDFS through router fails at this moment. This could make jobs failure or hang. Some hive jobs logs are as follow

      2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 133.33 sec
      MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
      Ended Job = job_1542178952162_24411913
      Launching Job 4 out of 6
      Exception in thread "Thread-86" java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode available under nameservice Cluster3
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
          at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
          at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:422)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
      Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
          at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
          at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
          at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
          at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
          at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:422)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
      
      

      Deep into the code. Maybe we can throw StandbyException when no namenodes available. Client will fail after some retries

      Attachments

        1. HDFS-14230-HDFS-13891.006.patch
          17 kB
          Hui Fei
        2. HDFS-14230-HDFS-13891.005.patch
          17 kB
          Hui Fei
        3. HDFS-14230-HDFS-13891.004.patch
          17 kB
          Hui Fei
        4. HDFS-14230-HDFS-13891.003.patch
          11 kB
          Hui Fei
        5. HDFS-14230-HDFS-13891.002.patch
          10 kB
          Hui Fei
        6. HDFS-14230-HDFS-13891.001.patch
          10 kB
          Hui Fei

        Issue Links

          Activity

            People

              ferhui Hui Fei
              ferhui Hui Fei
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: