Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-13891 HDFS RBF stabilization phase I
  3. HDFS-14230

RBF: Throw RetriableException instead of IOException when no namenodes available

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.2.0, 3.1.1, 2.9.2, 3.0.3
    • Fix Version/s: 3.3.0, HDFS-13891
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Failover usually happens when upgrading namenodes. And there are no active namenodes within some seconds, Accessing HDFS through router fails at this moment. This could make jobs failure or hang. Some hive jobs logs are as follow

      2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 133.33 sec
      MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
      Ended Job = job_1542178952162_24411913
      Launching Job 4 out of 6
      Exception in thread "Thread-86" java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode available under nameservice Cluster3
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
          at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
          at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:422)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
      Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
          at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
          at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
          at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
          at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
          at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:422)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
      
      

      Deep into the code. Maybe we can throw StandbyException when no namenodes available. Client will fail after some retries

        Attachments

        1. HDFS-14230-HDFS-13891.006.patch
          17 kB
          Hui Fei
        2. HDFS-14230-HDFS-13891.005.patch
          17 kB
          Hui Fei
        3. HDFS-14230-HDFS-13891.004.patch
          17 kB
          Hui Fei
        4. HDFS-14230-HDFS-13891.003.patch
          11 kB
          Hui Fei
        5. HDFS-14230-HDFS-13891.002.patch
          10 kB
          Hui Fei
        6. HDFS-14230-HDFS-13891.001.patch
          10 kB
          Hui Fei

          Issue Links

            Activity

              People

              • Assignee:
                ferhui Hui Fei
                Reporter:
                ferhui Hui Fei
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: