Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-13891 HDFS RBF stabilization phase I
  3. HDFS-14230

RBF: Throw RetriableException instead of IOException when no namenodes available

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.0, 3.1.1, 2.9.2, 3.0.3
    • 3.3.0, HDFS-13891
    • None
    • None
    • Reviewed

    Description

      Failover usually happens when upgrading namenodes. And there are no active namenodes within some seconds, Accessing HDFS through router fails at this moment. This could make jobs failure or hang. Some hive jobs logs are as follow

      2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 133.33 sec
      MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
      Ended Job = job_1542178952162_24411913
      Launching Job 4 out of 6
      Exception in thread "Thread-86" java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode available under nameservice Cluster3
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
          at org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
          at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
          at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:422)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
      Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
          at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
          at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
          at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
          at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
          at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:422)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
      
      

      Deep into the code. Maybe we can throw StandbyException when no namenodes available. Client will fail after some retries

      Attachments

        1. HDFS-14230-HDFS-13891.006.patch
          17 kB
          Hui Fei
        2. HDFS-14230-HDFS-13891.005.patch
          17 kB
          Hui Fei
        3. HDFS-14230-HDFS-13891.004.patch
          17 kB
          Hui Fei
        4. HDFS-14230-HDFS-13891.003.patch
          11 kB
          Hui Fei
        5. HDFS-14230-HDFS-13891.002.patch
          10 kB
          Hui Fei
        6. HDFS-14230-HDFS-13891.001.patch
          10 kB
          Hui Fei

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ferhui Hui Fei
            ferhui Hui Fei
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment