Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-3902

OM HA client failover switcher to a wrong OM server

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Cannot Reproduce
    • None
    • None
    • OM HA

    Description

      Found this problem with the PR/branch HDDS-3878, but it seems to be independent.

      1. ozone sh volume create /vol1 works well with HA
      2. ozone freon omkg (rpc client) doesn't work

      ozone freon omkg | grep "Failing over"
      2020-06-30 14:15:31 DEBUG OMFailoverProxyProvider:271 - Failing over OM proxy to index: 1, nodeId: om2
      2020-06-30 14:15:31 DEBUG OMFailoverProxyProvider:271 - Failing over OM proxy to index: 2, nodeId: om3
      2020-06-30 14:15:34 DEBUG OMFailoverProxyProvider:271 - Failing over OM proxy to index: 0, nodeId: omNodeIdDummy
      

      om2 seems to be the leader but for some reason the failover logic switching back to an unknown node

      2020-06-30 14:16:35 DEBUG OMFailoverProxyProvider:271 - Failing over OM proxy to index: 2, nodeId: om3
      2020-06-30 14:16:35 DEBUG Client:63 - getting client out of cache: org.apache.hadoop.ipc.Client@f5acb9d
      2020-06-30 14:16:35 DEBUG Client:497 - The ping interval is 60000 ms.
      2020-06-30 14:16:35 DEBUG Client:795 - Connecting to ozone-om-2.ozone-om.default.svc.cluster.local/10.42.0.175:9862
      2020-06-30 14:16:35 DEBUG Client:1074 - IPC Client (363509958) connection to ozone-om-2.ozone-om.default.svc.cluster.local/10.42.0.175:9862 from root: starting, having connections 3
      2020-06-30 14:16:35 DEBUG Client:1137 - IPC Client (363509958) connection to ozone-om-2.ozone-om.default.svc.cluster.local/10.42.0.175:9862 from root sending #0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
      2020-06-30 14:16:36 DEBUG Client:1191 - IPC Client (363509958) connection to ozone-om-2.ozone-om.default.svc.cluster.local/10.42.0.175:9862 from root got value #0
      2020-06-30 14:16:36 DEBUG ProtobufRpcEngine:254 - Call: submitRequest took 439ms
      2020-06-30 14:16:36 DEBUG Client:1137 - IPC Client (363509958) connection to ozone-om-2.ozone-om.default.svc.cluster.local/10.42.0.175:9862 from root sending #1 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
      2020-06-30 14:16:36 DEBUG Client:1191 - IPC Client (363509958) connection to ozone-om-2.ozone-om.default.svc.cluster.local/10.42.0.175:9862 from root got value #1
      2020-06-30 14:16:36 DEBUG ProtobufRpcEngine:254 - Call: submitRequest took 2ms
      2020-06-30 14:16:36 DEBUG Client:1137 - IPC Client (363509958) connection to ozone-om-2.ozone-om.default.svc.cluster.local/10.42.0.175:9862 from root sending #2 org.apache.hadoop.ozone.om.pro
      tocol.OzoneManagerProtocol.submitRequest
      2020-06-30 14:16:36 DEBUG Client:1191 - IPC Client (363509958) connection to ozone-om-2.ozone-om.default.svc.cluster.local/10.42.0.175:9862 from root got value #2
      2020-06-30 14:16:36 DEBUG ProtobufRpcEngine:254 - Call: submitRequest took 1ms
      2020-06-30 14:16:36 DEBUG Client:1137 - IPC Client (363509958) connection to ozone-om-2.ozone-om.default.svc.cluster.local/10.42.0.175:9862 from root sending #3 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
      2020-06-30 14:16:36 DEBUG Client:1191 - IPC Client (363509958) connection to ozone-om-2.ozone-om.default.svc.cluster.local/10.42.0.175:9862 from root got value #3
      2020-06-30 14:16:36 DEBUG ProtobufRpcEngine:254 - Call: submitRequest took 1ms
      2020-06-30 14:16:36 DEBUG Client:63 - getting client out of cache: org.apache.hadoop.ipc.Client@f5acb9d
      2020-06-30 14:16:36 DEBUG Groups:312 - GroupCacheLoader - load.
      2020-06-30 14:16:36 DEBUG Client:1137 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root sending #5 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
      2020-06-30 14:16:36 DEBUG Client:1137 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root sending #11 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
      2020-06-30 14:16:36 DEBUG Client:1137 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root sending #8 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
      2020-06-30 14:16:36 DEBUG Client:1137 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root sending #12 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
      2020-06-30 14:16:36 DEBUG Client:1137 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root sending #10 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
      2020-06-30 14:16:36 DEBUG Client:1137 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root sending #6 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
      2020-06-30 14:16:36 DEBUG Client:1137 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root sending #9 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
      2020-06-30 14:16:36 DEBUG Client:1137 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root sending #7 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
      2020-06-30 14:16:36 DEBUG Client:1137 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root sending #4 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
      2020-06-30 14:16:36 DEBUG Client:1137 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root sending #13 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest
      2020-06-30 14:16:36 DEBUG Client:1191 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root got value #5
      2020-06-30 14:16:36 DEBUG Client:1191 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root got value #8
      2020-06-30 14:16:36 DEBUG Client:1191 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root got value #11
      2020-06-30 14:16:36 DEBUG Client:1191 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root got value #10
      2020-06-30 14:16:36 DEBUG Client:1191 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root got value #12
      2020-06-30 14:16:36 DEBUG Client:1191 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root got value #7
      2020-06-30 14:16:36 DEBUG Hadoop3OmTransport:140 - RetryProxy: OM:om1 is not the leader. Suggested leader is OM:om3.
              at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:198)
              at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:141)
              at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:74)
              at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:113)
              at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
      
      2020-06-30 14:16:36 DEBUG Client:1191 - IPC Client (363509958) connection to ozone-om-0.ozone-om.default.svc.cluster.local/10.42.0.173:9862 from root got value #9
      2020-06-30 14:16:36 DEBUG OMFailoverProxyProvider:299 - Incrementing OM proxy index to 0, nodeId: omNodeIdDummy
      

      As you can see (after a few failover) finally om2 has been found and a few requests has been handled. But after that the client switched back to the om0 (???)

      Attachments

        Activity

          People

            sgal Szabolcs Gál
            elek Marton Elek
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: