Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-6517 Snapshot support for Ozone
  3. HDDS-8952

MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Blocker
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.

      To show the problem, we have a simple test that just writes 10,000 keys.  It regularly hangs/times out when running on the MiniOzoneHAClusterImpl cluster.  

      This is a problem because the MiniOzoneHAClusterImpl is critical for testing the snapshot bootstrap code.  We believe it is the root cause of this issue: https://issues.apache.org/jira/browse/HDDS-8876

      This simple test shows the problem: https://github.com/GeorgeJahad/ozone/compare/153659032b..testSimple

      In my experience it hangs 20-30% of the time when running on repeat in intellij.

      When it does hang the client thread is in the process of writing/commiting the key, which the server side translator thread is waiting on a future.  I believe that future is waiting on a ratis response from the active follower.  I've include stack traces for both threads below when the test is hung.

      Occasionally when it hangs we see the client thread in exactly the same place but there is no corresponding server side translator thread.  We are guessing in this case the server side thread gets killed by an unhandled exception, but the client side thread doesn't notice and just waits forever instead of retrying.

      Here are the stack traces for the two threads.

      Client thread:
      "main@1" prio=5 tid=0x1 nid=NA waiting
        java.lang.Thread.State: WAITING
            at java.lang.Object.wait(Object.java:-1)
            at java.lang.Object.wait(Object.java:502)
            at org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:65)
            at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1572)
            at org.apache.hadoop.ipc.Client.call(Client.java:1530)
            at org.apache.hadoop.ipc.Client.call(Client.java:1427)
            at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)
            at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)
            at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)
            at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source:-1)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:498)
            at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
            - locked <0x208df> (a org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
            at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
            at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)
            at org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)
            at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:304)
            at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:802)
            at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:760)
            at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:341)
            at org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:557)
            - locked <0x208e0> (a org.apache.hadoop.ozone.client.io.KeyOutputStream)
            at org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:86)
            - locked <0x208e1> (a org.apache.hadoop.ozone.client.io.OzoneOutputStream)
            at org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey(TestOzoneManagerHA.java:247)
            at org.apache.hadoop.ozone.om.TestOMRatisSnapshots.writeKeys(TestOMRatisSnapshots.java:1085)

       

       

      Server thread:

      "IPC Server handler 5 on default port 15133@101855" daemon prio=5 tid=0x159ab nid=NA waiting
        java.lang.Thread.State: WAITING
            at sun.misc.Unsafe.park(Unsafe.java:-1)
            at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
            at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
            at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
            at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
            at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
            at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:293)
            at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:250)
            at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:215)
            at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:200)
            at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$1300.1755468564.apply(Unknown Source:-1)
            at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
            at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:142)
            at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java:-1)

      Attachments

        Issue Links

          Activity

            People

              xbis Christos Bisias
              georgeJahad George Jahad
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: