Details
-
Sub-task
-
Status: Resolved
-
Blocker
-
Resolution: Duplicate
-
None
-
None
-
None
-
None
Description
MiniOzoneHAClusterImpl frequently hangs when writing keys to bucket.
To show the problem, we have a simple test that just writes 10,000 keys. It regularly hangs/times out when running on the MiniOzoneHAClusterImpl cluster.
This is a problem because the MiniOzoneHAClusterImpl is critical for testing the snapshot bootstrap code. We believe it is the root cause of this issue: https://issues.apache.org/jira/browse/HDDS-8876
This simple test shows the problem: https://github.com/GeorgeJahad/ozone/compare/153659032b..testSimple
In my experience it hangs 20-30% of the time when running on repeat in intellij.
When it does hang the client thread is in the process of writing/commiting the key, which the server side translator thread is waiting on a future. I believe that future is waiting on a ratis response from the active follower. I've include stack traces for both threads below when the test is hung.
Occasionally when it hangs we see the client thread in exactly the same place but there is no corresponding server side translator thread. We are guessing in this case the server side thread gets killed by an unhandled exception, but the client side thread doesn't notice and just waits forever instead of retrying.
Here are the stack traces for the two threads.
Client thread:
"main@1" prio=5 tid=0x1 nid=NA waiting
java.lang.Thread.State: WAITING
at java.lang.Object.wait(Object.java:-1)
at java.lang.Object.wait(Object.java:502)
at org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:65)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1572)
at org.apache.hadoop.ipc.Client.call(Client.java:1530)
at org.apache.hadoop.ipc.Client.call(Client.java:1427)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)
at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)
at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source:-1)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
- locked <0x208df> (a org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
at com.sun.proxy.$Proxy54.submitRequest(Unknown Source:-1)
at org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)
at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:304)
at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.updateKey(OzoneManagerProtocolClientSideTranslatorPB.java:802)
at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:760)
at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:341)
at org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:557)
- locked <0x208e0> (a org.apache.hadoop.ozone.client.io.KeyOutputStream)
at org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:86)
- locked <0x208e1> (a org.apache.hadoop.ozone.client.io.OzoneOutputStream)
at org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey(TestOzoneManagerHA.java:247)
at org.apache.hadoop.ozone.om.TestOMRatisSnapshots.writeKeys(TestOMRatisSnapshots.java:1085)
Server thread:
"IPC Server handler 5 on default port 15133@101855" daemon prio=5 tid=0x159ab nid=NA waiting
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Unsafe.java:-1)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:293)
at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:250)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:215)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:200)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$1300.1755468564.apply(Unknown Source:-1)
at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:142)
at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java:-1)
Attachments
Issue Links
- duplicates
-
HDDS-8880 Intermittent fork timeout in TestOMRatisSnapshots
- Resolved