Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Scenario: Bidirectional HBase replication, with HBase on Ozone on both the clusters.
After running for almost a day, and transferring approx 100GB of data, All RS and Master nodes of Cluster 2 went down.
This was there in most of the stack traces of failed roles, sample snippet from from one of the RS:
java.io.IOException: INTERNAL_ERROR org.apache.hadoop.ozone.om.exceptions.OMException: Unable to allocate a container to the block of size: 268435456, replicationConfig: RATIS/THREE. Waiting for one of pipelines to be OPEN failed. Pipeline d12aa22f-4439-4321-98cc-e245280b88dd,ae3ea458-ab25-4fb3-a380-941bee9c1fdb,06c3b21c-9721-49e5-9b24-04a3836036d3,3b07bee5-d3e7-4cfd-ad8a-5f336f6cf53c,416f70bc-8f6d-4f56-827f-e672d56507b2 is not ready in 60000 ms at org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:241) at org.apache.hadoop.ozone.client.io.KeyOutputStream.handleRetry(KeyOutputStream.java:413) at org.apache.hadoop.ozone.client.io.KeyOutputStream.handleException(KeyOutputStream.java:358) at org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:496) at org.apache.hadoop.ozone.client.io.KeyOutputStream.hsync(KeyOutputStream.java:461) at org.apache.hadoop.ozone.client.io.OzoneOutputStream.hsync(OzoneOutputStream.java:118) at org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184) at org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149) at org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hsync(OzoneFSOutputStream.java:80) at org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hflush(OzoneFSOutputStream.java:75) at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136) at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:84) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:669) Caused by: INTERNAL_ERROR org.apache.hadoop.ozone.om.exceptions.OMException: Unable to allocate a container to the block of size: 268435456, replicationConfig: RATIS/THREE. Waiting for one of pipelines to be OPEN failed. Pipeline d12aa22f-4439-4321-98cc-e245280b88dd,ae3ea458-ab25-4fb3-a380-941bee9c1fdb,06c3b21c-9721-49e5-9b24-04a3836036d3,3b07bee5-d3e7-4cfd-ad8a-5f336f6cf53c,416f70bc-8f6d-4f56-827f-e672d56507b2 is not ready in 60000 ms at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:755) at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleSubmitRequestAndSCMSafeModeRetry(OzoneManagerProtocolClientSideTranslatorPB.java:2328) at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:791) at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateNewBlock(BlockOutputStreamEntryPool.java:303) at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateBlockIfNeeded(BlockOutputStreamEntryPool.java:397) at org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:220) ... 12 more 2024-07-19 19:08:18,913 ERROR org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for region newtableloadtest,09999999,1721297943018.441fd24db6169f1d4c5ad7112b27d3b8. org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append sequenceId=298963, requesting roll of WAL at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1208) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1081) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:982) at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: : Stream is closed! Key: hbase/WALs/ccycloud-2.ozn-hbaserepl2.xyz,22101,1721293279189/ccycloud-2.ozn-hbaserepl2.xyz%2C22101%2C1721293279189.ccycloud-2.ozn-hbaserepl2.xyz%2C22101%2C1721293279189.regiongroup-0.1721415945748 at org.apache.hadoop.ozone.client.io.KeyOutputStream.checkNotClosed(KeyOutputStream.java:736) at org.apache.hadoop.ozone.client.io.KeyOutputStream.write(KeyOutputStream.java:200) at org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:94) at org.apache.hadoop.fs.ozone.OzoneFSOutputStream.lambda$write$1(OzoneFSOutputStream.java:58) at org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184) at org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149) at org.apache.hadoop.fs.ozone.OzoneFSOutputStream.write(OzoneFSOutputStream.java:54)
From SCM leader could see logs like below:
2024-07-19 19:07:52,242 ERROR [IPC Server handler 81 on 9863]-org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider: Unable to allocate a block for the size: 268435456, repConfig: RATIS/THREE
2024-07-19 19:08:01,783 INFO [node3-EventQueue-PipelineReportForPipelineReportHandler]-org.apache.hadoop.hdds.scm.pipeline.PipelineReportHandler: Reported pipeline PipelineID=770772b8-ea18-4ca4-a5f7-76ceb53a8c01 is not found
2024-07-19 19:08:01,784 INFO [IPC Server handler 99 on 9860]-org.apache.hadoop.ipc.Server: IPC Server handler 99 on 9860, call Call#3336 Retry#0 org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocol.submitRequest from 10.140.86.142:60838
org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: PipelineID=770772b8-ea18-4ca4-a5f7-76ceb53a8c01 not found
at org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:151)
at org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerImpl.getPipeline(PipelineStateManagerImpl.java:138)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeLocal(SCMHAInvocationHandler.java:92)
at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:75)
at com.sun.proxy.$Proxy25.getPipeline(Unknown Source)
at org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.getPipeline(PipelineManagerImpl.java:335)
at org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.getPipeline(SCMClientProtocolServer.java:761)
at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.getPipeline(StorageContainerLocationProtocolServerSideTranslatorPB.java:960)
at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:607)
at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:232)
at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)