Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-11219

[HBase Replication] RS and Master Nodes down with "Waiting for one of pipelines to be OPEN failed"

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • SCM
    • None

    Description

      Scenario: Bidirectional HBase replication, with HBase on Ozone on both the clusters.

      After running for almost a day, and transferring approx 100GB of data, All RS and Master nodes of Cluster 2 went down.
      This was there in most of the stack traces of failed roles, sample snippet from from one of the RS:

      java.io.IOException: INTERNAL_ERROR org.apache.hadoop.ozone.om.exceptions.OMException: Unable to allocate a container to the block of size: 268435456, replicationConfig: RATIS/THREE. Waiting for one of pipelines to be OPEN failed. Pipeline d12aa22f-4439-4321-98cc-e245280b88dd,ae3ea458-ab25-4fb3-a380-941bee9c1fdb,06c3b21c-9721-49e5-9b24-04a3836036d3,3b07bee5-d3e7-4cfd-ad8a-5f336f6cf53c,416f70bc-8f6d-4f56-827f-e672d56507b2 is not ready in 60000 ms
              at org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:241)
              at org.apache.hadoop.ozone.client.io.KeyOutputStream.handleRetry(KeyOutputStream.java:413)
              at org.apache.hadoop.ozone.client.io.KeyOutputStream.handleException(KeyOutputStream.java:358)
              at org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:496)
              at org.apache.hadoop.ozone.client.io.KeyOutputStream.hsync(KeyOutputStream.java:461)
              at org.apache.hadoop.ozone.client.io.OzoneOutputStream.hsync(OzoneOutputStream.java:118)
              at org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184)
              at org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
              at org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hsync(OzoneFSOutputStream.java:80)
              at org.apache.hadoop.fs.ozone.OzoneFSOutputStream.hflush(OzoneFSOutputStream.java:75)
              at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:136)
              at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:84)
              at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:669)
      Caused by: INTERNAL_ERROR org.apache.hadoop.ozone.om.exceptions.OMException: Unable to allocate a container to the block of size: 268435456, replicationConfig: RATIS/THREE. Waiting for one of pipelines to be OPEN failed. Pipeline d12aa22f-4439-4321-98cc-e245280b88dd,ae3ea458-ab25-4fb3-a380-941bee9c1fdb,06c3b21c-9721-49e5-9b24-04a3836036d3,3b07bee5-d3e7-4cfd-ad8a-5f336f6cf53c,416f70bc-8f6d-4f56-827f-e672d56507b2 is not ready in 60000 ms
              at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:755)
              at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleSubmitRequestAndSCMSafeModeRetry(OzoneManagerProtocolClientSideTranslatorPB.java:2328)
              at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:791)
              at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateNewBlock(BlockOutputStreamEntryPool.java:303)
              at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateBlockIfNeeded(BlockOutputStreamEntryPool.java:397)
              at org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:220)
              ... 12 more
      2024-07-19 19:08:18,913 ERROR org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for region newtableloadtest,09999999,1721297943018.441fd24db6169f1d4c5ad7112b27d3b8.
      org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append sequenceId=298963, requesting roll of WAL
              at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1208)
              at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1081)
              at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:982)
              at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
              at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: java.io.IOException: : Stream is closed! Key: hbase/WALs/ccycloud-2.ozn-hbaserepl2.xyz,22101,1721293279189/ccycloud-2.ozn-hbaserepl2.xyz%2C22101%2C1721293279189.ccycloud-2.ozn-hbaserepl2.xyz%2C22101%2C1721293279189.regiongroup-0.1721415945748
              at org.apache.hadoop.ozone.client.io.KeyOutputStream.checkNotClosed(KeyOutputStream.java:736)
              at org.apache.hadoop.ozone.client.io.KeyOutputStream.write(KeyOutputStream.java:200)
              at org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:94)
              at org.apache.hadoop.fs.ozone.OzoneFSOutputStream.lambda$write$1(OzoneFSOutputStream.java:58)
              at org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:184)
              at org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
              at org.apache.hadoop.fs.ozone.OzoneFSOutputStream.write(OzoneFSOutputStream.java:54) 

      From SCM leader could see logs like below:

      2024-07-19 19:07:52,242 ERROR [IPC Server handler 81 on 9863]-org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider: Unable to allocate a block for the size: 268435456, repConfig: RATIS/THREE
      2024-07-19 19:08:01,783 INFO [node3-EventQueue-PipelineReportForPipelineReportHandler]-org.apache.hadoop.hdds.scm.pipeline.PipelineReportHandler: Reported pipeline PipelineID=770772b8-ea18-4ca4-a5f7-76ceb53a8c01 is not found
      2024-07-19 19:08:01,784 INFO [IPC Server handler 99 on 9860]-org.apache.hadoop.ipc.Server: IPC Server handler 99 on 9860, call Call#3336 Retry#0 org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocol.submitRequest from 10.140.86.142:60838
      org.apache.hadoop.hdds.scm.pipeline.PipelineNotFoundException: PipelineID=770772b8-ea18-4ca4-a5f7-76ceb53a8c01 not found
              at org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.getPipeline(PipelineStateMap.java:151)
              at org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerImpl.getPipeline(PipelineStateManagerImpl.java:138)
              at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:498)
              at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeLocal(SCMHAInvocationHandler.java:92)
              at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:75)
              at com.sun.proxy.$Proxy25.getPipeline(Unknown Source)
              at org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.getPipeline(PipelineManagerImpl.java:335)
              at org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.getPipeline(SCMClientProtocolServer.java:761)
              at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.getPipeline(StorageContainerLocationProtocolServerSideTranslatorPB.java:960)
              at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:607)
              at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
              at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:232)
              at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899) 

      cc: weichiu sammichen ashishk 

      Attachments

        Activity

          People

            Unassigned Unassigned
            pratyush.bhatt Pratyush Bhatt
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: