Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-9319

EC Reconstruction fails when chunk length is 0 bytes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • EC
    • None

    Description

      EC Offline reconstruction is failing with java.io.IOException: Failed to get chunkInfo[123] exception. The DN log shows there are insufficient datanodes to read the EC blocks even though there are 3 DNs up out of 5.

      This could be because the chunk length is zero bytes.

      EC Policy: rs-3-2-1024k

      DN.log: 

      2023-09-15 13:47:57,844 ERROR [ec-reconstruct-reader-TID-0]-org.apache.hadoop.hdds.scm.XceiverClientGrpc: Failed to execute command GetBlock on the pipeline Pipeline[ Id: b3a4dcf4-916d-4a97-adde-daaf9785b237, Nodes: b3a4dcf4-916d-4a97-adde-daaf9785b237, ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:, CreationTimestamp2023-09-15T13:47:57.796097Z[UTC]].
      2023-09-15 13:47:57,845 INFO [ContainerReplicationThread-0]-org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream: ECBlockReconstructedStripeInputStream{conID: 1012 locID: 111677748019201071}@740ac7aa: error reading [2], marked as failed
      org.apache.hadoop.ozone.client.io.BadDataLocationException: java.io.IOException: Failed to get chunkInfo[123]: len == 0
      	at org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.readIntoBuffer(ECBlockReconstructedStripeInputStream.java:633)
      	at org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.lambda$loadDataBuffersFromStream$2(ECBlockReconstructedStripeInputStream.java:566)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: java.io.IOException: Failed to get chunkInfo[123]: len == 0
      	at org.apache.hadoop.hdds.scm.storage.BlockInputStream.validate(BlockInputStream.java:278)
      	at org.apache.hadoop.hdds.scm.storage.BlockInputStream.lambda$static$0(BlockInputStream.java:265)
      	at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:407)
      	at org.apache.hadoop.hdds.scm.XceiverClientGrpc.lambda$sendCommandWithTraceIDAndRetry$0(XceiverClientGrpc.java:347)
      	at org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
      	at org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
      	at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:342)
      	at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:323)
      	at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.getBlock(ContainerProtocolCalls.java:208)
      	at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.lambda$getBlock$0(ContainerProtocolCalls.java:186)
      	at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.tryEachDatanode(ContainerProtocolCalls.java:146)
      	at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.getBlock(ContainerProtocolCalls.java:185)
      	at org.apache.hadoop.hdds.scm.storage.BlockInputStream.getChunkInfos(BlockInputStream.java:255)
      	at org.apache.hadoop.hdds.scm.storage.BlockInputStream.initialize(BlockInputStream.java:146)
      	at org.apache.hadoop.hdds.scm.storage.BlockInputStream.readWithStrategy(BlockInputStream.java:308)
      	at org.apache.hadoop.hdds.scm.storage.ExtendedInputStream.read(ExtendedInputStream.java:66)
      	at org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.readFromCurrentLocation(ECBlockReconstructedStripeInputStream.java:644)
      	at org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.readIntoBuffer(ECBlockReconstructedStripeInputStream.java:620)
      	... 5 more
      2023-09-15 13:47:57,864 WARN [ContainerReplicationThread-0]-org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator: Exception while reconstructing the container 1012. Cleaning up all the recovering containers in the reconstruction process.
      org.apache.hadoop.ozone.client.io.InsufficientLocationsException: There are insufficient datanodes to read the EC block
      	at org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.init(ECBlockReconstructedStripeInputStream.java:224)
      	at org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.read(ECBlockReconstructedStripeInputStream.java:382)
      	at org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.recoverChunks(ECBlockReconstructedStripeInputStream.java:331)
      	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECBlockGroup(ECReconstructionCoordinator.java:288)
      	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:170)
      	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
      	at org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:359)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834)
      2023-09-15 13:47:57,916 INFO [ChunkReader-5]-org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer: Moving container /hadoop-ozone/datanode/data/hdds/CID-2e752843-8947-454d-992f-013b3824468b/current/containerDir1/1012 to state DELETED from state:RECOVERING
      2023-09-15 13:47:57,920 WARN [ContainerReplicationThread-0]-org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask: FAILED reconstructECContainersCommand: containerID=1012, replication=rs-3-2-1024k, missingIndexes=[1, 2], sources={3=b3a4dcf4-916d-4a97-adde-daaf9785b237, 4=4641d611-0a8b-4d23-b47a-d25d71403aaf, 5=eb073807-4edc-4753-a64a-a323a317ea2f}, targets={1=bc497718-d822-4592-b406-ef32d3173cd9, 2=2f307ebf-3bf3-412c-aaac-675718da3beb} after 372 ms
      org.apache.hadoop.ozone.client.io.InsufficientLocationsException: There are insufficient datanodes to read the EC block
      	at org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.init(ECBlockReconstructedStripeInputStream.java:224)
      	at org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.read(ECBlockReconstructedStripeInputStream.java:382)
      	at org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.recoverChunks(ECBlockReconstructedStripeInputStream.java:331)
      	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECBlockGroup(ECReconstructionCoordinator.java:288)
      	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:170)
      	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
      	at org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:359)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834)

       

      Attachments

        Activity

          People

            sodonnell Stephen O'Donnell
            varsha.ravi Varsha Ravi
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: