[HDDS-10682] EC Reconstruction creates empty chunks at the end of blocks with partial stripes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5.0, 1.4.1
Component/s: None
Labels:
- pull-request-available

Target Version/s:

1.4.1

Description

Given an EC block that is larger than 1 full stripe, but the last stripe is partial so that it does not use all the index.

If one of the replicas is reconstructed that does not have any data in that final position, an empty chunk is written to the end of the block's chunk list.

While this does no cause any immediate problem, it can prevent further reconstructions that attempt to use this block, and they will fail with an error like:

2024-04-09 01:06:21,855 ERROR [ec-reconstruct-reader-TID-4]-org.apache.hadoop.hdds.scm.XceiverClientGrpc: Failed to execute command GetBlock on the pipeline Pipeline[ Id: 7f6f1fc9-ed26-4e19-86b6-47435b027f6a, Nodes: 7f6f1fc9-ed26-4e19-86b6-47435b027f6a(ccycloud-4.quasar-jyswng.root.comops.site/10.140.150.0), ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:, CreationTimestamp2024-04-09T01:06:21.724509Z[UTC]].
2024-04-09 01:06:21,859 INFO [ContainerReplicationThread-1]-org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream: ECBlockReconstructedStripeInputStream{conID: 10007 locID: 113750153625610009}@756a3998: error reading [1], marked as failed
org.apache.hadoop.ozone.client.io.BadDataLocationException: java.io.IOException: Failed to get chunkInfo[77]: len == 0
        at org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.readIntoBuffer(ECBlockReconstructedStripeInputStream.java:644)
        at org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.lambda$loadDataBuffersFromStream$2(ECBlockReconstructedStripeInputStream.java:577)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.io.IOException: Failed to get chunkInfo[77]: len == 0
        at org.apache.hadoop.hdds.scm.storage.BlockInputStream.validate(BlockInputStream.java:278)
        at org.apache.hadoop.hdds.scm.storage.BlockInputStream.lambda$static$0(BlockInputStream.java:265)
        at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:407)
        at org.apache.hadoop.hdds.scm.XceiverClientGrpc.lambda$sendCommandWithTraceIDAndRetry$0(XceiverClientGrpc.java:347)
        at org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
        at org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
        at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:342)
        at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:323)
        at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.getBlock(ContainerProtocolCalls.java:208)
        at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.lambda$getBlock$0(ContainerProtocolCalls.java:186)
        at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.tryEachDatanode(ContainerProtocolCalls.java:146)
        at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.getBlock(ContainerProtocolCalls.java:185)
        at org.apache.hadoop.hdds.scm.storage.BlockInputStream.getChunkInfos(BlockInputStream.java:255)
        at org.apache.hadoop.hdds.scm.storage.BlockInputStream.initialize(BlockInputStream.java:146)
        at org.apache.hadoop.hdds.scm.storage.BlockInputStream.readWithStrategy(BlockInputStream.java:308)
        at org.apache.hadoop.hdds.scm.storage.ExtendedInputStream.read(ExtendedInputStream.java:66)
        at org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.readFromCurrentLocation(ECBlockReconstructedStripeInputStream.java:655)
        at org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.readIntoBuffer(ECBlockReconstructedStripeInputStream.java:631)
        ... 5 more

If there are other spare replicas which can be used, reconstruction will continue, otherwise it will not be able to complete.

At this stage, I am not sure if this can affect reading a block via the normal read path.

Attachments

Issue Links

links to

GitHub Pull Request #6515

Activity

People

Assignee:: Stephen O'Donnell

Reporter:: Stephen O'Donnell

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 11/Apr/24 16:45

Updated:: 18/Apr/24 14:23

Resolved:: 12/Apr/24 20:03