Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Container close due to container full will make DN reply a ContainerNotOpenException to the Client, but it doesn't mean that this DN is failed and should be excluded for new block group allocation. Otherwise we may get many HEALTHY DNs to be excluded and new block group may fail to be allocated in a small cluster.
E.g.
45 DNs(docker simulated), ozone-site.xml:
<property>
<name>ozone.scm.container.size</name>
<value>256MB</value>
</property>
<property>
<name>ozone.scm.block.size</name>
<value>16MB</value>
</property>
test with Freon ockg:
./bin/ozone freon ockg --type=EC --replication=rs-10-4-1024k -p test -n 10 -t 10 -s $((4 * 1024 * 1024 * 1024))
would result in a 5-8 failures with HDDS-6364 patched.
INTERNAL_ERROR org.apache.hadoop.ozone.om.exceptions.OMException: Allocated 0 blocks. Requested 1 blocks at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:660) at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:695) at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateNewBlock(BlockOutputStreamEntryPool.java:309) at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateBlockIfNeeded(BlockOutputStreamEntryPool.java:371) at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.rewriteStripeToNewBlockGroup(ECKeyOutputStream.java:244) at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.handleStripeFailure(ECKeyOutputStream.java:586) at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.checkAndWriteParityCells(ECKeyOutputStream.java:306) at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.write(ECKeyOutputStream.java:192) at org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:50) at org.apache.hadoop.ozone.freon.ContentGenerator.write(ContentGenerator.java:76) at org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:146) at com.codahale.metrics.Timer.time(Timer.java:101) at org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:143) at org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:183) at org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:163) at org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$1(BaseFreonGenerator.java:146) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Suppressed: java.lang.IllegalArgumentException: Expected writeOffset= 1069543424 Expected offset=1059061760 at com.google.common.base.Preconditions.checkArgument(Preconditions.java:144) at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.close(ECKeyOutputStream.java:564) at org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:61) at org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:151) ... 8 more One ore more freon test is failed. 2022-02-24 08:41:44,272 [shutdown-hook-0] INFO metrics: type=TIMER, name=key-create, count=10, min=313491.661668, max=577254.304029, mean=563762.9508485134, stddev=44787.24799551536, median=575542.093982, p75=577254.304029, p95=577254.304029, p98=577254.304029, p99=577254.304029, p999=577254.304029, mean_rate=0.017322637056902915, m1=0.029562618662863496, m5=0.014855802773079099, m15=0.007191674083204336, rate_unit=events/second, duration_unit=milliseconds 2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Total execution time (sec): 578 2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Failures: 6 2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Successful executions: 4
But with this fix and HDDS-6364 together, it shows all 10 success for many rounds.
2022-02-24 10:56:45,013 [Thread-4] INFO freon.ProgressBar: Progress: 90.00 % (9 out of 10) 2022-02-24 10:56:46,013 [Thread-4] INFO freon.ProgressBar: Progress: 100.00 % (10 out of 10) 2022-02-24 10:56:46,257 [shutdown-hook-0] INFO metrics: type=TIMER, name=key-create, count=10, min=958022.893372, max=1038271.448129, mean=1018238.201558835, stddev=22083.604143242464, median=1029968.020144, p75=1034239.403617, p95=1038271.448129, p98=1038271.448129, p99=1038271.448129, p999=1038271.448129, mean_rate=0.009623163938983789, m1=0.09995782091693355, m5=0.02731461121892791, m15=0.009684867189776935, rate_unit=events/second, duration_unit=milliseconds 2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Total execution time (sec): 1040 2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Failures: 0 2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Successful executions: 10
Attachments
Issue Links
- links to