Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-3816 Erasure Coding
  3. HDDS-6373

EC: Exclude pipeline upon container close instead of exclude DNs.

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • EC-Branch
    • None

    Description

      Container close due to container full will make DN reply a ContainerNotOpenException to the Client, but it doesn't mean that this DN is failed and should be excluded for new block group allocation. Otherwise we may get many HEALTHY DNs to be excluded and new block group may fail to be allocated in a small cluster.

      E.g.

      45 DNs(docker simulated), ozone-site.xml: 

        <property>
          <name>ozone.scm.container.size</name>
          <value>256MB</value>
        </property>

        <property>
          <name>ozone.scm.block.size</name>
          <value>16MB</value>
        </property>

      test with Freon ockg:

      ./bin/ozone freon ockg --type=EC --replication=rs-10-4-1024k -p test -n 10 -t 10 -s $((4 * 1024 * 1024 * 1024))

      would result in a 5-8 failures with HDDS-6364 patched.

      INTERNAL_ERROR org.apache.hadoop.ozone.om.exceptions.OMException: Allocated 0 blocks. Requested 1 blocks
              at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:660)
              at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:695)
              at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateNewBlock(BlockOutputStreamEntryPool.java:309)
              at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateBlockIfNeeded(BlockOutputStreamEntryPool.java:371)
              at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.rewriteStripeToNewBlockGroup(ECKeyOutputStream.java:244)
              at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.handleStripeFailure(ECKeyOutputStream.java:586)
              at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.checkAndWriteParityCells(ECKeyOutputStream.java:306)
              at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.write(ECKeyOutputStream.java:192)
              at org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:50)
              at org.apache.hadoop.ozone.freon.ContentGenerator.write(ContentGenerator.java:76)
              at org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:146)
              at com.codahale.metrics.Timer.time(Timer.java:101)
              at org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:143)
              at org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:183)
              at org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:163)
              at org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$1(BaseFreonGenerator.java:146)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
              Suppressed: java.lang.IllegalArgumentException: Expected writeOffset= 1069543424 Expected offset=1059061760
                      at com.google.common.base.Preconditions.checkArgument(Preconditions.java:144)
                      at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.close(ECKeyOutputStream.java:564)
                      at org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:61)
                      at org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:151)
                      ... 8 more
      One ore more freon test is failed.
      2022-02-24 08:41:44,272 [shutdown-hook-0] INFO metrics: type=TIMER, name=key-create, count=10, min=313491.661668, max=577254.304029, mean=563762.9508485134, stddev=44787.24799551536, median=575542.093982, p75=577254.304029, p95=577254.304029, p98=577254.304029, p99=577254.304029, p999=577254.304029, mean_rate=0.017322637056902915, m1=0.029562618662863496, m5=0.014855802773079099, m15=0.007191674083204336, rate_unit=events/second, duration_unit=milliseconds
      2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Total execution time (sec): 578
      2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Failures: 6
      2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Successful executions: 4 

      But with this fix and HDDS-6364 together, it shows all 10 success for many rounds.

      2022-02-24 10:56:45,013 [Thread-4] INFO freon.ProgressBar: Progress: 90.00 % (9 out of 10)
      2022-02-24 10:56:46,013 [Thread-4] INFO freon.ProgressBar: Progress: 100.00 % (10 out of 10)
      2022-02-24 10:56:46,257 [shutdown-hook-0] INFO metrics: type=TIMER, name=key-create, count=10, min=958022.893372, max=1038271.448129, mean=1018238.201558835, stddev=22083.604143242464, median=1029968.020144, p75=1034239.403617, p95=1038271.448129, p98=1038271.448129, p99=1038271.448129, p999=1038271.448129, mean_rate=0.009623163938983789, m1=0.09995782091693355, m5=0.02731461121892791, m15=0.009684867189776935, rate_unit=events/second, duration_unit=milliseconds
      2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Total execution time (sec): 1040
      2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Failures: 0
      2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Successful executions: 10 

       

      Attachments

        Issue Links

          Activity

            People

              markgui Mark Gui
              markgui Mark Gui
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: