Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-9876

OzoneManagerStateMachine should add response to OzoneManagerDoubleBuffer for every write request

    XMLWordPrintableJSON

Details

    Description

      This task is to resolve the issues in HDDS-9342.

      HDDS-2680 introduced a logic in OzoneManagerStateMachine to calculate the lastAppliedTermIndex based on two maps, applyTransactionMap and ratisTransactionMap. Any write request from RATIS through applyTransaction will add its trxLogIndex into applyTransactionMap. And any write request which is flushed by OzoneManagerDoubleBuffer#flushBatch will have its trxLogIndex removed from applyTransactionMap during flushBatch call ozoneManagerRatisSnapShot.updateLastAppliedIndex(flushedEpochs).

      If any write request from RATIS not going through OzoneManagerDoubleBuffer#flushBatch, then its trxLogIndex will be left in the
      applyTransactionMap forever. Since lastApplicedIndex can only be updated incrementally, any trxLogIndex not confirmed by OzoneManagerDoubleBuffer flush will make the lastApplicedIndex grow stops before it, and although write requests after that unconfirmed one could be flushed, but their trxLogIndex will be added to the ratisTransactionMap, which causes the ratisTransactionMap grow bigger and bigger.

      How a write request will not be confirmed by OzoneManagerDoubleBuffer flush? Here is one case reproduced locally.
      T1: create bucket1
      T2: client1 sends delete bucket "bucket1" request to OM. OM verify bucket1 exists, then send request to RATIS to handle the request.
      T3: client2 sends create key "bucket1/key1" request to OM. OM verify bucket2 exists, then send request to RATIS
      T4: OzoneManagerStateMachine executes delete bucket "bucket1" success, return response to client1
      T5: OzoneManagerStateMachine executes create key "bucket1/key1" request, "bucket1" cannot be found, execution fails, return failure to client2

      In T5, the failure stack is

      2023-10-18 19:04:10,131 [OM StateMachine ApplyTransaction Thread - 0] WARN org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Failed to write, Exception occurred 
      BUCKET_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Bucket not found: s3v/prod-voyager
      at org.apache.hadoop.ozone.om.OzoneManagerUtils.reportNotFound(OzoneManagerUtils.java:87)
      at org.apache.hadoop.ozone.om.OzoneManagerUtils.getBucketInfo(OzoneManagerUtils.java:72)
      at org.apache.hadoop.ozone.om.OzoneManagerUtils.resolveBucketInfoLink(OzoneManagerUtils.java:148)
      at org.apache.hadoop.ozone.om.OzoneManagerUtils.getResolvedBucketInfo(OzoneManagerUtils.java:124)
      at org.apache.hadoop.ozone.om.OzoneManagerUtils.getBucketLayout(OzoneManagerUtils.java:106)
      at org.apache.hadoop.ozone.om.request.BucketLayoutAwareOMKeyRequestFactory.createRequest(BucketLayoutAwareOMKeyRequestFactory.java:230)
      at org.apache.hadoop.ozone.om.ratis.utils.OzoneManagerRatisUtils.createClientRequest(OzoneManagerRatisUtils.java:336)
      at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:380)
      at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:572)
      at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:362)
      at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)

      In OzoneManagerStateMachine.runCommand, when IOException is throw out from OzoneManagerRequestHandler.handleWriteRequest, it constructs and returns OMResponse to client, it doesn't add the response into OzoneManagerDoubleBuffer, so OzoneManagerDoubleBuffer doesn't aware of this request and its trxLogIndex. The consequence is this trxLogIndex will be stay in applyTransactionMap forever.

      Attachments

        Issue Links

          Activity

            People

              Sammi Sammi Chen
              Sammi Sammi Chen
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: