Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-10548

[snapshot-LR] OM is getting shutdown due to Snapshot chain corruption in an LR setup

    XMLWordPrintableJSON

Details

    Description

      OM is getting shutdown due to Snapshot chain corruption in an LR setup

      Scenario :

      • Generate data over parallel threads over various volume/buckets
      • Perform parallel snapshot create/delete/list operations over above buckets
      • Perform parallel snapdiff operations over each bucket
      • Perform parallel read operations of snapshot contents
      • Introduce OM and cluster restarts in between along with DN decommissioning and balancer restarts.

      OM Leader error stacktrace -

      2024-03-14 04:07:13,525 INFO [JvmPauseMonitor0]-org.apache.ratis.util.JvmPauseMonitor: JvmPauseMonitor-om232: Started
      2024-03-14 04:07:13,534 INFO [main]-org.apache.hadoop.ozone.om.OzoneManager: Starting secret key client.
      2024-03-14 04:07:13,615 ERROR [OM StateMachine ApplyTransaction Thread - 0]-org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Terminating with exit status 1: OM Ratis Server has received unrecoverable error, to avoid further DB corruption, terminating OM. Error Response received is:cmdType: CreateSnapshot
      traceID: ""
      success: false
      message: "java.io.IOException: Snapshot chain is corrupted.\n\tat org.apache.hadoop.ozone.om.SnapshotChainManager.validateSnapshotChain(SnapshotChainManager.java:550)\n\tat org.apache.hadoop.ozone.om.SnapshotChainManager.getLatestPathSnapshotId(SnapshotChainManager.java:378)\n\tat org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest.addSnapshotInfoToSnapshotChainAndCache(OMSnapshotCreateRequest.java:232)\n\tat org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest.validateAndUpdateCache(OMSnapshotCreateRequest.java:162)\n\tat org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:378)\n\tat org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:568)\n\tat org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:363)\n\tat java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n"
      status: INTERNAL_ERRORINTERNAL_ERROR org.apache.hadoop.ozone.om.exceptions.OMException: java.io.IOException: Snapshot chain is corrupted.
              at org.apache.hadoop.ozone.om.SnapshotChainManager.validateSnapshotChain(SnapshotChainManager.java:550)
              at org.apache.hadoop.ozone.om.SnapshotChainManager.getLatestPathSnapshotId(SnapshotChainManager.java:378)
              at org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest.addSnapshotInfoToSnapshotChainAndCache(OMSnapshotCreateRequest.java:232)
              at org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest.validateAndUpdateCache(OMSnapshotCreateRequest.java:162)
              at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:378)
              at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:568)
              at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:363)
              at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)        at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.terminate(OzoneManagerStateMachine.java:404)
              at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$2(OzoneManagerStateMachine.java:379)
              at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
              at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
              at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
              at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1609)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      2024-03-14 04:07:13,620 INFO [shutdown-hook-0]-org.apache.ranger.audit.provider.AuditProviderFactory: ==> JVMShutdownHook.run()
      2024-03-14 04:07:13,621 INFO [shutdown-hook-0]-org.apache.ranger.audit.provider.AuditProviderFactory: JVMShutdownHook: Signalling async audit cleanup to start. 

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jyosin Jyotirmoy Sinha
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: