Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-10738

Unable to load snapshot exception encountered in a LR setup

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • OM, Snapshot

    Description

      Scenario :

      • Generate data over parallel threads over various volume/buckets
      • Perform parallel snapshot create/delete/list operations over above buckets
      • Perform parallel snapdiff operations over each bucket
      • Perform parallel read operations of snapshot contents
      • Introduce OM and cluster restarts in between along with DN decommissioning and balancer restarts.

      Observation - When multiple threads are running with snapshot operations the snapshot path contents are not accessible even after 20 mins

      Snapshot creation log OM -

      2024-04-22 11:18:49,123 INFO [OM StateMachine ApplyTransaction Thread - 0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created snapshot: 'snap1713809817' with snapshotId: '849a95ab-c5bc-4b78-9d0a-fdad34fd331a' under path 'vol-dp4tz/buck-cp6e6' 

      OM Error stacktrace - 

      2024-04-22 11:42:04,768 INFO [IPC Server handler 37 on 9862]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils: Checkpoint directory: 60 didn't get created in /var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a secs.
      2024-04-22 11:42:04,768 ERROR [IPC Server handler 37 on 9862]-org.apache.hadoop.ozone.om.OmSnapshotManager: Failed to retrieve snapshot: /vol-dp4tz/buck-cp6e6/snap1713809817
      TIMEOUT org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load snapshot. Snapshot checkpoint directory '/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a' does not exist yet. Please wait a few more seconds before retrying
              at org.apache.hadoop.ozone.om.snapshot.SnapshotUtils.checkSnapshotDirExist(SnapshotUtils.java:113)
              at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406)
              at org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357)
              at org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$1(SnapshotCache.java:147)
              at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:143)
              at org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625)
              at org.apache.hadoop.ozone.om.OzoneManager.getReader(OzoneManager.java:4634)
              at org.apache.hadoop.ozone.om.OzoneManager.getFileStatus(OzoneManager.java:3572)
              at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.getOzoneFileStatus(OzoneManagerRequestHandler.java:1002)
              at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:257)
              at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:220)
              at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:174)
              at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
              at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:143)
              at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
      2024-04-22 11:42:04,770 WARN [IPC Server handler 37 on 9862]-org.apache.hadoop.ipc.Server: IPC Server handler 37 on 9862, call Call#2 Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 10.17.207.24:57514
      java.lang.IllegalStateException: TIMEOUT org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load snapshot. Snapshot checkpoint directory '/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a' does not exist yet. Please wait a few more seconds before retrying 

      The above error is coming for multiple snapshots repeatedly and mostly coming in parallel snapshot operations across various volume/buckets, not in serial operations.

      Attachments

        Activity

          People

            hemantk Hemant Kumar
            jyosin Jyotirmoy Sinha
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: