Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-9126

[ozone-snapshot] Unordered deletion of snapshots corrupting OM

    XMLWordPrintableJSON

Details

    Description

      Test scenario :

      The test test_unordered_deletion is trying to delete snapshots in random order. And while doing so, we are hitting below exception with OM more often than not.

      Once the error is seen, the OM goes into an unhealthy state, and all the tests after this couldn't run.

      Snapshot is deleted :

      2023-08-06 06:33:27,113 INFO [OM StateMachine ApplyTransaction Thread - 0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotDeleteRequest: Deleted snapshot 'snap-ae5or' under path 'vol-w19gk/buck-f9sqw'
      

      And soon after during copy

      2023-08-06 06:39:06,314|INFO|MainThread|machine.py:188 - run()||GUID=5210f279-e5c7-4ee9-b652-b49a6b0eb07a|RUNNING: /opt/cloudera/parcels/CDH/bin/ozone fs -cp ofs://ozone1/vol-w19gk/buck-f9sqw/.snapshot/snap-5qmtv/key_1691303390 ofs://ozone1/vol-w19gk/buck-f9sqw/
      

      OM log stacktrace:

      2023-08-06 06:33:38,126 INFO [SstFilteringService#0]-org.apache.hadoop.hdds.utils.db.RocksDatabase: Deleting sst file /000396.sst corresponding to column family keyTable from db: /var/lib/hadoop-ozone/om/data293349/db.snapshots/checkpointState/om.db-0ccb08e9-c5ab-45bb-a71e-8444a2142511
      2023-08-06 06:33:38,127 INFO [SstFilteringService#0]-org.apache.hadoop.hdds.utils.db.managed.ManagedRocksObjectUtils: Waited for 1 milliseconds for file /var/lib/hadoop-ozone/om/data293349/db.snapshots/checkpointState/om.db-0ccb08e9-c5ab-45bb-a71e-8444a2142511/000396.sst deletion.
      2023-08-06 06:34:37,938 INFO [SstFilteringService#0]-org.apache.hadoop.ozone.om.snapshot.SnapshotCache: Loading snapshot. Table key: /vol-w19gk/buck-f9sqw/snap-ae5or
      2023-08-06 06:34:37,938 INFO [SstFilteringService#0]-org.apache.hadoop.ozone.om.helpers.OmKeyInfo: OmKeyInfo.getCodec ignorePipeline = true
      2023-08-06 06:34:37,989 ERROR [SstFilteringService#0]-org.apache.hadoop.ozone.om.SstFilteringService: Error during Snapshot sst filtering
      FILE_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load snapshot. Snapshot with table key '/vol-w19gk/buck-f9sqw/snap-ae5or' is no longer active
          at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:205)
          at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:151)
          at org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:178)
          at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
          at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at java.lang.Thread.run(Thread.java:748)
      2023-08-06 06:35:30,232 INFO [pool-8-thread-1]-org.apache.ozone.rocksdiff.RocksDBCheckpointDiffer: Removing SST files: [000410, 000453, 000496, 000253, 000374, 000535, 000611, 000456, 000417, 000658, 000338, 000459, 000380, 000185, 000124, 000245, 000443, 000200, 000563, 000364, 000562, 000128, 000447, 000248, 000688, 000324, 000522, 000367, 000209, 000407, 000129, 000602, 000290, 000296, 000692, 000130, 000372, 000690, 000172, 000293, 000157, 000355, 000399, 000674, 000233, 000277, 000310, 000398, 000552, 000596, 000474, 000352, 000550, 000315, 000359, 000634, 000236, 000599, 000554, 000638, 000637, 000559, 000514, 000518, 000160, 000681, 000163, 000284, 000162, 000344, 000663, 000264, 000462, 000425, 000667, 000225, 000302, 000467, 000588, 000301, 000506, 000307, 000504, 000668, 000628, 000193, 000391, 000197] as part of SST file pruning.
      2023-08-06 06:35:37,937 INFO [SstFilteringService#0]-org.apache.hadoop.ozone.om.snapshot.SnapshotCache: Loading snapshot. Table key: /vol-w19gk/buck-f9sqw/snap-ae5or
      2023-08-06 06:35:37,937 ERROR [SstFilteringService#0]-org.apache.hadoop.ozone.om.SstFilteringService: Error during Snapshot sst filtering
      FILE_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load snapshot. Snapshot with table key '/vol-w19gk/buck-f9sqw/snap-ae5or' is no longer active 
      
      

      Attachments

        1. ozone-om-quasar-csvjze-2.log
          3.29 MB
          Soumitra Sulav
        2. ozone-om-quasar-csvjze-1.log
          5.34 MB
          Soumitra Sulav
        3. console.log
          6.31 MB
          Soumitra Sulav
        4. ozone-om-quasar-csvjze-3.log
          3.68 MB
          Soumitra Sulav
        5. ozone-scm-quasar-csvjze-3.log
          6.60 MB
          Soumitra Sulav
        6. ozone-scm-quasar-csvjze-2.log
          11.22 MB
          Soumitra Sulav
        7. ozone-scm-quasar-csvjze-1.log
          11.03 MB
          Soumitra Sulav

        Issue Links

          Activity

            People

              sadanand_shenoy Sadanand Shenoy
              ssulav Soumitra Sulav
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: