Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-3723 OzoneManager HA (phase II)
  3. HDDS-3642

Stop/Pause Background services while replacing OM DB with checkpoint from Leader

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 1.0.0
    • OM HA

    Description

      When a follower OM needs to replace its DB with a checkpoint from Leader (to catch up on the transactions), it should pause or stop services which read/ write to the DB. 

      During OM HA testing, found that OM could crash with JVM error on RocksDB. This happened because KeyDeletingService was trying to access a memory which is already freed up.

      # A fatal error has been detected by the Java Runtime Environment:
      #
      #  SIGSEGV (0xb) at pc=0x00007f19de835af0, pid=1389, tid=1712
      #
      # JRE version: OpenJDK Runtime Environment (11.0.6+10) (build 11.0.6+10-LTS)
      # Java VM: OpenJDK 64-Bit Server VM (11.0.6+10-LTS, mixed mode, sharing, tiered, compressed oops, concurrent mark sweep gc, linux-amd64)
      # Problematic frame:
      # C  [librocksdbjni10001996641283911793.so+0x1aeaf0]  Java_org_rocksdb_RocksIterator_seekToFirst0+0x0
      #
      # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /opt/core.1389)
      #
      # An error report file with more information is saved as:
      # /opt/hs_err_pid1389.log
      
      

      From the hs_error log file:

      ---------------  T H R E A D  ---------------Current thread (0x00000000011a4000):  JavaThread "KeyDeletingService#1" daemon [_thread_in_native, id=1712, stack(0x00007f19d2443000,0x00007f19d2544000)]Stack: [0x00007f19d2443000,0x00007f19d2544000],  sp=0x00007f19d2541e78,  free space=1019k
      Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
      C  [librocksdbjni10001996641283911793.so+0x1aeaf0]  Java_org_rocksdb_RocksIterator_seekToFirst0+0x0
      j  org.rocksdb.AbstractRocksIterator.seekToFirst()V+26
      j  org.apache.hadoop.hdds.utils.db.RDBStoreIterator.<init>(Lorg/rocksdb/RocksIterator;)V+13
      j  org.apache.hadoop.hdds.utils.db.RDBTable.iterator()Lorg/apache/hadoop/hdds/utils/db/TableIterator;+30
      j  org.apache.hadoop.hdds.utils.db.TypedTable.iterator()Lorg/apache/hadoop/hdds/utils/db/TableIterator;+4
      j  org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getPendingDeletionKeys(I)Ljava/util/List;+8
      j  org.apache.hadoop.ozone.om.KeyManagerImpl.getPendingDeletionKeys(I)Ljava/util/List;+5
      j  org.apache.hadoop.ozone.om.KeyDeletingService$KeyDeletingTask.call()Lorg/apache/hadoop/hdds/utils/BackgroundTaskResult;+39
      j  org.apache.hadoop.ozone.om.KeyDeletingService$KeyDeletingTask.call()Ljava/lang/Object;+1
      J 4791 c1 java.util.concurrent.FutureTask.run()V java.base@11.0.6 (123 bytes) @ 0x00007f19f0c7b414 [0x00007f19f0c7ad20+0x00000000000006f4]
      J 4802 c1 java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object; java.base@11.0.6 (14 bytes) @ 0x00007f19f0c87214 [0x00007f19f0c870e0+0x0000000000000134]
      J 4791 c1 java.util.concurrent.FutureTask.run()V java.base@11.0.6 (123 bytes) @ 0x00007f19f0c7b414 [0x00007f19f0c7ad20+0x00000000000006f4]
      J 4802 c1 java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object; java.base@11.0.6 (14 bytes) @ 0x00007f19f0c87214 [0x00007f19f0c870e0+0x0000000000000134]
      J 4791 c1 java.util.concurrent.FutureTask.run()V java.base@11.0.6 (123 bytes) @ 0x00007f19f0c7b414 [0x00007f19f0c7ad20+0x00000000000006f4]
      J 4954 c1 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()V java.base@11.0.6 (57 bytes) @ 0x00007f19f0cfe10c [0x00007f19f0cfde40+0x00000000000002cc]
      
      

      Attachments

        Issue Links

          Activity

            People

              hanishakoneru Hanisha Koneru
              hanishakoneru Hanisha Koneru
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: