Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-6667

Recon can crash if processing a container report after installing an OM snapshot

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Resolved
    • None
    • 1.2.0
    • Ozone Recon
    • None

    Description

      There are two threads that access Recon's RocksDB instance: One is doing updates based on the OM DB state (ContainerKeyMapperTask), the other is doing updates based on container reports (ReconContainerReportHandler). When ContainerKeyMapperTask is updating from a snapshot, it needs to account for keys that may have been deleted, however the snapshot alone does not provide this information, so it needs to clear out its existing container -> key mappings and rebuild them from scratch. It does this by calling ContainerDBServiceProvider#initNewContainerDB, which deletes the whole recon DB from the disk and creates a new one. This gives us the current problem:

      1. ContainerKeyMapperTask#reprocess is called to do a snapshot based update from OM.
      2. ContainerKeyMapperTask deletes and recreates the Recon DB.
      3. Recon receives and processes a container report. When it needs to update the DB it may be using a stale handle from the old DB, or it may be trying to access the DB between it being deleted and created.

      This scenario caused a RocksDB crash on Recon, shown in this dump.

      C  [librocksdbjni4235643658444878552.so+0x242ea2]  Java_org_rocksdb_RocksDB_get__J_3BIIJ+0x62
      J 7320  org.rocksdb.RocksDB.get(J[BIIJ)[B (0 bytes) @ 0x00007f461e4ff36d [0x00007f461e4ff280+0xed]
      J 13283 C2 org.apache.hadoop.hdds.utils.db.TypedTable.getFromTable(Ljava/lang/Object;)Ljava/lang/Object; (36 bytes) @ 0x00007f461f32b730 [0x00007f461f32b420+0x310]
      J 13545 C2 org.apache.hadoop.ozone.recon.spi.impl.ContainerDBServiceProviderImpl.getContainerReplicaHistory(Ljava/lang/Long;)Ljava/util/Map; (90 bytes) @ 0x00007f461e8d77ac [0x00007f461e8d7440+0x36c]
      J 8503 C2 org.apache.hadoop.ozone.recon.scm.ReconContainerManager.upsertContainerHistory(JLjava/util/UUID;JJ)V (111 bytes) @ 0x00007f461e8bf3c4 [0x00007f461e8bf2e0+0xe4]
      J 11064 C2 org.apache.hadoop.ozone.recon.scm.ReconContainerManager.removeContainerReplica(Lorg/apache/hadoop/hdds/scm/container/ContainerID;Lorg/apache/hadoop/hdds/scm/container/ContainerReplica;)V (97 bytes) @ 0x00007f461ef420f8 [0x00007f461ef41a40+0x6b8]
      J 13568 C2 org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processMissingReplicas(Lorg/apache/hadoop/hdds/protocol/DatanodeDetails;Ljava/util/Set;)V (93 bytes) @ 0x00007f461e6afa68 [0x00007f461e6aeec0+0xba8]
      J 16028 C2 org.apache.hadoop.ozone.recon.scm.ReconContainerReportHandler.onMessage(Ljava/lang/Object;Lorg/apache/hadoop/hdds/server/events/EventPublisher;)V (10 bytes) @ 0x00007f461f936188 [0x00007f461f9348c0+0x18c8]
      J 13493 C2 org.apache.hadoop.hdds.server.events.SingleThreadExecutor$$Lambda$313.run()V (20 bytes) @ 0x00007f461e390f5c [0x00007f461e390ec0+0x9c]
      J 17137% C2 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (225 bytes) @ 0x00007f461fc093e4 [0x00007f461fc090e0+0x304]
      

      Attachments

        Issue Links

          Activity

            People

              erose Ethan Rose
              erose Ethan Rose
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: