Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Resolved
-
None
-
None
Description
There are two threads that access Recon's RocksDB instance: One is doing updates based on the OM DB state (ContainerKeyMapperTask), the other is doing updates based on container reports (ReconContainerReportHandler). When ContainerKeyMapperTask is updating from a snapshot, it needs to account for keys that may have been deleted, however the snapshot alone does not provide this information, so it needs to clear out its existing container -> key mappings and rebuild them from scratch. It does this by calling ContainerDBServiceProvider#initNewContainerDB, which deletes the whole recon DB from the disk and creates a new one. This gives us the current problem:
1. ContainerKeyMapperTask#reprocess is called to do a snapshot based update from OM.
2. ContainerKeyMapperTask deletes and recreates the Recon DB.
3. Recon receives and processes a container report. When it needs to update the DB it may be using a stale handle from the old DB, or it may be trying to access the DB between it being deleted and created.
This scenario caused a RocksDB crash on Recon, shown in this dump.
C [librocksdbjni4235643658444878552.so+0x242ea2] Java_org_rocksdb_RocksDB_get__J_3BIIJ+0x62 J 7320 org.rocksdb.RocksDB.get(J[BIIJ)[B (0 bytes) @ 0x00007f461e4ff36d [0x00007f461e4ff280+0xed] J 13283 C2 org.apache.hadoop.hdds.utils.db.TypedTable.getFromTable(Ljava/lang/Object;)Ljava/lang/Object; (36 bytes) @ 0x00007f461f32b730 [0x00007f461f32b420+0x310] J 13545 C2 org.apache.hadoop.ozone.recon.spi.impl.ContainerDBServiceProviderImpl.getContainerReplicaHistory(Ljava/lang/Long;)Ljava/util/Map; (90 bytes) @ 0x00007f461e8d77ac [0x00007f461e8d7440+0x36c] J 8503 C2 org.apache.hadoop.ozone.recon.scm.ReconContainerManager.upsertContainerHistory(JLjava/util/UUID;JJ)V (111 bytes) @ 0x00007f461e8bf3c4 [0x00007f461e8bf2e0+0xe4] J 11064 C2 org.apache.hadoop.ozone.recon.scm.ReconContainerManager.removeContainerReplica(Lorg/apache/hadoop/hdds/scm/container/ContainerID;Lorg/apache/hadoop/hdds/scm/container/ContainerReplica;)V (97 bytes) @ 0x00007f461ef420f8 [0x00007f461ef41a40+0x6b8] J 13568 C2 org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processMissingReplicas(Lorg/apache/hadoop/hdds/protocol/DatanodeDetails;Ljava/util/Set;)V (93 bytes) @ 0x00007f461e6afa68 [0x00007f461e6aeec0+0xba8] J 16028 C2 org.apache.hadoop.ozone.recon.scm.ReconContainerReportHandler.onMessage(Ljava/lang/Object;Lorg/apache/hadoop/hdds/server/events/EventPublisher;)V (10 bytes) @ 0x00007f461f936188 [0x00007f461f9348c0+0x18c8] J 13493 C2 org.apache.hadoop.hdds.server.events.SingleThreadExecutor$$Lambda$313.run()V (20 bytes) @ 0x00007f461e390f5c [0x00007f461e390ec0+0x9c] J 17137% C2 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (225 bytes) @ 0x00007f461fc093e4 [0x00007f461fc090e0+0x304]
Attachments
Issue Links
- is fixed by
-
HDDS-5332 Add a new column family and a service provider in Recon DB for Namespace Summaries
- Resolved