[HDDS-7925] Deadlocks among all OMs in OM HA - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.3.0
Fix Version/s: None
Component/s: OM HA
Labels:
None
Environment:

Configuration: FSO enabled, OM HA, SCM HA

Description

Overview

All S3 requests were timeout and not responding.
No failover to the follower OMs occurs. The deadlock sticks all OMs.
All OM APIs could not serve any responses by the deadlock.
Switching to non-HA does not recurs the deadlock

Environment

OM x3 with HA: nodeA, nodeB, nodeC
SCM x3 with HA: nodeA, nodeB, nodeC
DN x36: node[0:35]
S3G x36: node[0:35]

Problem Details

All requests in IPC Server Handler were stuck in a wait state, waiting to acquire a bucket lock for read/write keys or to get a response from OM Ratis. However, which component has a lock is unclear because the lock is spread across all OMs.

The attached archive includes stacktraces about each OM when the deadlock occurred. Each stacktraces were taken ten times (one-liner is attached in a comment below) at a second interval. OM nodes are assigned the following roles:

Leader: nodeB
Follower: nodeA, nodeC

In the log file (nodeB.jstack.log.2.0), we can see a jstack result in nodeB (OM leader) and two kinds of waiting threads. We checked some threads waiting for a bucket lock with the jdb debugger, and all threads wait for a lock about the same bucket to be freed. However, who acquired the lock is unclear.

Case #1: Waiting to acquire a bucket lock

"IPC Server handler 33 on default port 9862" #119 daemon prio=5 os_prio=0 cpu=62093.42ms elapsed=13319.52s allocated=5558M defined_classes=60 tid=0x00007f615f680890 nid=0x3d3b2e waiting on condition [0x00007f60d2214000]
...
at org.apache.hadoop.ozone.om.lock.OzoneManagerLock.acquireReadLock(OzoneManagerLock.java:146)
at org.apache.hadoop.ozone.om.BucketManagerImpl.getBucketInfo(BucketManagerImpl.java:64)
at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4272)
at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4290)
at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4237)
at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4208)
at org.apache.hadoop.ozone.om.OzoneManager.lambda$1(OzoneManager.java:2819)
at org.apache.hadoop.ozone.om.OzoneManager$$Lambda$1096/0x00000008013f2c10.get(Unknown Source)
at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:36)
at org.apache.hadoop.ozone.om.OzoneManager.lookupKey(OzoneManager.java:2817)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lookupKey(OzoneManagerRequestHandler.java:502)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:191)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:226)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:175)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown Source)
at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
...

Case #2: Waiting for Ratis

"IPC Server handler 35 on default port 9862" #121 daemon prio=5 os_prio=0 cpu=59113.87ms elapsed=13319.52s allocated=5395M defined_classes=7 tid=0x00007f615f6839f0 nid=0x3d3b30 waiting on condition [0x00007f60d2014000]
...
at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:285)
at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:247)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:217)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:198)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown Source)
at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
...

Workaround

We found two ways as a quick fix:

Reboot all OMs in an appropriate order; When stuck, rebooting all OMs starting with the old OM leader will recover the timeout. But it recurs within a few days.
Switch to the non-HA; After switching to the non-HA, we have not met the timeout issue.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

om-logs.2.tar.gz
15/Feb/23 11:13
186 kB
Kohei Sugihara
om-audit.2.tar.gz
16/Feb/23 14:24
2.82 MB
Kohei Sugihara
HDDS-7925-logs-2023-01-27.tar.gz
08/Feb/23 09:03
452 kB
Kohei Sugihara

Issue Links

duplicates

HDDS-7755 S3G cannot acquire VOLUME_LOCK lock while holding [BUCKET_LOCK]

Resolved

relates to

HDDS-8366 OzoneManager hangs when submitRequestToRatis

Resolved

Deadlocks among all OMs in OM HA

Details

Description

Overview

Environment

Problem Details

Case #1: Waiting to acquire a bucket lock

Case #2: Waiting for Ratis

Workaround

Attachments

Attachments

Issue Links

Activity

People

Dates