Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
1.3.0
-
None
-
None
-
Configuration: FSO enabled, OM HA, SCM HA
Description
Overview
- All S3 requests were timeout and not responding.
- No failover to the follower OMs occurs. The deadlock sticks all OMs.
- All OM APIs could not serve any responses by the deadlock.
- Switching to non-HA does not recurs the deadlock
Environment
- OM x3 with HA: nodeA, nodeB, nodeC
- SCM x3 with HA: nodeA, nodeB, nodeC
- DN x36: node[0:35]
- S3G x36: node[0:35]
Problem Details
All requests in IPC Server Handler were stuck in a wait state, waiting to acquire a bucket lock for read/write keys or to get a response from OM Ratis. However, which component has a lock is unclear because the lock is spread across all OMs.
The attached archive includes stacktraces about each OM when the deadlock occurred. Each stacktraces were taken ten times (one-liner is attached in a comment below) at a second interval. OM nodes are assigned the following roles:
- Leader: nodeB
- Follower: nodeA, nodeC
In the log file (nodeB.jstack.log.2.0), we can see a jstack result in nodeB (OM leader) and two kinds of waiting threads. We checked some threads waiting for a bucket lock with the jdb debugger, and all threads wait for a lock about the same bucket to be freed. However, who acquired the lock is unclear.
Case #1: Waiting to acquire a bucket lock
"IPC Server handler 33 on default port 9862" #119 daemon prio=5 os_prio=0 cpu=62093.42ms elapsed=13319.52s allocated=5558M defined_classes=60 tid=0x00007f615f680890 nid=0x3d3b2e waiting on condition [0x00007f60d2214000]
...
at org.apache.hadoop.ozone.om.lock.OzoneManagerLock.acquireReadLock(OzoneManagerLock.java:146)
at org.apache.hadoop.ozone.om.BucketManagerImpl.getBucketInfo(BucketManagerImpl.java:64)
at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4272)
at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4290)
at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4237)
at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4208)
at org.apache.hadoop.ozone.om.OzoneManager.lambda$1(OzoneManager.java:2819)
at org.apache.hadoop.ozone.om.OzoneManager$$Lambda$1096/0x00000008013f2c10.get(Unknown Source)
at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:36)
at org.apache.hadoop.ozone.om.OzoneManager.lookupKey(OzoneManager.java:2817)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lookupKey(OzoneManagerRequestHandler.java:502)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:191)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:226)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:175)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown Source)
at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
...
Case #2: Waiting for Ratis
"IPC Server handler 35 on default port 9862" #121 daemon prio=5 os_prio=0 cpu=59113.87ms elapsed=13319.52s allocated=5395M defined_classes=7 tid=0x00007f615f6839f0 nid=0x3d3b30 waiting on condition [0x00007f60d2014000]
...
at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:285)
at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:247)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:217)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:198)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown Source)
at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
...
Workaround
We found two ways as a quick fix:
- Reboot all OMs in an appropriate order; When stuck, rebooting all OMs starting with the old OM leader will recover the timeout. But it recurs within a few days.
- Switch to the non-HA; After switching to the non-HA, we have not met the timeout issue.