Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-7925

Deadlocks among all OMs in OM HA

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.3.0
    • None
    • OM HA
    • None
    • Configuration: FSO enabled, OM HA, SCM HA

    Description

      Overview

      • All S3 requests were timeout and not responding.
      • No failover to the follower OMs occurs. The deadlock sticks all OMs.
      • All OM APIs could not serve any responses by the deadlock.
      • Switching to non-HA does not recurs the deadlock

      Environment

      • OM x3 with HA: nodeA, nodeB, nodeC
      • SCM x3 with HA: nodeA, nodeB, nodeC
      • DN x36: node[0:35]
      • S3G x36: node[0:35]

      Problem Details

      All requests in IPC Server Handler were stuck in a wait state, waiting to acquire a bucket lock for read/write keys or to get a response from OM Ratis. However, which component has a lock is unclear because the lock is spread across all OMs. 

      The attached archive includes stacktraces about each OM when the deadlock occurred. Each stacktraces were taken ten times (one-liner is attached in a comment below) at a second interval. OM nodes are assigned the following roles:

      • Leader: nodeB
      • Follower: nodeA, nodeC

      In the log file (nodeB.jstack.log.2.0), we can see a jstack result in nodeB (OM leader) and two kinds of waiting threads. We checked some threads waiting for a bucket lock with the jdb debugger, and all threads wait for a lock about the same bucket to be freed. However, who acquired the lock is unclear.

      Case #1: Waiting to acquire a bucket lock

      "IPC Server handler 33 on default port 9862" #119 daemon prio=5 os_prio=0 cpu=62093.42ms elapsed=13319.52s allocated=5558M defined_classes=60 tid=0x00007f615f680890 nid=0x3d3b2e waiting on condition  [0x00007f60d2214000]
      ...
          at org.apache.hadoop.ozone.om.lock.OzoneManagerLock.acquireReadLock(OzoneManagerLock.java:146)
          at org.apache.hadoop.ozone.om.BucketManagerImpl.getBucketInfo(BucketManagerImpl.java:64)
          at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4272)
          at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4290)
          at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4237)
          at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4208)
          at org.apache.hadoop.ozone.om.OzoneManager.lambda$1(OzoneManager.java:2819)
          at org.apache.hadoop.ozone.om.OzoneManager$$Lambda$1096/0x00000008013f2c10.get(Unknown Source)
          at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:36)
          at org.apache.hadoop.ozone.om.OzoneManager.lookupKey(OzoneManager.java:2817)
          at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lookupKey(OzoneManagerRequestHandler.java:502)
          at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:191)
          at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:226)
          at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:175)
          at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown Source)
          at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
          at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
      ...

      Case #2: Waiting for Ratis

      "IPC Server handler 35 on default port 9862" #121 daemon prio=5 os_prio=0 cpu=59113.87ms elapsed=13319.52s allocated=5395M defined_classes=7 tid=0x00007f615f6839f0 nid=0x3d3b30 waiting on condition  [0x00007f60d2014000]
      ...
          at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:285)
          at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:247)
          at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:217)
          at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:198)
          at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown Source)
          at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
      ...

      Workaround

      We found two ways as a quick fix:

      • Reboot all OMs in an appropriate order; When stuck, rebooting all OMs starting with the old OM leader will recover the timeout. But it recurs within a few days.
      • Switch to the non-HA; After switching to the non-HA, we have not met the timeout issue.

      Attachments

        1. HDDS-7925-logs-2023-01-27.tar.gz
          452 kB
          Kohei Sugihara
        2. om-audit.2.tar.gz
          2.82 MB
          Kohei Sugihara
        3. om-logs.2.tar.gz
          186 kB
          Kohei Sugihara

        Issue Links

          Activity

            People

              duongnguyen Duong
              ksugihara Kohei Sugihara
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: