Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-17341

Add a timeout during replication endpoint termination

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.3.0, 1.4.0, 1.1.7, 0.98.23, 1.2.4, 2.0.0
    • Fix Version/s: 1.3.0, 1.2.5, 0.98.24, 1.1.9, 2.0.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In ReplicationSource#terminate(), a Future is obtained from ReplicationEndpoint#stop(). Future.get() is then called, but can potentially hang there if something went wrong in the endpoint stop().

      Hanging there has serious implications, because the thread could potentially be the ZK event thread (e.g. watcher calls ReplicationSourceManager#removePeer() -> ReplicationSource#terminate() -> blocked). This means no other events in the ZK event queue will get processed, which for HBase means other ZK watches such as replication watch notifications, snapshot watch notifications, even RegionServer shutdown will all get blocked.

      The short term fix addressed here is to simply add a timeout for Future.get(). But the severe consequences seen here perhaps suggest a broader refactoring of the ZKWatcher usage in HBase is in order, to protect against situations like this.

        Attachments

        1. HBASE-17341.branch-1.1.v1.patch
          5 kB
          Vincent Poon
        2. HBASE-17341.branch-1.1.v2.patch
          5 kB
          Vincent Poon
        3. HBASE-17341.master.v1.patch
          5 kB
          Vincent Poon
        4. HBASE-17341.master.v2.patch
          5 kB
          Vincent Poon

          Activity

            People

            • Assignee:
              vincentpoon Vincent Poon
              Reporter:
              vincentpoon Vincent Poon
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: