[HBASE-21796] RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5.0
Component/s: Zookeeper
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
Introduces retry logic when observing the AUTH_FAILED state from ZooKeeper. The number of retries can be controlled by "hbase.zookeeper.authfailed.retries.number" (default 15) and
the pause between retries controlled by "hbase.zookeeper.authfailed.pause" (default 100ms).

Show
Introduces retry logic when observing the AUTH_FAILED state from ZooKeeper. The number of retries can be controlled by "hbase.zookeeper.authfailed.retries.number" (default 15) and the pause between retries controlled by "hbase.zookeeper.authfailed.pause" (default 100ms).

Description

We've observed the following situation inside of a RegionServer which leaves an HConnection in a broken state as a result of the ZooKeeper client having received an AUTH_FAILED case in the Phoenix secondary indexing code-path. The result was that the HConnection used to write the secondary index updates failed every time the client re-attempted the write but we had no outward signs from the HConnection that there was a problem with that HConnection instance.

ZooKeeper programmer docs tell us that if a ZooKeeper instance goes to the AUTH_FAILED state that we must open a new ZooKeeper instance: https://zookeeper.apache.org/doc/r3.4.13/zookeeperProgrammers.html#ch_zkSessions

When a new HConnection (or one without a cached meta location) tries to access ZooKeeper to find meta's location or the cluster ID, this spin indefinitely because we can never access ZooKeeper because our client is broken from the AUTH_FAILED. For the Phoenix use-case (where we're trying to use this HConnection within the RS), this breaks things pretty fast.

The circumstances that caused us to observe this are not an HBase (or Phoenix or ZooKeeper) problem. The AUTH_FAILED exception we see is a result of networking issues on a user's system. Despite this, we can make our handling of this situation better.

We already have logic inside of RecoverableZooKeeper to re-create a ZooKeeper object when we need one (e.g. session expired/closed). We can extend this same logic to also re-create the ZK client object if we observe an AUTH_FAILED state.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-21796.001.branch-1.patch
28/Jan/19 17:33
26 kB
Josh Elser
HBASE-21796.002.branch-1.patch
29/Jan/19 15:54
26 kB
Josh Elser
HBASE-21796.003.branch-1.patch
20/Feb/19 19:26
36 kB
Josh Elser
HBASE-21796.004.branch-1.patch
22/Feb/19 23:31
37 kB
Josh Elser
HBASE-21796.005.branch-1.patch
28/Feb/19 19:19
37 kB
Josh Elser

Activity

People

Assignee:: Josh Elser

Reporter:: Josh Elser

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 28/Jan/19 17:01

Updated:: 13/Mar/19 06:23

Resolved:: 13/Mar/19 02:14