[HIVE-25663] Need to modify table/partition lock acquisition retry for Zookeeper lock manager - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Locking
Labels:
- pull-request-available

Description

LOCK TABLE default.my_table PARTITION (log_date='2021-10-30') EXCLUSIVE;
SET hive.query.timeout.seconds=5;
SELECT * FROM default.my_table WHERE log_date='2021-10-30' LIMIT 10;

If you execute the three SQLs above in the same session, the last SELECT will be cancelled by timeout error. The problem is that when you execute 'show locks', you will see a SHARED lock of default.my_table which is remained for 100 minutes, if you are using ZooKeeperHiveLockManager.

I am going to explain the problem one by one.

The SELECT SQL which gets some data from a partitioned table

SELECT * FROM my_table WHERE log_date='2021-10-30' LIMIT 10

needs two SHARED locks in order. The two SHARED locks are

default.my_table
default.my_table@log_date=2021-10-30

Before executing the SQL, an EXCLUSIVE lock of the partition exists. We can simulate it easily with a DDL like below;

LOCK TABLE default.my_table PARTITION (log_date='2021-10-30') EXCLUSIVE

The SELECT SQL acquires the SHARED lock of the table, but it can't acquire the SHARED lock of the partition. It retries to acquire it as specified by two configurations. The default values mean it will retry for 100 minutes.

hive.lock.sleep.between.retries=60s
hive.lock.numretries=100

If query.timeout is set to 5 seconds, the SELECT SQL is cancelled 5 seconds later and the client returns with timeout error. But the SHARED lock of the my_table is still remained for 100 minutes, because the current ZooKeeperHiveLockManager just logs InterruptedException and still goes on lock retry. This also means that the SQL processing thread is still doing its job for 100 minutes even though the SQL is cancelled. If the same SQL is executed 3 times, you can see 3 threads each of which thread dump is like below;

"HiveServer2-Background-Pool: Thread-154" #154 prio=5 os_prio=0 tid=0x00007f0ac91cb000 nid=0x13d25 waiting on condition [0x000
07f0aa2ce2000]
 java.lang.Thread.State: TIMED_WAITING (sleeping)
 at java.lang.Thread.sleep(Native Method)
 at org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager.lock(ZooKeeperHiveLockManager.java:303)
 at org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager.lock(ZooKeeperHiveLockManager.java:207)
 at org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager.acquireLocks(DummyTxnManager.java:199)
 at org.apache.hadoop.hive.ql.Driver.acquireLocks(Driver.java:1610)
 at org.apache.hadoop.hive.ql.Driver.lockAndRespond(Driver.java:1796)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1966)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1710)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1704)
 at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157)
 at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:217)
 at org.apache.hive.service.cli.operation.SQLOperation.access$500(SQLOperation.java:87)
 at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:309)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
 at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:322)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

I think ZooKeeperHiveLockManager should not swallow the unexpected exceptions.

It should only retry for expected ones.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2021-10-30-11-54-42-164.png
30/Oct/21 02:54
144 kB
Eugene Chung

Issue Links

links to

GitHub Pull Request #2761

Activity

People

Assignee:: Eugene Chung

Reporter:: Eugene Chung

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 30/Oct/21 02:59

Updated:: 06/Jan/22 00:13

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h