Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
2.3.5, 2.4.4, 3.0.0-alpha-2
-
None
-
None
-
Reviewed
Description
In Remote Procedure dispatcher, before submitting (sub)Procedure to Thread pool, we enqueue it as DelayedWithTimeout object on DelayQueue.
TimeoutExecutorThread keeps dequeuing elements from this DelayQueue and submit the Procedure to the threadpool. The expiration of DelayedWithTimeout is determined by getDelay(TimeUnit):
@Override public long getDelay(final TimeUnit unit) { return DelayedUtil.getRemainingTime(unit, getTimeout()); }
/** * @return Time remaining as milliseconds. */ public static long getRemainingTime(final TimeUnit resultUnit, final long timeout) { final long currentTime = EnvironmentEdgeManager.currentTime(); if (currentTime >= timeout) { return 0; } return resultUnit.convert(timeout - currentTime, TimeUnit.MILLISECONDS); }
Hence, in order for the elements to get dequeued on time, it is necessary that EnvironmentEdgeManager.currentTime() returns the current time in millis.
As part of unit test, if we use our own custom EnvironmentEdge and inject it using EnvironmentEdgeManager.injectEdge before creating any tables, it is possible that we continue returning same value (based on custom impl) with EnvironmentEdgeManager.currentTime(). If that is the case, getRemainingTime as mentioned above, will never return 0 and hence, the procedure wrapped in DelayedWithTimeout might never be dequeued from DelayQueue because it's delay will not expire.
As of today, our system goes in hanging state while waiting for table regions to be available (as mentioned above, DelayedWithTimeout object never gets dequeued from DelayQueue).
Thread dump might show something like this consistently:
"ProcedureDispatcherTimeoutThread" #319 daemon prio=5 os_prio=31 tid=0x00007fcaf0cae800 nid=0x21d03 waiting on condition [0x0000700019293000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000007225a0090> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.DelayQueue.take(DelayQueue.java:223) at org.apache.hadoop.hbase.procedure2.util.DelayedUtil.takeWithoutInterrupt(DelayedUtil.java:82) at org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$TimeoutExecutorThread.run(RemoteProcedureDispatcher.java:314) Locked ownable synchronizers: - None
Although running into situation like this is not likely possible unless custom EnvironmentEdge is used as mentioned above, we should improve our dequeue logic as well as log important message to show where we are stuck.