Details
-
Bug
-
Status: Closed
-
Trivial
-
Resolution: Fixed
-
1.0.0-beta1
Description
If an exception is thrown in org.apache.hudi.client.transaction.lock.LockManager#lock it is possible for lockDurationTimer in HoodieLockMetrics to be closed by the user before it is started, which throws and bubbles up a `HoodieException("Timer was not started")` exception (rather than the actual exception that occurred when trying to acquire lock). Specifically, this can happen due to following scenario
- The BaseHoodieTableServiceClient calls `org.apache.hudi.client.BaseHoodieTableServiceClient#completeClustering` , which in turn calls `org.apache.hudi.client.transaction.TransactionManager#beginTransaction` within a `try` block
- During `org.apache.hudi.client.transaction.TransactionManager#beginTransaction` the LockManager lock API `org.apache.hudi.client.transaction.lock.LockManager#lock` is called
- Inside ``org.apache.hudi.client.transaction.lock.LockManager#lock` , the `java.util.concurrent.locks.Lock#tryLock(long, java.util.concurrent.TimeUnit)` throws some exception. Because of this exception, the statement
`metrics.updateLockAcquiredMetric();` is not executed. This means that the `org.apache.hudi.common.util.HoodieTimer#startTimer` method was never called for the timer HoodieLockMetrics member variable `org.apache.hudi.client.transaction.lock.metrics.HoodieLockMetrics#lockDurationTimer` - The exception in (3) bubbles up back to `org.apache.hudi.client.BaseHoodieTableServiceClient#completeClustering`. Since this is in a `try` block, the `catch` and `finally` blocks are executed. When `finally` is executed though, `org.apache.hudi.client.transaction.TransactionManager#endTransaction` is called
- During `org.apache.hudi.client.transaction.TransactionManager#endTransaction` the LockManager unlock API `org.apache.hudi.client.transaction.lock.LockManager#unlock` is called. During the execution of `metrics.updateLockHeldTimerMetrics();` , The method `org.apache.hudi.common.util.HoodieTimer#endTimer` is called for `org.apache.hudi.client.transaction.lock.metrics.HoodieLockMetrics#lockDurationTimer` . This throws an exception ` `HoodieException("Timer was not started")` This is because the corresponding `org.apache.hudi.common.util.HoodieTimer#startTimer` method was never called
The issue here is that the caller ('BaseHoodieTableServiceClient` in this case) should have ended up re-throwing the exception thrown in (3) while trying to start the transaction in `org.apache.hudi.client.transaction.TransactionManager#startTransaction`. Instead though, because the caller safely "cleaned up" by calling `org.apache.hudi.client.transaction.TransactionManager#endTransaction` (in a `finally`), the `HoodieException("Timer was not started")` exception was raised instead, suppressing the exception from (3), which is the actual root cause issue. Instead, the execution of `org.apache.hudi.client.transaction.TransactionManager#endTransaction` should have executed without throwing this additional exception, which would have lead the caller to throw the exception in (3) before exiting.
Although resolving this would not prevent the overall operation from failing, it would provide better observability on the actual root cause exception (the one from (3)).
Attachments
Issue Links
- links to