Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7308

LockManager::unlock should not call updateLockHeldTimerMetrics if lockDurationTimer has not been started

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      If an exception is thrown in org.apache.hudi.client.transaction.lock.LockManager#lock it is possible for lockDurationTimer in HoodieLockMetrics to be closed by the user before it is started, which throws and bubbles up a `HoodieException("Timer was not started")`  exception (rather than the actual exception that occurred when trying to acquire lock). Specifically, this can happen due to following scenario

      1. The BaseHoodieTableServiceClient calls `org.apache.hudi.client.BaseHoodieTableServiceClient#completeClustering` , which in turn calls `org.apache.hudi.client.transaction.TransactionManager#beginTransaction` within a `try` block
      2. During `org.apache.hudi.client.transaction.TransactionManager#beginTransaction` the LockManager lock API `org.apache.hudi.client.transaction.lock.LockManager#lock` is called
      3. Inside ``org.apache.hudi.client.transaction.lock.LockManager#lock` , the `java.util.concurrent.locks.Lock#tryLock(long, java.util.concurrent.TimeUnit)` throws some exception. Because of this exception, the statement
        `metrics.updateLockAcquiredMetric();` is not executed. This means that the `org.apache.hudi.common.util.HoodieTimer#startTimer` method was never called for the timer HoodieLockMetrics member variable `org.apache.hudi.client.transaction.lock.metrics.HoodieLockMetrics#lockDurationTimer` 
      4. The exception in (3) bubbles up back to `org.apache.hudi.client.BaseHoodieTableServiceClient#completeClustering`. Since this is in a `try` block, the `catch` and `finally` blocks are executed. When `finally` is executed though, `org.apache.hudi.client.transaction.TransactionManager#endTransaction` is called
      5. During `org.apache.hudi.client.transaction.TransactionManager#endTransaction` the LockManager unlock API `org.apache.hudi.client.transaction.lock.LockManager#unlock` is called. During the  execution of `metrics.updateLockHeldTimerMetrics();`  ,  The method `org.apache.hudi.common.util.HoodieTimer#endTimer` is called for `org.apache.hudi.client.transaction.lock.metrics.HoodieLockMetrics#lockDurationTimer` . This throws an exception ` `HoodieException("Timer was not started")` This is  because the corresponding `org.apache.hudi.common.util.HoodieTimer#startTimer` method was never called

      The issue here is that the caller ('BaseHoodieTableServiceClient` in this case) should have ended up re-throwing the exception thrown in (3) while trying to start the transaction in `org.apache.hudi.client.transaction.TransactionManager#startTransaction`. Instead though, because the caller safely "cleaned up" by calling `org.apache.hudi.client.transaction.TransactionManager#endTransaction` (in a `finally`), the `HoodieException("Timer was not started")` exception was raised instead, suppressing the exception from (3), which is the actual root cause issue. Instead, the execution of `org.apache.hudi.client.transaction.TransactionManager#endTransaction` should have executed without throwing this additional exception, which would have lead the caller to throw the exception in (3) before exiting.

      Although resolving this would not prevent the overall operation from failing, it would provide better observability on the actual root cause exception (the one from (3)). 

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            krishen Krishen Bhan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment