Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Done
-
2.0.0
-
None
Description
was: Suspect that failure of background command results in locks being held as attempt to sync exception.
having looked at the code, can't find a root cause for this per se, but there were several bugs:
1) we call transactionService.callInTran(...) 3 times, if there's a failure then the two innermost correctly detect and result in setting the xactn to MUST_ABORT, the outer most one didn't and attempted to commit ... but then the MUST_ABORT is noticed and so it is effectively a commit after all.
2) if there is a failure then the startedAt on the CommandLogEntry, which initially is popluated when the execution starts, but ends up being rolled back to null if there's a failure. Meantime, the recovery processing was setting the completedAt and the exception stacktrace, but the startedAt is left untouched, ie null.
Fortuitiously, the query to find non-started queries looks at the startedAt rather than completedAt, and so re-executes.
3) the logic to execute each command in its own interaction was using REQUIRES_NEW, which suspends the top-level transaction that's started automatically when the interactionService.call..() is called. the failure within the transactions sets that new transaction as MUST_ABORT, but once finished, the original transaction is resumed and that is still healthy - resulting in the onCompletion publishing to be performed as normal. This would include the audit trail (EntityChangeTrackerDefault).
the fixes are:
1) make sure that rollback is called for the outermost.
2) provide a new config option (causeway.extensions.command-log.run-background-commands.on-failure-policy) to decide whether to STOP_THE_LINE (this was the accidental but most safe original behaviour) or to CONTINUE_WITH_NEXT. If the latter, then we do correctly log the exception and move on.
3) remove the unnecessary nested transactions with REQUIRES_NEW, so that the failure correctly is captured. This means that no command complete listeners (eg auditing) is performed if there was a failure.
to summarise, the new behaviour is (or is intended to be):
- if onFailurePolicy = STOP_THE_LINE (the safe, fail-fast mode)
- execute command (CommandLogEntry)
- if succeeds, then run publish on interactoin completion (auditing etc)
- if deadlocks, attempt up to 3x else treat as failure
- if fails, then do nothing.
- The quartz job will pick up the same command again in 10 seconds.
- execute command (CommandLogEntry)
- if onFailurePolicy = CONTINUE_WITH_NEXT
- execute command
- if succeeds, then run publish on interactoin completion (auditing etc)
- if deadlocks, attempt up to 3x else treat as failure
- if fails, then update CommandLogEntry as completed and its exception prop to hold the stack trace.
- the quartz job will pick up the next command in 10 seconds.
- execute command