[CAUSEWAY-3775] Improve background commands job failure handling - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Done
Affects Version/s: 2.0.0
Fix Version/s: 2.1.0, 3.1.0
Component/s: Core, Ext Core CommandLog
Labels:
None

Description

was: Suspect that failure of background command results in locks being held as attempt to sync exception.

having looked at the code, can't find a root cause for this per se, but there were several bugs:

1) we call transactionService.callInTran(...) 3 times, if there's a failure then the two innermost correctly detect and result in setting the xactn to MUST_ABORT, the outer most one didn't and attempted to commit ... but then the MUST_ABORT is noticed and so it is effectively a commit after all.

2) if there is a failure then the startedAt on the CommandLogEntry, which initially is popluated when the execution starts, but ends up being rolled back to null if there's a failure. Meantime, the recovery processing was setting the completedAt and the exception stacktrace, but the startedAt is left untouched, ie null.

Fortuitiously, the query to find non-started queries looks at the startedAt rather than completedAt, and so re-executes.

3) the logic to execute each command in its own interaction was using REQUIRES_NEW, which suspends the top-level transaction that's started automatically when the interactionService.call..() is called. the failure within the transactions sets that new transaction as MUST_ABORT, but once finished, the original transaction is resumed and that is still healthy - resulting in the onCompletion publishing to be performed as normal. This would include the audit trail (EntityChangeTrackerDefault).

the fixes are:

1) make sure that rollback is called for the outermost.

2) provide a new config option (causeway.extensions.command-log.run-background-commands.on-failure-policy) to decide whether to STOP_THE_LINE (this was the accidental but most safe original behaviour) or to CONTINUE_WITH_NEXT. If the latter, then we do correctly log the exception and move on.

3) remove the unnecessary nested transactions with REQUIRES_NEW, so that the failure correctly is captured. This means that no command complete listeners (eg auditing) is performed if there was a failure.

to summarise, the new behaviour is (or is intended to be):

if onFailurePolicy = STOP_THE_LINE (the safe, fail-fast mode)
- execute command (CommandLogEntry)
  - if succeeds, then run publish on interactoin completion (auditing etc)
  - if deadlocks, attempt up to 3x else treat as failure
  - if fails, then do nothing.
- The quartz job will pick up the same command again in 10 seconds.
if onFailurePolicy = CONTINUE_WITH_NEXT
- execute command
  - if succeeds, then run publish on interactoin completion (auditing etc)
  - if deadlocks, attempt up to 3x else treat as failure
  - if fails, then update CommandLogEntry as completed and its exception prop to hold the stack trace.
- the quartz job will pick up the next command in 10 seconds.

Attachments

Activity

People

Assignee:: Daniel Keir Haywood

Reporter:: Daniel Keir Haywood

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 11/Jun/24 10:07

Updated:: 05/Jul/24 14:12

Resolved:: 05/Jul/24 14:12