Details
Description
mKahaDB has a transaction journal to track the outcome of local JMS transactions that span kahaDB instances. Ie: send to A and B where each has its own persistence adapter.
To ensure N instances are in sync, a local 2PC is required
On recovery, if there is an outcome recorded in the txStore journal a commit must be replayed. If the outcome has not been recorded, rollback is assumed.
If the txJournal is corrupt and detected as corrupt the broker fails to start.
However it is not possible to manually recover via JMX without a live broker.
Each kahaDB instance would need to be restarted in isolation to force heuristic completion.
Deleting the corrupt journal to allow restart is potentially problematic if there are pending outcomes.
It can lead to partial outcomes, as recovery assumes good information, with no information from an empty tx journal, assumes rollback of any relevant pending transaction.
Note: the failure window is very small, but it is present and can lead to message loss once the txStore data is lost, deleted.
workarond:
Only delete a corrupt txStore journal once there are no recovered XA transactions in play in any of the nested kahaDB instances. This is visible from the recovery logging.
If there are recovered XA transactions with formatId=61616 (the internal id used to identify these local 2pc transactions), then careful recovery of the relevant kahaDB instance directory in standalone mode (without mKahaDB) will be required if the transactions should commit.
solution:
The journal needs to better detect corruption and not do any recovery processing in the absence of correct (non corrupt) information. Leaving pending transactions to be heuristically completed via JMX.