Description
There can be problems when Oozie cannot update the database properly. Recently, we have experienced erratic behavior with two setups:
- MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic locking which might cause a transaction to rollback if there are two or more parallel transaction running and one of them cannot complete because of a conflict.
- MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, Oozie might get "Communications link failure" exception during the failover.
The problem is that failed DB transactions later might cause a workflow (which are started/re-started by RecoveryService) to get stuck. It's not clear to us how this happens but it has to do with the fact that certain DB updates are not executed.
The solution is to use some sort of retry logic with exponential backoff if the DB update fails. We could start with a 100ms wait time which is doubled at every retry. The operation can be considered a failure if it still fails after 10 attempts. These values could be configurable. We should discuss initial values in the scope of this JIRA.
Note that this solution is to handle transient failures. If the DB is down for a longer period of time, we have to accept that the internal state of Oozie is corrupted.