[OOZIE-2854] Oozie should handle transient database problems - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 5.0.0b1
Component/s: core
Labels:
None

Description

There can be problems when Oozie cannot update the database properly. Recently, we have experienced erratic behavior with two setups:

MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic locking which might cause a transaction to rollback if there are two or more parallel transaction running and one of them cannot complete because of a conflict.

MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, Oozie might get "Communications link failure" exception during the failover.

The problem is that failed DB transactions later might cause a workflow (which are started/re-started by RecoveryService) to get stuck. It's not clear to us how this happens but it has to do with the fact that certain DB updates are not executed.

The solution is to use some sort of retry logic with exponential backoff if the DB update fails. We could start with a 100ms wait time which is doubled at every retry. The operation can be considered a failure if it still fails after 10 attempts. These values could be configurable. We should discuss initial values in the scope of this JIRA.

Note that this solution is to handle transient failures. If the DB is down for a longer period of time, we have to accept that the internal state of Oozie is corrupted.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

OOZIE-2854-POC-001.patch
03/May/17 10:59
18 kB
Peter Bacsko
OOZIE-2854-005.patch
11/May/17 14:51
24 kB
Peter Bacsko
OOZIE-2854-004.patch
11/May/17 14:39
25 kB
Peter Bacsko
OOZIE-2854-003.patch
09/May/17 13:20
23 kB
Peter Bacsko
OOZIE-2854-002.patch
09/May/17 12:01
23 kB
Peter Bacsko
OOZIE-2854-001.patch
04/May/17 18:21
24 kB
Peter Bacsko
OOZIE-2854.013.patch
11/Jul/17 14:42
136 kB
Andras Piros
OOZIE-2854.012.patch
11/Jul/17 12:02
134 kB
Andras Piros
OOZIE-2854.011.patch
06/Jul/17 16:09
132 kB
Andras Piros
OOZIE-2854.010.patch
06/Jul/17 14:58
132 kB
Andras Piros
OOZIE-2854.009.patch
06/Jul/17 09:12
132 kB
Andras Piros
OOZIE-2854.008.patch
05/Jul/17 17:44
127 kB
Andras Piros
OOZIE-2854.007.patch
03/Jul/17 15:15
126 kB
Andras Piros
OOZIE-2854.006.patch
29/Jun/17 16:05
115 kB
Andras Piros

Issue Links

links to

RB review

RB review request (patch 006 and upper)

Activity

People

Assignee:: Andras Piros

Reporter:: Peter Bacsko

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 03/Apr/17 12:50

Updated:: 25/Jan/18 21:00

Resolved:: 12/Jul/17 07:29