Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-2854

Oozie should handle transient database problems

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0.0b1
    • Component/s: core
    • Labels:
      None

      Description

      There can be problems when Oozie cannot update the database properly. Recently, we have experienced erratic behavior with two setups:

      • MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic locking which might cause a transaction to rollback if there are two or more parallel transaction running and one of them cannot complete because of a conflict.
      • MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, Oozie might get "Communications link failure" exception during the failover.

      The problem is that failed DB transactions later might cause a workflow (which are started/re-started by RecoveryService) to get stuck. It's not clear to us how this happens but it has to do with the fact that certain DB updates are not executed.

      The solution is to use some sort of retry logic with exponential backoff if the DB update fails. We could start with a 100ms wait time which is doubled at every retry. The operation can be considered a failure if it still fails after 10 attempts. These values could be configurable. We should discuss initial values in the scope of this JIRA.

      Note that this solution is to handle transient failures. If the DB is down for a longer period of time, we have to accept that the internal state of Oozie is corrupted.

        Attachments

        1. OOZIE-2854-POC-001.patch
          18 kB
          Peter Bacsko
        2. OOZIE-2854-005.patch
          24 kB
          Peter Bacsko
        3. OOZIE-2854-004.patch
          25 kB
          Peter Bacsko
        4. OOZIE-2854-003.patch
          23 kB
          Peter Bacsko
        5. OOZIE-2854-002.patch
          23 kB
          Peter Bacsko
        6. OOZIE-2854-001.patch
          24 kB
          Peter Bacsko
        7. OOZIE-2854.013.patch
          136 kB
          Andras Piros
        8. OOZIE-2854.012.patch
          134 kB
          Andras Piros
        9. OOZIE-2854.011.patch
          132 kB
          Andras Piros
        10. OOZIE-2854.010.patch
          132 kB
          Andras Piros
        11. OOZIE-2854.009.patch
          132 kB
          Andras Piros
        12. OOZIE-2854.008.patch
          127 kB
          Andras Piros
        13. OOZIE-2854.007.patch
          126 kB
          Andras Piros
        14. OOZIE-2854.006.patch
          115 kB
          Andras Piros

          Activity

            People

            • Assignee:
              andras.piros Andras Piros
              Reporter:
              pbacsko Peter Bacsko
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: