Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-2854

Oozie should handle transient database problems

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 5.0.0b1
    • core
    • None

    Description

      There can be problems when Oozie cannot update the database properly. Recently, we have experienced erratic behavior with two setups:

      • MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic locking which might cause a transaction to rollback if there are two or more parallel transaction running and one of them cannot complete because of a conflict.
      • MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, Oozie might get "Communications link failure" exception during the failover.

      The problem is that failed DB transactions later might cause a workflow (which are started/re-started by RecoveryService) to get stuck. It's not clear to us how this happens but it has to do with the fact that certain DB updates are not executed.

      The solution is to use some sort of retry logic with exponential backoff if the DB update fails. We could start with a 100ms wait time which is doubled at every retry. The operation can be considered a failure if it still fails after 10 attempts. These values could be configurable. We should discuss initial values in the scope of this JIRA.

      Note that this solution is to handle transient failures. If the DB is down for a longer period of time, we have to accept that the internal state of Oozie is corrupted.

      Attachments

        1. OOZIE-2854.006.patch
          115 kB
          Andras Piros
        2. OOZIE-2854.007.patch
          126 kB
          Andras Piros
        3. OOZIE-2854.008.patch
          127 kB
          Andras Piros
        4. OOZIE-2854.009.patch
          132 kB
          Andras Piros
        5. OOZIE-2854.010.patch
          132 kB
          Andras Piros
        6. OOZIE-2854.011.patch
          132 kB
          Andras Piros
        7. OOZIE-2854.012.patch
          134 kB
          Andras Piros
        8. OOZIE-2854.013.patch
          136 kB
          Andras Piros
        9. OOZIE-2854-001.patch
          24 kB
          Peter Bacsko
        10. OOZIE-2854-002.patch
          23 kB
          Peter Bacsko
        11. OOZIE-2854-003.patch
          23 kB
          Peter Bacsko
        12. OOZIE-2854-004.patch
          25 kB
          Peter Bacsko
        13. OOZIE-2854-005.patch
          24 kB
          Peter Bacsko
        14. OOZIE-2854-POC-001.patch
          18 kB
          Peter Bacsko

        Activity

          People

            andras.piros Andras Piros
            pbacsko Peter Bacsko
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: