Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-3260

[sla] Remove stale item above max retries on JPA related errors from in-memory SLA map

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 5.0.0
    • 5.1.0
    • coordinator, core, workflow
    • None

    Description

      Despite having implemented OOZIE-3134, there are still cases where SLACalculatorMemory#slaMap and database contents still get out of sync. Some possibilities including but not limited to:

      • database contents of SLA_SUMMARY table have been purged manually from DB
      • no corresponding WF_JOBS or COORD_JOBS entries exist anymore in DB
      • the WF_JOBS or COORD_JOBS instance that is being tracked by the SLACalcStatus instances inside SLACalculatorMemory#slaMap is not yet persisted to database when the SLA entry is already processed by SLACalculatorMemory.HistoryPurgeWorker. Depending on e.g. how many coordinator actions are being materialized, it can very well happen that SLACalcStatus entries inserted to the in-memory map will be processed before their corresponding CoordActionBean entries are yet to be persisted to database

      In those rare cases, we see JPAExecutorException instances like:

      2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST] <t 1527981517, conn 1584126245> [0 ms] spent
      2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA processing for job [0000438-170916014916144-oozie-oozi-C@556]
      org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist [select w.eventProcessed from SLASummaryBean w where w.jobId = :id]
              at org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161)
              at org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480)
              at org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601)
      

      Solution here is to track the number of times the SLACalcStatus entry has not been processed successfully, and when a preconfigured oozie.sla.service.SLAService.maximum.retry.count is reached, remove any SLACalculatorMemory#slaMap entries that are causing those JPAExecutorException instances, to not cause huge logfiles. The items to be logged don't exist, anyways.

      It's still possible that multiple CoordActionBean instances being inserted won't have SLACalcStatus entries inside SLACalculatorMemory#slaMap by the time written to database, and thus, no SLA will be tracked. In those rare cases, preconfigured maximum retry count can be extended.

      Note that current implementation of SLACalculatorMemory#updateJobSla() already removes the stale SLACalcStatus entry. The new functionality here is to introduce SLACalcStatus#retryCount, and extend the JPAExecutorException {{ErrorCode}}s of interest.

      Attachments

        1. OOZIE-3260.002.patch
          21 kB
          Andras Piros
        2. OOZIE-3260.001.patch
          21 kB
          Andras Piros

        Issue Links

          Activity

            People

              andras.piros Andras Piros
              andras.piros Andras Piros
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: