Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
5.0.0
-
None
Description
Despite having implemented OOZIE-3134, there are still cases where SLACalculatorMemory#slaMap and database contents still get out of sync. Some possibilities including but not limited to:
- database contents of SLA_SUMMARY table have been purged manually from DB
- no corresponding WF_JOBS or COORD_JOBS entries exist anymore in DB
- the WF_JOBS or COORD_JOBS instance that is being tracked by the SLACalcStatus instances inside SLACalculatorMemory#slaMap is not yet persisted to database when the SLA entry is already processed by SLACalculatorMemory.HistoryPurgeWorker. Depending on e.g. how many coordinator actions are being materialized, it can very well happen that SLACalcStatus entries inserted to the in-memory map will be processed before their corresponding CoordActionBean entries are yet to be persisted to database
In those rare cases, we see JPAExecutorException instances like:
2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST] <t 1527981517, conn 1584126245> [0 ms] spent 2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA processing for job [0000438-170916014916144-oozie-oozi-C@556] org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist [select w.eventProcessed from SLASummaryBean w where w.jobId = :id] at org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161) at org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480) at org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601)
Solution here is to track the number of times the SLACalcStatus entry has not been processed successfully, and when a preconfigured oozie.sla.service.SLAService.maximum.retry.count is reached, remove any SLACalculatorMemory#slaMap entries that are causing those JPAExecutorException instances, to not cause huge logfiles. The items to be logged don't exist, anyways.
It's still possible that multiple CoordActionBean instances being inserted won't have SLACalcStatus entries inside SLACalculatorMemory#slaMap by the time written to database, and thus, no SLA will be tracked. In those rare cases, preconfigured maximum retry count can be extended.
Note that current implementation of SLACalculatorMemory#updateJobSla() already removes the stale SLACalcStatus entry. The new functionality here is to introduce SLACalcStatus#retryCount, and extend the JPAExecutorException {{ErrorCode}}s of interest.
Attachments
Attachments
Issue Links
- relates to
-
OOZIE-1442 Purge rogue and stale entries from history set
- Open
-
OOZIE-3134 Potential inconsistency between the in-memory SLA map and the Oozie database
- Closed
-
OOZIE-3276 Refactor SLACalculatorMemory
- Open
-
OOZIE-2364 Remove deprecated SLAEventBean and related code
- Patch Available