[OOZIE-3260] [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map - ASF JIRA

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 5.0.0
Fix Version/s: 5.1.0
Component/s: coordinator, core, workflow
Labels:
None

Description

Despite having implemented ~~OOZIE-3134~~, there are still cases where SLACalculatorMemory#slaMap and database contents still get out of sync. Some possibilities including but not limited to:

database contents of SLA_SUMMARY table have been purged manually from DB
no corresponding WF_JOBS or COORD_JOBS entries exist anymore in DB
the WF_JOBS or COORD_JOBS instance that is being tracked by the SLACalcStatus instances inside SLACalculatorMemory#slaMap is not yet persisted to database when the SLA entry is already processed by SLACalculatorMemory.HistoryPurgeWorker. Depending on e.g. how many coordinator actions are being materialized, it can very well happen that SLACalcStatus entries inserted to the in-memory map will be processed before their corresponding CoordActionBean entries are yet to be persisted to database

In those rare cases, we see JPAExecutorException instances like:

2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST] <t 1527981517, conn 1584126245> [0 ms] spent
2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA processing for job [0000438-170916014916144-oozie-oozi-C@556]
org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist [select w.eventProcessed from SLASummaryBean w where w.jobId = :id]
        at org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161)
        at org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480)
        at org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601)

Solution here is to track the number of times the SLACalcStatus entry has not been processed successfully, and when a preconfigured oozie.sla.service.SLAService.maximum.retry.count is reached, remove any SLACalculatorMemory#slaMap entries that are causing those JPAExecutorException instances, to not cause huge logfiles. The items to be logged don't exist, anyways.

It's still possible that multiple CoordActionBean instances being inserted won't have SLACalcStatus entries inside SLACalculatorMemory#slaMap by the time written to database, and thus, no SLA will be tracked. In those rare cases, preconfigured maximum retry count can be extended.

Note that current implementation of SLACalculatorMemory#updateJobSla() already removes the stale SLACalcStatus entry. The new functionality here is to introduce SLACalcStatus#retryCount, and extend the JPAExecutorException {{ErrorCode}}s of interest.