[OOZIE-3260] [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 5.0.0
Fix Version/s: 5.1.0
Component/s: coordinator, core, workflow
Labels:
None

Description

Despite having implemented ~~OOZIE-3134~~, there are still cases where SLACalculatorMemory#slaMap and database contents still get out of sync. Some possibilities including but not limited to:

database contents of SLA_SUMMARY table have been purged manually from DB
no corresponding WF_JOBS or COORD_JOBS entries exist anymore in DB
the WF_JOBS or COORD_JOBS instance that is being tracked by the SLACalcStatus instances inside SLACalculatorMemory#slaMap is not yet persisted to database when the SLA entry is already processed by SLACalculatorMemory.HistoryPurgeWorker. Depending on e.g. how many coordinator actions are being materialized, it can very well happen that SLACalcStatus entries inserted to the in-memory map will be processed before their corresponding CoordActionBean entries are yet to be persisted to database

In those rare cases, we see JPAExecutorException instances like:

2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST] <t 1527981517, conn 1584126245> [0 ms] spent
2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA processing for job [0000438-170916014916144-oozie-oozi-C@556]
org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist [select w.eventProcessed from SLASummaryBean w where w.jobId = :id]
        at org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161)
        at org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480)
        at org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601)

Solution here is to track the number of times the SLACalcStatus entry has not been processed successfully, and when a preconfigured oozie.sla.service.SLAService.maximum.retry.count is reached, remove any SLACalculatorMemory#slaMap entries that are causing those JPAExecutorException instances, to not cause huge logfiles. The items to be logged don't exist, anyways.

It's still possible that multiple CoordActionBean instances being inserted won't have SLACalcStatus entries inside SLACalculatorMemory#slaMap by the time written to database, and thus, no SLA will be tracked. In those rare cases, preconfigured maximum retry count can be extended.

Note that current implementation of SLACalculatorMemory#updateJobSla() already removes the stale SLACalcStatus entry. The new functionality here is to introduce SLACalcStatus#retryCount, and extend the JPAExecutorException {{ErrorCode}}s of interest.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

OOZIE-3260.002.patch
06/Jun/18 09:55
21 kB
Andras Piros
OOZIE-3260.001.patch
31/May/18 10:23
21 kB
Andras Piros

Issue Links

relates to

OOZIE-1442 Purge rogue and stale entries from history set

Open

OOZIE-3134 Potential inconsistency between the in-memory SLA map and the Oozie database

Closed

OOZIE-3276 Refactor SLACalculatorMemory

Open

OOZIE-2364 Remove deprecated SLAEventBean and related code

Patch Available

Activity

People

Assignee:: Andras Piros

Reporter:: Andras Piros

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/May/18 19:02

Updated:: 18/Jan/19 09:03

Resolved:: 12/Jun/18 12:39