Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
0.15.0
-
None
-
None
Description
Context :
Currently due to NN availability issues, acquire job lock is failing, because of which job fails.
select deployment_id, status, count(*) from gobblin_job_queue where created_date >= '2021-09-01' and created_date < '2021-10-01' and failure_exception like '%NullPointerException%' group by deployment_id, status order by deployment_id, status; +---------------+--------+----------+ | deployment_id | status | count(*) | +---------------+--------+----------+ | 1 | FAILED | 253 | | 2 | FAILED | 6 | | 230 | FAILED | 157 | | 22702 | FAILED | 11 | | 22703 | FAILED | 13 | | 22704 | FAILED | 2 | +---------------+--------+----------+ 6 rows in set (1.04 sec) mysql> select deployment_id, status, count(*) from gobblin_job_queue where created_date >= '2021-08-01' and created_date < '2021-09-01' and failure_exception like '%NullPointerException%' group by deployment_id, status order by deployment_id, status; +---------------+--------+----------+ | deployment_id | status | count(*) | +---------------+--------+----------+ | 1 | FAILED | 1091 | | 3 | FAILED | 1598 | | 230 | FAILED | 15870 | +---------------+--------+----------+ 3 rows in set (1.18 sec)
Acceptance Criteria:
Job lock acquisition to be made resilient to NN issues, probably by moving locks to Zk or retrying while acquiring lock, in case of NN issues (IOExceptions)@
Attachments
Issue Links
- is a clone of
-
GOBBLIN-1963 Following the restart, jobs that were previously in the "RUNNING," "LAUNCHED," or "SUBMITTED" state failed to resume.
- Open
- is cloned by
-
GOBBLIN-1965 Need additional Hive data movement CDC check improvement to support table regex lookup
- Open