Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
0.15.0
-
None
-
None
Description
Context:
Following a restart, Gobblin service is currently unable to process previous jobs in the RUNNING/LAUNCHED/SUBMITTED state, resulting in a stuck state for these jobs.
Example scenario mentioned here
A job is in the LAUNCHED state, and while calculating CDC, the Application master got re-attempted, actually due to name node issue (can be any env issues).
As the job state in DB :
mysql> select * from gobblin_job_queue where job_name='DM-JOB-fpti-druid-dp-venmo' order by created_date desc limit 10; +------------------------------------------+----------------------------+---------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------------------------------------+---------------------+---------------------+ | queue_id | job_name | deployment_id | failure_exception | configs | status | job_id | created_date | updated_date | +------------------------------------------+----------------------------+---------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------------------------------------+---------------------+---------------------+ | DM-JOB-fpti-druid-dp-venmo_1630444318758 | DM-JOB-fpti-druid-dp-venmo | 2 | NULL | \{"dataset":{"batch_id":"20210831211155","name":"default._druid-test_dataproc-jobs_venmo","snapshot_id":"20210831211155"},"gobblin":\{"client":{"id":"AIRFLOW_PAZ_DMP_DO"},"deployment":\{"name":"DMP228"}},"namespace":"Chunnel"} | LAUNCHED | job_DM-JOB-fpti-druid-dp-venmo_1630444325903 | 2021-08-31 21:12:00 | 2021-08-31 21:12:38 |
Acceptance Criteria:
- Gobblin Jobs should be resumed, even if GobblinAppMaster gets restarted when the Jobs are not finalized.
- The system should automatically resume jobs that were in the RUNNING/LAUNCHED/SUBMITTED state after the restart.
- The solution should address lingering locks acquired in the previous run.
- It should not pick up the jobs/clean locks that are being picked up by other deployments, as part of work stealing.
Attachments
Attachments
Issue Links
- is cloned by
-
GOBBLIN-1964 [Gobblin] Job lock acquisition should be made resilient
- Open