Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-1963

Following the restart, jobs that were previously in the "RUNNING," "LAUNCHED," or "SUBMITTED" state failed to resume.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 0.15.0
    • None
    • misc
    • None

    Description

      Context:

      Following a restart, Gobblin service is currently unable to process previous jobs in the RUNNING/LAUNCHED/SUBMITTED state, resulting in a stuck state for these jobs.

      Example scenario mentioned here

      A job is in the LAUNCHED state, and while calculating CDC, the Application master got re-attempted, actually due to name node issue (can be any env issues).

       

       

      As the job state in DB  :

      mysql> select * from gobblin_job_queue where job_name='DM-JOB-fpti-druid-dp-venmo' order by created_date desc limit 10;
      +------------------------------------------+----------------------------+---------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------------------------------------+---------------------+---------------------+
      | queue_id | job_name | deployment_id | failure_exception | configs | status | job_id | created_date | updated_date |
      +------------------------------------------+----------------------------+---------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------------------------------------+---------------------+---------------------+
      | DM-JOB-fpti-druid-dp-venmo_1630444318758 | DM-JOB-fpti-druid-dp-venmo | 2 | NULL | \{"dataset":{"batch_id":"20210831211155","name":"default._druid-test_dataproc-jobs_venmo","snapshot_id":"20210831211155"},"gobblin":\{"client":{"id":"AIRFLOW_PAZ_DMP_DO"},"deployment":\{"name":"DMP228"}},"namespace":"Chunnel"} | LAUNCHED | job_DM-JOB-fpti-druid-dp-venmo_1630444325903 | 2021-08-31 21:12:00 | 2021-08-31 21:12:38 | 
      

       

      Acceptance Criteria:

      1. Gobblin Jobs should be resumed, even if GobblinAppMaster gets restarted when the Jobs are not finalized.
      2. The system should automatically resume jobs that were in the RUNNING/LAUNCHED/SUBMITTED state after the restart.
      3. The solution should address lingering locks acquired in the previous run.
      4. It should not pick up the jobs/clean locks that are being picked up by other deployments, as part of work stealing.

       
       
       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              apekshit Apekshit Kumar
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: