Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-1963

Following the restart, jobs that were previously in the "RUNNING," "LAUNCHED," or "SUBMITTED" state failed to resume.

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 0.15.0
    • None
    • misc
    • None

    Description

      Context:

      Following a restart, Gobblin service is currently unable to process previous jobs in the RUNNING/LAUNCHED/SUBMITTED state, resulting in a stuck state for these jobs.

      Example scenario mentioned here

      A job is in the LAUNCHED state, and while calculating CDC, the Application master got re-attempted, actually due to name node issue (can be any env issues).

       

       

      As the job state in DB  :

      mysql> select * from gobblin_job_queue where job_name='DM-JOB-fpti-druid-dp-venmo' order by created_date desc limit 10;
      +------------------------------------------+----------------------------+---------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------------------------------------+---------------------+---------------------+
      | queue_id | job_name | deployment_id | failure_exception | configs | status | job_id | created_date | updated_date |
      +------------------------------------------+----------------------------+---------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------------------------------------+---------------------+---------------------+
      | DM-JOB-fpti-druid-dp-venmo_1630444318758 | DM-JOB-fpti-druid-dp-venmo | 2 | NULL | \{"dataset":{"batch_id":"20210831211155","name":"default._druid-test_dataproc-jobs_venmo","snapshot_id":"20210831211155"},"gobblin":\{"client":{"id":"AIRFLOW_PAZ_DMP_DO"},"deployment":\{"name":"DMP228"}},"namespace":"Chunnel"} | LAUNCHED | job_DM-JOB-fpti-druid-dp-venmo_1630444325903 | 2021-08-31 21:12:00 | 2021-08-31 21:12:38 | 
      

       

      Acceptance Criteria:

      1. Gobblin Jobs should be resumed, even if GobblinAppMaster gets restarted when the Jobs are not finalized.
      2. The system should automatically resume jobs that were in the RUNNING/LAUNCHED/SUBMITTED state after the restart.
      3. The solution should address lingering locks acquired in the previous run.
      4. It should not pick up the jobs/clean locks that are being picked up by other deployments, as part of work stealing.

       
       
       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            apekshit Apekshit Kumar

            Dates

              Created:
              Updated:

              Slack

                Issue deployment