Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Before a job starts to run, `AbstractJobLauncher` attempts to clean left over staging data (which is usually due to the previous run failed) in `cleanLeftoverStagingData()`.
However, since the staging folder path contains the job ID and each run will have a unique job ID, the staging folder it is trying to clean is different from the staging folder the previous run used, which means the leftover staging data is not cleaned.
Two possible approaches: (1) change the job ID component in the staging folder path to job Name, so that different runs of the same job will have the same staging folder; (2) change `cleanLeftoverStagingData()` so that it somehow finds the staging folder of the previous run (e.g., using a glob pattern) and clean it.
Github Url : https://github.com/linkedin/gobblin/issues/857
Github Reporter : zliu41
Github Assignee : zliu41
Github Created At : 2016-03-16T19:02:35Z
Github Updated At : 2017-01-12T04:50:22Z
Comments
zliu41 wrote on 2016-03-16T19:04:41Z : @liyinan926 @sahilTakiar what do you think? Was there a reason that each run has a different staging folder? Is (1) safe to do in the continuous execution environment?
Github Url : https://github.com/linkedin/gobblin/issues/857#issuecomment-197493221
liyinan926 wrote on 2016-03-16T19:55:34Z : @zliu41 I couldn't remember that. This whole staging data cleanup thing has gone through several changes. If I'm not wrong, it used to use job name.
Github Url : https://github.com/linkedin/gobblin/issues/857#issuecomment-197517580
jbaranick wrote on 2016-03-16T21:25:14Z : @zliu41 1 doesn't seem like it would be safe if job locking isn't used. I think the same holds true for 2. There needs to be a better way to discriminate between a obviously bad staging folder and one that might be in use.
Github Url : https://github.com/linkedin/gobblin/issues/857#issuecomment-197555622
stakiar wrote on 2016-03-17T01:58:33Z : @zliu41 IIRC, the `cleanupStagingData` call that happens before the job is launched, is suppose to use the state from the previous execution, this way it only deletes staging data from the previous job. The fix should just require using the previous `WorkUnitState`s, rather than the new ones.
This should also avoid only problems with job locking as a job can only access the state of a previous execution which has already completed.
Github Url : https://github.com/linkedin/gobblin/issues/857#issuecomment-197649827
jbaranick wrote on 2016-03-17T02:02:45Z : What do we do if the state fails to write? Maybe that is out of scope for now.
Github Url : https://github.com/linkedin/gobblin/issues/857#issuecomment-197651512
zliu41 wrote on 2016-03-21T18:56:53Z : @sahilTakiar @kadaan is there any use case to schedule two runs of the same job simultaneously? That could mess up the state store and cause undefined behavior.
Github Url : https://github.com/linkedin/gobblin/issues/857#issuecomment-199424542
stakiar wrote on 2016-03-21T21:32:31Z : Well we have a Job Lock mechanism that prevents two job from running simultaneously, I believe there are multiple users who are using this feature. The Job Lock prevents two jobs from running at the same time.
The reason the Job Lock is useful is because there are some users who want to basically schedule the job back to back; this is easy in Azkaban, but not so easy in Quartz (or at least we haven't found a way yet), which is where the job lock comes in handy.
Github Url : https://github.com/linkedin/gobblin/issues/857#issuecomment-199495675
zliu41 wrote on 2016-03-24T18:19:28Z : @sahilTakiar so the vast majority of use cases shouldn't need to have two runs at the same time. So how about making the task staging directory path configurable with two options: using job id and using job name, the latter of which will ensure different runs have the same staging directory.
Github Url : https://github.com/linkedin/gobblin/issues/857#issuecomment-200958328
jbaranick wrote on 2016-03-24T18:57:54Z : Do we need a new setting or can this be determine if job lock is enabled?
Github Url : https://github.com/linkedin/gobblin/issues/857#issuecomment-200970667
zliu41 wrote on 2016-03-25T01:48:56Z : @kadaan I think we do. For example when running in Azkaban the job lock is not needed and is thus often disabled, but one may still want to use job name instead of job id.
Github Url : https://github.com/linkedin/gobblin/issues/857#issuecomment-201105437
stakiar wrote on 2016-03-25T03:10:20Z : @zliu41 can't we just use the state from the previous execution to determine the job_id of the last run? then we can use the previous job_id to determine what files to delete. we already load the state of the previous execution through a state-store call.
Github Url : https://github.com/linkedin/gobblin/issues/857#issuecomment-201119178
zliu41 wrote on 2016-03-25T03:13:42Z : @sahilTakiar if previous run failed, the state was most likely not persisted.
Github Url : https://github.com/linkedin/gobblin/issues/857#issuecomment-201119999
stakiar wrote on 2016-03-25T03:34:14Z : In that case, I think a better solution would as follows:
- Before a job runs, Gobblin should create an entry in the state-store saying the current jobId is in a running state
- When the job ends it can update it as it usually would
- This way, even if the job fails while writing the data, the next run will still know the jobId of the previous run, and it can do any cleanup necessary
This solution is more involved, but it avoids adding a new configuration option to Gobblin. We probably want to add this feature to Gobblin at some point anyway, so we can have more real-time metadata in the state-store.
Github Url : https://github.com/linkedin/gobblin/issues/857#issuecomment-201121882