The old conf files remain in the history folder and fail to be moved to "done" subdirectory
There is no need to move the conf file to the done folder. In this case the job is run as a new job and hence a new conf file is created for this job. The jobhistory file gets deleted as it is required for recovery (checkpoint process). The conf file is doesnt play any role in the recovery process. Here is what is happening
- jobtracker starts with id id1
- job job1 is submitted and creates history file hostname_id1_job1_user_jobname and conf file as hostname_id1_job1_conf.xml
- jobtracker restart with id id2
- jobtracker tries to recover the job. There are 2 possibilities here :
- If the job-initialization thread inits the job before the recovery-manager picks up the job for recovery then the new filename would be hostname_id1_job1_user_jobname.recover and the conf file would be hostname_id1_job1_conf.xml. In such a case there wont be any garbage left in the history folder.
- If the recovery-manager picks up the job first before the init-thread then it will assume that there is nothing to recover and will delete hostname_id1_job1_user_jobname (leaving hostname_id1_job1_conf.xml). When the job inits, it will take a new filename i.e hostname_id2_job1_user_jobname and hostname_id2_job1_conf.xml. Only in this case the conf file ( hostname_id1_job1_conf.xml) is left behind in the history folder.
AFAIK this is a timing issue. I think a proper fix for all this corner cases is