Affects Version/s: v3.0.0-alpha
Fix Version/s: None
Component/s: Job Engine
Our cluster has two nodes (named build1, build2) building cube jobs, and used DistributedScheduler.
There is a job, id 9f05b84b-cec9-81ee-9336-5a419e451a55, shown built on the build1 node.
The job displays Error, but the first sub task creating hive flat table display Ready, and can see the first task's yarn job running through yarn ui. After the yarn job is successful, the job re-runs the first sub-task, again and again.
Looking at the build1 log, the status of this job is changed from ready to running, then the first task status is ready to running, then the update job information is broadcast, then the update job information broadcast is received. But after twenty seconds, a broadcast of the updated job information was received.
After a few minutes, the first task is completed, but the log shows that the job status changed from Error to ready! Then the job status changed from ready to running, the first task starts running again .... Repeat the above log.
I suspect that other nodes have changed the job status. Looking at the build2 node log, there are a lot of exception logs, about there is no output for another job id f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d:
In addition, each build2 receives the broadcast of the build1 update the job information， after twenty seconds, the log print changes the first task state runinng to ready and broadcasts.
Restarting the build2 node, not printing the Job Fetcher caught a exception , and the job 9f05b84b-cec9-81ee-9336-5a419e451a55 was successfully executed.
This is due to a job metadata synchronization exception, which triggers a job scheduling bug. Build1 node try to run the job, but another build node kills the job and changes the job status to Error, causing problems.
The build2 node may have a metadata synchronization problem, the job with the id f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d exists in ExecutableDao's executableDigestMap, and does not exist in ExecutableDao's executableOutputDigestMap. Each time FetchRunner foreach the job, it throws an exception and fetchFailed is set to true.
When the build2 first processes the job that build1 is running, since fetchFailed is true, the job is not in the list of running jobs in build2, the job status is running, FetchRunner.jobStateCount() will kill the job, and set the running task status to ready, set the job status to error, broadcast.
After the first task of the job runs successfully on the build1, the task state is ready without change, and the job status is error，and executeResult returns successfully, then the job status is changed to ready. The job status Ready will not release the zk lock, build1 will continue to schedule the job to run, and then be killed by build2, again and again. The build job has not been able to run normally
There are two problems with FetcherRunner:
1. When FechRunnner foreach job, if the metadata of the job part is not found, an exception will be thrown. We can skip this job and foreach other jobs.
2. For DistributedScheduler, even if FetchFailed is true, not in runningJobs, the status is running, FetchRunner should not kill the job because the job may be scheduler by another kylin service
This jira solves the problem 1, another jira will solves the problem 2