Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.19.1
Description
Scheduled maintenance or buggy nodes on Kubernetes can result random pod termination and eventually a series of job restarts due to rolling restart of the Kubernetes cluster nodes. The larger the job is the higher the chance it is affected. The jobs should be able to auto-recover from these issues, but can cause unwanted turbulence in large scale pipeline.
In this case, it is very difficult to identify what is causing the restarts without knowing the issue at Kubernetes layer and the keyword to search with because it is logged at INFO level.
We need to log this at higher level. If changing it from INFO to ERROR breaks monitoring we should at least log as warning.
Attachments
Issue Links
- links to