[FLINK-36140] Log a warning when pods are terminated by kubernetes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.19.1
Fix Version/s: 2.0-preview
Component/s: Deployment / Kubernetes
Labels:
- pull-request-available

Description

Scheduled maintenance or buggy nodes on Kubernetes can result random pod termination and eventually a series of job restarts due to rolling restart of the Kubernetes cluster nodes. The larger the job is the higher the chance it is affected. The jobs should be able to auto-recover from these issues, but can cause unwanted turbulence in large scale pipeline.

In this case, it is very difficult to identify what is causing the restarts without knowing the issue at Kubernetes layer and the keyword to search with because it is logged at INFO level.

We need to log this at higher level. If changing it from INFO to ERROR breaks monitoring we should at least log as warning.

Attachments

Issue Links

links to

GitHub Pull Request #25242

Activity

People

Assignee:: Clara Xiong

Reporter:: Clara Xiong

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Aug/24 21:25

Updated:: 02/Sep/24 09:35

Resolved:: 26/Aug/24 11:48