Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Duplicate
-
1.16.0
-
None
Description
We observed several situations already where log files reached a file size of over 120G. This caused the worker's disk usage to reach 100% resulting in the worker machine to go "offline", i.e. not being available to pick up new tasks.
The initially observed excessive log spilling is due to a TaskManager failing fatally which results in the requested number of slots never becoming available and the test job ending up in an infinite failover/restart loop. See further details in the comment section.
Attachments
Attachments
Issue Links
- duplicates
-
FLINK-28077 Tasks get stuck during cancellation in ChannelStateWriteRequestExecutorImpl
- Closed
- is related to
-
FLINK-24433 "No space left on device" in Azure e2e tests
- Closed
-
FLINK-25374 Azure pipeline get stalled on scanning project
- Closed