[FLINK-7340] Taskmanager hung after temporary DNS outage - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3.1
Fix Version/s: None
Component/s: Runtime / Coordination
Labels:
None
Environment:

Non-HA Flink running in Kubernetes.

Description

After a Kubernetes node failure, several TaskManagers and the DNS system were automatically restarted. One TaskManager was unable to connect to the JobManager and continually logged the following errors:

2017-08-01 18:58:06.707 [flink-akka.actor.default-dispatcher-823] INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager (attempt 595, timeout: 30000 milliseconds)
2017-08-01 18:58:06.713 [flink-akka.actor.default-dispatcher-834] INFO Remoting flink-akka.remote.default-remote-dispatcher-240 - Quarantined address [akka.tcp://flink@jobmanager:6123] is still unreachable or has not been restarted. Keeping it quarantined.

After exec'ing into the container, I was able to telnet jobmanager 6123 successfully and dig jobmanager showed the correct IP in DNS. I suspect that the TaskManager cached a bad IP address for the JobManager when the DNS system was restarting and it used that cached address rather than respecting the 30s TTL and getting a new one for the next request. It may be a good idea for the TaskManager to explicitly perform a DNS lookup after JobManager connection failures.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Joshua Griffith

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 01/Aug/17 19:29

Updated:: 25/Oct/19 09:52

Resolved:: 25/Oct/19 09:52