[ZEPPELIN-5334] DNS race condition connecting to K8S interpreter pod - ASF JIRA

Attach files

Attach Screenshot

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.9.0
Fix Version/s: None
Component/s: interpreter-launcher
Labels:
None

Description

Apologies in advance for a bug report that is impossible to easily reproduce - I cannot reproduce it at will myself.

From time to time, running a paragraph from a fresh start fails with an error such as

java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: java.net.UnknownHostException: livy-wjmmsl.spark.svc at org.apache.zeppelin.interpreter.remote.PooledRemoteClient.callRemoteFunction(PooledRemoteClient.java:115) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.callRemoteFunction(RemoteInterpreterProcess.java:99)

Examining logs of zeppelin server reveals this sequence of events

07:00:03.662, zeppelin-server: Interpreter pod created livy-wjmmsl.spark.svc:12321
07:00:03.709, dnsmasq: Received DNS query for \"livy-wjmmsl.spark.svc.\
07:00:03.709, dnsmasq: Querying nameserver 172.20.0.10:53 question livy-wjmmsl.spark.svc.
07:00:03.725, zeppelin-server: java.net.UnknownHostException: livy-wjmmsl.spark.svc

It seems that Zeppelin assumes that as soon as pod is running, it can be looked up using DNS domain name. However, coredns needs to learn about this new pod from API server and update its records, and it takes non-zero time. I would propose that an exponential timeout is used when resolving DNS name for a new interpreter.