Description
While testing Kudu with our internal QA tools(1), we found that both DNS failure injection and elastic partitioning between clients and the master trigger a bug where master lookups just... vanish. Here's an example:
2016-04-13 22:18:55,506 WARN [New I/O boss #9] org.kududb.client.GetMasterRegistrationReceived: Error receiving a response from: francesco-ec2-kudu-centos66-11-1.vpc.cloudera.com:7051 org.kududb.client.ConnectionResetException: [Peer Kudu Master - francesco-ec2-kudu-centos66-11-1.vpc.cloudera.com:7051] Connection reset on [id: 0x9bd8ed44] at org.kududb.client.TabletClient.cleanup(TabletClient.java:630) (stack trace) 2016-04-13 22:18:55,507 WARN [New I/O boss #9] org.kududb.client.GetMasterRegistrationReceived: Unable to find the leader master (francesco-ec2-kudu-centos66-11-1.vpc.cloudera.com:7051), will retry 2016-04-13 22:18:55,507 DEBUG [New I/O boss #9] org.kududb.client.AsyncKuduClient: Going to sleep for 1017 at retry 2 2016-04-13 22:18:55,507 DEBUG [New I/O worker #7] org.kududb.client.TabletClient: [Peer Kudu Master - francesco-ec2-kudu-centos66-11-1.vpc.cloudera.com:7051] [id: 0x9bd8ed44] CLOSED (unrelated debug logs) 2016-04-13 22:28:44,951 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: Couldn't flush the head row, KuduRpc(method=Write, tablet=null, attempt=1, DeadlineTracker(timeout=0, elapsed=600001), null) row_key=(int64 key1=-721818921243156941, int64 key2=5432210168070573172) at org.kududb.mapreduce.tools.IntegrationTestBigLinkedList$Generator$GeneratorMapper.map(IntegrationTestBigLinkedList.java:516)
The client tries to reach the master, fails, says it's gonna retry in a second... then nothing until ITBLL times out 10 minutes later.