Saw failure on build server where replication didn't happen in an integration test. A tablet server was restarted as a part of this test.
As the tabletserver was starting back up, the Master was trying to scan the ReplicationTable. Before the tserver came up "completely" (not sure on details), the Master starting getting repeated RuntimeExceptions
TabletServer was still in the process of starting, but must have already obtained its lock (otherwise we couldn't have talked to it). It appears that the exceptions starting repeatedly printing in the Master log before the tserver hit it's main loop (lines 2414-2471 at f4024930).
I think there may be a separate issue with the client receiving those Exceptions before a tserver is "fully" up, but the Master thread needs to be resilient against these exceptions bubbling up.
- Time Spent:
Commit 73fc496a5474528d9a5a6de0e4027b506473f6e1 in accumulo's branch refs/heads/master from [~elserj]
[ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=73fc496 ]
ACCUMULO-2963Update ReplicationDriver to try/catch each step in the main-loop.
An RTE bubbling up from any step inside the ReplicationDriver, for example one
coming from the BatchScanner on Thrift exception, will inadvertently kill the
entire Daemon thread that runs replication. Try/catch the exception, log it,
and then retry the operation on the next cycle.
|Remaining Estimate||0h [ 0 ]|
|Time Spent||10m [ 600 ]|
|Worklog Id||16511 [ 16511 ]|
|Status||Open [ 1 ]||Resolved [ 5 ]|
|Resolution||Fixed [ 1 ]|
|Transition||Time In Source Status||Execution Times||Last Executer||Last Execution Date|
|45m 49s||1||Josh Elser||01/Jul/14 05:31|