I'm trying to figure out how to reliably duplicate this problem. It is not clear from log if nodes are being reconfigured on purpose or if it is spontaneously happening.
Tail exiting is likely due to an exception on append that gets propagated to the driver which triggers a shutdown of the logical node's source and sink.
There seem to be several problems and here are current theories for root cause:
1) Improperly handled exception (this could be a runtime exception or possibly an incorrectly handled interrupted exception). It looks like in the first set of problems an IllegalStateException eventually gets thrown.
2) Stubborn append sink may have a race condition on error recovery path.
3) There may be a race/error in the retransmit logic path.
4) It is possible that multiple instances of the NaiveWALManager are active simultaneously causing the state transition problems.
Also note, that several of the mechanisms from this trace have been significantly modified by
FLUME-569, FLUME-589, FLUME-595, FLUME-597. These changes will likely make the problems manifest differently.