In trying to understand what's happening in
ACCUMULO-2964, I noticed that I had similar exceptions from two different threads. One of the threads starting working after the unexplained thrift exceptions from a tserver restart, and the other continued to repeatedly fail for the lifetime of the test.
I repeatedly saw this exception:
The part that struck me as odd was that the BatchWriter wasn't against the metadata table, but the replication table.
I looked into the TabletServerBatchWriter. It appears that once the client sees a MutationsRejectedException, that BatchWriter becomes useless as the internal member somethingFailed is never reset back to false after the failure is reported. Same goes for serverSideErrors, unknownErrors, lastUnknownErrors, too.
If this is the case, this is a bug because the BatchWriter should be resilient in this regard and not force the client to create a new Instance. If that's infeasible to do, we should add exceptions to the BatchWriter that fail fast when a BatchWriter is used that will report repeatedly report the same failure over and over again.
|Document limitations on BatchWriter failure recovery||Resolved||