Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-2990

BatchWriter never recovers from failure(s)

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.5.1, 1.6.0
    • Fix Version/s: 2.0.0
    • Component/s: client
    • Labels:
      None

      Description

      In trying to understand what's happening in ACCUMULO-2964, I noticed that I had similar exceptions from two different threads. One of the threads starting working after the unexplained thrift exceptions from a tserver restart, and the other continued to repeatedly fail for the lifetime of the test.

      I repeatedly saw this exception:

      2014-07-11 04:14:41,591 [replication.WorkMaker] WARN : Failed to write work mutations for replication, will retry
      org.apache.accumulo.core.client.MutationsRejectedException: # constraint violations : 0  security codes: {accumulo.metadata(ID:!0)=[DEFAULT_SECURITY_ERROR]}  # server errors 0 # exceptions 0
              at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)
              at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)
              at org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:45)
              at org.apache.accumulo.master.replication.WorkMaker.addWorkRecord(WorkMaker.java:184)
              at org.apache.accumulo.master.replication.WorkMaker.run(WorkMaker.java:124)
              at org.apache.accumulo.master.replication.ReplicationDriver.run(ReplicationDriver.java:91)
      

      The part that struck me as odd was that the BatchWriter wasn't against the metadata table, but the replication table.

      I looked into the TabletServerBatchWriter. It appears that once the client sees a MutationsRejectedException, that BatchWriter becomes useless as the internal member somethingFailed is never reset back to false after the failure is reported. Same goes for serverSideErrors, unknownErrors, lastUnknownErrors, too.

      If this is the case, this is a bug because the BatchWriter should be resilient in this regard and not force the client to create a new Instance. If that's infeasible to do, we should add exceptions to the BatchWriter that fail fast when a BatchWriter is used that will report repeatedly report the same failure over and over again.

        Attachments

          Issue Links

          There are no Sub-Tasks for this issue.

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                elserj Josh Elser
              • Votes:
                1 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m