Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-2990

BatchWriter never recovers from failure(s)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Cannot Reproduce
    • 1.5.1, 1.6.0
    • None
    • client
    • None

    Description

      In trying to understand what's happening in ACCUMULO-2964, I noticed that I had similar exceptions from two different threads. One of the threads starting working after the unexplained thrift exceptions from a tserver restart, and the other continued to repeatedly fail for the lifetime of the test.

      I repeatedly saw this exception:

      2014-07-11 04:14:41,591 [replication.WorkMaker] WARN : Failed to write work mutations for replication, will retry
      org.apache.accumulo.core.client.MutationsRejectedException: # constraint violations : 0  security codes: {accumulo.metadata(ID:!0)=[DEFAULT_SECURITY_ERROR]}  # server errors 0 # exceptions 0
              at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)
              at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)
              at org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:45)
              at org.apache.accumulo.master.replication.WorkMaker.addWorkRecord(WorkMaker.java:184)
              at org.apache.accumulo.master.replication.WorkMaker.run(WorkMaker.java:124)
              at org.apache.accumulo.master.replication.ReplicationDriver.run(ReplicationDriver.java:91)
      

      The part that struck me as odd was that the BatchWriter wasn't against the metadata table, but the replication table.

      I looked into the TabletServerBatchWriter. It appears that once the client sees a MutationsRejectedException, that BatchWriter becomes useless as the internal member somethingFailed is never reset back to false after the failure is reported. Same goes for serverSideErrors, unknownErrors, lastUnknownErrors, too.

      If this is the case, this is a bug because the BatchWriter should be resilient in this regard and not force the client to create a new Instance. If that's infeasible to do, we should add exceptions to the BatchWriter that fail fast when a BatchWriter is used that will report repeatedly report the same failure over and over again.

      Attachments

        Issue Links

          There are no Sub-Tasks for this issue.

          Activity

            People

              Unassigned Unassigned
              elserj Josh Elser
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m