Accumulo
  1. Accumulo
  2. ACCUMULO-2990

BatchWriter never recovers from failure(s)

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.5.1, 1.6.0
    • Fix Version/s: 1.7.1, 1.8.0
    • Component/s: client
    • Labels:
      None

      Description

      In trying to understand what's happening in ACCUMULO-2964, I noticed that I had similar exceptions from two different threads. One of the threads starting working after the unexplained thrift exceptions from a tserver restart, and the other continued to repeatedly fail for the lifetime of the test.

      I repeatedly saw this exception:

      2014-07-11 04:14:41,591 [replication.WorkMaker] WARN : Failed to write work mutations for replication, will retry
      org.apache.accumulo.core.client.MutationsRejectedException: # constraint violations : 0  security codes: {accumulo.metadata(ID:!0)=[DEFAULT_SECURITY_ERROR]}  # server errors 0 # exceptions 0
              at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)
              at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)
              at org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:45)
              at org.apache.accumulo.master.replication.WorkMaker.addWorkRecord(WorkMaker.java:184)
              at org.apache.accumulo.master.replication.WorkMaker.run(WorkMaker.java:124)
              at org.apache.accumulo.master.replication.ReplicationDriver.run(ReplicationDriver.java:91)
      

      The part that struck me as odd was that the BatchWriter wasn't against the metadata table, but the replication table.

      I looked into the TabletServerBatchWriter. It appears that once the client sees a MutationsRejectedException, that BatchWriter becomes useless as the internal member somethingFailed is never reset back to false after the failure is reported. Same goes for serverSideErrors, unknownErrors, lastUnknownErrors, too.

      If this is the case, this is a bug because the BatchWriter should be resilient in this regard and not force the client to create a new Instance. If that's infeasible to do, we should add exceptions to the BatchWriter that fail fast when a BatchWriter is used that will report repeatedly report the same failure over and over again.

        Issue Links

          Activity

          Hide
          Josh Elser added a comment -

          Here's a little test case that shows this happening against 1.5.2-SNAPSHOT (at 609c7267c33368ff9cfbb459153546d22bc5d2ce) https://gist.github.com/joshelser/ca98b28f80d32600cb57

          Show
          Josh Elser added a comment - Here's a little test case that shows this happening against 1.5.2-SNAPSHOT (at 609c7267c33368ff9cfbb459153546d22bc5d2ce) https://gist.github.com/joshelser/ca98b28f80d32600cb57
          Hide
          Josh Elser added a comment -

          Eric Newton just noticed that there's a hardcoded reference to accumulo.metadata in the BatchWriter for authorization errors:

          updateAuthorizationFailures(Collections.singletonMap(new KeyExtent(new Text(Constants.METADATA_TABLE_ID), null, null),
                      SecurityErrorCode.valueOf(e.getSecurityErrorCode().name())));
          

          This probably explains why I was seeing the inexplicable metadata table name in the error message in the description.

          Show
          Josh Elser added a comment - Eric Newton just noticed that there's a hardcoded reference to accumulo.metadata in the BatchWriter for authorization errors: updateAuthorizationFailures(Collections.singletonMap(new KeyExtent(new Text(Constants.METADATA_TABLE_ID), null, null), SecurityErrorCode.valueOf(e.getSecurityErrorCode().name()))); This probably explains why I was seeing the inexplicable metadata table name in the error message in the description.
          Hide
          Josh Elser added a comment -

          There are two paths here that I see we can take.

          1. Make the BatchWriter resilient to server-side errors in all branches
          2. Only change in HEAD

          I think the only worry about doing this in older versions is getting the error handling correct (which can hopefully be thoroughly tested via mocking). In the case where some users have code that expect the BatchWriter to become unusable after a MutationsRejectedException is thrown, having the BatchWriter still be usable would not break their code. It does still have the possibility of confusion about when this was fixed; however, I think this risk is small given the BatchWriter never making assertions that it's unusable (looks like a bug, smells like a bug, etc).

          Please make your opinions aware if anyone disagrees with this or if I missed some case.

          Show
          Josh Elser added a comment - There are two paths here that I see we can take. 1. Make the BatchWriter resilient to server-side errors in all branches 2. Only change in HEAD I think the only worry about doing this in older versions is getting the error handling correct (which can hopefully be thoroughly tested via mocking). In the case where some users have code that expect the BatchWriter to become unusable after a MutationsRejectedException is thrown, having the BatchWriter still be usable would not break their code. It does still have the possibility of confusion about when this was fixed; however, I think this risk is small given the BatchWriter never making assertions that it's unusable (looks like a bug, smells like a bug, etc). Please make your opinions aware if anyone disagrees with this or if I missed some case.
          Hide
          Keith Turner added a comment -

          Need to correctly handle a situation like the following.

          1. Thread 1 adds 5 mutations to batch writer 1
          2. Thread 2 adds 6 mutations to batch writer 1
          3. Thread 2 calls flush() and receives an error, none of the 11 mutations queued were successfully written
          4. Thread 1 calls flush() and receives an error
          Show
          Keith Turner added a comment - Need to correctly handle a situation like the following. Thread 1 adds 5 mutations to batch writer 1 Thread 2 adds 6 mutations to batch writer 1 Thread 2 calls flush() and receives an error, none of the 11 mutations queued were successfully written Thread 1 calls flush() and receives an error
          Hide
          Josh Elser added a comment -

          It sounds like we would need to introduce some sort of flush ID/counter semantics on the client side. Thread 1 and 2 would both have some local ID less than the ID generated when those 11 mutations were flushed. We would then know that both of those threads need to receive the exception while any new thread that uses the BatchWriter after that flush in step 3 would have a new ID and would know that the exception should not be passed to the new thread.

          Is there an easier approach that the above? I can think of a couple of odd cases already that would be tricky.

          Show
          Josh Elser added a comment - It sounds like we would need to introduce some sort of flush ID/counter semantics on the client side. Thread 1 and 2 would both have some local ID less than the ID generated when those 11 mutations were flushed. We would then know that both of those threads need to receive the exception while any new thread that uses the BatchWriter after that flush in step 3 would have a new ID and would know that the exception should not be passed to the new thread. Is there an easier approach that the above? I can think of a couple of odd cases already that would be tricky.
          Hide
          Josh Elser added a comment -

          I just had a user ask me how to recover a BatchWriter after it errors out (this issue). We should also update the javadoc to inform users that there currently is no way (and then change it if/when this ever gets fixed).

          Show
          Josh Elser added a comment - I just had a user ask me how to recover a BatchWriter after it errors out (this issue). We should also update the javadoc to inform users that there currently is no way (and then change it if/when this ever gets fixed).

            People

            • Assignee:
              Unassigned
              Reporter:
              Josh Elser
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 10m
                10m

                  Development