Accumulo
  1. Accumulo
  2. ACCUMULO-2963

ReplicationDriver daemon dies from RTE thrown out of BatchScanner

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7.0
    • Component/s: replication
    • Labels:
      None

      Description

      Saw failure on build server where replication didn't happen in an integration test. A tablet server was restarted as a part of this test.

      As the tabletserver was starting back up, the Master was trying to scan the ReplicationTable. Before the tserver came up "completely" (not sure on details), the Master starting getting repeated RuntimeExceptions

      Exception in thread "Replication Driver" java.lang.RuntimeException: org.apache.accumulo.core.client.AccumuloSecurityException: Error DEFAULT_SECURITY_ERROR for user !SYSTEM on table replication(ID:3) - Unknown security exception
              at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.hasNext(TabletServerBatchReaderIterator.java:182)
              at org.apache.accumulo.master.replication.RemoveCompleteReplicationRecords.removeCompleteRecords(RemoveCompleteReplicationRecords.java:124)
              at org.apache.accumulo.master.replication.RemoveCompleteReplicationRecords.run(RemoveCompleteReplicationRecords.java:88)
              at org.apache.accumulo.master.replication.ReplicationDriver.run(ReplicationDriver.java:94)
      Caused by: org.apache.accumulo.core.client.AccumuloSecurityException: Error DEFAULT_SECURITY_ERROR for user !SYSTEM on table replication(ID:3) - Unknown security exception
              at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:690)
              at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:592)
              at org.apache.accumulo.core.metadata.MetadataLocationObtainer.lookupTablets(MetadataLocationObtainer.java:181)
              at org.apache.accumulo.core.client.impl.TabletLocatorImpl.processInvalidated(TabletLocatorImpl.java:667)
              at org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:337)
              at org.apache.accumulo.core.client.impl.TabletLocatorImpl.processInvalidated(TabletLocatorImpl.java:660)
              at org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:337)
              at org.apache.accumulo.core.client.impl.TimeoutTabletLocator.binRanges(TimeoutTabletLocator.java:104)
              at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.binRanges(TabletServerBatchReaderIterator.java:230)
              at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.processFailures(TabletServerBatchReaderIterator.java:302)
              at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.access$1400(TabletServerBatchReaderIterator.java:76)
              at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:386)
              at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
              at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: ThriftSecurityException(user:!SYSTEM, code:null)
              at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10045)
              at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10022)
              at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result.read(TabletClientService.java:9961)
              at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
              at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:313)
              at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:293)
              at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:632)
              ... 17 more
      

      TabletServer was still in the process of starting, but must have already obtained its lock (otherwise we couldn't have talked to it). It appears that the exceptions starting repeatedly printing in the Master log before the tserver hit it's main loop (lines 2414-2471 at f4024930).

      I think there may be a separate issue with the client receiving those Exceptions before a tserver is "fully" up, but the Master thread needs to be resilient against these exceptions bubbling up.

        Issue Links

          Activity

          ASF subversion and git services logged work - 01/Jul/14 06:31
          • Time Spent:
            10m
             
            Commit 73fc496a5474528d9a5a6de0e4027b506473f6e1 in accumulo's branch refs/heads/master from [~elserj]
            [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=73fc496 ]

            ACCUMULO-2963 Update ReplicationDriver to try/catch each step in the main-loop.

            An RTE bubbling up from any step inside the ReplicationDriver, for example one
            coming from the BatchScanner on Thrift exception, will inadvertently kill the
            entire Daemon thread that runs replication. Try/catch the exception, log it,
            and then retry the operation on the next cycle.

            People

            • Assignee:
              Josh Elser
              Reporter:
              Josh Elser
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 10m
                10m

                  Development