[ACCUMULO-2963] ReplicationDriver daemon dies from RTE thrown out of BatchScanner - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.7.0
Component/s: replication
Labels:
None

Description

Saw failure on build server where replication didn't happen in an integration test. A tablet server was restarted as a part of this test.

As the tabletserver was starting back up, the Master was trying to scan the ReplicationTable. Before the tserver came up "completely" (not sure on details), the Master starting getting repeated RuntimeExceptions

Exception in thread "Replication Driver" java.lang.RuntimeException: org.apache.accumulo.core.client.AccumuloSecurityException: Error DEFAULT_SECURITY_ERROR for user !SYSTEM on table replication(ID:3) - Unknown security exception
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.hasNext(TabletServerBatchReaderIterator.java:182)
        at org.apache.accumulo.master.replication.RemoveCompleteReplicationRecords.removeCompleteRecords(RemoveCompleteReplicationRecords.java:124)
        at org.apache.accumulo.master.replication.RemoveCompleteReplicationRecords.run(RemoveCompleteReplicationRecords.java:88)
        at org.apache.accumulo.master.replication.ReplicationDriver.run(ReplicationDriver.java:94)
Caused by: org.apache.accumulo.core.client.AccumuloSecurityException: Error DEFAULT_SECURITY_ERROR for user !SYSTEM on table replication(ID:3) - Unknown security exception
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:690)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:592)
        at org.apache.accumulo.core.metadata.MetadataLocationObtainer.lookupTablets(MetadataLocationObtainer.java:181)
        at org.apache.accumulo.core.client.impl.TabletLocatorImpl.processInvalidated(TabletLocatorImpl.java:667)
        at org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:337)
        at org.apache.accumulo.core.client.impl.TabletLocatorImpl.processInvalidated(TabletLocatorImpl.java:660)
        at org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:337)
        at org.apache.accumulo.core.client.impl.TimeoutTabletLocator.binRanges(TimeoutTabletLocator.java:104)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.binRanges(TabletServerBatchReaderIterator.java:230)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.processFailures(TabletServerBatchReaderIterator.java:302)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.access$1400(TabletServerBatchReaderIterator.java:76)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:386)
        at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
        at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
        at java.lang.Thread.run(Thread.java:745)
Caused by: ThriftSecurityException(user:!SYSTEM, code:null)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10045)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10022)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result.read(TabletClientService.java:9961)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:313)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:293)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:632)
        ... 17 more

TabletServer was still in the process of starting, but must have already obtained its lock (otherwise we couldn't have talked to it). It appears that the exceptions starting repeatedly printing in the Master log before the tserver hit it's main loop (lines 2414-2471 at f4024930).

I think there may be a separate issue with the client receiving those Exceptions before a tserver is "fully" up, but the Master thread needs to be resilient against these exceptions bubbling up.

Attachments

Issue Links

relates to

ACCUMULO-2964 Unexpected ThriftSecurityException from BatchScanner

Resolved

Activity

People

Assignee:: Josh Elser

Reporter:: Josh Elser

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 01/Jul/14 04:45

Updated:: 01/Jul/14 05:31

Resolved:: 01/Jul/14 05:31

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10m