[ACCUMULO-2053] Slow reassignment after failure and recovery - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: master
Labels:
None
Environment:

5bb28edb with Hadoop 2.2.0

Description

Running CI, I noticed the following situation. Agitation killed a tabletserver. Recovery was performed, but the tablets were not yet reassigned as reported by the monitor. A minute had gone by and there were still a significant number of tablets (~15 out of 150) still offline for a single table. One at a time, the tablets went from unassigned to assigned.

Tail'ing the master log, this was confirmed, as I saw the following lines repeated for every offline tablet:

2013-12-17 21:10:52,615 [recovery.RecoveryManager] DEBUG: Recovering hdfs://nameservice/accumulo/wal/tserver1+9997/0a60966c-b72d-4643-bf39-3fbfec342cc0 to hdfs://namenode/accumulo/recovery/0a60966c-b72d-4643-bf39-3fbfec342cc0
2013-12-17 21:10:52,624 [recovery.RecoveryManager] DEBUG: Recovering hdfs://nameservice/accumulo/wal/tserver1+9997/327e38cb-9f96-41a4-baff-a97d89d523e9 to hdfs://nameservice/accumulo/recovery/327e38cb-9f96-41a4-baff-a97d89d523e9

It seems like we should be able to bring all of these tablets back online at once (or at least more than one every 10 seconds as the log showed) because the recovery file was created. This made the complete recovery process take a bit longer than it should have as we waited 150s before reassigning the last tablet.

Attachments

Issue Links

relates to

ACCUMULO-1085 make the number of threads for assignment configurable

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Josh Elser

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 18/Dec/13 05:19

Updated:: 03/Sep/21 22:15

Resolved:: 03/Sep/21 22:15