Details
Description
Noticed the random walk test was generally hung. There was a tablet offline. Listed the Fate operations:
$ ./bin/accumulo org.apache.accumulo.server.fate.Admin print txid: 59c0403614dc0c39 status: IN_PROGRESS op: RenameTable locked: [] locking: [W:cz] top: RenameTable txid: 37539f8d61548764 status: IN_PROGRESS op: ChangeTableState locked: [] locking: [W:cz] top: ChangeTableState txid: 02f8323a3136e60d status: IN_PROGRESS op: TableRangeOp locked: [] locking: [W:cz] top: TableRangeOp txid: 044015732e97eec1 status: IN_PROGRESS op: CompactRange locked: [] locking: [R:cz] top: CompactRange txid: 6ce9dd63f9d51448 status: IN_PROGRESS op: CompactRange locked: [] locking: [R:cz] top: CompactRange txid: 417cb9b60e44ecd9 status: IN_PROGRESS op: TableRangeOp locked: [] locking: [W:cz] top: TableRangeOp txid: 5e7c5284a4677d6c status: IN_PROGRESS op: DeleteTable locked: [] locking: [W:cz] top: DeleteTable txid: 6633d3d841d66995 status: IN_PROGRESS op: TableRangeOp locked: [W:cz] locking: [] top: TableRangeOpWait
Quick check of master logs finds that Fate tx 6633d3d841d66995 did not complete. Here's what the master saw for this transaction:
The transaction is seeded, and a write lock is put into the queue; but there are other operations going on
30 20:52:29,387 [zookeeper.DistributedReadWriteLock] INFO : Added lock entry 4 userData 6633d3d841d66995 lockType WRITE
For example, there was a read lock for transaction 56fada3359bfb4c7
30 20:52:29,391 [tableOps.Utils] INFO : table cz (56fada3359bfb4c7) unlocked for read
And yet another write lock ahead of us for transaction 62c3cecaf387398e
30 20:52:29,486 [tableOps.Utils] INFO : table cz (62c3cecaf387398e) locked for write operation: ONLINE 30 20:52:29,488 [tables.TableManager] DEBUG: Transitioning state for table cz from OFFLINE to ONLINE 30 20:52:29,491 [tables.TableManager] DEBUG: State transition to ONLINE @ WatchedEvent state:SyncConnected type:NodeDataChanged path:/accumulo/88cd0f63-a36a-4218-86b1-9ba1d2cccf08/tables/cz/state 30 20:52:29,491 [master.EventCoordinator] INFO : Table state in zookeeper changed for cz to ONLINE 30 20:52:29,494 [tableOps.Utils] INFO : table cz (62c3cecaf387398e) unlocked for write
So now our table is in the online state... but the tablets may not be (yet). Our Fate operation gets the write lock and starts to execute
30 20:52:29,494 [tableOps.ChangeTableState] DEBUG: Changed table state cz ONLINE 30 20:52:29,494 [master.EventCoordinator] INFO : Set table state of cz to ONLINE 30 20:52:29,495 [tableOps.Utils] INFO : table cz (6633d3d841d66995) locked for write operation: MERGE
Hmm... this seems strange... we went from STARTED directly to MERGING: no OFFLINE, no CHOPPING. Those are very strange MergeStats, too. I'm not 100% sure that these merge stats go with this operation, but the timing suggests that it does (need to fix that log message).
30 20:52:29,500 [master.EventCoordinator] INFO : Merge state of cz;7c9131203e34514c;39ceaf91dd1db6ef set to STARTED 30 20:52:29,516 [state.MergeStats] INFO : 0 are hosted, total 0 30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are chopped 30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are chopped, 0 are offline 30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are unassigned 30 20:52:29,535 [master.EventCoordinator] INFO : Merge state of cz;7c9131203e34514c;39ceaf91dd1db6ef set to MERGING
Looks like the tablets were offline, too, because the main master loop is trying to assign them:
30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet cz;08498ad358a45bb0;050754dd3cc35c16=192.168.117.4:9997[134d7425fc8959e] 30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet cz;050754dd3cc35c16<=192.168.117.5:9997[334c843791cb78b] 30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet cz;137d382d52b8fbd9;08498ad358a45bb0=192.168.117.6:9997[134d7425fc8959f] 30 20:52:29,615 [master.EventCoordinator] INFO : tablet cz;08498ad358a45bb0;050754dd3cc35c16 was loaded 30 20:52:29,616 [master.EventCoordinator] INFO : tablet cz;137d382d52b8fbd9;08498ad358a45bb0 was loaded
And, just to make this more interesting, a tablet splits shortly after being loaded:
30 20:52:29,603 [state.MergeStats] INFO : 1 tablets are unassigned 30 20:52:29,688 [master.EventCoordinator] INFO : tablet cz<;137d382d52b8fbd9 was unloaded 30 20:52:30,081 [master.EventCoordinator] INFO : 192.168.117.9:9997 reported split cz;08498ad358a45bb0;050754dd3cc35c16, cz;137d382d52b8fbd9;08498ad358a45bb0
Master takes some of the tablets offline
30 20:52:30,083 [master.EventCoordinator] INFO : tablet cz;050754dd3cc35c16< was unloaded 30 20:52:30,083 [master.EventCoordinator] INFO : tablet cz;08498ad358a45bb0;050754dd3cc35c16 was unloaded 30 20:52:30,084 [master.EventCoordinator] INFO : tablet cz;137d382d52b8fbd9;08498ad358a45bb0 was unloaded
But not all of them
30 20:52:30,445 [master.EventCoordinator] INFO : tablet cz;050754dd3cc35c16< was loaded
The !METADATA status
!METADATA> scan -b cz; -e d cz;050754dd3cc35c16 file:/b-00006eb/I00006ec.rf [] 50848,0 cz;050754dd3cc35c16 file:/t-00006p2/F0000qra.rf [] 2936,660 cz;050754dd3cc35c16 file:/t-00006p2/F00023p9.rf [] 3176,720 cz;050754dd3cc35c16 last:334c843791cb78b [] 192.168.117.5:9997 cz;050754dd3cc35c16 loc:334c843791cb78b [] 192.168.117.5:9997 cz;050754dd3cc35c16 srv:dir [] /t-00006p2 cz;050754dd3cc35c16 srv:flush [] 0 cz;050754dd3cc35c16 srv:lock [] tservers/192.168.117.5:9997/zlock-0000000000$334c843791cb78b cz;050754dd3cc35c16 srv:time [] M1327993277526 cz;050754dd3cc35c16 ~tab:~pr [] \x00 cz;08498ad358a45bb0 file:/b-00006eb/I00006ec.rf [] 47216,0 cz;08498ad358a45bb0 file:/t-00006p3/F0000qj9.rf [] 1883,400 cz;08498ad358a45bb0 file:/t-00006p3/F00022u1.rf [] 1717,360 cz;08498ad358a45bb0 last:134d7425fc8959e [] 192.168.117.4:9997 cz;08498ad358a45bb0 loc:134d7425fc8959e [] 192.168.117.4:9997 cz;08498ad358a45bb0 srv:dir [] /t-00006p3 cz;08498ad358a45bb0 srv:flush [] 0 cz;08498ad358a45bb0 srv:lock [] tservers/192.168.117.4:9997/zlock-0000000000$134d7425fc8959e cz;08498ad358a45bb0 srv:time [] M1327993277506 cz;08498ad358a45bb0 ~tab:~pr [] \x01050754dd3cc35c16 cz;137d382d52b8fbd9 file:/b-00006eb/I00006ec.rf [] 118041,0 cz;137d382d52b8fbd9 file:/t-00006os/F0000quc.rf [] 6017,1430 cz;137d382d52b8fbd9 file:/t-00006os/F00023gh.rf [] 5409,1270 cz;137d382d52b8fbd9 last:134d7425fc8959f [] 192.168.117.6:9997 cz;137d382d52b8fbd9 loc:134d7425fc8959f [] 192.168.117.6:9997 cz;137d382d52b8fbd9 srv:dir [] /t-00006os cz;137d382d52b8fbd9 srv:flush [] 0 cz;137d382d52b8fbd9 srv:lock [] tservers/192.168.117.6:9997/zlock-0000000000$134d7425fc8959f cz;137d382d52b8fbd9 srv:time [] M1327993277515 cz;137d382d52b8fbd9 ~tab:~pr [] \x0108498ad358a45bb0 cz< chopped:chopped [] chopped cz< file:/b-00006eb/I00006ec.rf [] 1210190,0 cz< srv:compact [] 1 cz< srv:dir [] /default_tablet cz< srv:lock [] tservers/192.168.117.10:9997/zlock-0000000000$334c843791cb78f cz< srv:time [] M0 cz< ~tab:~pr [] \x01137d382d52b8fbd9
4 tablets, consistent splits, 3 are online, the offline tablet is chopped. The offline tablet doesn't have a last location, which is strange.