Details
Description
After 14 hours of randomwalk, a merge operation appeared to be stuck.
Garbage collector was stuck, some tablets were offline:
# Online Tablet Servers | # Total Tablet Servers | Loggers Last GC | # Tablets | # Unassigned | Tablets | Entries | Ingest | Query | Hold Time | OS Load |
---|---|---|---|---|---|---|---|---|---|---|
10 | 10 | 10 | Running 2/29/12 12:14 PM | 299 | 4 | 277.50M | 311 | 5.53K | — | 0.50 |
Garbage collector could not get a consistent !METADATA table scan:
29 13:04:10,808 [util.TabletIterator] INFO : Resetting !METADATA scanner to [24q;5f83b8f927c41c9d%00; : [] 9223372036854775807 false,~ : [] 9223372036854775807 false) 29 13:04:11,071 [util.TabletIterator] INFO : Metadata inconsistency : 1419e44259517c51 != 5f83b8f927c41c9d metadataKey = 24q< ~tab:~pr [] 724883 false
Table (id 24q) had a merge in progress:
./bin/accumulo org.apache.accumulo.server.fate.Admin print txid: 7bea12fa46c40a72 status: IN_PROGRESS op: BulkImport locked: [] locking: [R:24q] top: BulkImport txid: 08db6105a25c0788 status: IN_PROGRESS op: CloneTable locked: [] locking: [R:24q] top: CloneTable txid: 5f798db1cab5fdea status: IN_PROGRESS op: BulkImport locked: [] locking: [R:24q] top: BulkImport txid: 6aa9a8a9b36a4f4d status: IN_PROGRESS op: TableRangeOp locked: [] locking: [W:24q] top: TableRangeOp txid: 5c6e82e235ec3855 status: IN_PROGRESS op: TableRangeOp locked: [] locking: [W:24q] top: TableRangeOp txid: 653a9293ba9f1cdc status: IN_PROGRESS op: RenameTable locked: [] locking: [W:24q] top: RenameTable txid: 651c62eb37136b6e status: IN_PROGRESS op: TableRangeOp locked: [W:24q] locking: [] top: TableRangeOpWait
Scan of table 24q:
scan -b 24q; -e 24q< 24q;073b220b74a75059 loc:135396fb191d4b6 [] 192.168.117.6:9997 24q;073b220b74a75059 srv:compact [] 3 24q;073b220b74a75059 srv:dir [] /t-00031y0 24q;073b220b74a75059 srv:lock [] tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3 24q;073b220b74a75059 srv:time [] M0 24q;073b220b74a75059 ~tab:~pr [] \x00 24q;1419e44259517c51 loc:235396fb184b5cd [] 192.168.117.12:9997 24q;1419e44259517c51 srv:compact [] 3 24q;1419e44259517c51 srv:dir [] /t-00031y1 24q;1419e44259517c51 srv:lock [] tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3 24q;1419e44259517c51 srv:time [] M0 24q;1419e44259517c51 ~tab:~pr [] \x01073b220b74a75059 24q;51fc3e7faea2b7e9 chopped:chopped [] chopped 24q;51fc3e7faea2b7e9 srv:compact [] 3 24q;51fc3e7faea2b7e9 srv:dir [] /t-00031y2 24q;51fc3e7faea2b7e9 srv:lock [] tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3 24q;51fc3e7faea2b7e9 srv:time [] M0 24q;51fc3e7faea2b7e9 ~tab:~pr [] \x011419e44259517c51 24q;5e65b844f2c7f868 chopped:chopped [] chopped 24q;5e65b844f2c7f868 srv:compact [] 3 24q;5e65b844f2c7f868 srv:dir [] /t-00031e1 24q;5e65b844f2c7f868 srv:lock [] tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3 24q;5e65b844f2c7f868 srv:time [] M0 24q;5e65b844f2c7f868 ~tab:~pr [] \x0151fc3e7faea2b7e9 24q;5f83b8f927c41c9d chopped:chopped [] chopped 24q;5f83b8f927c41c9d srv:compact [] 3 24q;5f83b8f927c41c9d srv:dir [] /t-000329w 24q;5f83b8f927c41c9d srv:lock [] tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3 24q;5f83b8f927c41c9d srv:time [] M0 24q;5f83b8f927c41c9d ~tab:~pr [] \x015e65b844f2c7f868 24q< chopped:chopped [] chopped 24q< srv:compact [] 3 24q< srv:dir [] /default_tablet 24q< srv:lock [] tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3 24q< srv:time [] M0 24q< ~tab:~pr [] \x011419e44259517c51
Master Logs
29 13:11:49,903 [state.MergeStats] INFO : Computing next merge state for 24q;6badf28df1d8ece7;37f3488aa92ac056 which is presently MERGING isDelete : false 29 13:11:49,903 [state.MergeStats] INFO : 4 tablets are unassigned 24q;6badf28df1d8ece7;37f3488aa92ac056
The final consistency check is failing because the merge is partially complete. The final step is not "adampotent" enough: partial execution leaves the Repo in a state in which it cannot continue after restart.