Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-356

merge failed to complete

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.4.0
    • master
    • random walk test of "All" without "Security" on a 10-node test cluster

    Description

      Noticed the random walk test was generally hung. There was a tablet offline. Listed the Fate operations:

      $ ./bin/accumulo org.apache.accumulo.server.fate.Admin print
      txid: 59c0403614dc0c39  status: IN_PROGRESS         op: RenameTable      locked: []              locking: [W:cz]          top: RenameTable
      txid: 37539f8d61548764  status: IN_PROGRESS         op: ChangeTableState  locked: []              locking: [W:cz]          top: ChangeTableState
      txid: 02f8323a3136e60d  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:cz]          top: TableRangeOp
      txid: 044015732e97eec1  status: IN_PROGRESS         op: CompactRange     locked: []              locking: [R:cz]          top: CompactRange
      txid: 6ce9dd63f9d51448  status: IN_PROGRESS         op: CompactRange     locked: []              locking: [R:cz]          top: CompactRange
      txid: 417cb9b60e44ecd9  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:cz]          top: TableRangeOp
      txid: 5e7c5284a4677d6c  status: IN_PROGRESS         op: DeleteTable      locked: []              locking: [W:cz]          top: DeleteTable
      txid: 6633d3d841d66995  status: IN_PROGRESS         op: TableRangeOp     locked: [W:cz]          locking: []              top: TableRangeOpWait
      

      Quick check of master logs finds that Fate tx 6633d3d841d66995 did not complete. Here's what the master saw for this transaction:

      The transaction is seeded, and a write lock is put into the queue; but there are other operations going on

      30 20:52:29,387 [zookeeper.DistributedReadWriteLock] INFO : Added lock entry 4 userData 6633d3d841d66995 lockType WRITE
      

      For example, there was a read lock for transaction 56fada3359bfb4c7

      30 20:52:29,391 [tableOps.Utils] INFO : table cz (56fada3359bfb4c7) unlocked for read
      

      And yet another write lock ahead of us for transaction 62c3cecaf387398e

      30 20:52:29,486 [tableOps.Utils] INFO : table cz (62c3cecaf387398e) locked for write operation: ONLINE
      30 20:52:29,488 [tables.TableManager] DEBUG: Transitioning state for table cz from OFFLINE to ONLINE
      30 20:52:29,491 [tables.TableManager] DEBUG: State transition to ONLINE @ WatchedEvent state:SyncConnected type:NodeDataChanged path:/accumulo/88cd0f63-a36a-4218-86b1-9ba1d2cccf08/tables/cz/state
      30 20:52:29,491 [master.EventCoordinator] INFO : Table state in zookeeper changed for cz to ONLINE
      30 20:52:29,494 [tableOps.Utils] INFO : table cz (62c3cecaf387398e) unlocked for write
      

      So now our table is in the online state... but the tablets may not be (yet). Our Fate operation gets the write lock and starts to execute

      30 20:52:29,494 [tableOps.ChangeTableState] DEBUG: Changed table state cz ONLINE
      30 20:52:29,494 [master.EventCoordinator] INFO : Set table state of cz to ONLINE
      30 20:52:29,495 [tableOps.Utils] INFO : table cz (6633d3d841d66995) locked for write operation: MERGE
      

      Hmm... this seems strange... we went from STARTED directly to MERGING: no OFFLINE, no CHOPPING. Those are very strange MergeStats, too. I'm not 100% sure that these merge stats go with this operation, but the timing suggests that it does (need to fix that log message).

      30 20:52:29,500 [master.EventCoordinator] INFO : Merge state of cz;7c9131203e34514c;39ceaf91dd1db6ef set to STARTED
      30 20:52:29,516 [state.MergeStats] INFO : 0 are hosted, total 0
      30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are chopped
      30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are chopped, 0 are offline
      30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are unassigned
      30 20:52:29,535 [master.EventCoordinator] INFO : Merge state of cz;7c9131203e34514c;39ceaf91dd1db6ef set to MERGING
      

      Looks like the tablets were offline, too, because the main master loop is trying to assign them:

      30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet cz;08498ad358a45bb0;050754dd3cc35c16=192.168.117.4:9997[134d7425fc8959e]
      30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet cz;050754dd3cc35c16<=192.168.117.5:9997[334c843791cb78b]
      30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet cz;137d382d52b8fbd9;08498ad358a45bb0=192.168.117.6:9997[134d7425fc8959f]
      30 20:52:29,615 [master.EventCoordinator] INFO : tablet cz;08498ad358a45bb0;050754dd3cc35c16 was loaded
      30 20:52:29,616 [master.EventCoordinator] INFO : tablet cz;137d382d52b8fbd9;08498ad358a45bb0 was loaded
      

      And, just to make this more interesting, a tablet splits shortly after being loaded:

      30 20:52:29,603 [state.MergeStats] INFO : 1 tablets are unassigned
      30 20:52:29,688 [master.EventCoordinator] INFO : tablet cz<;137d382d52b8fbd9 was unloaded
      30 20:52:30,081 [master.EventCoordinator] INFO : 192.168.117.9:9997 reported split cz;08498ad358a45bb0;050754dd3cc35c16, cz;137d382d52b8fbd9;08498ad358a45bb0
      

      Master takes some of the tablets offline

      30 20:52:30,083 [master.EventCoordinator] INFO : tablet cz;050754dd3cc35c16< was unloaded
      30 20:52:30,083 [master.EventCoordinator] INFO : tablet cz;08498ad358a45bb0;050754dd3cc35c16 was unloaded
      30 20:52:30,084 [master.EventCoordinator] INFO : tablet cz;137d382d52b8fbd9;08498ad358a45bb0 was unloaded
      

      But not all of them

      30 20:52:30,445 [master.EventCoordinator] INFO : tablet cz;050754dd3cc35c16< was loaded
      

      The !METADATA status

      !METADATA> scan -b cz; -e d
      cz;050754dd3cc35c16 file:/b-00006eb/I00006ec.rf []    50848,0
      cz;050754dd3cc35c16 file:/t-00006p2/F0000qra.rf []    2936,660
      cz;050754dd3cc35c16 file:/t-00006p2/F00023p9.rf []    3176,720
      cz;050754dd3cc35c16 last:334c843791cb78b []    192.168.117.5:9997
      cz;050754dd3cc35c16 loc:334c843791cb78b []    192.168.117.5:9997
      cz;050754dd3cc35c16 srv:dir []    /t-00006p2
      cz;050754dd3cc35c16 srv:flush []    0
      cz;050754dd3cc35c16 srv:lock []    tservers/192.168.117.5:9997/zlock-0000000000$334c843791cb78b
      cz;050754dd3cc35c16 srv:time []    M1327993277526
      cz;050754dd3cc35c16 ~tab:~pr []    \x00
      cz;08498ad358a45bb0 file:/b-00006eb/I00006ec.rf []    47216,0
      cz;08498ad358a45bb0 file:/t-00006p3/F0000qj9.rf []    1883,400
      cz;08498ad358a45bb0 file:/t-00006p3/F00022u1.rf []    1717,360
      cz;08498ad358a45bb0 last:134d7425fc8959e []    192.168.117.4:9997
      cz;08498ad358a45bb0 loc:134d7425fc8959e []    192.168.117.4:9997
      cz;08498ad358a45bb0 srv:dir []    /t-00006p3
      cz;08498ad358a45bb0 srv:flush []    0
      cz;08498ad358a45bb0 srv:lock []    tservers/192.168.117.4:9997/zlock-0000000000$134d7425fc8959e
      cz;08498ad358a45bb0 srv:time []    M1327993277506
      cz;08498ad358a45bb0 ~tab:~pr []    \x01050754dd3cc35c16
      cz;137d382d52b8fbd9 file:/b-00006eb/I00006ec.rf []    118041,0
      cz;137d382d52b8fbd9 file:/t-00006os/F0000quc.rf []    6017,1430
      cz;137d382d52b8fbd9 file:/t-00006os/F00023gh.rf []    5409,1270
      cz;137d382d52b8fbd9 last:134d7425fc8959f []    192.168.117.6:9997
      cz;137d382d52b8fbd9 loc:134d7425fc8959f []    192.168.117.6:9997
      cz;137d382d52b8fbd9 srv:dir []    /t-00006os
      cz;137d382d52b8fbd9 srv:flush []    0
      cz;137d382d52b8fbd9 srv:lock []    tservers/192.168.117.6:9997/zlock-0000000000$134d7425fc8959f
      cz;137d382d52b8fbd9 srv:time []    M1327993277515
      cz;137d382d52b8fbd9 ~tab:~pr []    \x0108498ad358a45bb0
      cz< chopped:chopped []    chopped
      cz< file:/b-00006eb/I00006ec.rf []    1210190,0
      cz< srv:compact []    1
      cz< srv:dir []    /default_tablet
      cz< srv:lock []    tservers/192.168.117.10:9997/zlock-0000000000$334c843791cb78f
      cz< srv:time []    M0
      cz< ~tab:~pr []    \x01137d382d52b8fbd9
      

      4 tablets, consistent splits, 3 are online, the offline tablet is chopped. The offline tablet doesn't have a last location, which is strange.

      Attachments

        Activity

          People

            ecn Eric C. Newton
            ecn Eric C. Newton
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: