Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-436

tablet merge stuck

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.4.0
    • master
    • randomwalk with agitation on 10-node test cluster

    Description

      After 14 hours of randomwalk, a merge operation appeared to be stuck.

      Garbage collector was stuck, some tablets were offline:

      # Online Tablet Servers # Total Tablet Servers Loggers Last GC # Tablets # Unassigned Tablets Entries Ingest Query Hold Time OS Load
      10 10 10 Running 2/29/12 12:14 PM 299 4 277.50M 311 5.53K 0.50

      Garbage collector could not get a consistent !METADATA table scan:

      29 13:04:10,808 [util.TabletIterator] INFO : Resetting !METADATA scanner to [24q;5f83b8f927c41c9d%00; : [] 9223372036854775807 false,~ : [] 9223372036854775807 false)
      29 13:04:11,071 [util.TabletIterator] INFO : Metadata inconsistency : 1419e44259517c51 != 5f83b8f927c41c9d metadataKey = 24q< ~tab:~pr [] 724883 false
      

      Table (id 24q) had a merge in progress:

      ./bin/accumulo org.apache.accumulo.server.fate.Admin print
      txid: 7bea12fa46c40a72  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
      txid: 08db6105a25c0788  status: IN_PROGRESS         op: CloneTable       locked: []              locking: [R:24q]         top: CloneTable
      txid: 5f798db1cab5fdea  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
      txid: 6aa9a8a9b36a4f4d  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
      txid: 5c6e82e235ec3855  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
      txid: 653a9293ba9f1cdc  status: IN_PROGRESS         op: RenameTable      locked: []              locking: [W:24q]         top: RenameTable
      txid: 651c62eb37136b6e  status: IN_PROGRESS         op: TableRangeOp     locked: [W:24q]         locking: []              top: TableRangeOpWait
      

      Scan of table 24q:

      scan -b 24q; -e 24q<
      24q;073b220b74a75059 loc:135396fb191d4b6 []    192.168.117.6:9997
      24q;073b220b74a75059 srv:compact []    3
      24q;073b220b74a75059 srv:dir []    /t-00031y0
      24q;073b220b74a75059 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
      24q;073b220b74a75059 srv:time []    M0
      24q;073b220b74a75059 ~tab:~pr []    \x00
      24q;1419e44259517c51 loc:235396fb184b5cd []    192.168.117.12:9997
      24q;1419e44259517c51 srv:compact []    3
      24q;1419e44259517c51 srv:dir []    /t-00031y1
      24q;1419e44259517c51 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
      24q;1419e44259517c51 srv:time []    M0
      24q;1419e44259517c51 ~tab:~pr []    \x01073b220b74a75059
      24q;51fc3e7faea2b7e9 chopped:chopped []    chopped
      24q;51fc3e7faea2b7e9 srv:compact []    3
      24q;51fc3e7faea2b7e9 srv:dir []    /t-00031y2
      24q;51fc3e7faea2b7e9 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
      24q;51fc3e7faea2b7e9 srv:time []    M0
      24q;51fc3e7faea2b7e9 ~tab:~pr []    \x011419e44259517c51
      24q;5e65b844f2c7f868 chopped:chopped []    chopped
      24q;5e65b844f2c7f868 srv:compact []    3
      24q;5e65b844f2c7f868 srv:dir []    /t-00031e1
      24q;5e65b844f2c7f868 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
      24q;5e65b844f2c7f868 srv:time []    M0
      24q;5e65b844f2c7f868 ~tab:~pr []    \x0151fc3e7faea2b7e9
      24q;5f83b8f927c41c9d chopped:chopped []    chopped
      24q;5f83b8f927c41c9d srv:compact []    3
      24q;5f83b8f927c41c9d srv:dir []    /t-000329w
      24q;5f83b8f927c41c9d srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
      24q;5f83b8f927c41c9d srv:time []    M0
      24q;5f83b8f927c41c9d ~tab:~pr []    \x015e65b844f2c7f868
      24q< chopped:chopped []    chopped
      24q< srv:compact []    3
      24q< srv:dir []    /default_tablet
      24q< srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
      24q< srv:time []    M0
      24q< ~tab:~pr []    \x011419e44259517c51
      

      Master Logs

      29 13:11:49,903 [state.MergeStats] INFO : Computing next merge state for 24q;6badf28df1d8ece7;37f3488aa92ac056 which is presently MERGING isDelete : false
      29 13:11:49,903 [state.MergeStats] INFO : 4 tablets are unassigned 24q;6badf28df1d8ece7;37f3488aa92ac056
      

      The final consistency check is failing because the merge is partially complete. The final step is not "adampotent" enough: partial execution leaves the Repo in a state in which it cannot continue after restart.

      Attachments

        Activity

          People

            ecn Eric C. Newton
            ecn Eric C. Newton
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: