Accumulo
  1. Accumulo
  2. ACCUMULO-444

Data loss possible when tablet killed immediately after recovery

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.3.5
    • Fix Version/s: 1.3.6, 1.4.0
    • Component/s: tserver
    • Labels:
    • Environment:

      Running random walk, continuous ingest, and agitator on 10 node cluster.

      Description

      Came in after a weekend of running test to find the Shard random walk test had lost data in its index table. After debugging I found the following sequence of events occurred.

      • Mutation X was written to shard index on Tablet T1
      • X was minor compacted to file F1
      • Tablet server serving T1 was killed
      • When T1 came up on another tablet server, it did not know about F1

      The above sequence of events indicate that the !METADATA table lost data. So I started looking into that, and found the following sequence of events.

      • Tablet server T1 serving METADATA tablet MT was killed
      • MT comes up on another tablet server T2
      • Mutation Y is written to MT about file F1 for tablet T1
      • Tablet server T2 is killed.
      • MT comes up in tablet server T3
      • The mutations for MT from T1 are recovered, but not from T2.. therefore Y is lost

      There is code that supposed to handle this situation, but its not working... I think this issue exist in 1.3

      Data loss is not certain in this situation. In the scenario above, when MT is loaded on T2 a minor compaction is started. If the server is killed before this minor compaction completes then data loss will likely occur.

        Activity

        Hide
        Keith Turner added a comment -

        I think ACCUMULO-388 would have made this bug show up sooner in testing.

        Show
        Keith Turner added a comment - I think ACCUMULO-388 would have made this bug show up sooner in testing.
        Hide
        Keith Turner added a comment -

        Posting the tablet server logs that show how this scenario played out.

        Below are the logs for tablet server T2

        04 04:46:32,613 [log.RemoteLogger] DEBUG: Got new write-ahead log: xxx.xxx.xxx.10:11224/3f729699-6a7b-4ed0-8e41-cd767056a62e
        04 04:46:32,615 [log.RemoteLogger] DEBUG: Got new write-ahead log: xxx.xxx.xxx.11:11224/6042d841-0f4f-40a0-b480-1e7d51047982
        
        04 04:50:59,873 [tabletserver.TabletServer] DEBUG: Loading extent: !0;~;!0<
        04 04:50:59,877 [util.MetadataTable] INFO : Returning logs [!0;~; 661acb95-8fb9-4ac7-8545-83032f05eb2b (257)] for extent !0;~;!0<
        04 04:50:59,877 [tabletserver.Tablet] DEBUG: got [!0;~; 661acb95-8fb9-4ac7-8545-83032f05eb2b (257)] for logs for !0;~;!0<
        04 04:50:59,901 [tabletserver.Tablet] INFO : Starting Write-Ahead Log recovery for !0;~;!0<
        04 04:50:59,901 [tabletserver.TabletServer] INFO : Looking for /accumulo/recovery/661acb95-8fb9-4ac7-8545-83032f05eb2b.recovered/finished
        04 04:50:59,902 [log.SortedLogRecovery] INFO : Looking at mutations from /accumulo/recovery/661acb95-8fb9-4ac7-8545-83032f05eb2b.recovered for !0;~;!0<
        04 04:50:59,914 [log.SortedLogRecovery] DEBUG: Found tid, seq 257 9
        04 04:50:59,928 [log.SortedLogRecovery] INFO : Scanning for mutations starting at sequence number 10 for tid 257
        04 04:51:00,047 [log.SortedLogRecovery] INFO : Recovery complete for /accumulo/recovery/661acb95-8fb9-4ac7-8545-83032f05eb2b.recovered
        04 04:51:00,047 [tabletserver.Tablet] INFO : Write-Ahead Log recovery complete for !0;~;!0< (4721 mutations applied, 21111 entries created)
        04 04:51:00,047 [tabletserver.Tablet] TABLET_HIST: !0;~;!0< opened 
        
        04 04:51:00,066 [tabletserver.TabletServer] INFO : Adding 2 logs for extent !0;~;!0< as alias 353
        
        04 05:01:25,326 [tabletserver.TabletServer] FATAL: Lost tablet server lock (reason = LOCK_DELETED), exiting.
        04 05:08:48,307 [server.Accumulo] INFO : tserver starting
        

        Below is the lost mutation that was written to walog 3f729699-6a7b-4ed0-8e41-cd767056a62e on T2.

        MUTATION 353 3
        1 mutations:
          7kb;000054
              file:/t-000h69l/F000h8ae.rf [system]:2175897 [] 64826,16582
              srv:time [system]:2175897 [] M1330836309355
              last:335d51be2cb3cc6 [system]:2175897 [] xxx.xxx.xxx.11:9997
              log:xxx.xxx.xxx.10:11224/d1699ae9-04c7-4128-8b9d-a6355af0393a [system]:2175897 [] <deleted>
              log:xxx.xxx.xxx.13:11224/4b28c8e0-8729-4d52-a0a7-e080e03601a0 [system]:2175897 [] <deleted>
              srv:flush [system]:2175897 [] 0
              srv:lock [system]:2175897 [] tservers/xxx.xxx.xxx.11:9997/zlock-0000000001$335d51be2cb3cc6
        

        Below shows that the metadata table was loaded on tablet server T3, and even though it should recover more data is recovers the same amount as when it recovered on T2. This shows that all data written while the tablet was hosted on T2 was lost.

        04 05:01:49,458 [tabletserver.TabletServer] DEBUG: Loading extent: !0;~;!0<
        04 05:01:49,460 [util.MetadataTable] INFO : Returning logs [!0;~; 661acb95-8fb9-4ac7-8545-83032f05eb2b (257), !0;~; 3f729699-6a7b-4ed0-8e41-cd767056a62e (353)] for extent !0;~;!0<
        04 05:01:49,461 [tabletserver.Tablet] DEBUG: got [!0;~; 661acb95-8fb9-4ac7-8545-83032f05eb2b (257), !0;~; 3f729699-6a7b-4ed0-8e41-cd767056a62e (353)] for logs for !0;~;!0<
        04 05:01:49,474 [tabletserver.Tablet] INFO : Starting Write-Ahead Log recovery for !0;~;!0<
        04 05:01:49,489 [log.SortedLogRecovery] DEBUG: Found tid, seq 257 9
        04 05:01:49,495 [log.SortedLogRecovery] INFO : Looking at mutations from /accumulo/recovery/3f729699-6a7b-4ed0-8e41-cd767056a62e.recovered for !0;~;!0<
        04 05:01:49,500 [log.SortedLogRecovery] DEBUG: Found tid, seq 353 3
        04 05:01:49,510 [log.SortedLogRecovery] INFO : Scanning for mutations starting at sequence number 10 for tid 257
        04 05:01:49,624 [log.SortedLogRecovery] INFO : Recovery complete for /accumulo/recovery/661acb95-8fb9-4ac7-8545-83032f05eb2b.recovered
        04 05:01:49,629 [log.SortedLogRecovery] INFO : Scanning for mutations starting at sequence number 10 for tid 353
        04 05:01:49,674 [log.SortedLogRecovery] INFO : Recovery complete for /accumulo/recovery/3f729699-6a7b-4ed0-8e41-cd767056a62e.recovered
        04 05:01:49,674 [tabletserver.Tablet] INFO : Write-Ahead Log recovery complete for !0;~;!0< (4721 mutations applied, 21111 entries created)
        04 05:01:49,674 [tabletserver.Tablet] TABLET_HIST: !0;~;!0< opened 
        
        Show
        Keith Turner added a comment - Posting the tablet server logs that show how this scenario played out. Below are the logs for tablet server T2 04 04:46:32,613 [log.RemoteLogger] DEBUG: Got new write-ahead log: xxx.xxx.xxx.10:11224/3f729699-6a7b-4ed0-8e41-cd767056a62e 04 04:46:32,615 [log.RemoteLogger] DEBUG: Got new write-ahead log: xxx.xxx.xxx.11:11224/6042d841-0f4f-40a0-b480-1e7d51047982 04 04:50:59,873 [tabletserver.TabletServer] DEBUG: Loading extent: !0;~;!0< 04 04:50:59,877 [util.MetadataTable] INFO : Returning logs [!0;~; 661acb95-8fb9-4ac7-8545-83032f05eb2b (257)] for extent !0;~;!0< 04 04:50:59,877 [tabletserver.Tablet] DEBUG: got [!0;~; 661acb95-8fb9-4ac7-8545-83032f05eb2b (257)] for logs for !0;~;!0< 04 04:50:59,901 [tabletserver.Tablet] INFO : Starting Write-Ahead Log recovery for !0;~;!0< 04 04:50:59,901 [tabletserver.TabletServer] INFO : Looking for /accumulo/recovery/661acb95-8fb9-4ac7-8545-83032f05eb2b.recovered/finished 04 04:50:59,902 [log.SortedLogRecovery] INFO : Looking at mutations from /accumulo/recovery/661acb95-8fb9-4ac7-8545-83032f05eb2b.recovered for !0;~;!0< 04 04:50:59,914 [log.SortedLogRecovery] DEBUG: Found tid, seq 257 9 04 04:50:59,928 [log.SortedLogRecovery] INFO : Scanning for mutations starting at sequence number 10 for tid 257 04 04:51:00,047 [log.SortedLogRecovery] INFO : Recovery complete for /accumulo/recovery/661acb95-8fb9-4ac7-8545-83032f05eb2b.recovered 04 04:51:00,047 [tabletserver.Tablet] INFO : Write-Ahead Log recovery complete for !0;~;!0< (4721 mutations applied, 21111 entries created) 04 04:51:00,047 [tabletserver.Tablet] TABLET_HIST: !0;~;!0< opened 04 04:51:00,066 [tabletserver.TabletServer] INFO : Adding 2 logs for extent !0;~;!0< as alias 353 04 05:01:25,326 [tabletserver.TabletServer] FATAL: Lost tablet server lock (reason = LOCK_DELETED), exiting. 04 05:08:48,307 [server.Accumulo] INFO : tserver starting Below is the lost mutation that was written to walog 3f729699-6a7b-4ed0-8e41-cd767056a62e on T2. MUTATION 353 3 1 mutations: 7kb;000054 file:/t-000h69l/F000h8ae.rf [system]:2175897 [] 64826,16582 srv:time [system]:2175897 [] M1330836309355 last:335d51be2cb3cc6 [system]:2175897 [] xxx.xxx.xxx.11:9997 log:xxx.xxx.xxx.10:11224/d1699ae9-04c7-4128-8b9d-a6355af0393a [system]:2175897 [] <deleted> log:xxx.xxx.xxx.13:11224/4b28c8e0-8729-4d52-a0a7-e080e03601a0 [system]:2175897 [] <deleted> srv:flush [system]:2175897 [] 0 srv:lock [system]:2175897 [] tservers/xxx.xxx.xxx.11:9997/zlock-0000000001$335d51be2cb3cc6 Below shows that the metadata table was loaded on tablet server T3, and even though it should recover more data is recovers the same amount as when it recovered on T2. This shows that all data written while the tablet was hosted on T2 was lost. 04 05:01:49,458 [tabletserver.TabletServer] DEBUG: Loading extent: !0;~;!0< 04 05:01:49,460 [util.MetadataTable] INFO : Returning logs [!0;~; 661acb95-8fb9-4ac7-8545-83032f05eb2b (257), !0;~; 3f729699-6a7b-4ed0-8e41-cd767056a62e (353)] for extent !0;~;!0< 04 05:01:49,461 [tabletserver.Tablet] DEBUG: got [!0;~; 661acb95-8fb9-4ac7-8545-83032f05eb2b (257), !0;~; 3f729699-6a7b-4ed0-8e41-cd767056a62e (353)] for logs for !0;~;!0< 04 05:01:49,474 [tabletserver.Tablet] INFO : Starting Write-Ahead Log recovery for !0;~;!0< 04 05:01:49,489 [log.SortedLogRecovery] DEBUG: Found tid, seq 257 9 04 05:01:49,495 [log.SortedLogRecovery] INFO : Looking at mutations from /accumulo/recovery/3f729699-6a7b-4ed0-8e41-cd767056a62e.recovered for !0;~;!0< 04 05:01:49,500 [log.SortedLogRecovery] DEBUG: Found tid, seq 353 3 04 05:01:49,510 [log.SortedLogRecovery] INFO : Scanning for mutations starting at sequence number 10 for tid 257 04 05:01:49,624 [log.SortedLogRecovery] INFO : Recovery complete for /accumulo/recovery/661acb95-8fb9-4ac7-8545-83032f05eb2b.recovered 04 05:01:49,629 [log.SortedLogRecovery] INFO : Scanning for mutations starting at sequence number 10 for tid 353 04 05:01:49,674 [log.SortedLogRecovery] INFO : Recovery complete for /accumulo/recovery/3f729699-6a7b-4ed0-8e41-cd767056a62e.recovered 04 05:01:49,674 [tabletserver.Tablet] INFO : Write-Ahead Log recovery complete for !0;~;!0< (4721 mutations applied, 21111 entries created) 04 05:01:49,674 [tabletserver.Tablet] TABLET_HIST: !0;~;!0< opened

          People

          • Assignee:
            Keith Turner
            Reporter:
            Keith Turner
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development