Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-444

Data loss possible when tablet killed immediately after recovery

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.3.5-incubating
    • 1.3.6, 1.4.0
    • tserver
    • Running random walk, continuous ingest, and agitator on 10 node cluster.

    Description

      Came in after a weekend of running test to find the Shard random walk test had lost data in its index table. After debugging I found the following sequence of events occurred.

      • Mutation X was written to shard index on Tablet T1
      • X was minor compacted to file F1
      • Tablet server serving T1 was killed
      • When T1 came up on another tablet server, it did not know about F1

      The above sequence of events indicate that the !METADATA table lost data. So I started looking into that, and found the following sequence of events.

      • Tablet server T1 serving METADATA tablet MT was killed
      • MT comes up on another tablet server T2
      • Mutation Y is written to MT about file F1 for tablet T1
      • Tablet server T2 is killed.
      • MT comes up in tablet server T3
      • The mutations for MT from T1 are recovered, but not from T2.. therefore Y is lost

      There is code that supposed to handle this situation, but its not working... I think this issue exist in 1.3

      Data loss is not certain in this situation. In the scenario above, when MT is loaded on T2 a minor compaction is started. If the server is killed before this minor compaction completes then data loss will likely occur.

      Attachments

        Activity

          People

            kturner Keith Turner
            kturner Keith Turner
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: