Came in after a weekend of running test to find the Shard random walk test had lost data in its index table. After debugging I found the following sequence of events occurred.
- Mutation X was written to shard index on Tablet T1
- X was minor compacted to file F1
- Tablet server serving T1 was killed
- When T1 came up on another tablet server, it did not know about F1
The above sequence of events indicate that the !METADATA table lost data. So I started looking into that, and found the following sequence of events.
- Tablet server T1 serving METADATA tablet MT was killed
- MT comes up on another tablet server T2
- Mutation Y is written to MT about file F1 for tablet T1
- Tablet server T2 is killed.
- MT comes up in tablet server T3
- The mutations for MT from T1 are recovered, but not from T2.. therefore Y is lost
There is code that supposed to handle this situation, but its not working... I think this issue exist in 1.3
Data loss is not certain in this situation. In the scenario above, when MT is loaded on T2 a minor compaction is started. If the server is killed before this minor compaction completes then data loss will likely occur.
|Workflow||no-reopen-closed, patch-avail [ 12656698 ]||patch-available, re-open possible [ 12671773 ]|
|Status||Open [ 1 ]||Resolved [ 5 ]|
|Resolution||Fixed [ 1 ]|
|Field||Original Value||New Value|