Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
0.7.0
-
None
Description
I spent a while trying to debug a failure of alter_table-randomized-test and found the following interesting logs:
- We have two operations in the WAL which arrived in short succession (about 4ms apart) just before an alter table. I've renumbered the txids for readability here:
1.13@2 REPLICATE WRITE_OP op 0: MUTATE (int32 key=1643562) SET c6=1107303203 1.14@4 REPLICATE WRITE_OP op 0: MUTATE (int32 key=1643562) DELETE
- and the Flush that was caused by the Altertable has the following snapshots:
... Phase 1 snapshot: MvccSnapshot[committed={T|T < 2 or (T in (4))] ... ... Phase 2 snapshot: MvccSnapshot[committed={T|T < 2 or (T in (4, 2))]
Note that the first snapshot considers the 'DELETE' committed but not the 'UPDATE'. We then fill in the 'UPDATE' in the second snapshot.The end result here is that we end up flushing REDO deltas as follows:
REDO file 1 (flushed in phase 1): includes only the DELETE
REDO file 2 (flushed after ReupdateMissedDeltas); includes only the UPDATE
When we later proceed to compact this rowset, we get "Check failed: !is_deleted Got UPDATE for deleted row."
Scenarios like this seem to reproduce a few tenths of a percent of the time in this stress test.
Attachments
Issue Links
- relates to
-
KUDU-1341 Out of order UNDO in delta file, possibly related to REINSERT
- Resolved