[CASSANDRA-9669] If sstable flushes complete out of order, on restart we can fail to replay necessary commit log records - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Urgent
Resolution: Fixed
Fix Version/s: 2.2.7, 3.0.7, 3.7
Component/s: Legacy/Local Write-Read Paths
Labels:
- correctness

Severity:
Critical

Description

While postFlushExecutor ensures it never expires CL entries out-of-order, on restart we simply take the maximum replay position of any sstable on disk, and ignore anything prior.

It is quite possible for there to be two flushes triggered for a given table, and for the second to finish first by virtue of containing a much smaller quantity of live data (or perhaps the disk is just under less pressure). If we crash before the first sstable has been written, then on restart the data it would have represented will disappear, since we will not replay the CL records.

This looks to be a bug present since time immemorial, and also seems pretty serious.

Attachments

Issue Links

is related to

CASSANDRA-9840 global_row_key_cache_test.py fails; loses mutations on cluster restart

Resolved

relates to

CASSANDRA-9806 some TTL test are failing on trunk: losing data after restart?

Resolved

Activity

People

Assignee:: Branimir Lambov

Reporter:: Benedict Elliott Smith

Authors:: Branimir Lambov

Reviewers:: Benedict Elliott Smith

Votes:: 0 Vote for this issue

Watchers:: 19 Start watching this issue

Dates

Created:: 28/Jun/15 13:11

Updated:: 16/Apr/19 09:31

Resolved:: 31/May/16 08:47