[CASSANDRA-5210] DB is randomly and undetectably corrupted during high traffic column family flushes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Duplicate
Fix Version/s: None
Component/s: None
Labels:
None
Environment:

Cassandra 0.8+, OS/X, java version "1.6.0_37"

Severity:
Normal

Description

Writes during high traffic column family flushes corrupt the DB and make slice queries return incorrect data.

Any multi-column write on any version of Cassandra can put the DB in a state where some columns cannot be read alongside other columns.

eg.

{{
// *** for any NON-NULL column (eg. col_a=>AAA)
cqlsh> SELECT 'col_a' FROM test WHERE KEY='row_a';
returns: 'AAA'

// *** it can disappear when queried alongside another column
cqlsh> SELECT 'col_a', 'col_b' FROM test WHERE KEY='row_a';
returns: null, 'BBB' // *** col_a is MISSING

// *** but it depends on the other columns
cqlsh> SELECT 'col_a', 'col_b', 'col_c' FROM test WHERE KEY='row_a';
returns: 'AAA', 'BBB', 'CCC' // *** col_a is BACK
}}

Once in this state the database is corrupt and essentially returning random data depending on what columns you query. Single column queries always return correct results so there is no way to verify the data. No errors are logged during corruption and it is impossible to detect without querying all combinations of all columns.

To reproduce:

1. Unzip a distribution of Cassandra and create a test.test column family.
2. In a loop alternate between updating either row 'a' or a random row.
Write a random value to four random columns (out of 10000). Keep track
of all columns set in row 'a'.
3. Each pass through the loop query four random columns (out of 10000) from row 'a'. If a column that is known to be set is null, print out the columns that were requested during the query.
4. The DB is now corrupt and will return the column if queried by itself but will return null if queried alongside the columns that triggered the error. This is a permanent condition.

Observations: This bug only manifests directly after a high traffic column family flush occurs in the log. This is a correlation based on simply watching the log. There are no errors or warnings of any kind.

Workaround: Any multi-column read is potentially invalid and corruption is virtually undetectable. The only workaround is never writing or reading more than a single column in a query.

I have a simple groovy script that can trigger the error. I have verified the behavior on Cassandra versions as old as 0.8.1

Attachments

Issue Links

is duplicated by

CASSANDRA-5225 Missing columns, errors when requesting specific columns from wide rows

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Elden Bishop

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 31/Jan/13 21:16

Updated:: 16/Apr/19 09:32

Resolved:: 12/Feb/13 19:19