Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-5210

DB is randomly and undetectably corrupted during high traffic column family flushes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Duplicate
    • None
    • None
    • None
    • Cassandra 0.8+, OS/X, java version "1.6.0_37"

    • Normal

    Description

      Writes during high traffic column family flushes corrupt the DB and make slice queries return incorrect data.

      Any multi-column write on any version of Cassandra can put the DB in a state where some columns cannot be read alongside other columns.

      eg.

      {{
      // *** for any NON-NULL column (eg. col_a=>AAA)
      cqlsh> SELECT 'col_a' FROM test WHERE KEY='row_a';
      returns: 'AAA'

      // *** it can disappear when queried alongside another column
      cqlsh> SELECT 'col_a', 'col_b' FROM test WHERE KEY='row_a';
      returns: null, 'BBB' // *** col_a is MISSING

      // *** but it depends on the other columns
      cqlsh> SELECT 'col_a', 'col_b', 'col_c' FROM test WHERE KEY='row_a';
      returns: 'AAA', 'BBB', 'CCC' // *** col_a is BACK
      }}

      Once in this state the database is corrupt and essentially returning random data depending on what columns you query. Single column queries always return correct results so there is no way to verify the data. No errors are logged during corruption and it is impossible to detect without querying all combinations of all columns.

      To reproduce:

      1. Unzip a distribution of Cassandra and create a test.test column family.
      2. In a loop alternate between updating either row 'a' or a random row.
      Write a random value to four random columns (out of 10000). Keep track
      of all columns set in row 'a'.
      3. Each pass through the loop query four random columns (out of 10000) from row 'a'. If a column that is known to be set is null, print out the columns that were requested during the query.
      4. The DB is now corrupt and will return the column if queried by itself but will return null if queried alongside the columns that triggered the error. This is a permanent condition.

      Observations: This bug only manifests directly after a high traffic column family flush occurs in the log. This is a correlation based on simply watching the log. There are no errors or warnings of any kind.

      Workaround: Any multi-column read is potentially invalid and corruption is virtually undetectable. The only workaround is never writing or reading more than a single column in a query.

      I have a simple groovy script that can trigger the error. I have verified the behavior on Cassandra versions as old as 0.8.1

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              eldenbishop Elden Bishop
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: