Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-6531

Failure to start after unclean shutdown - java.lang.IllegalArgumentException: bufferSize must be positive

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.14, 2.0.5
    • Component/s: None
    • Labels:
    • Environment:

      Description

      We had a severe power outage in the lab that resulted in unclean shutdown of the Cassandra servers. After the power was back I tried to start the cluster. Two out of 6 nodes cannot start because of this exception:

       INFO 20:47:11,003 Initializing system.local
       INFO [main] 2013-12-27 20:47:11,003 ColumnFamilyStore.java (line 251) Initializing system.local
       INFO 20:47:11,006 Opening /hadoop/disk1/cassandra/data/system/local/system-local-jb-2478 (5836 bytes)
       INFO [SSTableBatchOpen:1] 2013-12-27 20:47:11,006 SSTableReader.java (line 223) Opening /hadoop/disk1/cassandra/data/system/local/system-local-jb-2478 (5836 bytes)
       INFO 20:47:11,006 Opening /hadoop/disk4/cassandra/data/system/local/system-local-jb-2479 (144 bytes)
       INFO [SSTableBatchOpen:2] 2013-12-27 20:47:11,006 SSTableReader.java (line 223) Opening /hadoop/disk4/cassandra/data/system/local/system-local-jb-2479 (144 bytes)
      ERROR 20:47:12,366 Exception encountered during startup
      java.lang.IllegalArgumentException: bufferSize must be positive
              at org.apache.cassandra.io.util.RandomAccessReader.<init>(RandomAccessReader.java:67)
              at org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(CompressedRandomAccessReader.java:76)
              at org.apache.cassandra.io.compress.CompressedRandomAccessReader.open(CompressedRandomAccessReader.java:55)
              at org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1363)
              at org.apache.cassandra.io.sstable.SSTableScanner.<init>(SSTableScanner.java:67)
              at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1147)
              at org.apache.cassandra.db.RowIteratorFactory.getIterator(RowIteratorFactory.java:69)
              at org.apache.cassandra.db.ColumnFamilyStore.getSequentialIterator(ColumnFamilyStore.java:1526)
              at org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1645)
              at org.apache.cassandra.db.RangeSliceCommand.executeLocally(RangeSliceCommand.java:137)
              at org.apache.cassandra.cql3.statements.SelectStatement.executeInternal(SelectStatement.java:236)
              at org.apache.cassandra.cql3.statements.SelectStatement.executeInternal(SelectStatement.java:1)
              at org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:255)
              at org.apache.cassandra.db.SystemKeyspace.getUnfinishedCompactions(SystemKeyspace.java:206)
              at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:261)
              at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:461)
              at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:504)
      

      Collecting the logs now, will attach to the issue in a moment.

      1. 6531.txt
        2 kB
        Jonathan Ellis
      2. cassandra_jstack.txt
        24 kB
        Nikolai Grigoriev
      3. system.log.gz
        11 kB
        Nikolai Grigoriev
      4. system.log.gz
        3 kB
        Nikolai Grigoriev

        Activity

        Hide
        ngrigoriev Nikolai Grigoriev added a comment -

        system log with default logging levels

        Show
        ngrigoriev Nikolai Grigoriev added a comment - system log with default logging levels
        Hide
        ngrigoriev Nikolai Grigoriev added a comment -

        Startup log with default logging level set to TRACE (log4j.rootLogger=TRACE,stdout,R). Interestingly enough I do not see the original exception in the log when I change it to TRACE and Cassandra process does not terminate. So I include the thread dump just in case.

        Show
        ngrigoriev Nikolai Grigoriev added a comment - Startup log with default logging level set to TRACE (log4j.rootLogger=TRACE,stdout,R). Interestingly enough I do not see the original exception in the log when I change it to TRACE and Cassandra process does not terminate. So I include the thread dump just in case.
        Hide
        jbellis Jonathan Ellis added a comment -

        Patch to fsync compression metadata on close. As near as I can tell this is a problem back to the introduction of compression in 1.0 (CASSANDRA-47). /cc Pavel Yaskevich T Jake Luciani

        Show
        jbellis Jonathan Ellis added a comment - Patch to fsync compression metadata on close. As near as I can tell this is a problem back to the introduction of compression in 1.0 ( CASSANDRA-47 ). /cc Pavel Yaskevich T Jake Luciani
        Hide
        ngrigoriev Nikolai Grigoriev added a comment -

        Do I understand correctly that this patch should eliminate/reduce the possibility of this to happen but won't help to start the node in my case? If so, I am going to try to abuse this JIRA a bit and ask if anyone could recommend a method of bringing this node back to life without wiping it? If such a method exists, of course. Thanks!

        Show
        ngrigoriev Nikolai Grigoriev added a comment - Do I understand correctly that this patch should eliminate/reduce the possibility of this to happen but won't help to start the node in my case? If so, I am going to try to abuse this JIRA a bit and ask if anyone could recommend a method of bringing this node back to life without wiping it? If such a method exists, of course. Thanks!
        Hide
        jbellis Jonathan Ellis added a comment -

        The only straightforward approach (i.e.: doesn't involve writing code to rebuild the corrupt offsets from the data component) is to remove the affected sstables and repair.

        Show
        jbellis Jonathan Ellis added a comment - The only straightforward approach (i.e.: doesn't involve writing code to rebuild the corrupt offsets from the data component) is to remove the affected sstables and repair.
        Hide
        xedin Pavel Yaskevich added a comment -

        Jonathan Ellis I think we should fsync twice in this case - first time whe the header is finalized and second time on close.

        Show
        xedin Pavel Yaskevich added a comment - Jonathan Ellis I think we should fsync twice in this case - first time whe the header is finalized and second time on close.
        Hide
        jbellis Jonathan Ellis added a comment -

        It's still -tmp at this point so a single sync is sufficient.

        Show
        jbellis Jonathan Ellis added a comment - It's still -tmp at this point so a single sync is sufficient.
        Hide
        ngrigoriev Nikolai Grigoriev added a comment -

        I believe the log messages were somewhat misleading. After messing with debug logs for a while I have found that the problem was, apparently, not with system KS. I have removed all stuff *compactions_in_progress* and that helped one of these two nodes in question to start. Another one seems to have an identity crisis as it believes that another node with the same address has already joined the cluster...but that seems to be another story.

        Show
        ngrigoriev Nikolai Grigoriev added a comment - I believe the log messages were somewhat misleading. After messing with debug logs for a while I have found that the problem was, apparently, not with system KS. I have removed all stuff *compactions_in_progress* and that helped one of these two nodes in question to start. Another one seems to have an identity crisis as it believes that another node with the same address has already joined the cluster...but that seems to be another story.
        Hide
        xedin Pavel Yaskevich added a comment -

        Ah, I completely forgot about that, sure.

        Show
        xedin Pavel Yaskevich added a comment - Ah, I completely forgot about that, sure.
        Hide
        jbellis Jonathan Ellis added a comment -

        committed

        Show
        jbellis Jonathan Ellis added a comment - committed
        Hide
        jbellis Jonathan Ellis added a comment -

        (Incidentally, the fix for the compactions in progress bug will be committed soon for CASSANDRA-6086.)

        Show
        jbellis Jonathan Ellis added a comment - (Incidentally, the fix for the compactions in progress bug will be committed soon for CASSANDRA-6086 .)
        Hide
        ngrigoriev Nikolai Grigoriev added a comment -

        Just one more suggestion for the fix. I think it is important to log the name of the problematic file every time an exception is thrown. This helps to correctly associate the exception with the file. In this case RandomAccessReader.java:67 throws an exception without this information. I have replaced that line with:

                    throw new IllegalArgumentException("bufferSize must be positive while reading " + file.getAbsolutePath());
        
        Show
        ngrigoriev Nikolai Grigoriev added a comment - Just one more suggestion for the fix. I think it is important to log the name of the problematic file every time an exception is thrown. This helps to correctly associate the exception with the file. In this case RandomAccessReader.java:67 throws an exception without this information. I have replaced that line with: throw new IllegalArgumentException( "bufferSize must be positive while reading " + file.getAbsolutePath());
        Show
        mshuler Michael Shuler added a comment - committed patch causes: http://cassci.datastax.com/job/cassandra-2.0_test/76/testReport/junit/org.apache.cassandra.io.compress/CompressedRandomAccessReaderTest/testResetAndTruncateCompressed/
        Hide
        jbellis Jonathan Ellis added a comment -

        Updated close method to be idempotent.

        Show
        jbellis Jonathan Ellis added a comment - Updated close method to be idempotent.

          People

          • Assignee:
            jbellis Jonathan Ellis
            Reporter:
            ngrigoriev Nikolai Grigoriev
            Reviewer:
            Pavel Yaskevich
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development