Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-4601

Ensure unique commit log file names

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Urgent
    • Resolution: Fixed
    • 1.1.5
    • None
    • None
    • Sun JVM 1.6.33 / Ubuntu 10.04.4 LTS

    • Critical

    Description

      The commit log segment name uses System.nanoTime() as part of the file name. There is no guarantee that successive calls to nanoTime() will return different values. And on less than optimal hypervisors this happens a lot.

      I observed the following in the wild:

      ERROR [COMMIT-LOG-ALLOCATOR] 2012-08-31 15:56:49,815 AbstractCassandraDaemon.java (line 134) Exception in thread Thread[COMMIT-LOG-ALLOCATOR,5,main]
      java.lang.AssertionError: attempted to delete non-existing file CommitLog-13926764209796414.log
              at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:68)
              at org.apache.cassandra.db.commitlog.CommitLogSegment.discard(CommitLogSegment.java:172)
              at org.apache.cassandra.db.commitlog.CommitLogAllocator$4.run(CommitLogAllocator.java:223)
              at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:95)
              at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
              at java.lang.Thread.run(Unknown Source)
      

      My assumption is that it was because of duplicate file names. As this is on a hypervisor that is less than optimal.

      After a while (about 30 minutes) mutations stopped being processed and the pending count sky rocketed. I think this was because log writing was blocked trying to get a new segment and writers could not submit to the commit log queue. The only way to stop the affected nodes was kill -9.

      Over about 24 hours this happened 5 times. I have deployed a patch that has been running for 12 hours without incident, will attach.

      The affected nodes could still read, and I'm checking logs to see how the other nodes handled the situation.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            amorton Aaron Morton Assign to me
            amorton Aaron Morton
            Aaron Morton
            Jonathan Ellis
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment