[CASSANDRA-4601] Ensure unique commit log file names - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Urgent
Resolution: Fixed
Fix Version/s: 1.1.5
Component/s: None
Labels:
None
Environment:

Sun JVM 1.6.33 / Ubuntu 10.04.4 LTS

Severity:
Critical

Description

The commit log segment name uses System.nanoTime() as part of the file name. There is no guarantee that successive calls to nanoTime() will return different values. And on less than optimal hypervisors this happens a lot.

I observed the following in the wild:

ERROR [COMMIT-LOG-ALLOCATOR] 2012-08-31 15:56:49,815 AbstractCassandraDaemon.java (line 134) Exception in thread Thread[COMMIT-LOG-ALLOCATOR,5,main]
java.lang.AssertionError: attempted to delete non-existing file CommitLog-13926764209796414.log
        at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:68)
        at org.apache.cassandra.db.commitlog.CommitLogSegment.discard(CommitLogSegment.java:172)
        at org.apache.cassandra.db.commitlog.CommitLogAllocator$4.run(CommitLogAllocator.java:223)
        at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:95)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
        at java.lang.Thread.run(Unknown Source)

My assumption is that it was because of duplicate file names. As this is on a hypervisor that is less than optimal.

After a while (about 30 minutes) mutations stopped being processed and the pending count sky rocketed. I think this was because log writing was blocked trying to get a new segment and writers could not submit to the commit log queue. The only way to stop the affected nodes was kill -9.

Over about 24 hours this happened 5 times. I have deployed a patch that has been running for 12 hours without incident, will attach.

The affected nodes could still read, and I'm checking logs to see how the other nodes handled the situation.

Attachments

cassandra-1.1-4601.patch
02/Sep/12 23:13
3 kB
Aaron Morton

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Aaron Morton Assign to me

Reporter:: Aaron Morton

Authors:: Aaron Morton

Reviewers:: Jonathan Ellis

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 02/Sep/12 22:37

Updated:: 16/Apr/19 09:32

Resolved:: 04/Sep/12 20:09

Agile

View on Board

Ensure unique commit log file names

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment