[CASSANDRA-19564] MemtablePostFlush deadlock leads to stuck nodes and crashes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Urgent
Resolution: Unresolved
Fix Version/s: 4.1.x
Component/s: Local/Compaction, Local/Memtable
Labels:
None

Bug Category:
Availability - Process Crash
Severity:
Critical
Complexity:
Normal
Discovered By:
User Report
Platform:

All
Impacts:

None
Since Version:

4.1.4

Description

I've run into an issue on a 4.1.4 cluster where an entire node has locked up due to what I believe is a deadlock in memtable flushing. Here's what I know so far. I've stitched together what happened based on conversations, logs, and some flame graphs.

Log reports memtable flushing

The last successful flush happens at 12:19.

INFO  [NativePoolCleaner] 2024-04-16 12:19:53,634 AbstractAllocatorMemtable.java:286 - Flushing largest CFS(Keyspace='ks', ColumnFamily='version') to free up room. Used total: 0.24/0.33, live: 0.16/0.20, flushing: 0.09/0.13, this: 0.13/0.15
INFO  [NativePoolCleaner] 2024-04-16 12:19:53,634 ColumnFamilyStore.java:1012 - Enqueuing flush of ks.version, Reason: MEMTABLE_LIMIT, Usage: 660.521MiB (13%) on-heap, 790.606MiB (15%) off-heap

MemtablePostFlush appears to be blocked

At this point, MemtablePostFlush completed tasks stops incrementing, active stays at 1 and pending starts to rise.

MemtablePostFlush   1    1   3446   0   0

The flame graph reveals that PostFlush.call is stuck. I don't have the line number, but I know we're stuck in org.apache.cassandra.db.ColumnFamilyStore.PostFlush#call given the visual below:

Memtable flushing is now blocked.

All MemtableFlushWriter threads are Parked waiting on OpOrder.Barrier.await. A wall clock profile of 30s reveals all time is spent here. Presumably we're waiting on the single threaded Post Flush.

Memtable allocations start to block

Eventually it looks like the NativeAllocator stops successfully allocating memory. I assume it's waiting on memory to be freed, but since memtable flushes are blocked, we wait indefinitely.

Looking at a wall clock flame graph, all writer threads have reached the allocation failure path of MemtableAllocator.allocate(). I believe we're waiting on signal.awaitThrowUncheckedOnInterrupt()

 MutationStage    48    828425      980253369      0    0

Compaction Stops

Since we write to the compaction history table, and that requires memtables, compactions are now blocked as well.

The node is now doing basically nothing and must be restarted.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

screenshot-1.png
19/Apr/24 17:03
183 kB
Jon Haddad
image-2024-04-17-19-14-34-344.png
18/Apr/24 02:14
403 kB
Jon Haddad
image-2024-04-17-19-13-06-769.png
18/Apr/24 02:13
157 kB
Jon Haddad
image-2024-04-17-18-46-29-474.png
18/Apr/24 01:46
217 kB
Jon Haddad
image-2024-04-16-13-53-24-455.png
16/Apr/24 20:53
245 kB
Jon Haddad
image-2024-04-16-13-43-11-064.png
16/Apr/24 20:43
96 kB
Jon Haddad
image-2024-04-16-12-29-15-386.png
16/Apr/24 19:29
98 kB
Jon Haddad
image-2024-04-16-11-55-54-750.png
16/Apr/24 18:55
427 kB
Jon Haddad

Activity

People

Assignee:: Runtian Liu

Reporter:: Jon Haddad

Authors:: Runtian Liu

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 16/Apr/24 20:59

Updated:: 22/Nov/24 13:44