[CASSANDRA-14898] Key cache loading is very slow when there are many SSTables - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Low
Resolution: Fixed
Fix Version/s: 3.0.26, 3.11.12, 4.0.2
Component/s: Legacy/Local Write-Read Paths
Labels:
- Performance
- low-hanging-fruit
Environment:

AWS i3.2xlarge, 4 physical cores (8 threads), 60GB of RAM, loading about 8MB of KeyCache with 10k keys in it.

Severity:
Low
Since Version:

3.0.0
Source Control Link:

https://github.com/apache/cassandra/commit/97b47c3b5f845097181130125752bd6efc1e1e47
Test and Documentation Plan:

Hide

Pre-commit tests on 3.0, 3.11, 4.0 and trunk.

Included microbench shows reduction from ~140ms / to 20ms / load in keycache loading with 1000 sstables.

Show
Pre-commit tests on 3.0, 3.11, 4.0 and trunk. Included microbench shows reduction from ~140ms / to 20ms / load in keycache loading with 1000 sstables.

Description

While dealing with a production issue today where some 3.0.17 nodes had close to ~8k sstables on disk due to excessive write pressure, we had a few nodes crash due to OOM and then they took close to 17 minutes to load the key cache and recover. This excessive key cache load significantly increased the duration of the outage (to mitigate we just removed the saved key cache files). For example here is one example taking 17 minutes to load 10k keys, or about 10 keys per second (which is ... very slow):

INFO  [pool-3-thread-1] 2018-11-15 21:50:21,885 AutoSavingCache.java:190 - reading saved cache /mnt/data/cassandra/saved_caches/KeyCache-d.db
INFO  [pool-3-thread-1] 2018-11-15 22:07:16,490 AutoSavingCache.java:166 - Completed loading (1014606 ms; 10103 keys) KeyCache cache

I've witnessed similar behavior in the past with large LCS clusters, and indeed it appears that any time the number of sstables is large, KeyCache loading takes a really long time. Today I got a flame graph and I believe that I found the issue and I think it's reasonably easy to fix. From what I can tell the KeyCacheSerializer::deserialize method which is called for every key is linear in the number of sstables due to the call to ColumnFamilyStore::getSSTables which ends up calling View::select here. The View::select call is linear in the number of sstables and causes a lot of HashSet resizing when the number of sstables is much greater than 16 (the default size of the backing HashMap).

As we see in the attached flamegraph we spend 50% of our CPU time in these getSSTable calls, of which 36% is spent adding sstables to the HashSet in View::select and 17% is spent just iterating the sstables in the first place. A full 16% of CPU time is spent just resizing the HashMap. Then another 4% is spend calling CacheService::findDesc which does a linear search for the sstable generation.

I believe that this affects at least Cassandra 3.0.17 and trunk, and could be pretty easily fixed by either caching the getSSTables call or at the very least pre-sizing the HashSet in View::select to be the size of the sstables map.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

key_cache_load_slow.svg
16/Nov/18 05:37
435 kB
Joey Lynch

Issue Links

is duplicated by

CASSANDRA-16380 KeyCache load performance issue during startup

Resolved

Activity

People

Assignee:: n.v.harikrishna

Reporter:: Joey Lynch

Authors:: n.v.harikrishna

Reviewers:: Joey Lynch, Marcus Eriksson

Votes:: 1 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 16/Nov/18 05:37

Updated:: 16/Mar/22 08:53

Resolved:: 09/Dec/21 15:51

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1.5h