[CASSANDRA-14611] Quorum reads of wide partitions are expensive due to inefficient hashing in AbstractCell - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Low
Resolution: Unresolved
Fix Version/s: None
Component/s: Legacy/Coordination
Labels:
- performance
Environment:

AWS, i3.4xlarge (very fast nvme drives), Linux 4.13.

Cassandra 3.0.16, Java 8u162

Description

While I was analyzing flame graphs as part of investigating CASSANDRA-14605 I noticed that digest overhead for quorum reads of partitions with many small columns is much higher than I would expect.

In this particular use case of (LOCAL_QUORUM) reads of partitions with (p50, p95, p99) = (60, 5k, 20k) small columns (a set<int>), digest requests make up 45%+ of the CPU time that Cassandra is using. About 20% is reading the columns into the iterator abstraction (bummer) but surprisingly about 20% is calculating the digests themselves. I would expect digest calculation itself to be close to 2% total ... I've attached a flame graph showing this behavior.

The part that is particularly concerning to me is that ~17% of CPU time (~1/3 of the overall digest time) is spent in calculating AbstractCell::digest but only ~2% is digesting the actual cell values. Looking deeper the implementation of FBUtilities::updateWithInt and other related functions (updateWithBoolean, updateWithLong, etc...) are very inefficient as they call update on the underlying MessageDigest on each byte individually. When you have lots of small cells the overhead of this approach is apparently significant. In particular it looks like the underlying MD5 implementation is not as efficient as it should be for lots of single byte updates as it does attempt to buffer them into a 64 byte buffer, but that's too small to get maximum throughput. In my local benchmarking I'm seeing just grouping the metadata updates into a 13 byte array yields a 3x performance improvement but it has 13 bytes of allocated memory per cell during the hash that previously we didn't allocate.

I've been playing around with a patch which changes out the normal MD5 digest for a buffered one (e.g. has a fixed 1kb byte array per read thread which is used to roll up digest updates and then whenever the buffer fills it does an underlying update). So far such a buffered MessageDigest is showing 2x better performance on hashing 10k column metadata in a jmh microbenchmark I wrote. If I combine that with metadata grouping into a 13 byte array I get 10x more throughput. I think I need to do more profiling but I think that buffering our digest calls could really help for wide partitions of smaller columns (as opposed to fat single columns, where the metadata hashing overhead is probably pretty negligible).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

quorum_reads_lots_of_small_columns.svg
29/Jul/18 22:54
748 kB
Joey Lynch

Issue Links

relates to

CASSANDRA-13292 Replace MessagingService usage of MD5 with something more modern

Open

Activity

People

Assignee:: Unassigned

Reporter:: Joey Lynch

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 29/Jul/18 22:56

Updated:: 16/Apr/19 09:29