While I was analyzing flame graphs as part of investigating CASSANDRA-14605 I noticed that digest overhead for quorum reads of partitions with many small columns is much higher than I would expect.
In this particular use case of (LOCAL_QUORUM) reads of partitions with (p50, p95, p99) = (60, 5k, 20k) small columns (a set<int>), digest requests make up 45%+ of the CPU time that Cassandra is using. About 20% is reading the columns into the iterator abstraction (bummer) but surprisingly about 20% is calculating the digests themselves. I would expect digest calculation itself to be close to 2% total ... I've attached a flame graph showing this behavior.
The part that is particularly concerning to me is that ~17% of CPU time (~1/3 of the overall digest time) is spent in calculating AbstractCell::digest but only ~2% is digesting the actual cell values. Looking deeper the implementation of FBUtilities::updateWithInt and other related functions (updateWithBoolean, updateWithLong, etc...) are very inefficient as they call update on the underlying MessageDigest on each byte individually. When you have lots of small cells the overhead of this approach is apparently significant. In particular it looks like the underlying MD5 implementation is not as efficient as it should be for lots of single byte updates as it does attempt to buffer them into a 64 byte buffer, but that's too small to get maximum throughput. In my local benchmarking I'm seeing just grouping the metadata updates into a 13 byte array yields a 3x performance improvement but it has 13 bytes of allocated memory per cell during the hash that previously we didn't allocate.
I've been playing around with a patch which changes out the normal MD5 digest for a buffered one (e.g. has a fixed 1kb byte array per read thread which is used to roll up digest updates and then whenever the buffer fills it does an underlying update). So far such a buffered MessageDigest is showing 2x better performance on hashing 10k column metadata in a jmh microbenchmark I wrote. If I combine that with metadata grouping into a 13 byte array I get 10x more throughput. I think I need to do more profiling but I think that buffering our digest calls could really help for wide partitions of smaller columns (as opposed to fat single columns, where the metadata hashing overhead is probably pretty negligible).