Description
Cassandra 4 exhibit a severe drop of performance on count operations.
We created a reproduction workflow inserting a 100k rows of 10kb random string
After this data is inserted in a 3 nodes cluster at RF3 and queried at LQ, a count on said table takes
- circa 2s on 3.11
- consistently more than 10s on 4.0 and 4.1 (around 12 to 13s) - tested 4.0.10 and 4.1.5
Observation of same program/query against each environment:
3.11
# COUNT # 61a5bcb0-75ca-11ef-9cff-55d571fe1347 Row count:100000 Count timing with fetch 5000: 0:00:01.846531 Average row size: 10000.0
4.1
# COUNT # 55d79f60-75cb-11ef-a8be-399c3e257132 Row count:100000 Count timing with fetch 5000: 0:00:13.408626 Average row size: 10000.0
The UUID shown in the above output is the trace ID on execution of the query which is then exported from each cluster via the command below and provide the cassXXtrace.txt file
cqlsh -e show session [trace_id] | tee cassXXtrace.txt
Attached cass311trace.txt and cass41trace.txt which show the associated events from above query.
Note the issue is way more prevalent in a 3 nodes cluster (I also have tested on docker in one node and it's less visible).
Attaching objcount.py which contains 2 functions to insert and read the data. The insert is pretty slow due to generating random junk 10k objects but allows to reproduce. Just comment out the gateway_insert function for it to trigger data insert.
# gateway_insert(session, ks, tbl) gateway_query(session, ks, tbl, fetch)
Requires argparse and cassandra driver
To use, run the following command. Consider uncommenting l.40 and 41 for ks/table creation and l. 155 for insert workload
python3 ./objcount.py -i <ip> -k <ks> -t <table>