This ticket introduces Harry, a component for fuzz testing and verification of the Apache Cassandra clusters at scale.
Current testing tooling largely tests for common- and edge-cases, and most of the tests use predefined datasets. Property-based tests can help explore a broader range of states, but often require either a complex model or a large state to test against.
Harry allows to run tests that are able to validate state of both dense nodes (to test local read-write path) and large clusters (to test distributed read-write path), and do it efficiently. Main goals, and what sets it apart from the other testing tools is:
- The state required for verification should remain as compact as possible.
- The verification process itself should be as performant as possible.
- Ideally, we'd want a way to verify database state while continuing running state change queries against it.
To achieve this, Harry defines a model that holds the state of the database, generators that produce reproducible, pseudo-random schemas, mutations, and queries, and a validator that asserts the correctness of the model following execution of generated traffic.
- Generator library: how to create a library of invertible, order-preserving generators for simple and composite data types.
- Model and checker: how to use the properties of generators to validate the output of an eventually-consistent database in a linear time.
- Runner library: how to create a scheme for reproducible runs, despite the concurrent nature of database and fuzzer itself.
Generation and validation define strict mathematical relations between the generated values and pseudorandom numbers they were generated from. Using these properties, we can store minimal state and check if these properties hold during validation.
Since Cassandra stores data in rows, we should be able to "inflate" data to insert a row into the database from a single number we call descriptor. Each value in the row read from the database can be "deflated" back to the descriptor it was generated from. This way, to precisely verify the state of the row, we only need to know the descriptor it was generated from and a timestamp at which it was inserted.
Similarly, keys for the inserted row can be "inflated" from a single 64-bit integer, and then "deflated" back to it. To efficiently search for keys, while allowing range scans, our generation scheme preserves the order of the original 64-bit integer. Every pair of keys generated from two 64-bit integers would sort the same way as these integers.
This way, in order to validate a state of the range of rows queried from the database, it is sufficient to "deflate" its key and data values, use deflated 64-bit key representation to find all descriptors these rows were generated from, and ensure that the given sequence of descriptors could have resulted in the state that database has responded with.
Using this scheme, we keep a minimum possible amount of data per row, can efficiently generate the data, and backtrack values to the numbers they were generated from. Most of the time, we operate on 64-bit integer values and only use "inflated" objects when running queries against database state, minimizing the amount of required memory.
According to Marriam-Webster:
- to torment by or as if by constant attack
- persistently carry out attacks on (an enemy or an enemy's territory)