[CASSANDRA-15536] 4.0 Quality: Components and Test Plans - ASF JIRA

Details

Type: Epic
Status: In Progress
Priority: High
Resolution: Unresolved
Fix Version/s: 4.0.x
Component/s: Test/benchmark, Test/dtest/python, Test/fuzz, Test/unit
Labels:
None

Epic Name:
40_Quality_Test
Change Category:
Quality Assurance
Complexity:
Challenging
Platform:

All
Impacts:

None

Description

Jira migrated from cwiki

The overarching goal of the 4.0 release is that Cassandra 4.0 should be at a state where major users would run it in production when it is cut. To gain this confidence there are various ongoing testing efforts involving correctness, performance, and ease of use. In this page we try to coordinate and identify blockers for subsystems before we can release 4.0

For each component we strive to have shepherds and contributors involved. Shepherds should be committers or knowledgeable component owners and are responsible for driving their blocking tickets to completion and ensuring quality in their claimed area, while contributors have signed up to help verify that subsystem by running tests or contributing fixes. Shepherds also ideally help set testing standards and ensure that we meet a high standard of quality in their claimed area.

If you are interested in contributing to testing 4.0, please add your name as assignee if you want to drive things, reviewer if just participate and review, and get involved in the the tracking ticket, and dev list/IRC discussions involving that component.

Targeted Components / Subsystems

We've tried to collect some of the major components or subsystems that we want to ensure work properly towards having a great 4.0 release. If you think something is missing please add it. Better yet volunteer to contribute to testing it!

Internode Messaging

In 4.0 we're getting a new Netty based inter-node communication system (~~CASSANDRA-8457~~). As internode messaging is vital to the correctness and performance of the database we should make sure that all forms (TLS, compressed, low latency, high latency, etc ...) of internode messaging function correctly.

Test Infrastructure / Automation: Diff Testing

Diff testing is a form of model-based testing in which two clusters are exhaustively compared to assert identity. To support Apache Cassandra 4.0 validation, contributors have developed cassandra-diff. This is a Spark application that distributes the token range over a configurable number of Spark executors, then parallelizes randomized forward and reverse reads with varying paging sizes to read and compare every row present in the cluster, persisting a record of mismatches for investigation. This methodology has been instrumental to identifying data loss, data corruption, and incorrect response issues introduced in early Cassandra 3.0 releases.

cassandra-diff and associated documentation can be found at: https://github.com/apache/cassandra-diff. Contributors are encouraged to run diff tests against clusters they manage and report issues to ensure workload diversity across the project.

System Tables and Internal Schema

This task covers a review of and minor bug fixes to local and distributed system keyspaces. Planned work in this area is now complete.

Source Audit and Performance Testing: Streaming

This task covers an audit of the Streaming implementation in Apache Cassandra 4.0. In this release, contributors have implemented full-SSTable streaming to improve performance and reduce memory pressure. Internode messaging changes implemented in ~~CASSANDRA-15066~~ adjacent to streaming suggested that review of the streaming implementation itself may be desirable. Prior work also covered performance testing of full-SSTable streaming.

Test Infrastructure / Automation: "Harry"

~~CASSANDRA-15348~~ - Harry: generator library and extensible framework for fuzz testing Apache Cassandra TRIAGE NEEDED

Harry is a component for fuzz testing and verification of the Apache Cassandra clusters at scale. Harry allows to run tests that are able to validate state of both dense nodes (to test local read-write path) and large clusters (to test distributed read-write path), and do it efficiently. Harry defines a model that holds the state of the database, generators that produce reproducible, pseudo-random schemas, mutations, and queries, and a validator that asserts the correctness of the model following execution of generated traffic. See ~~CASSANDRA-15348~~ for additional details.

Local Read/Write Path: IndexInfo (CASSANDRA-11206)

Users upgrading from Cassandra 3.0.x to trunk will pick up ~~CASSANDRA-11206~~ in the process. Contributors to 4.0 testing and validation have allocated time to testing and validation of these changes via source audit and implementation of property-based tests (currently underway). The majority of planned work here is complete, with a final set of perf tests in progress. No correctness issues were identified via the source audit and randomized testing. Minor cleanup and refactoring may follow, but these changes are expected to be small in scope, if any.

Local Read/Write Path: Upgrade and Diff Test

Execution of upgrade and diff tests via cassandra-diff have proven to be one of the most effective approaches toward identifying issues with the local read/write path. These include instances of data loss, data corruption, data resurrection, incorrect responses to queries, incomplete responses, and others. Upgrade and diff tests can be executed concurrent with fault injection (such as host or network failure); as well as during mixed-version scenarios (such as upgrading half of the instances in a cluster, and running upgradesstables on only half of the upgraded instances).

Upgrade and diff tests are expected to continue through the release cycle, and are a great way for contributors to gain confidence in the correctness of the database under their own workloads.

Local Read/Write Path: Other Areas

Testing in this area refers to the local read/write path (StorageProxy, ColumnFamilyStore, Memtable, SSTable reading/writing, etc). We are still finding numerous bugs and issues with the 3.0 storage engine rewrite (~~CASSANDRA-8099~~). For 4.0 we want to ensure that we thoroughly cover the local read/write path with techniques such as property-based testing, fuzzing (example), and a source audit.

Distributed Read/Write Path: Coordination, Replication, and Read Repair
Testing in this area focuses on non-node-local aspects of the read-write path: coordination, replication, read repair, etc.

Repair

We aim for 4.0 to have the first fully functioning incremental repair solution (~~CASSANDRA-9143~~)! Furthermore we aim to verify that all types of repair: (full range, sub range, incremental) function as expected as well as ensuring community tools such as Reaper work. ~~CASSANDRA-3200~~ adds an experimental option to reduce the amount of data streamed during repair, we should write more tests and see how it works with big nodes.

Compaction

Alongside the local and distributed read/write paths, we'll also want to validate compaction. ~~CASSANDRA-6696~~ introduced substantial changes/improvements that require testing (esp. JBOD).

Metrics

In past releases we've unknowingly broken metrics integrations and introduced performance regressions in metrics collection and reporting. We strive in 4.0 to not do that. Metrics should work well!

Tooling: Bundled / First-Party

Test plans should cover bundled first-party tooling and CLIs such as nodetool, cqlsh, and new tools supporting full query and audit logging (~~CASSANDRA-13983~~, ~~CASSANDRA-12151~~).

Tooling: External Ecosystem

Many users of Apache Cassandra employ open source tooling to automate Cassandra configuration, runtime management, and repair scheduling. Prior to release, we need to confirm that popular third-party tools such as Reaper, Priam, etc. function properly.

Test Frameworks, Tooling, Infrastructure / Automation

This area refers to contributions to test frameworks/tooling (e.g., dtests, QuickTheories, ~~CASSANDRA-14821~~), and automation enabling those tools to be applied at scale (e.g., replay testing via Spark-based replay of captured FQL logs).

Cluster Setup and Maintenance

We want 4.0 to be easy for users to setup out of the box and just work. This means having low friction when users download the Cassandra package and start running it. For example, users should be able to easily configure and start new 4.0 clusters and have tokens distributed evenly. Another example is packaging, it should be easy to install Cassandra on all supported platforms (e.g. packaging) and have Cassandra use standard platform integrations.

Platforms / Runtimes

~~CASSANDRA-9608~~ introduces support for Java 11. We'll want to verify that Cassandra under Java 11 meets expectations of stability.

Cluster Upgrade

We've historically had numerous bugs concerning upgrading clusters from one version to the other. Let's establish the supported upgrade path and ensure that users can safely perform the upgrades in production.

Documentation

Many sections of our documentation are incomplete or wrong. Let's deliver a functional but also well documented 4.0 release.

Features / Substantial Changes

Transient Replication
Transient Replication is an experimental implementation of witness replicas included in Apache Cassandra 4.0 (CASSANDRA-14697). As this feature is experimental, the focus of testing and validation in this release will be toward ensuring that its implementation doesn't negatively impact non-transient use cases.

4.0 Quality: Components and Test Plans