Running COPY from on a large dataset (20G divided in 20M records) revealed two issues:
- The progress report is incorrect, it is very slow until almost the end of the test at which point it catches up extremely quickly.
- The performance in rows per second is similar to running smaller tests with a smaller cluster locally (approx 35,000 rows per second). As a comparison, cassandra-stress manages 50,000 rows per second under the same set-up, therefore resulting 1.5 times faster.
See attached file copy_from_large_benchmark.txt for the benchmark details.
- A new option was added: PREPAREDSTATEMENTS - it indicates if prepared statements should be used; it defaults to true.
- The default value of CHUNKSIZE changed from 1000 to 5000.
- The default value of MINBATCHSIZE changed from 2 to 10.