[CASSANDRA-11053] COPY FROM on large datasets: fix progress report and optimize performance part 4 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 2.1.14, 2.2.6, 3.0.5, 3.5
Component/s: Legacy/Tools
Labels:
- doc-impacting

Severity:
Normal

Description

Description

Running COPY from on a large dataset (20G divided in 20M records) revealed two issues:

The progress report is incorrect, it is very slow until almost the end of the test at which point it catches up extremely quickly.

The performance in rows per second is similar to running smaller tests with a smaller cluster locally (approx 35,000 rows per second). As a comparison, cassandra-stress manages 50,000 rows per second under the same set-up, therefore resulting 1.5 times faster.

See attached file copy_from_large_benchmark.txt for the benchmark details.

Doc-impacting changes to COPY FROM options

A new option was added: PREPAREDSTATEMENTS - it indicates if prepared statements should be used; it defaults to true.
The default value of CHUNKSIZE changed from 1000 to 5000.
The default value of MINBATCHSIZE changed from 2 to 10.

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

bisect_test.py
22/Mar/16 03:01
1 kB
Stefania Alborghetti
copy_from_large_benchmark_2.txt
03/Feb/16 08:27
5 kB
Stefania Alborghetti
parent_profile_2.txt
03/Feb/16 07:57
9 kB
Stefania Alborghetti
worker_profiles_2.txt
03/Feb/16 07:57
61 kB
Stefania Alborghetti
copy_from_large_benchmark.txt
02/Feb/16 06:50
3 kB
Stefania Alborghetti
parent_profile.txt
02/Feb/16 06:49
9 kB
Stefania Alborghetti
worker_profiles.txt
02/Feb/16 06:49
193 kB
Stefania Alborghetti

Issue Links

blocks

CASSANDRA-11274 cqlsh: interpret CQL type for formatting blob types

Resolved

breaks

CASSANDRA-11549 cqlsh: COPY FROM ignores NULL values in conversion

Resolved

CASSANDRA-11574 clqsh: COPY FROM throws TypeError with Cython extensions enabled

Resolved

is related to

CASSANDRA-9302 Optimize cqlsh COPY FROM, part 3

Resolved

relates to

CASSANDRA-11630 Make cython optional in pylib/setup.py

Resolved

CASSANDRA-11255 COPY TO should have higher double precision

Resolved

CASSANDRA-11274 cqlsh: interpret CQL type for formatting blob types

Resolved

(2 relates to)

Activity

People

Assignee:: Stefania Alborghetti

Reporter:: Stefania Alborghetti

Authors:: Stefania Alborghetti

Reviewers:: Adam Holmberg

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 21/Jan/16 01:56

Updated:: 16/Apr/19 09:30

Resolved:: 28/Mar/16 18:03