[CASSANDRA-9259] Bulk Reading from Cassandra - ASF JIRA

Agile Board

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Add vote

Voters

Watch issue

Watchers

Create sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Urgent
Resolution: Unresolved
Fix Version/s: 5.x
Component/s: Legacy/CQL, Legacy/Local Write-Read Paths, Legacy/Streaming and Messaging, Legacy/Testing, Local/Compaction
Labels:
None

Description

This ticket is following on from the 2015 NGCC. This ticket is designed to be a place for discussing and designing an approach to bulk reading.

The goal is to have a bulk reading path for Cassandra. That is, a path optimized to grab a large portion of the data for a table (potentially all of it). This is a core element in the Spark integration with Cassandra, and the speed at which Cassandra can deliver bulk data to Spark is limiting the performance of Spark-plus-Cassandra operations. This is especially of importance as Cassandra will (likely) leverage Spark for internal operations (for example CASSANDRA-8234).

The core CQL to consider is the following:
SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND Token(partitionKey) <= Y

There are a few approaches that could be considered. First, we consider a new "Streaming Compaction" approach. The key observation here is that a bulk read from Cassandra is a lot like a major compaction, though instead of outputting a new SSTable we would output CQL rows to a stream/socket/etc. This would be similar to a CompactionTask, but would strip out some unnecessary things in there (e.g., some of the indexing, etc). Predicates and projections could also be encapsulated in this new "StreamingCompactionTask", for example.

Here, we choose X and Y to be contained within one token range (perhaps considering the primary range of a node without vnodes, for example). This query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk operations via Spark (or other processing frameworks - ETL, etc). There are a few causes (e.g., inefficient paging).

Another approach would be an alternate storage format. For example, we might employ Parquet (just as an example) to store the same data as in the primary Cassandra storage (aka SSTables). This is akin to Global Indexes (an alternate storage of the same data optimized for a particular query). Then, Cassandra can choose to leverage this alternate storage for particular CQL queries (e.g., range scans).

These are just 2 suggestions to get the conversation going.

One thing to note is that it will be useful to have this storage segregated by token range so that when you extract via these mechanisms you do not get replications-factor numbers of copies of the data. That will certainly be an issue for some Spark operations (e.g., counting). Thus, we will want per-token-range storage (even for single disks), so this will likely leverage ~~CASSANDRA-6696~~ (though, we'll want to also consider the single disk case).

It is also worth discussing what the success criteria is here. It is unlikely to be as fast as EDW or HDFS performance (though, that is still a good goal), but being within some percentage of that performance should be set as success. For example, 2x as long as doing bulk operations on HDFS with similar node count/size/etc.

Attachments

256_vnodes.jpg
30/Jul/16 02:54
48 kB
Stefania Alborghetti
before_after.jpg
30/Jul/16 02:50
58 kB
Stefania Alborghetti
bulk-read-benchmark.1.html
16/Mar/16 11:31
791 kB
Stefania Alborghetti
bulk-read-jfr-profiles.1.tar.gz
16/Mar/16 11:31
14.37 MB
Stefania Alborghetti
bulk-read-jfr-profiles.2.tar.gz
16/Mar/16 11:31
14.80 MB
Stefania Alborghetti
no_vnodes.jpg
30/Jul/16 02:54
48 kB
Stefania Alborghetti
spark_benchmark_raw_data.zip
30/Jul/16 02:25
674 kB
Stefania Alborghetti

Issue Links

Add Link

is related to

CASSANDRA-16222 CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Resolved

Delete this link

CASSANDRA-11697 Improve Compaction Throughput

Open

Delete this link

CASSANDRA-10384 Create automated performance workloads and metrics collection for improving compaction performance

Open

Delete this link

links to

patch for streaming and local read POC

Delete this link

Sub-Tasks

Create Sub-Task

1.	Establish and implement canonical bulk reading workload(s)	Resolved	Stefania Alborghetti	Actions
2.	Run canonical bulk reading workload nightly in CI and provide a dashboard of the result	Open	Unassigned	Actions
3.	Collect SAR metrics from CI jobs running canonical bulk reading workload	Open	Ryan McGuire	Actions
4.	Generate flame graphs from canonical bulk reading workload running in CI	Resolved	Alan Boudreault	Actions
5.	Collect flight recordings of canonical bulk read workload in CI	Resolved	Ryan McGuire	Actions
6.	Implement optimized local read path for CL.ONE	Resolved	Stefania Alborghetti	Actions
7.	Implement streaming for bulk read requests	Resolved	Stefania Alborghetti	Actions
8.	Create a benchmark to compare HDFS and Cassandra bulk read times	Resolved	Stefania Alborghetti	Actions

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned Assign to me

Reporter:: Brian Hess

Votes:: 22 Vote for this issue

Watchers:: 58 Start watching this issue

Dates

Created:: 28/Apr/15 22:30

Updated:: 07/Mar/23 10:54

Agile

View on Board

Bulk Reading from Cassandra

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates

Agile

Slack

Issue deployment