[CASSANDRA-16776] modify SecondaryIndexManager#indexPartition() to retrieve only columns for which indexes are actually being built - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 4.1-alpha1, 4.1
Component/s: Feature/2i Index
Labels:
None

Change Category:
Performance
Complexity:
Normal
Platform:

All
Impacts:

None
Source Control Link:

https://github.com/apache/cassandra/commit/6a1b20e58d493925439cc9a67bc6b51bb0be631a
Test and Documentation Plan:

Hide

The correctness of this improvement relies on the existing 2i tests. Proof that it's actually an improvement is illustrated by two new tests in CompactionAllocationTest.

Show
The correctness of this improvement relies on the existing 2i tests. Proof that it's actually an improvement is illustrated by two new tests in CompactionAllocationTest .

Description

Secondary indexes are (for the moment) built as special compaction tasks via SecondaryIndexBuilder. From a profiling perspective, the fun begins in SecondaryIndexManager.indexPartition(). The work above it in SecondaryIndexBuilder is just key iteration.

Two basic things happen in indexPartition(). First, we read a single partition in its entirety, and then we send individual rows to the Indexer. When we read these partitions, we use ColumnFilter.all(), which ends up materializing full rows, even when we’re indexing a single column (or at least fewer columns than we need for all the indexes participating in the build). If we narrowed this to fetch only the necessary columns, we might be able to create less garbage in AbstractBTreePartition#searchIterator() when we create a copy of the underlying full row from disk.

In some initial testing, I’ve been using a simple schema with fairly narrow rows.

CREATE TABLE tlp_stress.allow_filtering (
    partition_id text,
    row_id int,
    payload text,
    value int,
    PRIMARY KEY (partition_id, row_id)
) WITH CLUSTERING ORDER BY (row_id ASC)

The price of deserializing these rows is still visible, however, in the results of some basic sampling profiling.

The possible optimization above to avoid unnecessary copying of a row’s columns would also narrow cell deserialization only to indexed cells, which would probably be very beneficial for index builds with very wide rows. One minor wrinkle in all of this is that since 3.0, it has been possible to create indexes one entire rows, rather than single columns, so we’d have to keep that case in mind.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

index1.png
30/Jun/21 19:45
335 kB
Caleb Rackliffe
index2.png
30/Jun/21 19:45
524 kB
Caleb Rackliffe

Activity

People

Assignee:: Caleb Rackliffe

Reporter:: Caleb Rackliffe

Authors:: Caleb Rackliffe

Reviewers:: Aleksei Zotov, Benedict Elliott Smith

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 30/Jun/21 19:46

Updated:: 27/May/22 19:25

Resolved:: 10/Aug/21 20:45