[CASSANDRA-8550] Internal pagination in CQL3 index queries creating substantial overhead - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 2.0.12, 2.1.3
Component/s: None
Labels:
None

Severity:
Normal
Since Version:

1.2.0 beta 1

Description

While benchmarking CQL3 secondary indexes in 2.1.2, we've noticed substantial performance degradation as the volume of indexed data increases. In trying to figure out what's going on, we found that a major factor contributing to this degradation appears to be logic in o.a.c.db.index.composites.CompositesSearcher used to paginate scans of index tables. In particular, in the use cases we've explored, this short algorithm used to select a page size appears to be the culprit:

            private int meanColumns = Math.max(index.getIndexCfs().getMeanColumns(), 1);
            // We shouldn't fetch only 1 row as this provides buggy paging in case the first row doesn't satisfy all clauses
            private int rowsPerQuery = Math.max(Math.min(filter.maxRows(), filter.maxColumns() / meanColumns), 2);

In indexes where the cardinality doesn't scale linearly with the volume of data indexed, it seems likely that the value of meanColumns will steadily rise in write-heavy workloads. In the cases we've explored, filter.maxColumns() returns a small enough number (related to the lesser of the native-protocol page size or the user-specified limit for the query) that, after meanColumns reaches a few thousand, rowsPerQuery (the page size) is consistently set to 2.

The resulting overhead is severe. In our environment, if we fix rowsPerQuery to some reasonably large constant (e.g., 5,000), queries that with the existing logic would require over two minutes to complete can run in under ten seconds.

Using a constant clearly seems like the wrong answer. But the overhead the existing algorithm seems to introduce suggests that it isn't the right answer either. An intuitive solution might be to use the minimum of filter.maxRows() and filter.maxColumns() (or 2 if both of those are 1), but it's not immediately clear that there aren't safety considerations the algorithm is attempting to account for that this strategy does not.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

8550-2.0.txt
08/Jan/15 21:15
3 kB
Tom Hobbs

Activity

People

Assignee:: Tom Hobbs

Reporter:: Samuel Klock

Authors:: Tom Hobbs

Reviewers:: Benjamin Lerer

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 31/Dec/14 17:23

Updated:: 16/Apr/19 09:31

Resolved:: 14/Jan/15 17:36