[CASSANDRA-8131] Short-circuited query results from collection index query - ASF JIRA

Agile Board

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 2.1.1
Component/s: Legacy/CQL
Labels:
- collections
- cql3
- cqlsh
- query
- queryparser
- triaged
Environment:

Debian Wheezy, Oracle JDK, Cassandra 2.1

Severity:
Normal
Since Version:

2.1.0

Description

After watching Jonathan's 2014 summit video, I wanted to give collection indexes a try as they seem to be a fit for a "search by key/values" usage pattern we have in our setup. Doing some test queries that I expect users would do against the table, a short-circuit behavior came up:

Here's the whole transcript:

CREATE TABLE by_sets (id int PRIMARY KEY, datakeys set<text>, datavars set<text>);
CREATE INDEX by_sets_datakeys ON by_sets (datakeys);
CREATE INDEX by_sets_datavars ON by_sets (datavars);
INSERT INTO by_sets (id, datakeys, datavars) VALUES (1, {'a'}, {'b'});
INSERT INTO by_sets (id, datakeys, datavars) VALUES (2, {'c'}, {'d'});
INSERT INTO by_sets (id, datakeys, datavars) VALUES (3, {'e'}, {'f'});
INSERT INTO by_sets (id, datakeys, datavars) VALUES (4, {'a'}, {'z'});
SELECT * FROM by_sets;

 id | datakeys | datavars
----+----------+----------
  1 |    {'a'} |    {'b'}
  2 |    {'c'} |    {'d'}
  4 |    {'a'} |    {'z'}
  3 |    {'e'} |    {'f'}

We then tried this query which short-circuited:

SELECT * FROM by_sets WHERE datakeys CONTAINS 'a' AND datakeys CONTAINS 'c';

 id | datakeys | datavars
----+----------+----------
  1 |    {'a'} |    {'b'}
  4 |    {'a'} |    {'z'}

(2 rows)

Instead of receveing 3 rows, which match the datakeys CONTAINS 'a' AND datakeys CONTAINS 'c' we only got the first.

Doing the same, but with CONTAINS 'c' first, ignores the second AND.

SELECT * FROM by_sets WHERE datakeys CONTAINS 'c' AND datakeys CONTAINS 'a' ;

 id | datakeys | datavars
----+----------+----------
  2 |    {'c'} |    {'d'}

(1 rows)

Also, on a side-note, I have two indexes on both datakeys and datavars. But when trying to run a query such as:

select * from by_sets WHERE datakeys CONTAINS 'a' AND datavars CONTAINS 'z';
code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. 
If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

The second column, after AND (even if I inverse the order) requires an "allow filtering" clause yet the column is indexed an an in-memory "join" of the primary keys of these sets on the coordinator could build up the result.

Could anyone explain the short-circuit behavior?
And the requirement for "allow-filtering" on a secondly indexed column?

If they're not bugs but intended they should be documented better, at least their limitations.

Attachments

CASSANDRA-8131-V2.txt
21/Oct/14 12:15
8 kB
Benjamin Lerer
CASSANDRA-8131.txt
20/Oct/14 13:54
8 kB
Benjamin Lerer

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Benjamin Lerer Assign to me

Reporter:: Catalin Alexandru Zamfir

Authors:: Benjamin Lerer

Reviewers:: Sylvain Lebresne

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 16/Oct/14 21:34

Updated:: 16/Apr/19 09:31

Resolved:: 21/Oct/14 14:41

Agile

View on Board

Short-circuited query results from collection index query

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment