[CASSANDRA-13379] SASI index returns duplicate rows - ASF JIRA

XML

Word

Printable

JSON

CREATE TABLE bulks_recipients (
    bulk_id uuid,
    recipient text,
    bulk_id_idx uuid,
    PRIMARY KEY ((bulk_id, recipient))
)

bulk_id_idx is just a copy of bulk_id because SASI does not work on partition key component at all for some reason.

CREATE CUSTOM INDEX bulks_recipients_bulk_id ON bulks_recipients (bulk_id_idx) USING 'org.apache.cassandra.index.sasi.SASIIndex';

Then i insert 1 million rows with the same bulk_id and different recipient. Then

> select count(*) from bulks_recipients ;

 count
---------
 1000000

(1 rows)

Ok, it's fine here. Now let's query by SASI:

> select count(*) from bulks_recipients where bulk_id_idx = fedd95ec-2cc8-4040-8619-baf69647700b;

 count
---------
 1010101

(1 rows)

Hmm, very strange count - 10101 extra rows.
Ok, i've dumped the query result into a text file:

# cat sasi.txt | wc -l
1000200

Here we have 200 extra rows for some reason.

Let's check if these are duplicates:

# cat sasi.txt | sort | uniq | wc -l
1000000

Yep, looks like.

Recreating index does not help. If i issue the very same query (against partition key bulk_id, not bulk_id_idx) - i get correct results.

is duplicated by

CASSANDRA-13302 last row of previous page == first row of next page while querying data using SASI index