[CASSANDRA-17811] Fix CQL aggregation functions for collections, tuples and UDTs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 5.0-alpha1, 5.0
Component/s: CQL/Semantics
Labels:
None

Epic Link:
CEP-20: Dynamic Data Masking
Complexity:
Normal
Platform:

All
Impacts:

None
Source Control Link:

https://github.com/apache/cassandra/commit/6da9e33602fad4b8bf9466dc0e9a73665469a195
Test and Documentation Plan:

Hide

New unit tests are included.

Show
New unit tests are included.

Description

It has been found during ~~CASSANDRA-8877~~ that CQLS's aggregation functions max, min and count can be applied to collections, but the result is returned as a blob. For example:

CREATE TABLE t (k int PRIMARY KEY, l list<int>);
INSERT INTO t(k, l) VALUES (0, [1, 2, 3]);
INSERT INTO t(k, l) VALUES (1, [10, 20, 30]);
SELECT max(l) FROM t;

 system.max(l)
------------------------------------------------------------
 0x00000003000000040000000a0000000400000014000000040000001e

This happens on 3.0, 3.11, 4.0, 4.1 and trunk.

I'm not sure on whether the function shouldn't be supported for collections, or it should be supported but the result is wrong.

In the example above, the returned blob is the serialized value of [10, 20, 30], which is the right one according to the list comparator. I think this happens because the matched version of the function is the one for (blob) -> blob. We would need a (list<int>) -> list<int> function instead, but this function doesn't exist.

It would be quite easy to add versions of the max, min and count functions for every type of collection (list<int>, list<text>, map<int, int>, map<int, text>, etc.). The downside of this approach is that it would increase the number of aggregation functions kept in memory from 82 to 2722, if my maths are right. This is quite an increase, mainly due to the many possible combinations of the map type. Here is a quick, incomplete prototype of the approach.

Also, I'm not sure that applying those aggregation functions to collections is very useful in practice. Thus, an alternative approach would be just forbidding them, considering them not supported. I don't think it would be a problem for backward compatibility since no one has complained about the current behaviour, and we might well consider that the original intent was not to allow aggregation on collections. At least, there aren't any tests for it, and I can't find any documentation about it either.

Another idea that comes to mind is that we could change the meaning of those functions to aggregate the values within the collection, instead of aggregating the rows. In that case, the behaviour would be:

CREATE TABLE t (k int PRIMARY KEY, l list<int>);
INSERT INTO t(k, l) VALUES (0, [1, 2, 3]);
INSERT INTO t(k, l) VALUES (1, [10, 20, 30]);
SELECT max(l) FROM t;

 k | system.max(l)
---+-----------
 1 | 30
 0 | 3

Of course we could have separate function names for that type of collection aggregations, like collectionMax, maxItem, or something like that.

Attachments

Issue Links

Discovered while testing

CASSANDRA-8877 Ability to read the TTL and WRITE TIME of an element in a collection

Resolved

is depended upon by

CASSANDRA-17941 CQL data masking functions

Resolved

Parent Feature

CASSANDRA-18060 Add aggregation scalar functions on collections

Resolved

links to

PR trunk

Activity

People

Assignee:: Andres de la Peña

Reporter:: Andres de la Peña

Authors:: Andres de la Peña

Reviewers:: Benjamin Lerer

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/Aug/22 16:37

Updated:: 12/Sep/23 13:02

Resolved:: 18/Nov/22 10:39