[CASSANDRA-11871] Allow to aggregate by time intervals - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 4.1-alpha1, 4.1
Component/s: Legacy/CQL
Labels:
None

Source Control Link:

https://github.com/apache/cassandra/commit/1ad8bf67a9c82cbb5ff38e5cf785f9fe2516d009
Test and Documentation Plan:

Hide

The patch add new unit tests and DTests

Show
The patch add new unit tests and DTests

Description

For time series data it can be usefull to aggregate by time intervals.

The idea would be to add support for one or several functions in the GROUP BY clause.

Regarding the implementation, even if in general I also prefer to follow the SQL syntax, I do not believe it will be a good fit for Cassandra.

If we have a table like:

CREATE TABLE trades
{
    symbol text,
    date date,
    time time,
    priceMantissa int,
    priceExponent tinyint,
    volume int,
    PRIMARY KEY ((symbol, date), time)
};

The trades will be inserted with an increasing time and sorted in the same order. As we can have to process a large amount of data, we want to try to limit ourself to the cases where we can build the groups on the flight (which is not a requirement in the SQL world).

If we want to get the number of trades per minutes with the SQL syntax we will have to write:

SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' GROUP BY hour(time), minute(time);
which is fine. The problem is that if the user invert by mistake the functions like that:
SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' GROUP BY minute(time), hour(time);
the query will return weird results.
The only way to prevent that would be to check the function order and make sure that we do not allow to skip functions (e.g. GROUP BY hour(time), second(time)).

In my opinion a function like floor(<columnName>, <time range>) will be much better as it does not allow for this type of mistakes and is much more flexible (you can create 5 minutes buckets if you want to).

SELECT floor(time, m), count() FROM Trades 
WHERE symbol = 'AAPL' AND date = '2016-01-11'
GROUP BY floor(time, m);

An important aspect to keep in mind with a function like floor is the starting point. For a query like: SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11' AND time >= '01:30:00' AND time =< '07:30:00' GROUP BY floor(time, 2h);, I think that ideally the result should return 3 groups: 01:30:00, 03:30:00 and 05:30:00.

Attachments

Issue Links

depends upon

CASSANDRA-10707 Add support for Group By to Select statement

Resolved

CASSANDRA-11873 Add duration type

Resolved

CASSANDRA-10783 Allow literal value as parameter of UDF & UDA

Resolved

is related to

CASSANDRA-18867 Document GROUP BY time interval feature

Triage Needed

Activity

People

Assignee:: Benjamin Lerer

Reporter:: Benjamin Lerer

Authors:: Benjamin Lerer

Reviewers:: Andres de la Peña, Yifan Cai

Votes:: 3 Vote for this issue

Watchers:: 24 Start watching this issue

Dates

Created:: 23/May/16 08:51

Updated:: 19/Sep/23 18:41

Resolved:: 22/Apr/22 09:05

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3.5h