Cassandra
  1. Cassandra
  2. CASSANDRA-5514

Track min/max clustered values per sstable to optimize reads

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 2.0 beta 1
    • Component/s: API, Core
    • Labels:
      None

      Description

      Slice queries can't optimize based on timestamp except for rare cases (CASSANDRA-4116). However, many common queries involve an implicit time component, where the application author knows that he is only interested in data more recent than X, or older than Y.

      We could use the per-sstable max and min timestamps we track to avoid touching cold data if we could pass a hint to Cassandra about the time range we care about.

      1. 0001-CASSANDRA-5514-v1.patch
        42 kB
        Marcus Eriksson
      2. 0001-CASSANDRA-5514-v2.patch
        63 kB
        Marcus Eriksson

        Activity

        Jonathan Ellis created issue -
        Hide
        Marcus Eriksson added a comment -

        i like it, it would go very nice with CASSANDRA-5228 and an "age tiered" compaction strategy

        Show
        Marcus Eriksson added a comment - i like it, it would go very nice with CASSANDRA-5228 and an "age tiered" compaction strategy
        Hide
        Jonathan Ellis added a comment -

        Syntax could be something like

        SELECT *
        FROM mytimeseries
        WHERE key = foo
        USING TIMESTAMP > X
        

        Alternatives could include

        1. Adding this to the WHERE clause. However, this would be incompatible with any existing columns named "timestamp," and implementation would likely be messier than just letting the parser pull this out for us. Additionally, we've already decided to not pretend timestamp is a pseudo-column for insert/update.
        2. Using the WITH keyword or a new keyword entirely. This may be somewhere more natural than USING, but again, having gone with USING on insert/update, it seems like a good idea to be consistent here.
        Show
        Jonathan Ellis added a comment - Syntax could be something like SELECT * FROM mytimeseries WHERE key = foo USING TIMESTAMP > X Alternatives could include Adding this to the WHERE clause. However, this would be incompatible with any existing columns named "timestamp," and implementation would likely be messier than just letting the parser pull this out for us. Additionally, we've already decided to not pretend timestamp is a pseudo-column for insert/update. Using the WITH keyword or a new keyword entirely. This may be somewhere more natural than USING , but again, having gone with USING on insert/update, it seems like a good idea to be consistent here.
        Jonathan Ellis made changes -
        Field Original Value New Value
        Assignee Marcus Eriksson [ krummas ]
        Hide
        Jonathan Ellis added a comment - - edited

        Sylvain suggests that instead of adding extra syntax, we track min and max column values (cell name components) on a per-sstable basis. The query then just becomes

        SELECT *
        FROM mytimeseries
        WHERE key = foo
          AND mytimecolumn > X
        

        This requires less moving pieces to implement (don't need to touch QueryProcessor) and I'm generally a fan of less special cases.

        This also generalizes to non-time-based clustering. I can't think of any examples off the top of my head where that's critical, but it does seem like a good option to support.

        A small downside is that we'll need to make SSTableMetadata.Collector cql-aware. Simplest approach IMO is to just track all the "clustering" components per-sstable, it's not going to be a whole lot of extra data consumed.

        Show
        Jonathan Ellis added a comment - - edited Sylvain suggests that instead of adding extra syntax, we track min and max column values (cell name components) on a per-sstable basis. The query then just becomes SELECT * FROM mytimeseries WHERE key = foo AND mytimecolumn > X This requires less moving pieces to implement (don't need to touch QueryProcessor) and I'm generally a fan of less special cases. This also generalizes to non-time-based clustering. I can't think of any examples off the top of my head where that's critical, but it does seem like a good option to support. A small downside is that we'll need to make SSTableMetadata.Collector cql-aware. Simplest approach IMO is to just track all the "clustering" components per-sstable, it's not going to be a whole lot of extra data consumed.
        Hide
        Tupshin Harper added a comment - - edited

        See CASSANDRA-3581 for a previous ticket along similar lines.

        +1 on no additional syntax and Sylvain's general approach.

        +1000 on a time-series aware compaction strategy. That is where the big win will be.

        Show
        Tupshin Harper added a comment - - edited See CASSANDRA-3581 for a previous ticket along similar lines. +1 on no additional syntax and Sylvain's general approach. +1000 on a time-series aware compaction strategy. That is where the big win will be.
        Hide
        Marcus Eriksson added a comment -

        Tupshin Harper what ticket did you mean? 5518 looks quite unrelated

        Show
        Marcus Eriksson added a comment - Tupshin Harper what ticket did you mean? 5518 looks quite unrelated
        Hide
        Tupshin Harper added a comment -

        Markus, fixed to 3581.

        Show
        Tupshin Harper added a comment - Markus, fixed to 3581.
        Hide
        Jonathan Ellis added a comment -

        I made a good point in 3581 that I forgot about.

        Currently we only promote row tombstones if there is enough data in the row to be worth indexing, under the reasoning that if it's not worth indexing we might as well read the tombstone from the data file as well as any columns involved.

        So, we'll need to change that to always promote (CASSANDRA-5526). As it happens, I wrote this code once already for CASSANDRA-5487, so I'll see about dusting it off.

        Show
        Jonathan Ellis added a comment - I made a good point in 3581 that I forgot about. Currently we only promote row tombstones if there is enough data in the row to be worth indexing, under the reasoning that if it's not worth indexing we might as well read the tombstone from the data file as well as any columns involved. So, we'll need to change that to always promote ( CASSANDRA-5526 ). As it happens, I wrote this code once already for CASSANDRA-5487 , so I'll see about dusting it off.
        Hide
        Sylvain Lebresne added a comment -

        So, we'll need to change that to always promote

        I'm not sure that's necessary for that issue. The goal here would be to ignore sstable based on basically their metadata, so promoting stuff from the data file to the index file shouldn't have any impact here.

        But, it's true that row tombstones complicate stuff because in theory, as soon as a sstable has a row tombstone, then the min/max column name should be empty (i.e. cover allthethings). But I think we can probably get away by involving the min/max timestamps. More precisely, we can ignore row tombstones in the min/max column computation and initially ignore sstable based on this min/max column name. But then, we would need to do a pass over the result to make sure no column from the result could be overridden by a row tombstone in one of the ignored sstable, which can be done using the sstable max timestamp.

        Or was that something else you had in mind?

        Show
        Sylvain Lebresne added a comment - So, we'll need to change that to always promote I'm not sure that's necessary for that issue. The goal here would be to ignore sstable based on basically their metadata, so promoting stuff from the data file to the index file shouldn't have any impact here. But , it's true that row tombstones complicate stuff because in theory, as soon as a sstable has a row tombstone, then the min/max column name should be empty (i.e. cover allthethings). But I think we can probably get away by involving the min/max timestamps. More precisely, we can ignore row tombstones in the min/max column computation and initially ignore sstable based on this min/max column name. But then, we would need to do a pass over the result to make sure no column from the result could be overridden by a row tombstone in one of the ignored sstable, which can be done using the sstable max timestamp. Or was that something else you had in mind?
        Hide
        Jonathan Ellis added a comment -

        You're right, that's a simpler solution.

        Show
        Jonathan Ellis added a comment - You're right, that's a simpler solution.
        Marcus Eriksson made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Marcus Eriksson added a comment -
        • tracks min/max column names in sstables (according to comparator)
        • uses the min/max column names no eliminate sstables during slices (and maxLocalDeletionTime to figure out if an sstable has a tombstone)

        Jonathan Ellis - i have not made SSTM.Collector cql-aware, i couldnt really see a reason for it, could you elaborate on that comment? We should be able to eliminate sstables based on the entire column name, even if it is a composite one right?

        Show
        Marcus Eriksson added a comment - tracks min/max column names in sstables (according to comparator) uses the min/max column names no eliminate sstables during slices (and maxLocalDeletionTime to figure out if an sstable has a tombstone) Jonathan Ellis - i have not made SSTM.Collector cql-aware, i couldnt really see a reason for it, could you elaborate on that comment? We should be able to eliminate sstables based on the entire column name, even if it is a composite one right?
        Marcus Eriksson made changes -
        Attachment 0001-CASSANDRA-5514-v1.patch [ 12583105 ]
        Hide
        Sylvain Lebresne added a comment -

        We should be able to eliminate sstables based on the entire column name, even if it is a composite one right?

        That won't work as well as we want unfortunately. Say your table is something like:

        CREATE TABLE timeline (
          key int,
          category string,
          time timestamp,
          value1 int,
          value2 text,
          PRIMARY KEY (key, category, time);
        )
        

        i.e. the (internal) column names are a composite with a category first and then a timeline for each category (not saying that model is smart, it's just an example to illustrate ).

        Now say in one sstable you only have entries whose time is <= 10, but for category "a" through "z". So the min/max column names might be "a:0"/"z:10". So if you do a query like:

        SELECT * FROM timeline WHERE key = 3 AND category = "c" AND time > 100
        

        then we cannot eliminate the sstable above, because "a:0" < "c:100" < "z:10".

        However, if we keep each component separately, we would keep: for category, min/max = "a"/"z" and for time, min/max = 0/10. And from that we can eliminate the sstable for the query above since the time queried is not in the min/max range for time.

        Show
        Sylvain Lebresne added a comment - We should be able to eliminate sstables based on the entire column name, even if it is a composite one right? That won't work as well as we want unfortunately. Say your table is something like: CREATE TABLE timeline ( key int, category string, time timestamp, value1 int, value2 text, PRIMARY KEY (key, category, time); ) i.e. the (internal) column names are a composite with a category first and then a timeline for each category (not saying that model is smart, it's just an example to illustrate ). Now say in one sstable you only have entries whose time is <= 10, but for category "a" through "z". So the min/max column names might be "a:0"/"z:10". So if you do a query like: SELECT * FROM timeline WHERE key = 3 AND category = "c" AND time > 100 then we cannot eliminate the sstable above, because "a:0" < "c:100" < "z:10". However, if we keep each component separately, we would keep: for category , min/max = "a"/"z" and for time , min/max = 0/10. And from that we can eliminate the sstable for the query above since the time queried is not in the min/max range for time .
        Hide
        Marcus Eriksson added a comment -

        track and use CompositeType components

        Show
        Marcus Eriksson added a comment - track and use CompositeType components
        Marcus Eriksson made changes -
        Attachment 0001-CASSANDRA-5514-v2.patch [ 12583829 ]
        Jonathan Ellis made changes -
        Reviewer slebresne
        Jonathan Ellis made changes -
        Reviewer slebresne jbellis
        Hide
        Jonathan Ellis added a comment -

        LGTM in principle. Made some tweaks in https://github.com/jbellis/cassandra/commits/5514. Still a bit unclear about DCT – added a hack in the last commit to always return true, but do we need to do anything on the collector side?

        Show
        Jonathan Ellis added a comment - LGTM in principle. Made some tweaks in https://github.com/jbellis/cassandra/commits/5514 . Still a bit unclear about DCT – added a hack in the last commit to always return true, but do we need to do anything on the collector side?
        Hide
        Marcus Eriksson added a comment -

        the changes look good to me

        regarding DCT, i dunno, i guess we could simply not collect the max/min columns when comparator is DCT?

        Show
        Marcus Eriksson added a comment - the changes look good to me regarding DCT, i dunno, i guess we could simply not collect the max/min columns when comparator is DCT?
        Hide
        Sylvain Lebresne added a comment -

        regarding DCT

        I haven't looked at the patch details, but my intuition would be do handle DCT as any non-composite type. That's what CQL3 does really.

        Show
        Sylvain Lebresne added a comment - regarding DCT I haven't looked at the patch details, but my intuition would be do handle DCT as any non-composite type. That's what CQL3 does really.
        Hide
        Marcus Eriksson added a comment -

        ah, ok, that's what the patch does

        Show
        Marcus Eriksson added a comment - ah, ok, that's what the patch does
        Hide
        Jonathan Ellis added a comment -

        Ship it!

        Show
        Jonathan Ellis added a comment - Ship it!
        Hide
        Marcus Eriksson added a comment -

        committed as 0ad499e9a82afa23f17672e25678d065b421e0d8 - thanks!

        Show
        Marcus Eriksson added a comment - committed as 0ad499e9a82afa23f17672e25678d065b421e0d8 - thanks!
        Marcus Eriksson made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Jonathan Ellis added a comment -

        Is there any reason we can't extend this to non-clustered columns as well?

        Show
        Jonathan Ellis added a comment - Is there any reason we can't extend this to non-clustered columns as well?
        Hide
        Sylvain Lebresne added a comment -

        Is there any reason we can't extend this to non-clustered columns as well?

        I don't a reason why it wouldn't work technically. I'm slightly less sure how useful it will be in practice. For the partition key columns, I suspect the bloom filter makes it almost always useless. For non primary-key columns, we only allow conditions on them with either ALLOW FILTERING or 2ndary indexes. Even for 2ndary indexes, we basically only allow them without ALLOW FILTERING if there is just one equal condition (on an indexed column). In that latter case, the index gives us directly rows that does match the condition so the BF should again remove useless sstables. So if that reasoning is right, it means it would only help with ALLOW FILTERING. Which doesn't mean we shouldn't do it btw, just trying to think out loud if that's worth the trouble.

        Show
        Sylvain Lebresne added a comment - Is there any reason we can't extend this to non-clustered columns as well? I don't a reason why it wouldn't work technically. I'm slightly less sure how useful it will be in practice. For the partition key columns, I suspect the bloom filter makes it almost always useless. For non primary-key columns, we only allow conditions on them with either ALLOW FILTERING or 2ndary indexes. Even for 2ndary indexes, we basically only allow them without ALLOW FILTERING if there is just one equal condition (on an indexed column). In that latter case, the index gives us directly rows that does match the condition so the BF should again remove useless sstables. So if that reasoning is right, it means it would only help with ALLOW FILTERING. Which doesn't mean we shouldn't do it btw, just trying to think out loud if that's worth the trouble.
        Jonathan Ellis made changes -
        Summary Allow timestamp hints Track min/max clustered values per sstable to optimize reads
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        18d 17h 40m 1 Marcus Eriksson 14/May/13 10:26
        Patch Available Patch Available Resolved Resolved
        10d 6h 17m 1 Marcus Eriksson 24/May/13 16:43

          People

          • Assignee:
            Marcus Eriksson
            Reporter:
            Jonathan Ellis
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development