Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: Core
    • Labels:

      Description

      We currently support expiring columns by time-to-live; we've also had requests for keeping the most recent N columns in a row.

      1. 3929.txt
        6 kB
        Dave Brosius
      2. 3929_g.txt
        7 kB
        Dave Brosius
      3. 3929_g_tests.txt
        6 kB
        Dave Brosius
      4. 3929_f.txt
        7 kB
        Dave Brosius
      5. 3929_e.txt
        6 kB
        Fabien Rousseau
      6. 3929_d.txt
        4 kB
        Dave Brosius
      7. 3929_c.txt
        5 kB
        Dave Brosius
      8. 3929_b.txt
        7 kB
        Dave Brosius

        Activity

        Hide
        Jonathan Ellis added a comment -

        This is difficult to do efficiently, since it implies checking the entire row's contents on each update. (Skipping this, and only checking/deleting obsolete columns at read time, means you could blow your column budget by arbitrarily large amounts during write intensive workloads.) Even checking randomly on say 1% of writes could dramatically affect write performance for larger-than-memory datasets.

        Show
        Jonathan Ellis added a comment - This is difficult to do efficiently, since it implies checking the entire row's contents on each update. (Skipping this, and only checking/deleting obsolete columns at read time, means you could blow your column budget by arbitrarily large amounts during write intensive workloads.) Even checking randomly on say 1% of writes could dramatically affect write performance for larger-than-memory datasets.
        Hide
        Sylvain Lebresne added a comment -

        Agreed, it's hard to do efficiently. What is easy to do is to write a compaction strategy (or have a strategy option) that only keeps the N first columns on each compaction. Of course, that doesn't guarantee that you will only get the N most recent columns but in practice that would fairly efficiently get rid of the excess data, which I believe is mostly what people care about. Basically it would really just be "we'll discard everything we know is out of the N first columns". I suspect that in practice that may be the good trade-off, but given it's not perfect I've always though that it probably make more sense as an externally contributed compaction strategy.

        Show
        Sylvain Lebresne added a comment - Agreed, it's hard to do efficiently. What is easy to do is to write a compaction strategy (or have a strategy option) that only keeps the N first columns on each compaction. Of course, that doesn't guarantee that you will only get the N most recent columns but in practice that would fairly efficiently get rid of the excess data, which I believe is mostly what people care about. Basically it would really just be "we'll discard everything we know is out of the N first columns". I suspect that in practice that may be the good trade-off, but given it's not perfect I've always though that it probably make more sense as an externally contributed compaction strategy.
        Hide
        Drew Kutcharian added a comment -

        I agree with Sylvain. Most of the time all you care is to have a capped collection, for example keeping a history or an audit log for something.

        Show
        Drew Kutcharian added a comment - I agree with Sylvain. Most of the time all you care is to have a capped collection, for example keeping a history or an audit log for something.
        Hide
        Colin Taylor added a comment - - edited

        Another vote for the compaction strategy. We're required to keep at least N days worth of logs so would like to bound our usage without needing precision.

        Show
        Colin Taylor added a comment - - edited Another vote for the compaction strategy. We're required to keep at least N days worth of logs so would like to bound our usage without needing precision.
        Hide
        Ahmet AKYOL added a comment -

        Here are some example hypothetical column family storage parameters for this feature:

        max_column_number_hint : 1000 // meaning: try to keep around 1000 columns. Since it's a hint, we(users) are OK with tombstones or 800 - 1200 range

        or

        max_row_size_hint : 1MB

        Show
        Ahmet AKYOL added a comment - Here are some example hypothetical column family storage parameters for this feature: max_column_number_hint : 1000 // meaning: try to keep around 1000 columns. Since it's a hint, we(users) are OK with tombstones or 800 - 1200 range or max_row_size_hint : 1MB
        Hide
        Rick Branson added a comment -

        Would love to see this as well, as a way to keep data sizes for wide rows under control, for use cases where old data at the tail of the row becomes more or less useless and time is not a dependable dimension to use as a truncation method.

        Clearly it doesn't have to be perfect as far as how much data it actually keeps around, but I'd like to see the CF configuration be a lower bound on the number of columns kept. Basically a way to communicate to Cassandra what your requirement is as far as retention, and it takes care of meeting that target. An acceptable edge case (at least from my perspective) where this might be "break" is if the user does their own deletion of some columns.

        Show
        Rick Branson added a comment - Would love to see this as well, as a way to keep data sizes for wide rows under control, for use cases where old data at the tail of the row becomes more or less useless and time is not a dependable dimension to use as a truncation method. Clearly it doesn't have to be perfect as far as how much data it actually keeps around, but I'd like to see the CF configuration be a lower bound on the number of columns kept. Basically a way to communicate to Cassandra what your requirement is as far as retention, and it takes care of meeting that target. An acceptable edge case (at least from my perspective) where this might be "break" is if the user does their own deletion of some columns.
        Hide
        Dave Brosius added a comment -

        Here's an attempt. It's likely this patch is naively wrong, as i may not understand the full consequences of whats going on, but basically before writing out the precompacted row, limit the number of live columns to value set in cf's compaction_options. Ir converts the first livecolumns - n columns to tombstones.

        Show
        Dave Brosius added a comment - Here's an attempt. It's likely this patch is naively wrong, as i may not understand the full consequences of whats going on, but basically before writing out the precompacted row, limit the number of live columns to value set in cf's compaction_options. Ir converts the first livecolumns - n columns to tombstones.
        Hide
        Jonathan Ellis added a comment - - edited

        I think we also need to limit

        • on flush, since there's no need to knowingly save data we don't want
        • on LCR as well as PCR, since enough small rows can still overflow to LCR mode

        Also: if I'm understanding correctly, we're tombstoning the beginning of the row here? ISTM tombstoning the end of the row will be more in keeping with our advice that "querying from the start of the row in comparator order is fastest."

        Show
        Jonathan Ellis added a comment - - edited I think we also need to limit on flush, since there's no need to knowingly save data we don't want on LCR as well as PCR, since enough small rows can still overflow to LCR mode Also: if I'm understanding correctly, we're tombstoning the beginning of the row here? ISTM tombstoning the end of the row will be more in keeping with our advice that "querying from the start of the row in comparator order is fastest."
        Hide
        Dave Brosius added a comment -

        Also: if I'm understanding correctly, we're tombstoning the beginning of the row here? ISTM tombstoning the end of the row will be more in keeping with our advice that "querying from the start of the row in comparator order is fastest."

        It seems to me you would want this feature only when you have some sort of time based column name scheme, and thus you only want to save the most recent n samples. Thus tossing out the old ones.

        Show
        Dave Brosius added a comment - Also: if I'm understanding correctly, we're tombstoning the beginning of the row here? ISTM tombstoning the end of the row will be more in keeping with our advice that "querying from the start of the row in comparator order is fastest." It seems to me you would want this feature only when you have some sort of time based column name scheme, and thus you only want to save the most recent n samples. Thus tossing out the old ones.
        Hide
        Rick Branson added a comment -

        +1 for tombstoning the tail of the row and not the head.

        If you want the most recent data at the head of the row, use a ReversedType(TimeUUIDType) comparator. Grabbing the tail on every query will kill performance.

        Show
        Rick Branson added a comment - +1 for tombstoning the tail of the row and not the head. If you want the most recent data at the head of the row, use a ReversedType(TimeUUIDType) comparator. Grabbing the tail on every query will kill performance.
        Hide
        Jonathan Ellis added a comment - - edited

        That's what Reversed comparator is for.

        (Non-facetiously, that is what we'd recommend in that case since reading from start of row can skip index deserialization for a decent speedup. Basically you only want to be reading from end of row if that's a once-in-a-while query. If it's your main query, reverse it at the comparator level, not the query level.)

        Edit: Rick typed faster than I did.

        Show
        Jonathan Ellis added a comment - - edited That's what Reversed comparator is for. (Non-facetiously, that is what we'd recommend in that case since reading from start of row can skip index deserialization for a decent speedup. Basically you only want to be reading from end of row if that's a once-in-a-while query. If it's your main query, reverse it at the comparator level, not the query level.) Edit: Rick typed faster than I did.
        Hide
        Dave Brosius added a comment -

        retain the columns at the front of the row. This patch needs to add tombstoning of columns on flush as well, as suggested by jbellis. (in progress)

        Show
        Dave Brosius added a comment - retain the columns at the front of the row. This patch needs to add tombstoning of columns on flush as well, as suggested by jbellis. (in progress)
        Hide
        Dave Brosius added a comment -

        3929_c.txt tombstones columns on compaction and flushing, the setting is still in compaction_options.

        Show
        Dave Brosius added a comment - 3929_c.txt tombstones columns on compaction and flushing, the setting is still in compaction_options.
        Hide
        Jonathan Ellis added a comment -

        Good idea putting the code in the index Builder!

        It looks though that build() is only used when we can fit the row in memory, otherwise LazilyCompactedRow calls add directly (also called by build). So I think you're going to need to move the retained row count into the Builder instance to maintain state across invocations.

        Show
        Jonathan Ellis added a comment - Good idea putting the code in the index Builder! It looks though that build() is only used when we can fit the row in memory, otherwise LazilyCompactedRow calls add directly (also called by build ). So I think you're going to need to move the retained row count into the Builder instance to maintain state across invocations.
        Hide
        Dave Brosius added a comment -

        store state in Builder, and push logic to add

        Show
        Dave Brosius added a comment - store state in Builder, and push logic to add
        Hide
        Jonathan Ellis added a comment -

        The good news is, this looks good. (Nit: getRetainedColumnCount would be a bit cleaner as a method on CFMetaData.)

        The bad news is, I think we need to scope creep – the right unit of retention is the cql3 row. For COMPACT STORAGE there is one row per cell, but otherwise it gets complicated... there's a "this row exists" marker cell, and collection columns become one cell per entry. Dealing with partial (cql3) rows is not something we want to inflict on users, so we should enable column tombstoning only on cql3 row boundaries.

        cfmetadata.cqlCfDef will have the information we need to do this, in particulary isCompact and keys. (See www.datastax.com/dev/blog/thrift-to-cql3.)

        I suspect you're going to want a unit test or two here. QueryProcessor.processInternal is probably the easiest way to do cql from a test.

        Show
        Jonathan Ellis added a comment - The good news is, this looks good. (Nit: getRetainedColumnCount would be a bit cleaner as a method on CFMetaData.) The bad news is, I think we need to scope creep – the right unit of retention is the cql3 row. For COMPACT STORAGE there is one row per cell, but otherwise it gets complicated... there's a "this row exists" marker cell, and collection columns become one cell per entry. Dealing with partial (cql3) rows is not something we want to inflict on users, so we should enable column tombstoning only on cql3 row boundaries. cfmetadata.cqlCfDef will have the information we need to do this, in particulary isCompact and keys . (See www.datastax.com/dev/blog/thrift-to-cql3.) I suspect you're going to want a unit test or two here. QueryProcessor.processInternal is probably the easiest way to do cql from a test.
        Hide
        Fabien Rousseau added a comment -

        Hum, the current patch works if no deletes are done...

        Let's have an example with deletes :
        Suppose that we want to keep 3 columns, and have standard comparator.
        Let's insert 4 column names : E, F, G, H
        Then flush (on the SSTable, we will have : E, F, G, tombstone(H) ).
        Let's insert another 4 column names : A, B, C, D
        Then delete column B.
        Then flush (on the SSTable, we will have : A, tombstone(B), C, tombstone(D) )

        With the current patch (which excludes tombstones in the count on the read path) :
        reading the first 3 columns would return : A,C,E
        By including the tombstones in the count in the read path :
        reading the first 3 columns would return : A,C

        I think returning A,C,E is incorrect because last inserted columns where A,C,D.

        So, to support delete, there is also something to do on the read path (include tombstones in columns count, so it never goes after "maxColumns").

        I propose the patch 3929_e.txt.

        Show
        Fabien Rousseau added a comment - Hum, the current patch works if no deletes are done... Let's have an example with deletes : Suppose that we want to keep 3 columns, and have standard comparator. Let's insert 4 column names : E, F, G, H Then flush (on the SSTable, we will have : E, F, G, tombstone(H) ). Let's insert another 4 column names : A, B, C, D Then delete column B. Then flush (on the SSTable, we will have : A, tombstone(B), C, tombstone(D) ) With the current patch (which excludes tombstones in the count on the read path) : reading the first 3 columns would return : A,C,E By including the tombstones in the count in the read path : reading the first 3 columns would return : A,C I think returning A,C,E is incorrect because last inserted columns where A,C,D. So, to support delete, there is also something to do on the read path (include tombstones in columns count, so it never goes after "maxColumns"). I propose the patch 3929_e.txt.
        Hide
        Dave Brosius added a comment -

        rebase line to trunk latest, fix ACS validations, and only do auto tombstoning if using compact storage.

        doing non compact storage would mean (i think) deserializing each columns name to find where the first component changes to count rows, which seems to be performantly painful.

        Show
        Dave Brosius added a comment - rebase line to trunk latest, fix ACS validations, and only do auto tombstoning if using compact storage. doing non compact storage would mean (i think) deserializing each columns name to find where the first component changes to count rows, which seems to be performantly painful.
        Hide
        Dave Brosius added a comment - - edited

        support removal of non compact storage (as well), by removing whole 'cql rows' at a time.

        patch against trunk

        3929_g.txt

        Show
        Dave Brosius added a comment - - edited support removal of non compact storage (as well), by removing whole 'cql rows' at a time. patch against trunk 3929_g.txt
        Hide
        Dave Brosius added a comment -

        add tests for compact storage and composite cases

        3929_g_tests.txt

        Show
        Dave Brosius added a comment - add tests for compact storage and composite cases 3929_g_tests.txt
        Hide
        Sylvain Lebresne added a comment -

        I have to say that I'm a bit unconfortable with that patch/ticket.

        My problem is, it is not very easy to understand what that feature actually does for a end user, and provided said user does deletes, the behavior becomes pretty much random.

        Let's ignore deletions first and let get ourselves in the feet of a user.

        That option is supposed to impose a row size limit. So say N = 2 and I insert (not at the same time, nor necessarily in that order) columns A, B and C. Since I cap the row at 2, if I do a full row read that's what I well [A, B]. So the row contains only A and B, right! But what if I do a slice(B, "")? Then it depends: I may get [B], but I can also get [B, C] (because maybe flush happens so that [A, B] ends up in one sstable, and [C] in another, so that C is still here internally, and the slice will have no way to know that it shouldn't return C because C is over the row size limit). And that heavily depend on internal timing: maybe I'll get [B, C] but if I try one second later I'll get [B] because compaction has kicked in. So, what gives?

        Adding deletion makes that even worst. If you start doing deletes, depending on the timing of flush/compaction, you may or may not even get the N first column you've inserted in the row (typically, in Fabien's example above, if you change when flush occurs, even with the last patch attached, you may either get [A, C] (which is somewhat wrong really) or [A, C, D]).

        I also want to mention that because compaction/flush don't happen synchronously on all replica, there is a high change that even if replica are consistent, their actual sstable content differs, meaning that this probably break repair fairly badly.

        Let's be clear. I'm not saying that feature cannot be useful. But I'm saying this is a bit of hack whose semantic depends on internal timing of operations, not a feature with a cleanly defined semantic. That's why I said earlier that I always though this would make a good externally contributed compaction strategy, but a priori feels a bit too hacky for core Cassandra imo. I haven't made up my mind completely yet, but I wanted to voice my concern first and see what other thinks. And I have to say that if we do go ahead with that feature in core Cassandra, I'd be in favor of disabling deletes on CF that have that option set, because imo throwing deletes in the mix makes things too unpredicatable to be really useful.

        Show
        Sylvain Lebresne added a comment - I have to say that I'm a bit unconfortable with that patch/ticket. My problem is, it is not very easy to understand what that feature actually does for a end user, and provided said user does deletes, the behavior becomes pretty much random. Let's ignore deletions first and let get ourselves in the feet of a user. That option is supposed to impose a row size limit. So say N = 2 and I insert (not at the same time, nor necessarily in that order) columns A, B and C. Since I cap the row at 2, if I do a full row read that's what I well [A, B] . So the row contains only A and B, right! But what if I do a slice(B, "")? Then it depends: I may get [B] , but I can also get [B, C] (because maybe flush happens so that [A, B] ends up in one sstable, and [C] in another, so that C is still here internally, and the slice will have no way to know that it shouldn't return C because C is over the row size limit). And that heavily depend on internal timing: maybe I'll get [B, C] but if I try one second later I'll get [B] because compaction has kicked in. So, what gives? Adding deletion makes that even worst. If you start doing deletes, depending on the timing of flush/compaction, you may or may not even get the N first column you've inserted in the row (typically, in Fabien's example above, if you change when flush occurs, even with the last patch attached, you may either get [A, C] (which is somewhat wrong really) or [A, C, D] ). I also want to mention that because compaction/flush don't happen synchronously on all replica, there is a high change that even if replica are consistent, their actual sstable content differs, meaning that this probably break repair fairly badly. Let's be clear. I'm not saying that feature cannot be useful. But I'm saying this is a bit of hack whose semantic depends on internal timing of operations, not a feature with a cleanly defined semantic. That's why I said earlier that I always though this would make a good externally contributed compaction strategy, but a priori feels a bit too hacky for core Cassandra imo. I haven't made up my mind completely yet, but I wanted to voice my concern first and see what other thinks. And I have to say that if we do go ahead with that feature in core Cassandra, I'd be in favor of disabling deletes on CF that have that option set, because imo throwing deletes in the mix makes things too unpredicatable to be really useful.
        Hide
        Jonathan Ellis added a comment - - edited

        Let's ignore deletions first and let get ourselves in the feet of a user.

        I think if we make users think about columns we have lost. It should really be defined in terms of cql3 rows per partition.

        I also think that it's confusing for a CF defined with a limit of, say, 50 to have SELECT * FROM cf WHERE key = ? to sometimes return 50 results, but sometimes more, depending on compaction state. We should have logic on the query path to "pretend that compaction is perfect." Put another way, we shouldn't leak implementation details. (And it should be fairly easy to do this with the LIMIT logic.)

        With this design I don't think we'd need any unusual restrictions on DELETE.

        Show
        Jonathan Ellis added a comment - - edited Let's ignore deletions first and let get ourselves in the feet of a user. I think if we make users think about columns we have lost. It should really be defined in terms of cql3 rows per partition. I also think that it's confusing for a CF defined with a limit of, say, 50 to have SELECT * FROM cf WHERE key = ? to sometimes return 50 results, but sometimes more, depending on compaction state. We should have logic on the query path to "pretend that compaction is perfect." Put another way, we shouldn't leak implementation details. (And it should be fairly easy to do this with the LIMIT logic.) With this design I don't think we'd need any unusual restrictions on DELETE.
        Hide
        Ahmet AKYOL added a comment -

        I see Sylvain Lebresne's point and even without knowing C* internals, it really sounds like an impossible task to do. The real problem here is, during compaction, nodes have to deal with many rows; it's like a thread synchronization nightmare on node level(a.k.a. distributed systems ). So why not give responsibility to users and provide something like that :

        delete from recentuseractivities where userid=1 AFTER COLUMN [50]; 
        

        users may add this statement after every insert in a batch or they can find a way to call it less. It should skip tombstones and may not throw an exception for small sizes of course...
        since it's row based, it seems doable to me.

        or you may add this kind of feature to lists:

        DELETE top_places [>50] FROM users WHERE user_id = 'frodo';
        

        something like [>N] could be added ...

        just my two cents, no intention to interrupt your development process

        Show
        Ahmet AKYOL added a comment - I see Sylvain Lebresne 's point and even without knowing C* internals, it really sounds like an impossible task to do. The real problem here is, during compaction, nodes have to deal with many rows; it's like a thread synchronization nightmare on node level(a.k.a. distributed systems ). So why not give responsibility to users and provide something like that : delete from recentuseractivities where userid=1 AFTER COLUMN [50]; users may add this statement after every insert in a batch or they can find a way to call it less. It should skip tombstones and may not throw an exception for small sizes of course... since it's row based, it seems doable to me. or you may add this kind of feature to lists: DELETE top_places [>50] FROM users WHERE user_id = 'frodo'; something like [>N] could be added ... just my two cents, no intention to interrupt your development process
        Hide
        Rick Branson added a comment - - edited

        I see the arguments in general with how it's difficult to clearly communicate to the end user what's exactly going to happen to their data. At this point I'm looking at implementing this as a compaction strategy. I've also not done extensive testing on exactly how expensive it'd be to read the N'th column in the row for a sample of inserts and delete the unneeded data, which will probably come first. This is a blocker for us moving some storage for a few features from a very manually managed Redis cluster to C*.

        Ahmet AKYOL: something like that, while probably slightly more "grokable" by the end user, would actually require reading the entire row for each operation unless some fancy enhancements to tombstones were made. If the data is time-ordered, this can be emulated by reading the N+1th column and deleting the row with a timestamp of that column+1. The idea with implementing this the way we have in this ticket is that we'd get it for "free" by making it part of the compaction process.

        Show
        Rick Branson added a comment - - edited I see the arguments in general with how it's difficult to clearly communicate to the end user what's exactly going to happen to their data. At this point I'm looking at implementing this as a compaction strategy. I've also not done extensive testing on exactly how expensive it'd be to read the N'th column in the row for a sample of inserts and delete the unneeded data, which will probably come first. This is a blocker for us moving some storage for a few features from a very manually managed Redis cluster to C*. Ahmet AKYOL : something like that, while probably slightly more "grokable" by the end user, would actually require reading the entire row for each operation unless some fancy enhancements to tombstones were made. If the data is time-ordered, this can be emulated by reading the N+1th column and deleting the row with a timestamp of that column+1. The idea with implementing this the way we have in this ticket is that we'd get it for "free" by making it part of the compaction process.
        Hide
        Sylvain Lebresne added a comment -

        I think if we make users think about columns we have lost. It should really be defined in terms of cql3 rows per partition.

        I agree with that and I didn't mean to imply that. I was just talking of columns and rows as in "the internal storage" to explain my concern. And technically, taking of number of internal columns inside internal rows or cql3 rows inside partition doesn't really change the problem or how to solve it (after all, for some layout, cql3 rows correspond one to one to internal columns).

        Put another way, we shouldn't leak implementation details

        I fully agree, that was my point

        We should have logic on the query path to "pretend that compaction is perfect."

        I agree we should. What logic exactly is another matter. Again, if we do ignore deletes, then I suppose we could fix the "slice" problem I've described above if we make it so that a read on a CF with a max_cql_rows setting always read a row from the beginning (up until the data really queried). That way, we could say "that column is here but we should pretend it's not". But doing that would be pretty painful in practice, and would have a non-negligible performance cost.

        But if you throw deletes, I'm honestly not sure it's possible to implement the "pretend that compaction is perfect." at all honestly. The problem is, say your max_cql_rows is N, and you insert N+1 cql3 rows. And then you delete one of the N first column. Now you are dependent of whether the N+1th row had been deleted by compaction already or not. The 2 only way to deal with that I can see is either:

        1. not deleting any column internally (but do the pretending client side) in case some deletes comes in. But that defeats the whole purpose of the ticket.
        2. apply the truncation at writing time. I.e, you say that as soon as you insert the N+1th column in a row, then whatever N+1th is the tail of that row disappear right away. But that means full row read on every write, not an option either.
        Show
        Sylvain Lebresne added a comment - I think if we make users think about columns we have lost. It should really be defined in terms of cql3 rows per partition. I agree with that and I didn't mean to imply that. I was just talking of columns and rows as in "the internal storage" to explain my concern. And technically, taking of number of internal columns inside internal rows or cql3 rows inside partition doesn't really change the problem or how to solve it (after all, for some layout, cql3 rows correspond one to one to internal columns). Put another way, we shouldn't leak implementation details I fully agree, that was my point We should have logic on the query path to "pretend that compaction is perfect." I agree we should. What logic exactly is another matter. Again, if we do ignore deletes, then I suppose we could fix the "slice" problem I've described above if we make it so that a read on a CF with a max_cql_rows setting always read a row from the beginning (up until the data really queried). That way, we could say "that column is here but we should pretend it's not". But doing that would be pretty painful in practice, and would have a non-negligible performance cost. But if you throw deletes, I'm honestly not sure it's possible to implement the "pretend that compaction is perfect." at all honestly. The problem is, say your max_cql_rows is N, and you insert N+1 cql3 rows. And then you delete one of the N first column. Now you are dependent of whether the N+1th row had been deleted by compaction already or not. The 2 only way to deal with that I can see is either: not deleting any column internally (but do the pretending client side) in case some deletes comes in. But that defeats the whole purpose of the ticket. apply the truncation at writing time. I.e, you say that as soon as you insert the N+1th column in a row, then whatever N+1th is the tail of that row disappear right away. But that means full row read on every write, not an option either.
        Hide
        Sylvain Lebresne added a comment -

        At this point I'm looking at implementing this as a compaction strategy

        Just wanted to note that when I say that this would "make a good externally contributed compaction strategy", I also mean that I'm fine generalizing a bit our current compaction strategy API to make that easier to do (because currently it's a bit of a pain).

        Show
        Sylvain Lebresne added a comment - At this point I'm looking at implementing this as a compaction strategy Just wanted to note that when I say that this would "make a good externally contributed compaction strategy", I also mean that I'm fine generalizing a bit our current compaction strategy API to make that easier to do (because currently it's a bit of a pain).
        Hide
        Rick Branson added a comment -

        Just wanted to note that when I say that this would "make a good externally contributed compaction strategy", I also mean that I'm fine generalizing a bit our current compaction strategy API to make that easier to do (because currently it's a bit of a pain).

        I definitely think making compaction more extensible is quite a useful goal, and could expand use cases for C*.

        It's probably going to be something very simple because our requirements are pretty loose. We don't need (or even want) deletes and data should get pushed out pretty quickly. I'm leaning towards simply putting an upper bound on column count in any given SSTable file produced by compaction. Given that LCS limits the sprawl of columns across SSTables, I figure the real world p99 is that we'll have 2x the limit. The target would be to have a mean bloat of 1.5x.

        Show
        Rick Branson added a comment - Just wanted to note that when I say that this would "make a good externally contributed compaction strategy", I also mean that I'm fine generalizing a bit our current compaction strategy API to make that easier to do (because currently it's a bit of a pain). I definitely think making compaction more extensible is quite a useful goal, and could expand use cases for C*. It's probably going to be something very simple because our requirements are pretty loose. We don't need (or even want) deletes and data should get pushed out pretty quickly. I'm leaning towards simply putting an upper bound on column count in any given SSTable file produced by compaction. Given that LCS limits the sprawl of columns across SSTables, I figure the real world p99 is that we'll have 2x the limit. The target would be to have a mean bloat of 1.5x.
        Hide
        Jonathan Ellis added a comment -

        The problem is, say your max_cql_rows is N, and you insert N+1 cql3 rows. And then you delete one of the N first column. Now you are dependent of whether the N+1th row had been deleted by compaction already or not.

        I see what you mean now; you're right.

        Show
        Jonathan Ellis added a comment - The problem is, say your max_cql_rows is N, and you insert N+1 cql3 rows. And then you delete one of the N first column. Now you are dependent of whether the N+1th row had been deleted by compaction already or not. I see what you mean now; you're right.
        Hide
        Ahmet AKYOL added a comment -

        Rick Branson: thanks for the explanation. I also want to "get it for free" but what I tried to say is "as a user, I am OK with extra cql if it's necessary " . I was thinking something similar to a redis pipeline which starts with adding data with zadd and after that limiting data with zremrangeByRank as in your words "if the data is time-ordered ...".

        About the requirement "reading the entire row", let's first revisit our use cases for this "limited row size type tables". Why we want them? Most probably we already have tables for our "big data" (that's why we use and love C*), but we need a special cache for "hot data" that's why it's a blocker to move from Redis to C* for some storage. So, what about C*'s row cache ? unfortunately, not an option because we may need data from many tables or we need only (most recent) portion of the wide rows, not all of them. So,once again, what we really want from this issue is indeed, a "special cache table" and that's why "reading the entire row" is not a problem beacuse we want the entire row on memory when it's hot.

        once more, just my two cents, no intention to interrupt your development process. Just see me as the business (user) side and remember the principle "business people and developers must work together daily throughout the project" of agile manifesto

        Show
        Ahmet AKYOL added a comment - Rick Branson : thanks for the explanation. I also want to "get it for free" but what I tried to say is "as a user, I am OK with extra cql if it's necessary " . I was thinking something similar to a redis pipeline which starts with adding data with zadd and after that limiting data with zremrangeByRank as in your words "if the data is time-ordered ...". About the requirement "reading the entire row", let's first revisit our use cases for this "limited row size type tables". Why we want them? Most probably we already have tables for our "big data" (that's why we use and love C*), but we need a special cache for "hot data" that's why it's a blocker to move from Redis to C* for some storage. So, what about C*'s row cache ? unfortunately, not an option because we may need data from many tables or we need only (most recent) portion of the wide rows, not all of them. So,once again, what we really want from this issue is indeed, a "special cache table" and that's why "reading the entire row" is not a problem beacuse we want the entire row on memory when it's hot. once more, just my two cents, no intention to interrupt your development process. Just see me as the business (user) side and remember the principle "business people and developers must work together daily throughout the project" of agile manifesto
        Hide
        Rick Branson added a comment -

        Ahmet AKYOL: What I mean is that in order to DELETE only the tail, Cassandra will have to read the entire row. For instance, your minimum retention requirement is ~500 columns, in order to find any columns after the 500th, the following operations must be performed:

        • All of the columns are read from the SSTable files that contain columns for that row
        • These row fragments are "merged" (re-sorting by Comparator, tombstone removal, etc)
        • Tombstones must be inserted for each column "after" the 500th.
        • As time goes on and tombstones build up (before GC grace), this operation gets more and more expensive and compaction perf also suffers.

        What I mean by "free" is not actually the need to perform the DELETE operation, but that it doesn't add extra cost burden to support this feature.

        As far as use case, it varies quite a bit. There are many use cases I can imagine for persistent storage with a quota for each user that auto-evicts old data over time for a low cost. Even for "big data" scenarios, the cost of computing still goes up as the data size grows. For instance, a database used to store objects a user interacted with for performing collaborative filtering only needs a sample. In real world use cases, these types of algorithms really need a relatively bounded set of data, and user taste might change over time, so only taking into consideration the most recent 90 objects makes sense. TTL'ing this data also doesn't make sense, because there are a wide range of frequencies at which users might generate this data.

        Sylvain Lebresne: I spent a few hours digging thru the compaction source and it's going to be messy to do this, probably involving a lot of copy+paste, so I'm even more +1 on disaggregating that massive Runnable method in CompactionTask into something more pluggable / extensible.

        Show
        Rick Branson added a comment - Ahmet AKYOL : What I mean is that in order to DELETE only the tail, Cassandra will have to read the entire row. For instance, your minimum retention requirement is ~500 columns, in order to find any columns after the 500th, the following operations must be performed: All of the columns are read from the SSTable files that contain columns for that row These row fragments are "merged" (re-sorting by Comparator, tombstone removal, etc) Tombstones must be inserted for each column "after" the 500th. As time goes on and tombstones build up (before GC grace), this operation gets more and more expensive and compaction perf also suffers. What I mean by "free" is not actually the need to perform the DELETE operation, but that it doesn't add extra cost burden to support this feature. As far as use case, it varies quite a bit. There are many use cases I can imagine for persistent storage with a quota for each user that auto-evicts old data over time for a low cost. Even for "big data" scenarios, the cost of computing still goes up as the data size grows. For instance, a database used to store objects a user interacted with for performing collaborative filtering only needs a sample. In real world use cases, these types of algorithms really need a relatively bounded set of data, and user taste might change over time, so only taking into consideration the most recent 90 objects makes sense. TTL'ing this data also doesn't make sense, because there are a wide range of frequencies at which users might generate this data. Sylvain Lebresne : I spent a few hours digging thru the compaction source and it's going to be messy to do this, probably involving a lot of copy+paste, so I'm even more +1 on disaggregating that massive Runnable method in CompactionTask into something more pluggable / extensible.
        Hide
        Fabien Rousseau added a comment -

        We also had this requirement and did a similar patch (which allowed delete, but as Sylvain's said, it is not correct and this should be forbidden)

        At the time I wrote this patch, I've also tried to create a new compaction strategy but came to the same conclusion as Rick (lot of copy+paste).

        Show
        Fabien Rousseau added a comment - We also had this requirement and did a similar patch (which allowed delete, but as Sylvain's said, it is not correct and this should be forbidden) At the time I wrote this patch, I've also tried to create a new compaction strategy but came to the same conclusion as Rick (lot of copy+paste).
        Hide
        Sylvain Lebresne added a comment -

        I note that if we're going to go the route of "we don't know how to do this correctly, so we'll make it easy for people to implement their own incorrect, but good enough for them, solution", then there is a maybe simpler solution than improving the compaction strategy API.

        Typically, we could take inspiration of Dave's idea of changing the indexer. That is, we could provide a "SSTableWriteFilter" interface for which user could provide custom implementation. That interface could look something like:

        public interface SSTableWriteFilter {
            public Column filter(ByteBuffer rowKey, Column column);
        }
        

        and the way it would work is that in SSTableWriter, each column would first go through this filter (and then it would be indexed/written). Then a simple filter filter to do the row size capping would just count columns for each rowKey and start returning tombstones once the limit is reached (we may even allow return null from filter() to just mean "skip that column").

        I'm suggesting that because:

        1. It's not clearly obvious to me how to generalize the compaction strategy API to make the row capping easy without leaking to much implementation detail.
        2. I suspect there could be other use for such filter. You could have (custom) filter that just collect statistic (in fact, we may even be able to rewrite our current statistic collector to use this interface). Or say you want to remove all the TTL (or extend them) from all your data for some reason (maybe your client code messed up and inserted data with a TTL too short). Then you could write a trivial filter, and call upgradesstables and voilà.

        Just a suggestion.

        Show
        Sylvain Lebresne added a comment - I note that if we're going to go the route of "we don't know how to do this correctly, so we'll make it easy for people to implement their own incorrect, but good enough for them, solution", then there is a maybe simpler solution than improving the compaction strategy API. Typically, we could take inspiration of Dave's idea of changing the indexer. That is, we could provide a "SSTableWriteFilter" interface for which user could provide custom implementation. That interface could look something like: public interface SSTableWriteFilter { public Column filter(ByteBuffer rowKey, Column column); } and the way it would work is that in SSTableWriter, each column would first go through this filter (and then it would be indexed/written). Then a simple filter filter to do the row size capping would just count columns for each rowKey and start returning tombstones once the limit is reached (we may even allow return null from filter() to just mean "skip that column"). I'm suggesting that because: It's not clearly obvious to me how to generalize the compaction strategy API to make the row capping easy without leaking to much implementation detail. I suspect there could be other use for such filter. You could have (custom) filter that just collect statistic (in fact, we may even be able to rewrite our current statistic collector to use this interface). Or say you want to remove all the TTL (or extend them) from all your data for some reason (maybe your client code messed up and inserted data with a TTL too short). Then you could write a trivial filter, and call upgradesstables and voilà . Just a suggestion.
        Hide
        Jonathan Ellis added a comment -

        How would you use this filter interface to implement something CQL-row-aware?

        Show
        Jonathan Ellis added a comment - How would you use this filter interface to implement something CQL-row-aware?
        Hide
        Sylvain Lebresne added a comment -

        How would you use this filter interface to implement something CQL-row-aware?

        I'm not sure honestly. I was suggesting that as a simpler alternative to modifying the compaction strategy API, which wouldn't be easily CQL-aware either because both solution involve dealing with the internal storage engine. Now, it would be possible to write something CQL-aware with the filter interface above (except maybe that we might want to provide the CFMetaData object too), but that would have to be done manually (typically using the ColumnCounter class, you could reasonably easily start "dropping" columns once you reach the nth CQL-row).

        Show
        Sylvain Lebresne added a comment - How would you use this filter interface to implement something CQL-row-aware? I'm not sure honestly. I was suggesting that as a simpler alternative to modifying the compaction strategy API, which wouldn't be easily CQL-aware either because both solution involve dealing with the internal storage engine. Now, it would be possible to write something CQL-aware with the filter interface above (except maybe that we might want to provide the CFMetaData object too), but that would have to be done manually (typically using the ColumnCounter class, you could reasonably easily start "dropping" columns once you reach the nth CQL-row).
        Hide
        Jonathan Ellis added a comment -

        Wontfixing for now, although still open to a pluggable solution as above.

        Show
        Jonathan Ellis added a comment - Wontfixing for now, although still open to a pluggable solution as above.
        Hide
        Rick Branson added a comment -

        +1. After implementing this on top of stock Cassandra, I think pushing this down to storage is probably not all that advantageous. Would still like to see compaction more extensible, however.

        Show
        Rick Branson added a comment - +1. After implementing this on top of stock Cassandra, I think pushing this down to storage is probably not all that advantageous. Would still like to see compaction more extensible, however.
        Hide
        Jonathan Ellis added a comment -

        Fine in principle w/ making compaction more extensible. Just need someone w/ a compaction extension to come along and show what extension points he needs.

        Show
        Jonathan Ellis added a comment - Fine in principle w/ making compaction more extensible. Just need someone w/ a compaction extension to come along and show what extension points he needs.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jonathan Ellis
          • Votes:
            4 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development