Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-11721

Have a per operation truncate ddl "no snapshot" option

Details

    Description

      Right now with truncate, it will always create a snapshot. That is the right thing to do most of the time. 'auto_snapshot' exists as an option to disable that but it is server wide and requires a restart to change. There are data models, however, that require rotating through a handful of tables and periodically truncating them. Currently you either have to operate with no safety net (some actually do this) or manually clear those snapshots out periodically. Both are less than optimal.

      In HDFS, you generally delete something where it goes to the trash. If you don't want that safety net, you can do something like 'rm -rf -skiptrash /jeremy/stuff' in one command.

      It would be nice to have something in the truncate ddl to skip the snapshot on a per operation basis. Perhaps 'TRUNCATE solarsystem.earth NO SNAPSHOT'.

      This might also be useful in those situations where you're just playing with data and you don't want something to take a snapshot in a development system. If that's the case, this would also be useful for the DROP operation, but that convenience is not the main reason for this option.

      Additional information for newcomers:

      This test is a bit more complex that normal LHF tickets but is still reasonably easy.

      The idea is to support disabling snapshots when performing a Truncate as follow:

      TRUNCATE x WITH OPTIONS = { 'snapshot' : false }

      In order to implement that feature several changes are required:

      • A new Class TruncateAttributes inheriting from PropertyDefinitions must be create in a similar way to KeyspaceAttributes or TableAttributes
      • This class should be passed to the TruncateStatement constructor and stored as a field
      • The ANTLR parser logic should be change to retrieve the options and passe them to the constructor (see createKeyspaceStatement for an example)
      • The TruncateStatement will then need to be modified to take into account the new option. Locally it will neeed to call ColumnFamilyStore#truncateBlockingWithoutSnapshot if no snapshot should be done instead of ColumnFamilyStore#truncateBlocking. For non local call it will need to pass a new parameter to StorageProxy#truncateBloking. That parameter will then need to be passed to the other nodes through the TruncateRequest.
      • As a new field need to be added to TruncateRequest this field will need to be serialized and deserialized and a new MessagingService.Version will need to be created and set as the current version the new version should be 50 (and yes it means that the next release will be a major one 5.0)
      • In TruncateVerbHandler the new field should be used to determine if ColumnFamilyStore#truncateBlockingWithoutSnapshot or ColumnFamilyStore#truncateBlocking should be called.
      • An in-jvm test should be added in test/distributed/org/apache/cassandra/distributed/test to test that truncate does not generate snapshots when the new option is specified.
        Do not hesitate to ping the mentor for more information.

      Attachments

        Issue Links

          Activity

            Not necessary saying I prefer it, but just mentioning that if I understand the motivating use case correctly, an alternative could be to have a table option to override the yaml one.
            In any case, it's worth mentioning that:

            1. adding the option to TRUNCATE would require a change to the internal truncate VERB, which means a change to the intra-node protocol, which kind of means 4.0
            2. adding it as a table option requires adding it to the schema table and that's currently also problematic on minors (CASSANDRA-11382)

            tl;dr, that's a reasonable option to add but it might not happen right away.

            slebresne Sylvain Lebresne added a comment - Not necessary saying I prefer it, but just mentioning that if I understand the motivating use case correctly, an alternative could be to have a table option to override the yaml one. In any case, it's worth mentioning that: adding the option to TRUNCATE would require a change to the internal truncate VERB, which means a change to the intra-node protocol, which kind of means 4.0 adding it as a table option requires adding it to the schema table and that's currently also problematic on minors ( CASSANDRA-11382 ) tl;dr, that's a reasonable option to add but it might not happen right away.
            jeromatron Jeremy Hanna added a comment -

            It's fine if it's not right away and it's understandable that at those levels, it takes a major version to make the change. People have been living with the limited options for this long . If we could do the NO SNAPSHOT syntax for everything with a snapshot to be consistent in the DDL and do that as a per cf setting (auto_snapshot), I think both would be nice options. If only one option is considered, then the per operation would be preferable because it gives functionality that the per cf does not.

            What do you think rssvihla weideng?

            jeromatron Jeremy Hanna added a comment - It's fine if it's not right away and it's understandable that at those levels, it takes a major version to make the change. People have been living with the limited options for this long . If we could do the NO SNAPSHOT syntax for everything with a snapshot to be consistent in the DDL and do that as a per cf setting (auto_snapshot), I think both would be nice options. If only one option is considered, then the per operation would be preferable because it gives functionality that the per cf does not. What do you think rssvihla weideng ?
            rssvihla Ryan Svihla added a comment -

            I've thought about it a bit:

            1. NO SNAPSHOT is probably the most pure and clean and satisfies even the most pendantic user who wants their temporary data backed up in C* when a drop or typical truncate is called, but comes at the cost of changing truncate and having driver dependencies.
            2. table based is easy to impliment and satisfies a lot of people even if a couple of people will be sad. They probably can just log their data in another table before they truncate if they're that determined to have it backed up.

            rssvihla Ryan Svihla added a comment - I've thought about it a bit: 1. NO SNAPSHOT is probably the most pure and clean and satisfies even the most pendantic user who wants their temporary data backed up in C* when a drop or typical truncate is called, but comes at the cost of changing truncate and having driver dependencies. 2. table based is easy to impliment and satisfies a lot of people even if a couple of people will be sad. They probably can just log their data in another table before they truncate if they're that determined to have it backed up.
            weideng Wei Deng added a comment -

            Option 1 (DDL NO SNAPSHOT) looks good to me and will cause the least amount of confusion to developers and operators.

            weideng Wei Deng added a comment - Option 1 (DDL NO SNAPSHOT) looks good to me and will cause the least amount of confusion to developers and operators.
            slebresne Sylvain Lebresne added a comment - - edited

            As said above, it's probably not gonna happen too soon, but for the record, if we do got with a DDL syntax, my preference would be to add some WITH OPTIONS rather than some specific NO SNAPSHOT. So something like:

            TRUNCATE x WITH OPTIONS = { 'snapshot' : false }
            

            so that it's somewhat more consistent with other statements and can be easily extended to other options without requiring new syntax every time.

            slebresne Sylvain Lebresne added a comment - - edited As said above, it's probably not gonna happen too soon, but for the record, if we do got with a DDL syntax, my preference would be to add some WITH OPTIONS rather than some specific NO SNAPSHOT . So something like: TRUNCATE x WITH OPTIONS = { 'snapshot' : false } so that it's somewhat more consistent with other statements and can be easily extended to other options without requiring new syntax every time.
            smiklosovic Stefan Miklosovic added a comment - - edited

            This was effectively achieved in CASSANDRA-10383

            Is this ticket still relevant?

            The difference between 10383 and this one is that there is not option in cql for TRUNCATE, it is driven by a table property. I think that is better approach because in practice, I just can't see scenario when I just do not want to take a snapshot on truncated tables just right now but otherwise I am happy to have them. What is the reasoning behind that? I think it is either yes we want them or no we do not. Some tables are not so interesting to keep holding data for them forever after truncation. That decision is mostly made upon's table creation. If somebody want to have snapshots anyway they can just enable that flag again.

            With TRUNCATE + option for CQL, if somebody does not want these snapshots taken upon truncation at all, he would be forced to explicitly mention it every single time.

            smiklosovic Stefan Miklosovic added a comment - - edited This was effectively achieved in CASSANDRA-10383 Is this ticket still relevant? The difference between 10383 and this one is that there is not option in cql for TRUNCATE, it is driven by a table property. I think that is better approach because in practice, I just can't see scenario when I just do not want to take a snapshot on truncated tables just right now but otherwise I am happy to have them. What is the reasoning behind that? I think it is either yes we want them or no we do not. Some tables are not so interesting to keep holding data for them forever after truncation. That decision is mostly made upon's table creation. If somebody want to have snapshots anyway they can just enable that flag again. With TRUNCATE + option for CQL, if somebody does not want these snapshots taken upon truncation at all, he would be forced to explicitly mention it every single time.
            jeromatron Jeremy Hanna added a comment - - edited

            I think CASSANDRA-10383 solves the production use cases for this and I'm very happy that it got implemented there.  There are cases in test and dev environments where I could still see a per operation setting being useful, but the majority of the use cases are covered by a table level setting.  I'm happy to "won't do" this one as updating CQL is a pain for just those use cases.

            jeromatron Jeremy Hanna added a comment - - edited I think CASSANDRA-10383 solves the production use cases for this and I'm very happy that it got implemented there.  There are cases in test and dev environments where I could still see a per operation setting being useful, but the majority of the use cases are covered by a table level setting.  I'm happy to "won't do" this one as updating CQL is a pain for just those use cases.
            jeromatron Jeremy Hanna added a comment -

            As discussed previously,  CASSANDRA-10383 solves the majority of what this covers.

            jeromatron Jeremy Hanna added a comment - As discussed previously,   CASSANDRA-10383 solves the majority of what this covers.

            People

              Unassigned Unassigned
              jeromatron Jeremy Hanna
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: