Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not a Problem
    • Fix Version/s: None
    • Component/s: Core
    • Labels:

      Description

      Compaction merges older sstables into newer versions of the data.

      When snapshotting sstables (esp incrementally) it would be very useful to know what older sstables are no longer needed because they are now represented in a newer version.

      This patch should add the list of sstables that made up each new sstable and store this info in the -Statistics file.

        Activity

        Hide
        Jonathan Ellis added a comment -

        You are saying I should cron snapshot the cluster then keep the incremental between

        Yes. This is standard backup procedure. Doing periodic full snapshots both gives you an upper bound on how many incrementals you have to apply (which, as you point out, can certainly contain information that is obsoleted later) and gives you extra redundancy in case of corruption of one of the incrementals (which otherwise becomes increasingly likely as time goes by).

        I think with the feature I'm suggesting this wouldn't be necessary and IMO be less data to backup in the end

        I don't think it's worth it. It would only be useful for the "restore to most recent possible time" and nothing earlier, because otherwise you have the "data mixed in from newer sstables in the compacted version" problem.

        Additionally, at least one person has implemented map/reduce against snapshots, which is another point in favor of a "periodic full + incrementals" approach. (I'll go bug him again about contributing a patch, now that I remember it...)

        Show
        Jonathan Ellis added a comment - You are saying I should cron snapshot the cluster then keep the incremental between Yes. This is standard backup procedure. Doing periodic full snapshots both gives you an upper bound on how many incrementals you have to apply (which, as you point out, can certainly contain information that is obsoleted later) and gives you extra redundancy in case of corruption of one of the incrementals (which otherwise becomes increasingly likely as time goes by). I think with the feature I'm suggesting this wouldn't be necessary and IMO be less data to backup in the end I don't think it's worth it. It would only be useful for the "restore to most recent possible time" and nothing earlier, because otherwise you have the "data mixed in from newer sstables in the compacted version" problem. Additionally, at least one person has implemented map/reduce against snapshots, which is another point in favor of a "periodic full + incrementals" approach. (I'll go bug him again about contributing a patch, now that I remember it...)
        Hide
        T Jake Luciani added a comment -

        the snapshot files plus incrementals from after the last full snapshot (up to point-in-time, if desired) give you exactly what you want, no more, no less.

        Maybe I'm thinking about this wrong but If I was going to backup data in cassandra I would never run nodetool snapshot. I would only enable incremental backup and remote backup the sstable and remove what's been backed up.
        I could then get to any point in time.

        You are saying I should cron snapshot the cluster then keep the incremental between.. I think with the feature I'm suggesting this wouldn't be necessary and IMO be less data to backup in the end.

        Show
        T Jake Luciani added a comment - the snapshot files plus incrementals from after the last full snapshot (up to point-in-time, if desired) give you exactly what you want, no more, no less. Maybe I'm thinking about this wrong but If I was going to backup data in cassandra I would never run nodetool snapshot. I would only enable incremental backup and remote backup the sstable and remove what's been backed up. I could then get to any point in time. You are saying I should cron snapshot the cluster then keep the incremental between.. I think with the feature I'm suggesting this wouldn't be necessary and IMO be less data to backup in the end.
        Hide
        Jonathan Ellis added a comment -

        If you want to restore from a backup in this scenario you need to load all the sstables then compact

        I'm still confused: the snapshot files plus incrementals from after the last full snapshot (up to point-in-time, if desired) give you exactly what you want, no more, no less. None of the incrementals can be compacted into sstables in the snapshot because by construction we've said the snapshot is older. (And if we have a newer snapshot... use that one instead.)

        If you're trying to do a "partial" snapshot restore (i.e. not removing all the existing sstable files first) that won't work in the general case because you're unlikely to end up with sstables containing exactly the set of incremental sstables you want with no other data mixed in.

        Show
        Jonathan Ellis added a comment - If you want to restore from a backup in this scenario you need to load all the sstables then compact I'm still confused: the snapshot files plus incrementals from after the last full snapshot (up to point-in-time, if desired) give you exactly what you want, no more, no less. None of the incrementals can be compacted into sstables in the snapshot because by construction we've said the snapshot is older. (And if we have a newer snapshot... use that one instead.) If you're trying to do a "partial" snapshot restore (i.e. not removing all the existing sstable files first) that won't work in the general case because you're unlikely to end up with sstables containing exactly the set of incremental sstables you want with no other data mixed in.
        Hide
        T Jake Luciani added a comment -

        That's not what I'm saying.

        When "incremental_backup: true" then sstables are hard linked you end up with a directory full of sstables including ones that have been compacted into newer versions of the data.

        If you want to restore from a backup in this scenario you need to load all the sstables then compact.
        If we had constituent data stored in the sstables of what sstables were used to create them then you could programmatically figure out what sstables we need to use to get a complete optimal snapshot.

        It would also be handy to track this information anyway in the case of corruption of a sstable you could inspect the meta-data and get the list of sstables to retrieve from backup to fix just the corrupt file.

        Show
        T Jake Luciani added a comment - That's not what I'm saying. When "incremental_backup: true" then sstables are hard linked you end up with a directory full of sstables including ones that have been compacted into newer versions of the data. If you want to restore from a backup in this scenario you need to load all the sstables then compact. If we had constituent data stored in the sstables of what sstables were used to create them then you could programmatically figure out what sstables we need to use to get a complete optimal snapshot. It would also be handy to track this information anyway in the case of corruption of a sstable you could inspect the meta-data and get the list of sstables to retrieve from backup to fix just the corrupt file.
        Hide
        Jonathan Ellis added a comment -

        What is an "incremental snapshot?" If you're trying to make things more complicated by not linking already-linked sstables in new snapshots, don't. Hard links are close enough to free as not to matter.

        Show
        Jonathan Ellis added a comment - What is an "incremental snapshot?" If you're trying to make things more complicated by not linking already-linked sstables in new snapshots, don't. Hard links are close enough to free as not to matter.
        Hide
        T Jake Luciani added a comment -

        But you will for incremental snapshots. How do you know what versions to load of the sstables? Right now you must load all previous versions.

        Show
        T Jake Luciani added a comment - But you will for incremental snapshots. How do you know what versions to load of the sstables? Right now you must load all previous versions.
        Hide
        Jonathan Ellis added a comment -

        I'm not sure where you're going with this. The old -> new replace in DataTracker is done atomically; you will never have both old and new sstables present in the same View.

        Show
        Jonathan Ellis added a comment - I'm not sure where you're going with this. The old -> new replace in DataTracker is done atomically; you will never have both old and new sstables present in the same View.

          People

          • Assignee:
            Unassigned
            Reporter:
            T Jake Luciani
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development