Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.2.0 beta 1
    • Component/s: Core
    • Labels:

      Description

      Under size-tiered compaction, you can generate large sstables that compact infrequently. With expiring columns mixed in, we could waste a lot of space in this situation.

      If we kept a TTL EstimatedHistogram in the sstable metadata, we could do a single-sstable compaction aginst sstables with over 20% expired data.

      1. cassandra-1.1-3442.txt
        35 kB
        Yuki Morishita
      2. 3442-v3.txt
        42 kB
        Jonathan Ellis
      3. 3442-track-tombstones.txt
        51 kB
        Yuki Morishita

        Activity

        Hide
        Jonathan Ellis added a comment -

        Sylvain, what do you think of this idea? (If it has merit, we can assign to Yuki for implementation.)

        Show
        Jonathan Ellis added a comment - Sylvain, what do you think of this idea? (If it has merit, we can assign to Yuki for implementation.)
        Hide
        Sylvain Lebresne added a comment -

        I think the idea has merits.
        We have to keep in mind that unless the expired data is also gcable, this would only transform expiring columns to tombstones, so for those that have very small columns (~4 bytes) this would be useless. I would also maybe bump that 20% up to be sure we don't rewrite too ofen sstables that are good candidate for compaction anyway. But on principle this is a good idea I believe.

        Show
        Sylvain Lebresne added a comment - I think the idea has merits. We have to keep in mind that unless the expired data is also gcable, this would only transform expiring columns to tombstones, so for those that have very small columns (~4 bytes) this would be useless. I would also maybe bump that 20% up to be sure we don't rewrite too ofen sstables that are good candidate for compaction anyway. But on principle this is a good idea I believe.
        Hide
        Jonathan Ellis added a comment -

        We can get average column size from row size / column count, so we could make it "If column size > N bytes or sstable is older than gc_grace_period."

        Show
        Jonathan Ellis added a comment - We can get average column size from row size / column count, so we could make it "If column size > N bytes or sstable is older than gc_grace_period."
        Hide
        Yuki Morishita added a comment -

        Patch is against 1.0 branch. Added TTL histogram and logic to determine whether to perform compaction over sstables that have expiring columns more than threshold.
        Threshold is default to 20% (0.2) but you can specify via compaction_strategy_options.

        Show
        Yuki Morishita added a comment - Patch is against 1.0 branch. Added TTL histogram and logic to determine whether to perform compaction over sstables that have expiring columns more than threshold. Threshold is default to 20% (0.2) but you can specify via compaction_strategy_options.
        Hide
        Jonathan Ellis added a comment -

        Should we do single-sstable compactions after the bucket compactions? Doing them first means we might compact them twice, when the bucket-based compaction would have been adequate.

        It looks like this will never stop recompacting sstables with high expiring column counts, until they finally expire and are expunged. I think we need to address this somehow, possibly by waiting until some fraction of gc_grace_seconds has elapsed since sstable creation (which we can just get from mtime).

        If we can reasonably test this in CompactionsTest I'd like to add that.

        Show
        Jonathan Ellis added a comment - Should we do single-sstable compactions after the bucket compactions? Doing them first means we might compact them twice, when the bucket-based compaction would have been adequate. It looks like this will never stop recompacting sstables with high expiring column counts, until they finally expire and are expunged. I think we need to address this somehow, possibly by waiting until some fraction of gc_grace_seconds has elapsed since sstable creation (which we can just get from mtime). If we can reasonably test this in CompactionsTest I'd like to add that.
        Hide
        Yuki Morishita added a comment -

        I modified the patch to perform single sstable compaction when nothing else can be compacted.

        I added new single_compaction_interval option which defaults to 24 hours in order to prevent recursive compaction on the same sstable.

        I also added unit test to CompactionsTest.

        Show
        Yuki Morishita added a comment - I modified the patch to perform single sstable compaction when nothing else can be compacted. I added new single_compaction_interval option which defaults to 24 hours in order to prevent recursive compaction on the same sstable. I also added unit test to CompactionsTest.
        Hide
        Yuki Morishita added a comment -

        Patch also can apply to trunk at this moment.

        Show
        Yuki Morishita added a comment - Patch also can apply to trunk at this moment.
        Hide
        Jonathan Ellis added a comment -

        I'm going to wait to apply this because Sylvain asked us to use trunk for 1.1.1 for now, and IMO this should go in 1.2.

        Show
        Jonathan Ellis added a comment - I'm going to wait to apply this because Sylvain asked us to use trunk for 1.1.1 for now, and IMO this should go in 1.2.
        Hide
        Sylvain Lebresne added a comment -

        Agreed on targeting for 1.2. There is now a branch for it though if that's ready (the patches probably need rebase though).

        Show
        Sylvain Lebresne added a comment - Agreed on targeting for 1.2. There is now a branch for it though if that's ready (the patches probably need rebase though).
        Hide
        Jonathan Ellis added a comment -

        Yes, I'm working on an updated patch.

        Show
        Jonathan Ellis added a comment - Yes, I'm working on an updated patch.
        Hide
        Jonathan Ellis added a comment -

        v3 attached. It's a bit of two steps forward, one step back:

        • renames RowStats -> ColumnStats; separates row size computation from column/tombstone counts, moves ColumnStats computation out of serializer and into AbstractCompactedRow
        • I switched from checking instanceof ExipiringColumn, to instanceof DeletedColumn, since an ExpiringColumn just means it will expire eventually (at which point it turns into a DeletedColumn), whereas a DeletedColumn is a tombstone that will be eligible for dropping after gc_grace_seconds. A common use case for TTL is to expire all data in a row after N days; if we're just going by "this column has a TTL" we'll compact these sstables daily even if none of the data has actually expired yet. Switching to checking for DC instead mitigates this a little.

        However, the more I think about it, the more I think what we really want to track is a histogram of when tombstones are eligible to be dropped, relative to the sstable creation time. So, if I had a column that expired after 30 days, and a gc_grace_seconds of 10 days, I'd add an entry for 40 days to the histogram. If I had a new manual delete, I'd add an entry for 10 days.

        This would allow us to have a good estimate of how much of the sstable could actually be cleaned out by compaction, and we could drop the single_compaction_interval code entirely.

        What do you think?

        Minor note: the new test seems fairly involved – what would we lose by just testing compaction of a single sstable w/ tombstones?

        Show
        Jonathan Ellis added a comment - v3 attached. It's a bit of two steps forward, one step back: renames RowStats -> ColumnStats; separates row size computation from column/tombstone counts, moves ColumnStats computation out of serializer and into AbstractCompactedRow I switched from checking instanceof ExipiringColumn, to instanceof DeletedColumn, since an ExpiringColumn just means it will expire eventually (at which point it turns into a DeletedColumn), whereas a DeletedColumn is a tombstone that will be eligible for dropping after gc_grace_seconds. A common use case for TTL is to expire all data in a row after N days; if we're just going by "this column has a TTL" we'll compact these sstables daily even if none of the data has actually expired yet. Switching to checking for DC instead mitigates this a little. However, the more I think about it, the more I think what we really want to track is a histogram of when tombstones are eligible to be dropped , relative to the sstable creation time. So, if I had a column that expired after 30 days, and a gc_grace_seconds of 10 days, I'd add an entry for 40 days to the histogram. If I had a new manual delete, I'd add an entry for 10 days. This would allow us to have a good estimate of how much of the sstable could actually be cleaned out by compaction , and we could drop the single_compaction_interval code entirely. What do you think? Minor note: the new test seems fairly involved – what would we lose by just testing compaction of a single sstable w/ tombstones?
        Hide
        Yuki Morishita added a comment -

        I switched from checking instanceof ExipiringColumn, to instanceof DeletedColumn, ...

        I don't see this switch in your patch, but as you described, tracking tombstones makes more sense.

        I think it's doable to create histogram of dropping time when writing sstable and use that for single sstable compaction.
        So we can trigger compaction on single sstable which contains say 20% of ExpiringColumns and DeletedColumns and 50% of them can be dropped.

        Minor note: the new test seems fairly involved – what would we lose by just testing compaction of a single sstable w/ tombstones?

        Well, nothing
        single sstable compaction test is fine.

        Show
        Yuki Morishita added a comment - I switched from checking instanceof ExipiringColumn, to instanceof DeletedColumn, ... I don't see this switch in your patch, but as you described, tracking tombstones makes more sense. I think it's doable to create histogram of dropping time when writing sstable and use that for single sstable compaction. So we can trigger compaction on single sstable which contains say 20% of ExpiringColumns and DeletedColumns and 50% of them can be dropped. Minor note: the new test seems fairly involved – what would we lose by just testing compaction of a single sstable w/ tombstones? Well, nothing single sstable compaction test is fine.
        Hide
        Yuki Morishita added a comment -

        Patch attached to track tombstones by creating its drop time histogram.
        Size tiered compaction strategy uses this to calculate fraction of droppable tombstones at compaction and perform single sstable compaction if the fraction exceeds threshold.

        Note that original patch overcounted ExpiringColumn inside SuperColumn. Overall column count is done at SuperColumn level, so tombstone count should be done at the same level. Newer patch counts tombstones by simply checking its local deletion time < Integer.MAX_VALUE.

        I also rewrite unit test to simply create one sstable with tombstones and let it get compacted.

        Show
        Yuki Morishita added a comment - Patch attached to track tombstones by creating its drop time histogram. Size tiered compaction strategy uses this to calculate fraction of droppable tombstones at compaction and perform single sstable compaction if the fraction exceeds threshold. Note that original patch overcounted ExpiringColumn inside SuperColumn. Overall column count is done at SuperColumn level, so tombstone count should be done at the same level. Newer patch counts tombstones by simply checking its local deletion time < Integer.MAX_VALUE. I also rewrite unit test to simply create one sstable with tombstones and let it get compacted.
        Hide
        Jonathan Ellis added a comment -

        Nice work. Committed!

        Show
        Jonathan Ellis added a comment - Nice work. Committed!
        Hide
        B. Todd Burruss added a comment -

        Will this patch force clean up any tombstoned data, or only because of columns that have expired because of their TTL?

        Show
        B. Todd Burruss added a comment - Will this patch force clean up any tombstoned data, or only because of columns that have expired because of their TTL?
        Hide
        Yuki Morishita added a comment -

        the former.

        Show
        Yuki Morishita added a comment - the former.

          People

          • Assignee:
            Yuki Morishita
            Reporter:
            Jonathan Ellis
            Reviewer:
            Jonathan Ellis
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development