Accumulo
  1. Accumulo
  2. ACCUMULO-1802

Create a compaction strategy for aging off data

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Trivial Trivial
    • Resolution: Later
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: tserver
    • Labels:
      None

      Description

      The default compaction strategy has a tendency to put the oldest data in the largest files. This leads to a lot of work when it is time to age off data.

      One could imaging a compaction strategy that would split data into separate files based on the timestamp. Additionally, if the min/max timestamps for a file were known, old data could be aged off by deleting whole files.

      To accomplish this, will need to augment the configurable compaction strategy to support multiple output files, and saving/using extra metadata in each file.

        Activity

        Hide
        Josh Elser added a comment -

        Can reopen when someone wants to do this.

        Show
        Josh Elser added a comment - Can reopen when someone wants to do this.
        Hide
        Sean Busbey added a comment -

        Adam Fuchs, could you break those into separate tickets?

        Show
        Sean Busbey added a comment - Adam Fuchs , could you break those into separate tickets?
        Hide
        David Medinets added a comment -

        Reduced priority because this ticket does not directly impact functionality. Also the ticket suffers from scope creep.

        Show
        David Medinets added a comment - Reduced priority because this ticket does not directly impact functionality. Also the ticket suffers from scope creep.
        Hide
        Adam Fuchs added a comment -

        I have seen several use cases lately that lead me to agree that we should consider other compaction strategies. Some of the factors you might want to optimize by a compaction strategy are:
        1. Number of blocks read concurrently for a single query
        2. Number of times a key/value pair is written to disk
        3. Total number of files stored in HDFS
        4. Efficiency of deleting data

        Some of the additional use cases I've seen that would lead to different optimal compaction algorithms are:
        1. Time-series data and log data that is stored in roughly temporal order. In these cases, once a record is written its "neighborhood" (things that sort close by) is not updated. We can't help factor 1 by compacting frequently, since the ranges of files generated by minor compaction are mostly distinct.
        2. Use of one locality group at a time. This could be done to add features to existing rows as the result of a ML process or something like it. With our current strategy, we are compacting files together that have completely distinct locality groups. This doesn't help with factors 1 and 4, and hurts factor 2.
        3. Inverted indexing and graph storage with an expiration date or age-off. I think this is part of the use case Eric refers to. In this case, data is written in essentially random order, but is deleted in temporal order. We could get tricky and optimize factor 4 at some cost to factors 1, 2, and 3.
        4. Document-partitioned indexing with really big tablets. In this case, we end up relying more on the log-structured merge tree to sort data than the bucket sorting that comes with organic tablet splits. Non-uniform updates across the tablet space could be optimized by having multiple files output by the big major compactions, such that the files' ranges are non-overlapping. Basically, when we do a major compaction to include lots of small files in a narrower range than the whole tablet we don't want to have to rewrite the data from the entire tablet. This potential optimization is augmented by frequent updates, deletions, and aggregation in a sub-range of a tablet.

        Show
        Adam Fuchs added a comment - I have seen several use cases lately that lead me to agree that we should consider other compaction strategies. Some of the factors you might want to optimize by a compaction strategy are: 1. Number of blocks read concurrently for a single query 2. Number of times a key/value pair is written to disk 3. Total number of files stored in HDFS 4. Efficiency of deleting data Some of the additional use cases I've seen that would lead to different optimal compaction algorithms are: 1. Time-series data and log data that is stored in roughly temporal order. In these cases, once a record is written its "neighborhood" (things that sort close by) is not updated. We can't help factor 1 by compacting frequently, since the ranges of files generated by minor compaction are mostly distinct. 2. Use of one locality group at a time. This could be done to add features to existing rows as the result of a ML process or something like it. With our current strategy, we are compacting files together that have completely distinct locality groups. This doesn't help with factors 1 and 4, and hurts factor 2. 3. Inverted indexing and graph storage with an expiration date or age-off. I think this is part of the use case Eric refers to. In this case, data is written in essentially random order, but is deleted in temporal order. We could get tricky and optimize factor 4 at some cost to factors 1, 2, and 3. 4. Document-partitioned indexing with really big tablets. In this case, we end up relying more on the log-structured merge tree to sort data than the bucket sorting that comes with organic tablet splits. Non-uniform updates across the tablet space could be optimized by having multiple files output by the big major compactions, such that the files' ranges are non-overlapping. Basically, when we do a major compaction to include lots of small files in a narrower range than the whole tablet we don't want to have to rewrite the data from the entire tablet. This potential optimization is augmented by frequent updates, deletions, and aggregation in a sub-range of a tablet.
        Hide
        Billie Rinaldi added a comment -

        One could also imagine matching a reading strategy with a compaction strategy, to allow skipping of entire files based on file metadata when reading.

        Show
        Billie Rinaldi added a comment - One could also imagine matching a reading strategy with a compaction strategy, to allow skipping of entire files based on file metadata when reading.

          People

          • Assignee:
            Unassigned
            Reporter:
            Eric Newton
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development