HBase
  1. HBase
  2. HBASE-6093

Flatten timestamps during flush and compaction

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: io, Performance, regionserver
    • Labels:
      None

      Description

      Many applications run with maxVersions=1 and do not care about timestamps, or they will specify one timestamp per row as a normal KeyValue rather than per-cell.

      Then, DataBlockEncoders like those in HBASE-4218 and HBASE-4676 often encode timestamps as diffs from the previous or diffs from the minimum timestamp in the block. If all timestamps in a block are the same, they will all compress to basically <= 8 bytes total per block. This can be 10% to 25% space savings for some schemas, and that savings is realized both on disk and in block cache.

      We could add a ColumnFamily setting flattenTimestamps=[true/false]. If true, then all timestamps are modified during a flush/compaction to the currentTimeMillis() at the start of the flush/compaction. If all timestamps are made identical in a file, then the encoder will be able to eliminate them.

      The simplest use case is probably that where all inserts are type=Put, there are no overwrites, and there are no deletes. As use cases get more complex, then so does the implementation.

      For example, what happens when there is a Put and a Delete of the same cell in the same memstore? Maybe for a flush at t=flushStartTime, the Put gets timestamp=t, and the Delete gets timestamp=t+1. Or maybe HBASE-4241 could take care of this problem.

        Activity

        Hide
        Matt Corgan added a comment -

        oops - for flushes you would set all timestamps to the flush start time like i said above. But for compactions you would would set all timestamps to the earliest timestamp in the compaction, and ensure that only consecutive files get compacted together

        Show
        Matt Corgan added a comment - oops - for flushes you would set all timestamps to the flush start time like i said above. But for compactions you would would set all timestamps to the earliest timestamp in the compaction, and ensure that only consecutive files get compacted together

          People

          • Assignee:
            Unassigned
            Reporter:
            Matt Corgan
          • Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development