[ACCUMULO-1266] Automatically determine when a full major compaction would benefit scans - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Abandoned
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

For the following situation, there is a tipping point where it becomes beneficial to do a full major compaction.

a tablet is frequently scanned
scan time iterators supress a lot of data
a full major compaction would also supress that data

Examples of this are tablets with lots of deletes, versions that are suppressed, data thats combined, and data thats filtered.

If tablet servers kept track of statistics about scans, could this be used to determine when its beneficial to automatically compact? In the following simple example, it seems obvious that a major compaction would be beneficial. In this example scans over the last hour have had to examine and throw away 20 million uneeded keys. Alot of scan work could have been saved by doing a major compaction.

all scans over tabletA within the last hour have read 30 million keys and returned 10 million keys
TabletA has 3 million keys
a major compaction would reduce tabletA to 1 million keys and result in future scans returning all keys read.

One complicating factor is that major compaction may have a different set of iterators configured. Therefore its possible that scan may filter a lot of data, and major compactions may not. Could possibly keep track of ratio of data dropped by compactions and the ratio of data dropped by scans. This could be used when deciding if a major compaction should be done to improve scan performance.

What other situation can cause unnecessary major compactions and need to be defended against?

In the case where a compaction of just the data in memory would benefit scans, ACCUMULO-519 may solve the problem that this ticket is looking to solve.

So what should the formula be?

  // k/v : key values
  // recentlyRead    : total number of k/v read before applying iterators by recent scans (recentlyRead - recentlyDropped equals # of k/v returned to users)
  // majcDropRatio   : ratio of k/v dropped by recent major compactions
  // totalKeyValues  : total # of k/v in tablet
  // R a user configurable ratio, like the current major compaction ratio that is based on files
  if((recentlyRead * majcDropRatio > R * totalKeyValues)){
     doFullMajorCompaction()
     resetScanStats()
  }

The example formula above has an issue, it may initiate a major compaction when scans are not reading a part of the tablet that drops data. The formula below tries to remedy this.

  // k/v : key values
  // recentlyDropped : number of k/v dropped by recent scans
  // recentlyRead    : total number of k/v read before applying iterators by recent scans (recentlyRead - recentlyDropped equals # of k/v returned to users)
  // majcDropRatio   : ratio of k/v dropped by recent major compactions
  // totalKeyValues  : total # of k/v in tablet
  // R a user configurable ratio, like the current major compaction ratio that is based on files
  if((recentlyDropped > R * totalKeyValues) && (recentlyRead * majcDropRatio > R * totalKeyValues)){
     doFullMajorCompaction()
     resetScanStats()
  }

An issue with the above is that the total # of key values for a tablet may not be accurate because of bulk import and splits.

Attachments

Issue Links

depends upon

ACCUMULO-1451 Make Compaction triggers extensible

Resolved

is related to

ACCUMULO-2232 Combiners can cause deleted data to come back

Resolved

supercedes

ACCUMULO-96 Need to handle case where tablet has lots of delete entries

Resolved

Sub-Tasks

Design infrastructure for passing scan stats to compaction strategies

Open

Unassigned

Activity

People

Assignee:: Unassigned

Reporter:: Keith Turner

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Apr/13 17:16

Updated:: 28/Oct/20 23:04

Resolved:: 28/Oct/20 23:04