[HBASE-5626] Compactions simulator tool for proofing algorithms - ASF JIRA

Details

Type: Task
Status: Closed
Priority: Minor
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- beginner

Tags:
beginner

Description

A tool to run compaction simulations would be a nice to have. We could use it to see how well an algo ran under different circumstances loaded w/ different value types with different rates of flushes and splits, etc. ~~HBASE-2462~~ had one (see in patch). Or we could try doing it using something like this: http://en.wikipedia.org/wiki/Discrete_event_simulation

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

cf_compact.py
26/Mar/12 22:04
3 kB
Nicolas Spiegelberg

Issue Links

is duplicated by

HBASE-5627 Compactions simulator tool for proofing algorithms

Closed

is related to

HBASE-7519 Support level compaction

Closed

Activity

Ascending order - Click to sort in descending order

Nicolas Spiegelberg added a comment - 26/Mar/12 21:11

How is this different from the compaction simulation python script? The unit of measurement should be a flush, since we flush after a certain memstore memory size, regardless of flow rate or KV length.

Nicolas Spiegelberg added a comment - 26/Mar/12 21:11 How is this different from the compaction simulation python script? The unit of measurement should be a flush, since we flush after a certain memstore memory size, regardless of flow rate or KV length.

Michael Stack added a comment - 26/Mar/12 21:22

Where is the python simulation script? Is it uploaded anywhere? (Pardon me if I missed it)

Simulator needs to also factor in splitting.

Michael Stack added a comment - 26/Mar/12 21:22 Where is the python simulation script? Is it uploaded anywhere? (Pardon me if I missed it) Simulator needs to also factor in splitting.

Nicolas Spiegelberg added a comment - 26/Mar/12 22:04

Attached the current python script that I use to emulate compactions given different params.

Nicolas Spiegelberg added a comment - 26/Mar/12 22:04 Attached the current python script that I use to emulate compactions given different params.

Nicolas Spiegelberg added a comment - 26/Mar/12 22:16

A little more explanation.

Basic Concept:
We wish to model the amount of compaction IO and file dispersion. The unit of measurement for compactions is a flush. This is because a flush is always 64MB (or whatever you configure) regardless of other properties about the CF/KV. Column families might trigger flushes at different intervals, but they usually flush a consistent amount of data. You can understand the behavior of a compaction algorithm based upon how it behaves over X amount of flushes. Does this test make a lot of assumptions and simplifications? Yes!

Inputs:
1. ratio = compaction.ratio between files. (same as the HBase config)
2. min.files = minimum count of files that must be selected for a compaction to occur (same as HBase config)
3. duplication = percentage of KVs within a file that are mutations and will be deduped on compaction (0 <= DUPLICATION <= 1)
4. iterations = number of flushes to simulate

Output:
1. The StoreFile dispersion after every flush (and, possibly, compaction triggered by that flush)
2. The average storefile count over <iterations> flushes
3. The amount of IO consumed by compactions after those <iterations> flushes.

Nicolas Spiegelberg added a comment - 26/Mar/12 22:16 A little more explanation. Basic Concept: We wish to model the amount of compaction IO and file dispersion. The unit of measurement for compactions is a flush. This is because a flush is always 64MB (or whatever you configure) regardless of other properties about the CF/KV. Column families might trigger flushes at different intervals, but they usually flush a consistent amount of data. You can understand the behavior of a compaction algorithm based upon how it behaves over X amount of flushes. Does this test make a lot of assumptions and simplifications? Yes! Inputs: 1. ratio = compaction.ratio between files. (same as the HBase config) 2. min.files = minimum count of files that must be selected for a compaction to occur (same as HBase config) 3. duplication = percentage of KVs within a file that are mutations and will be deduped on compaction (0 <= DUPLICATION <= 1) 4. iterations = number of flushes to simulate Output: 1. The StoreFile dispersion after every flush (and, possibly, compaction triggered by that flush) 2. The average storefile count over <iterations> flushes 3. The amount of IO consumed by compactions after those <iterations> flushes.

Michael Stack added a comment - 26/Mar/12 22:18

Nice. Let me take a looksee...

Michael Stack added a comment - 26/Mar/12 22:18 Nice. Let me take a looksee...

Sergey Shelukhin added a comment - 25/Jan/13 01:48

IMHO it may make sense to have it in java... if we could have some sort of pluggability in policy and compactor, which would cause them to not read or write the files but just report what they "read" and "wrote", we could use normal compaction code without having to keep the script in sync.

Sergey Shelukhin added a comment - 25/Jan/13 01:48 IMHO it may make sense to have it in java... if we could have some sort of pluggability in policy and compactor, which would cause them to not read or write the files but just report what they "read" and "wrote", we could use normal compaction code without having to keep the script in sync.

Michael Stack added a comment - 25/Jan/13 05:51

sershe Sounds good (Could do jython and then you could play in both worlds). We will need something to run the scenarios; stripe, sigma, tiered, leveled (We have quite the menagerie of compacting algos now...)

Michael Stack added a comment - 25/Jan/13 05:51 sershe Sounds good (Could do jython and then you could play in both worlds). We will need something to run the scenarios; stripe, sigma, tiered, leveled (We have quite the menagerie of compacting algos now...)

Nitish Upreti added a comment - 20/Jul/14 05:34

I am a newbie (student), learning about HBase and want to get started contributing to the project. I have been scanning through the HBase "noob" tag and found this issue interesting to work on. As this issue was last updated on 24/Jan/13 21:51, is the community still interested in this Task?

I understand the overall concepts of log-structured merge tree and reducing the maximum number of disk seeks needed by compaction. I also understand how HBase has pluggable a compaction component where we can exploit performance benefits by knowing our data and request patterns in depth.

What are the relevant packages / source files / API References I should look into for this task? Any general pointers from the community for working on this task will be of great help.

Nitish Upreti added a comment - 20/Jul/14 05:34 I am a newbie (student), learning about HBase and want to get started contributing to the project. I have been scanning through the HBase "noob" tag and found this issue interesting to work on. As this issue was last updated on 24/Jan/13 21:51, is the community still interested in this Task? I understand the overall concepts of log-structured merge tree and reducing the maximum number of disk seeks needed by compaction. I also understand how HBase has pluggable a compaction component where we can exploit performance benefits by knowing our data and request patterns in depth. What are the relevant packages / source files / API References I should look into for this task? Any general pointers from the community for working on this task will be of great help.

Michael Stack added a comment - 20/Jul/14 18:47

This would be an excellent item to work on nitish. A means of being able to view write amplification would be sweet and being able to compare compaction algoritims would be great.

I see another issue filed recently where FB talk of adding metadata to the files we write so more data for analyzing provenance/history.

The script attached might be a start.

Folks have used excel spreadsheet functions in the past, etc., looking at compactions.

Michael Stack added a comment - 20/Jul/14 18:47 This would be an excellent item to work on nitish . A means of being able to view write amplification would be sweet and being able to compare compaction algoritims would be great. I see another issue filed recently where FB talk of adding metadata to the files we write so more data for analyzing provenance/history. The script attached might be a start. Folks have used excel spreadsheet functions in the past, etc., looking at compactions.

People

Assignee:: Unassigned

Reporter:: Michael Stack

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 23/Mar/12 22:27

Updated:: 13/Jun/22 15:39

Resolved:: 13/Jun/22 15:39