Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-5626

Compactions simulator tool for proofing algorithms

Details

    • Task
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • None
    • None
    • None
    • beginner

    Description

      A tool to run compaction simulations would be a nice to have. We could use it to see how well an algo ran under different circumstances loaded w/ different value types with different rates of flushes and splits, etc. HBASE-2462 had one (see in patch). Or we could try doing it using something like this: http://en.wikipedia.org/wiki/Discrete_event_simulation

      Attachments

        1. cf_compact.py
          3 kB
          Nicolas Spiegelberg

        Issue Links

          Activity

            How is this different from the compaction simulation python script? The unit of measurement should be a flush, since we flush after a certain memstore memory size, regardless of flow rate or KV length.

            nspiegelberg Nicolas Spiegelberg added a comment - How is this different from the compaction simulation python script? The unit of measurement should be a flush, since we flush after a certain memstore memory size, regardless of flow rate or KV length.
            stack Michael Stack added a comment -

            Where is the python simulation script? Is it uploaded anywhere? (Pardon me if I missed it)

            Simulator needs to also factor in splitting.

            stack Michael Stack added a comment - Where is the python simulation script? Is it uploaded anywhere? (Pardon me if I missed it) Simulator needs to also factor in splitting.

            Attached the current python script that I use to emulate compactions given different params.

            nspiegelberg Nicolas Spiegelberg added a comment - Attached the current python script that I use to emulate compactions given different params.

            A little more explanation.

            Basic Concept:
            We wish to model the amount of compaction IO and file dispersion. The unit of measurement for compactions is a flush. This is because a flush is always 64MB (or whatever you configure) regardless of other properties about the CF/KV. Column families might trigger flushes at different intervals, but they usually flush a consistent amount of data. You can understand the behavior of a compaction algorithm based upon how it behaves over X amount of flushes. Does this test make a lot of assumptions and simplifications? Yes!

            Inputs:
            1. ratio = compaction.ratio between files. (same as the HBase config)
            2. min.files = minimum count of files that must be selected for a compaction to occur (same as HBase config)
            3. duplication = percentage of KVs within a file that are mutations and will be deduped on compaction (0 <= DUPLICATION <= 1)
            4. iterations = number of flushes to simulate

            Output:
            1. The StoreFile dispersion after every flush (and, possibly, compaction triggered by that flush)
            2. The average storefile count over <iterations> flushes
            3. The amount of IO consumed by compactions after those <iterations> flushes.

            nspiegelberg Nicolas Spiegelberg added a comment - A little more explanation. Basic Concept: We wish to model the amount of compaction IO and file dispersion. The unit of measurement for compactions is a flush. This is because a flush is always 64MB (or whatever you configure) regardless of other properties about the CF/KV. Column families might trigger flushes at different intervals, but they usually flush a consistent amount of data. You can understand the behavior of a compaction algorithm based upon how it behaves over X amount of flushes. Does this test make a lot of assumptions and simplifications? Yes! Inputs: 1. ratio = compaction.ratio between files. (same as the HBase config) 2. min.files = minimum count of files that must be selected for a compaction to occur (same as HBase config) 3. duplication = percentage of KVs within a file that are mutations and will be deduped on compaction (0 <= DUPLICATION <= 1) 4. iterations = number of flushes to simulate Output: 1. The StoreFile dispersion after every flush (and, possibly, compaction triggered by that flush) 2. The average storefile count over <iterations> flushes 3. The amount of IO consumed by compactions after those <iterations> flushes.
            stack Michael Stack added a comment -

            Nice. Let me take a looksee...

            stack Michael Stack added a comment - Nice. Let me take a looksee...

            IMHO it may make sense to have it in java... if we could have some sort of pluggability in policy and compactor, which would cause them to not read or write the files but just report what they "read" and "wrote", we could use normal compaction code without having to keep the script in sync.

            sershe Sergey Shelukhin added a comment - IMHO it may make sense to have it in java... if we could have some sort of pluggability in policy and compactor, which would cause them to not read or write the files but just report what they "read" and "wrote", we could use normal compaction code without having to keep the script in sync.
            stack Michael Stack added a comment -

            sershe Sounds good (Could do jython and then you could play in both worlds). We will need something to run the scenarios; stripe, sigma, tiered, leveled (We have quite the menagerie of compacting algos now...)

            stack Michael Stack added a comment - sershe Sounds good (Could do jython and then you could play in both worlds). We will need something to run the scenarios; stripe, sigma, tiered, leveled (We have quite the menagerie of compacting algos now...)
            nitish Nitish Upreti added a comment -

            I am a newbie (student), learning about HBase and want to get started contributing to the project. I have been scanning through the HBase "noob" tag and found this issue interesting to work on. As this issue was last updated on 24/Jan/13 21:51, is the community still interested in this Task?

            I understand the overall concepts of log-structured merge tree and reducing the maximum number of disk seeks needed by compaction. I also understand how HBase has pluggable a compaction component where we can exploit performance benefits by knowing our data and request patterns in depth.

            What are the relevant packages / source files / API References I should look into for this task? Any general pointers from the community for working on this task will be of great help.

            nitish Nitish Upreti added a comment - I am a newbie (student), learning about HBase and want to get started contributing to the project. I have been scanning through the HBase "noob" tag and found this issue interesting to work on. As this issue was last updated on 24/Jan/13 21:51, is the community still interested in this Task? I understand the overall concepts of log-structured merge tree and reducing the maximum number of disk seeks needed by compaction. I also understand how HBase has pluggable a compaction component where we can exploit performance benefits by knowing our data and request patterns in depth. What are the relevant packages / source files / API References I should look into for this task? Any general pointers from the community for working on this task will be of great help.
            stack Michael Stack added a comment -

            This would be an excellent item to work on nitish. A means of being able to view write amplification would be sweet and being able to compare compaction algoritims would be great.

            I see another issue filed recently where FB talk of adding metadata to the files we write so more data for analyzing provenance/history.

            The script attached might be a start.

            Folks have used excel spreadsheet functions in the past, etc., looking at compactions.

            stack Michael Stack added a comment - This would be an excellent item to work on nitish . A means of being able to view write amplification would be sweet and being able to compare compaction algoritims would be great. I see another issue filed recently where FB talk of adding metadata to the files we write so more data for analyzing provenance/history. The script attached might be a start. Folks have used excel spreadsheet functions in the past, etc., looking at compactions.

            People

              Unassigned Unassigned
              stack Michael Stack
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: