HBase
  1. HBase
  2. HBASE-3484

Replace memstore's ConcurrentSkipListMap with our own implementation

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 0.92.0
    • Fix Version/s: None
    • Component/s: Performance
    • Labels:
      None

      Description

      By copy-pasting ConcurrentSkipListMap into HBase we can make two improvements to it for our use case in MemStore:

      • add an iterator.replace() method which should allow us to do upsert much more cheaply
      • implement a Set directly without having to do Map<KeyValue,KeyValue> to save one reference per entry

      It turns out CSLM is in public domain from its development as part of JSR 166, so we should be OK with licenses.

      1. WIP_HBASE-3484.patch
        51 kB
        Anoop Sam John
      2. memstore_drag.png
        31 kB
        Jean-Daniel Cryans
      3. hierarchical-map.txt
        40 kB
        Todd Lipcon

        Issue Links

          Activity

          Show
          Todd Lipcon added a comment - Here's the link to the class in Apache Harmony: https://svn.apache.org/repos/asf/harmony/enhanced/java/branches/java6/classlib/modules/concurrent/src/main/java/java/util/concurrent/ConcurrentSkipListMap.java
          Hide
          Joe Pallas added a comment -

          This issue was cited by jdcryans as related to unfortunate performance seen in the following case:

          A test program fills a single row of a family with tens of thousands of sequentially increasing qualifiers. Then it performs random gets (or exists) of those qualifiers. The response time seen is (on average) proportional to the ordinal position of the qualifier. If the table is flushed before the random tests begin, then the average response time is basically constant, independent of the qualifier's ordinal position.

          I'm not sure that either of the two points in the description actually covers this case, but I don't know enough to say.

          Show
          Joe Pallas added a comment - This issue was cited by jdcryans as related to unfortunate performance seen in the following case: A test program fills a single row of a family with tens of thousands of sequentially increasing qualifiers. Then it performs random gets (or exists) of those qualifiers. The response time seen is (on average) proportional to the ordinal position of the qualifier. If the table is flushed before the random tests begin, then the average response time is basically constant, independent of the qualifier's ordinal position. I'm not sure that either of the two points in the description actually covers this case, but I don't know enough to say.
          Hide
          stack added a comment -

          Upping this to critical. It keeps coming up as an issue.

          Show
          stack added a comment - Upping this to critical. It keeps coming up as an issue.
          Hide
          Joe Pallas added a comment -

          I think the performance issue I mentioned above may actually be HBASE-3855.

          Show
          Joe Pallas added a comment - I think the performance issue I mentioned above may actually be HBASE-3855 .
          Hide
          stack added a comment -

          Moving out of 0.92. Don't see it happening in time.

          Show
          stack added a comment - Moving out of 0.92. Don't see it happening in time.
          Hide
          Todd Lipcon added a comment -

          Here's something I hacked together tonight which maps the memstore maps hierarchical. It should save a bit of CPU especially when doing wide puts, but I haven't done any serious benchmarking. It probably has negative memory effects in its current incarnation. Seems to kind-of work.

          Show
          Todd Lipcon added a comment - Here's something I hacked together tonight which maps the memstore maps hierarchical. It should save a bit of CPU especially when doing wide puts, but I haven't done any serious benchmarking. It probably has negative memory effects in its current incarnation. Seems to kind-of work.
          Hide
          stack added a comment -

          It probably has negative memory effects in its current incarnation.

          How you think Todd? Because of the tiering cost more or is it something to do w/ mslab allocations?

          What would you like to see test-wise proving this direction better than what we currently have? I could work up some tests?

          Show
          stack added a comment - It probably has negative memory effects in its current incarnation. How you think Todd? Because of the tiering cost more or is it something to do w/ mslab allocations? What would you like to see test-wise proving this direction better than what we currently have? I could work up some tests?
          Hide
          Todd Lipcon added a comment -

          How you think Todd? Because of the tiering cost more or is it something to do w/ mslab allocations?

          Extra container costs with the fact that we have extra CSLM objects for each row. I haven't measured but I bet there is some map-wide overhead that we're paying.

          There are some other things I noticed that could be improved, though. In particular, CSLM optimizes for Comparable keys, so if you specify a custom comparator, then it has to wrap every key you insert with a wrapper object. Specializing CSLM for our purposes would easily save 64 bytes per entry on this.

          Another thought I had was to do the following:

          • have the actual entries in rowMap be Object-typed, rather than CSLMs.
          • when the first insert happens, just insert the KeyValue itself (optimization for the case where each row has only one cell)
          • when more inserts happen, swap it out for a proper container type

          The proper container type's also interesting to consider here. We never have contention on update within a row, since the updates happen under a row lock, right? So, we can consider any map type that supports single-writer multiple-reader efficiently, which is a wider range of data structures than support multi-writer multi-reader. One possibility is snap trees or even copy-on-write sorted array lists.

          What would you like to see test-wise proving this direction better than what we currently have? I could work up some tests?

          Would be great if we had a benchmark focused on memstore-only which allowed a mix of the following operations from different threads:

          • full scans
          • range scans
          • updates to existing rows which just touch 1 or a few columns
          • updates to existing rows which touch lots of columns
          • inserts of new rows (few or lots of columns)

          But it's a bit of work to do all that. So, a microbenchmark which just timed something like having 20 threads each do a bunch of inserts with multi-column rows would at least show whether there's promise here.

          Show
          Todd Lipcon added a comment - How you think Todd? Because of the tiering cost more or is it something to do w/ mslab allocations? Extra container costs with the fact that we have extra CSLM objects for each row. I haven't measured but I bet there is some map-wide overhead that we're paying. There are some other things I noticed that could be improved, though. In particular, CSLM optimizes for Comparable keys, so if you specify a custom comparator, then it has to wrap every key you insert with a wrapper object. Specializing CSLM for our purposes would easily save 64 bytes per entry on this. Another thought I had was to do the following: have the actual entries in rowMap be Object-typed, rather than CSLMs. when the first insert happens, just insert the KeyValue itself (optimization for the case where each row has only one cell) when more inserts happen, swap it out for a proper container type The proper container type's also interesting to consider here. We never have contention on update within a row, since the updates happen under a row lock, right? So, we can consider any map type that supports single-writer multiple-reader efficiently, which is a wider range of data structures than support multi-writer multi-reader. One possibility is snap trees or even copy-on-write sorted array lists. What would you like to see test-wise proving this direction better than what we currently have? I could work up some tests? Would be great if we had a benchmark focused on memstore-only which allowed a mix of the following operations from different threads: full scans range scans updates to existing rows which just touch 1 or a few columns updates to existing rows which touch lots of columns inserts of new rows (few or lots of columns) But it's a bit of work to do all that. So, a microbenchmark which just timed something like having 20 threads each do a bunch of inserts with multi-column rows would at least show whether there's promise here.
          Hide
          stack added a comment -

          Great stuff Todd.

          ...copy-on-write sorted array lists.

          Could we do this? We'd allocate a new array everytime we did an insert? An array would be cheaper space wise and more efficient scanning, etc., I'd think.... It'd just be the insert and sort that'd be 'expensive'.

          Let me have a go at your suggested microbenchmark.

          Show
          stack added a comment - Great stuff Todd. ...copy-on-write sorted array lists. Could we do this? We'd allocate a new array everytime we did an insert? An array would be cheaper space wise and more efficient scanning, etc., I'd think.... It'd just be the insert and sort that'd be 'expensive'. Let me have a go at your suggested microbenchmark.
          Hide
          Jean-Daniel Cryans added a comment -

          This is what the slow down currently looks like when using big MemStores. In this graph I flush at around 6GB and there's only 1 region per RS. It seems that the top speed is 100k/s which happens at the very beginning and it can go down to 40k/s.

          For those wondering, the dips are caused because we max out our links when the flushes happen all at the same time.

          Show
          Jean-Daniel Cryans added a comment - This is what the slow down currently looks like when using big MemStores. In this graph I flush at around 6GB and there's only 1 region per RS. It seems that the top speed is 100k/s which happens at the very beginning and it can go down to 40k/s. For those wondering, the dips are caused because we max out our links when the flushes happen all at the same time.
          Hide
          Otis Gospodnetic added a comment -

          @JD - what would/should the ideal graph look like, roughly?

          Show
          Otis Gospodnetic added a comment - @JD - what would/should the ideal graph look like, roughly?
          Hide
          Jean-Daniel Cryans added a comment - - edited

          Something that's not log( n ), so a straight line would be ideal

          Edit: apparently this is thumbdown

          Show
          Jean-Daniel Cryans added a comment - - edited Something that's not log( n ), so a straight line would be ideal Edit: apparently this is thumbdown
          Hide
          Matt Corgan added a comment -

          I've been pondering how to better compact the data in the memstore. Sometimes we see a 100MB memstore flush that is really 10MB of KeyValues, which gzips to like 2MB, meaning there is a ton of pointer overhead.

          One thing that came to mind was splitting each memstore into "regions" of consecutive cell ranges and fronting these regions with an index of some sort. Instead of Set<KeyValue> the memstore is Set<Set<KeyValue>>. When an internal region crosses a certain size we split it in half. With a good index structure in front of the memstore blocks, it might get closer to a linear performance/size curve. It's comparable with hbase splitting a table into regions.

          Then, to address the pointer overhead problem, you could use DataBlockEncoding to encode each memstore region individually. A memstore region could accumulate several blocks that get compacted periodically. Given a region size of ~64-256KB, the compaction could be very aggressive and could even be done by the thread writing the data. Again, very similar to how hbase manages the internals of a single region.

          This adds moving pieces and complexity but could be developed as a pluggable module that passes the same unit tests as the current memstore.

          Show
          Matt Corgan added a comment - I've been pondering how to better compact the data in the memstore. Sometimes we see a 100MB memstore flush that is really 10MB of KeyValues, which gzips to like 2MB, meaning there is a ton of pointer overhead. One thing that came to mind was splitting each memstore into "regions" of consecutive cell ranges and fronting these regions with an index of some sort. Instead of Set<KeyValue> the memstore is Set<Set<KeyValue>>. When an internal region crosses a certain size we split it in half. With a good index structure in front of the memstore blocks, it might get closer to a linear performance/size curve. It's comparable with hbase splitting a table into regions. Then, to address the pointer overhead problem, you could use DataBlockEncoding to encode each memstore region individually. A memstore region could accumulate several blocks that get compacted periodically. Given a region size of ~64-256KB, the compaction could be very aggressive and could even be done by the thread writing the data. Again, very similar to how hbase manages the internals of a single region. This adds moving pieces and complexity but could be developed as a pluggable module that passes the same unit tests as the current memstore.
          Hide
          Ted Yu added a comment -

          +1 on the above suggestion.
          We can trade some complexity for better compression rate.

          Show
          Ted Yu added a comment - +1 on the above suggestion. We can trade some complexity for better compression rate.
          Hide
          Anoop Sam John added a comment -

          Trying out some thing like how there can be multiple HFiles within a store.
          Within a memstore there can be more than one KeyValueSkipListSet object at a time (and so CSLM)
          For each of the KeyValueSkipListSet slice there is a configurable max size . Initially there will be only one KeyValueSkipListSet in the Memstore. Once the size reaches the threshold, we will create another KeyValueSkipListSet (So a new CSLM) and new KVs are inserted into this. The old datastructure wont get KVs again. So within one KeyValueSkipListSet KVs will be sorted. This continues and finally all these KeyValueSkipListSets are taken in to Snapshots and written to HFile. We will need changes in the MemstoreScanner so as to consider this as a heap and emit KVs in the correct order.
          Once the flush is over again there will be only one KeyValueSkipListSet in a memstore and this continues. Basically trying to avoid a single CSLM to grow to very big size with more #entries.

          By default there is no max size for a slice so single CSLM becoming bigger as long as KVs are inserted into memstore before a flush.

          Done a POC and tested also. The initial test with LoadTestTool shows that we can avoid the decrease in throughput with size of the memstore. Will attach a patch with this change by this weekend.

          Show
          Anoop Sam John added a comment - Trying out some thing like how there can be multiple HFiles within a store. Within a memstore there can be more than one KeyValueSkipListSet object at a time (and so CSLM) For each of the KeyValueSkipListSet slice there is a configurable max size . Initially there will be only one KeyValueSkipListSet in the Memstore. Once the size reaches the threshold, we will create another KeyValueSkipListSet (So a new CSLM) and new KVs are inserted into this. The old datastructure wont get KVs again. So within one KeyValueSkipListSet KVs will be sorted. This continues and finally all these KeyValueSkipListSets are taken in to Snapshots and written to HFile. We will need changes in the MemstoreScanner so as to consider this as a heap and emit KVs in the correct order. Once the flush is over again there will be only one KeyValueSkipListSet in a memstore and this continues. Basically trying to avoid a single CSLM to grow to very big size with more #entries. By default there is no max size for a slice so single CSLM becoming bigger as long as KVs are inserted into memstore before a flush. Done a POC and tested also. The initial test with LoadTestTool shows that we can avoid the decrease in throughput with size of the memstore. Will attach a patch with this change by this weekend.
          Hide
          Anoop Sam John added a comment -

          Also seeing optimization possibilities as such to CSLM so as to have our own CSLM. Todd already mentioned some points above. Will be working on that as well as some other things.

          Show
          Anoop Sam John added a comment - Also seeing optimization possibilities as such to CSLM so as to have our own CSLM. Todd already mentioned some points above. Will be working on that as well as some other things.
          Hide
          Anoop Sam John added a comment -

          Doing the memstore slicing

          Show
          Anoop Sam John added a comment - Doing the memstore slicing
          Hide
          Anoop Sam John added a comment -

          In current patch the max size of the memstore slice is specified in terms of heap size. But what matters is the #entries in the CSLM. I am thinking this can be #entries in the slice.

          Show
          Anoop Sam John added a comment - In current patch the max size of the memstore slice is specified in terms of heap size. But what matters is the #entries in the CSLM. I am thinking this can be #entries in the slice.
          Hide
          Lars Hofhansl added a comment - - edited

          From Matt Corgan...

          I've been pondering how to better compact the data in the memstore. Sometimes we see a 100MB memstore flush that is really 10MB of KeyValues, which gzips to like 2MB, meaning there is a ton of pointer overhead.

          This should better now. In various patches I removed:

          That saves 44 bytes + sizeOf(rowKey) for each KeyValue in the memstore.
          The KV in memory overhead now is: 56 bytes. (the memstoreTS is also stored in the HFiles).

          Show
          Lars Hofhansl added a comment - - edited From Matt Corgan ... I've been pondering how to better compact the data in the memstore. Sometimes we see a 100MB memstore flush that is really 10MB of KeyValues, which gzips to like 2MB, meaning there is a ton of pointer overhead. This should better now. In various patches I removed: caching of the row key ( HBASE-7279 ) caching of the timestamp ( HBASE-7279 ) caching of the KV length ( HBASE-9956 ) That saves 44 bytes + sizeOf(rowKey) for each KeyValue in the memstore. The KV in memory overhead now is: 56 bytes. (the memstoreTS is also stored in the HFiles).
          Hide
          haosdent added a comment -

          Anybody know about Order Maintenance Tree(https://github.com/shuttler/omt), which is used in TokuDB? As is know to us, SkipList isn't cpu cache-efficiency which OMT is cpu cache-efficiency.

          Show
          haosdent added a comment - Anybody know about Order Maintenance Tree( https://github.com/shuttler/omt ), which is used in TokuDB? As is know to us, SkipList isn't cpu cache-efficiency which OMT is cpu cache-efficiency.

            People

            • Assignee:
              Unassigned
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

              • Created:
                Updated:

                Development