HBase
  1. HBase
  2. HBASE-3327

For increment workloads, retain memstores in memory after flushing them

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: regionserver
    • Labels:
      None

      Description

      This is an improvement based on our observation of what happens in an increment workload. The working set is typically small and is contained in the memstores.
      1. The reason the memstores get flushed is because the number of wal logs limit gets hit.
      2. This in turn triggers compactions, which evicts the block cache.
      3. Flushing of memstore and eviction of the block cache causes disk reads for increments coming in after this because the data is no longer in memory.

      We could solve this elegantly by retaining the memstores AFTER they are flushed into files. This would mean we can quickly populate the new memstore with the working set of data from memory itself without having to hit disk. We can throttle the number of such memstores we retain, or the memory allocated to it. In fact, allocating a percentage of the block cache to this would give us a huge boost.

        Issue Links

          Activity

          Hide
          Jean-Daniel Cryans added a comment -

          I'm wondering... if those Memstores are flushed because of HLogs, wouldn't HLog compactions (HBASE-3242) solve the issue more elegantly than special casing ICVs?

          Show
          Jean-Daniel Cryans added a comment - I'm wondering... if those Memstores are flushed because of HLogs, wouldn't HLog compactions ( HBASE-3242 ) solve the issue more elegantly than special casing ICVs?
          Hide
          Karthik Ranganathan added a comment -

          True - I mentioned HLog limit because we observed it because of that, but this would address the underlying issue for any of the reasons to flush. Additionally, this also makes it resilient in the face of compactions, which HLog compactions would not help with.

          HLog compactions would also be most effective for the ICV kind of workload (frequent updates to existing data) right?

          Show
          Karthik Ranganathan added a comment - True - I mentioned HLog limit because we observed it because of that, but this would address the underlying issue for any of the reasons to flush. Additionally, this also makes it resilient in the face of compactions, which HLog compactions would not help with. HLog compactions would also be most effective for the ICV kind of workload (frequent updates to existing data) right?
          Hide
          Jean-Daniel Cryans added a comment -

          Additionally, this also makes it resilient in the face of compactions, which HLog compactions would not help with.

          Yes, but if you don't flush then you don't compact meaning that it won't screw up the BC.

          HLog compactions would also be most effective for the ICV kind of workload (frequent updates to existing data) right?

          I'm pretty sure we both agree on that, and this jira is also about helping that case as far as I understand it.

          Show
          Jean-Daniel Cryans added a comment - Additionally, this also makes it resilient in the face of compactions, which HLog compactions would not help with. Yes, but if you don't flush then you don't compact meaning that it won't screw up the BC. HLog compactions would also be most effective for the ICV kind of workload (frequent updates to existing data) right? I'm pretty sure we both agree on that, and this jira is also about helping that case as far as I understand it.
          Hide
          Kannan Muthukkaruppan added a comment -

          I think this scheme helps more than the ICV case. For example, workloads that mostly tend to access recent data. You still bound your recovery time by flushing the memstores into HFiles-- but now continue to keep them around as a "read-cache". [This scheme provides some of the benefits (granted, not all) of doing a "scan cache" (as described in the big table paper), but with much less implementation complexity.]

          Show
          Kannan Muthukkaruppan added a comment - I think this scheme helps more than the ICV case. For example, workloads that mostly tend to access recent data. You still bound your recovery time by flushing the memstores into HFiles-- but now continue to keep them around as a "read-cache". [This scheme provides some of the benefits (granted, not all) of doing a "scan cache" (as described in the big table paper), but with much less implementation complexity.]
          Hide
          Kannan Muthukkaruppan added a comment -

          Typo:
          I think this scheme helps more than the ICV case.
          meant to say:
          I think this scheme helps more than just the ICV case.

          Show
          Kannan Muthukkaruppan added a comment - Typo: I think this scheme helps more than the ICV case. meant to say: I think this scheme helps more than just the ICV case.
          Hide
          ryan rawson added a comment -

          What about the write a block cache on hfile write patch? Does that not help?

          Show
          ryan rawson added a comment - What about the write a block cache on hfile write patch? Does that not help?
          Hide
          Jonathan Gray added a comment -

          It does help. For flushes the different between cacheOnWrite and this are not that big. This helps mostly in the face of compactions, I think.

          One potential downside of keeping stuff in MemStore vs. block cache via CacheOnWrite is the relative efficiencies. With a full increment workload what I see is major reductions in storage between MemStore -> block cache -> compressed files. I'm seeing approximately 128MB -> 32MB -> 2-3MB (so block cache is 4X more efficient at storing the same data as MemStore, and compressed files another 10X).

          There's also the suspicion that I think many of us have that reads out of MemStore are actually slower than reads out of the block cache.

          I still think this is a really interesting potential direction but w/ CacheOnWrite and the difference in space efficiency, I think other optimizations may be better to target first.

          Show
          Jonathan Gray added a comment - It does help. For flushes the different between cacheOnWrite and this are not that big. This helps mostly in the face of compactions, I think. One potential downside of keeping stuff in MemStore vs. block cache via CacheOnWrite is the relative efficiencies. With a full increment workload what I see is major reductions in storage between MemStore -> block cache -> compressed files. I'm seeing approximately 128MB -> 32MB -> 2-3MB (so block cache is 4X more efficient at storing the same data as MemStore, and compressed files another 10X). There's also the suspicion that I think many of us have that reads out of MemStore are actually slower than reads out of the block cache. I still think this is a really interesting potential direction but w/ CacheOnWrite and the difference in space efficiency, I think other optimizations may be better to target first.
          Hide
          Kannan Muthukkaruppan added a comment -

          Ryan: If this happened only for recent HFiles or compactions of recent files, and not for say bigger compactions-- then yes, the two schemes start to have more similarities. The trouble with writing to block cache on all HFile creations (i.e. not just flushes but also on all compactions) is too much old data could be rewritten, and you might have storms that fully clear out items in the block cache. Jonathan has suggested knobs to throttle how much "write through" happens--- but they are size based rather than recency of data based.

          But I agree your suggestion sounds like a viable alternative with the right tweaks.

          Show
          Kannan Muthukkaruppan added a comment - Ryan: If this happened only for recent HFiles or compactions of recent files, and not for say bigger compactions-- then yes, the two schemes start to have more similarities. The trouble with writing to block cache on all HFile creations (i.e. not just flushes but also on all compactions) is too much old data could be rewritten, and you might have storms that fully clear out items in the block cache. Jonathan has suggested knobs to throttle how much "write through" happens--- but they are size based rather than recency of data based. But I agree your suggestion sounds like a viable alternative with the right tweaks.
          Hide
          Karthik Ranganathan added a comment -

          Ryan: was talking to Kannan as well about this. The only thing the writing into block cache on flushes works for flushes. But for compactions, it gets a bit complicated - and any algorithm will become a little dependent on the compaction policy.

          Show
          Karthik Ranganathan added a comment - Ryan: was talking to Kannan as well about this. The only thing the writing into block cache on flushes works for flushes. But for compactions, it gets a bit complicated - and any algorithm will become a little dependent on the compaction policy.
          Hide
          Paul Tuckfield added a comment -

          I see the logic behind compact memory and cacheOnWrite, but still for some distribution of keys being updated, the memory tradeoffs can favor memstore in terms of ram consumption. I suppose the tradeoff point exists somewhere in reasonable tuning range. So it seems like this gives the user control to understand their datalocality and make tuning tradeoffs.

          If memstore reads are slower (presumably because of contention with writers to memstore) that seems like a global problem, especially if check-and-miss is slow (I"m ignorant as to whether checking existence of a key is as expensive as checking+readingvalue) Because that's the first check any read must do, block cache, snapshot or physical IO, all check memstore first i think.

          I'd very much like to test this just by a boolean setting allowing a snapshot to remain in ram until the next memstore must be converted to a snapshot. I suspect 1 memstore plus one snapshot gives most of the benefit, and is tuneable by existing memstore size affecting parameters. But maybe this could be a memstore + N snapshots.

          Show
          Paul Tuckfield added a comment - I see the logic behind compact memory and cacheOnWrite, but still for some distribution of keys being updated, the memory tradeoffs can favor memstore in terms of ram consumption. I suppose the tradeoff point exists somewhere in reasonable tuning range. So it seems like this gives the user control to understand their datalocality and make tuning tradeoffs. If memstore reads are slower (presumably because of contention with writers to memstore) that seems like a global problem, especially if check-and-miss is slow (I"m ignorant as to whether checking existence of a key is as expensive as checking+readingvalue) Because that's the first check any read must do, block cache, snapshot or physical IO, all check memstore first i think. I'd very much like to test this just by a boolean setting allowing a snapshot to remain in ram until the next memstore must be converted to a snapshot. I suspect 1 memstore plus one snapshot gives most of the benefit, and is tuneable by existing memstore size affecting parameters. But maybe this could be a memstore + N snapshots.
          Hide
          Jonathan Gray added a comment -

          I actually disagree that the biggest benefit is 1 memstore plus snapshot. That would then cover flushes but not compactions. As stated, flushing w/ cacheOnWrite would be virtually the same but consume 25% the memory. So for this case, I don't see the clear benefit of retaining the snapshot vs. cacheOnWrite of the flushed file.

          This change is significant and would require a good bit of modifications to the tracking of aggregate MemStore sizes and the rules around eviction when under global heap pressure. I still do like this idea in general but not sure it's the best direction for effort to be spent right now.

          Show
          Jonathan Gray added a comment - I actually disagree that the biggest benefit is 1 memstore plus snapshot. That would then cover flushes but not compactions. As stated, flushing w/ cacheOnWrite would be virtually the same but consume 25% the memory. So for this case, I don't see the clear benefit of retaining the snapshot vs. cacheOnWrite of the flushed file. This change is significant and would require a good bit of modifications to the tracking of aggregate MemStore sizes and the rules around eviction when under global heap pressure. I still do like this idea in general but not sure it's the best direction for effort to be spent right now.
          Hide
          Todd Lipcon added a comment -

          Here's a brainstormy idea: we don't like the cacheOnWrite with compactions because it tries to cache everything instead of just the warm keys. So our goal should be to figure out how to only cache blocks that contain the warm ones.

          What if we maintained a counting bloom filter which was periodically cleared in order to determine which keys in the region were potentially hot. Then as we flush those keys, those hfile blocks are the ones that get pre-cached?

          Show
          Todd Lipcon added a comment - Here's a brainstormy idea: we don't like the cacheOnWrite with compactions because it tries to cache everything instead of just the warm keys. So our goal should be to figure out how to only cache blocks that contain the warm ones. What if we maintained a counting bloom filter which was periodically cleared in order to determine which keys in the region were potentially hot. Then as we flush those keys, those hfile blocks are the ones that get pre-cached?
          Hide
          Jonathan Gray added a comment -

          I looked into doing some kind of intelligent block selection for caching on cacheOnWrite. Was not going to be simple. To start I was thinking that I would re-cache blocks if the originating block(s) were already cached. If the originating blocks were not cached, I would skip caching on write of those block(s).

          Counting bloom filter sounds interesting.

          If we can ever get fast local fs reads, seems like we wouldn't need to do cacheOnWrite because the recently written file would be in the fs cache?

          Show
          Jonathan Gray added a comment - I looked into doing some kind of intelligent block selection for caching on cacheOnWrite. Was not going to be simple. To start I was thinking that I would re-cache blocks if the originating block(s) were already cached. If the originating blocks were not cached, I would skip caching on write of those block(s). Counting bloom filter sounds interesting. If we can ever get fast local fs reads, seems like we wouldn't need to do cacheOnWrite because the recently written file would be in the fs cache?
          Hide
          Todd Lipcon added a comment -

          If we can ever get fast local fs reads, seems like we wouldn't need to do cacheOnWrite because the recently written file would be in the fs cache?

          That should help, but also keep in mind that our block cache is post-decompression, so we'd still pay the decompression tax even if we're reading from OS cache, right?

          Show
          Todd Lipcon added a comment - If we can ever get fast local fs reads, seems like we wouldn't need to do cacheOnWrite because the recently written file would be in the fs cache? That should help, but also keep in mind that our block cache is post-decompression, so we'd still pay the decompression tax even if we're reading from OS cache, right?
          Hide
          Jonathan Gray added a comment -

          yes, but just for the first read. we'd then load into block cache. but in this way, we'd have "intelligent" selection of which blocks to cache (those that get used).

          Show
          Jonathan Gray added a comment - yes, but just for the first read. we'd then load into block cache. but in this way, we'd have "intelligent" selection of which blocks to cache (those that get used).
          Hide
          Karthick Sankarachary added a comment -

          Just out of curiosity, is this issue still open? In other words, when we read from a HFile right after it has been flushed (or compacted), will that strictly be an in-memory call? If not, will the following approach address this issue (at the risk of sounding uneducated):

          • Define a Map<Path, BlockCache> in StoreFile that captures the BlockCache objects used by writes, regardless of if it's triggered by a flush or a compaction.
          • Lookup the BlockCache from that map based on the StoreFile's Path, at the time we create a reader for it, and use that as opposed to an empty BlockCache.

          Correct me if I'm wrong, but when "hbase.rs.cacheblocksonwrite" is true, we seem to be caching blocks on writes regardless of whether we're flushing or compacting. If that's already the case, we might as well make those block caches visible in the read path.

          Show
          Karthick Sankarachary added a comment - Just out of curiosity, is this issue still open? In other words, when we read from a HFile right after it has been flushed (or compacted), will that strictly be an in-memory call? If not, will the following approach address this issue (at the risk of sounding uneducated): Define a Map<Path, BlockCache> in StoreFile that captures the BlockCache objects used by writes, regardless of if it's triggered by a flush or a compaction. Lookup the BlockCache from that map based on the StoreFile 's Path , at the time we create a reader for it, and use that as opposed to an empty BlockCache . Correct me if I'm wrong, but when "hbase.rs.cacheblocksonwrite" is true, we seem to be caching blocks on writes regardless of whether we're flushing or compacting. If that's already the case, we might as well make those block caches visible in the read path.
          Hide
          Yi Liang added a comment -

          Is there a patch for 0.90.3?

          Show
          Yi Liang added a comment - Is there a patch for 0.90.3?

            People

            • Assignee:
              Unassigned
              Reporter:
              Karthik Ranganathan
            • Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:

                Development