Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: regionserver
    • Labels:
      None

      Description

      Currently, our memstore flush algorithm is pretty trivial. We let it grow to a flushsize and flush a region or grow to a certain log count and then flush everything below a seqid. In certain situations, we can get big wins from being more intelligent with our memstore flush algorithm. I suggest we look into algorithms to intelligently handle HLog compactions. By compaction, I mean replacing existing HLogs with new HLogs created using the contents of a memstore snapshot. Situations where we can get huge wins:

      1. In the incrementColumnValue case, N HLog entries often correspond to a single memstore entry. Although we may have large HLog files, our memstore could be relatively small.
      2. If we have a hot region, the majority of the HLog consists of that one region and other region edits would be minuscule.

      In both cases, we are forced to flush a bunch of very small stores. Its really hard for a compaction algorithm to be efficient when it has no guarantees of the approximate size of a new StoreFile, so it currently does unconditional, inefficient compactions. Additionally, compactions & flushes suck because they invalidate cache entries: be it memstore or LRUcache. If we can limit flushes to cases where we will have significant HFile output on a per-Store basis, we can get improved performance, stability, and reduced failover time.

        Activity

        Hide
        Nicolas Spiegelberg added a comment -

        Request for comments. If someone feels strongly about this, they can feel free to run with the idea. I have enough on my plate in the short term.

        Show
        Nicolas Spiegelberg added a comment - Request for comments. If someone feels strongly about this, they can feel free to run with the idea. I have enough on my plate in the short term.
        Hide
        Jean-Daniel Cryans added a comment -

        I had something in mind like that for a while now, good on you for creating the jira. IMO there are huge gains in what you described, hopefully someone someday will have the time to make it happen

        Show
        Jean-Daniel Cryans added a comment - I had something in mind like that for a while now, good on you for creating the jira. IMO there are huge gains in what you described, hopefully someone someday will have the time to make it happen
        Hide
        Nicolas Spiegelberg added a comment -

        IRC communication below. Highlights:

        1. ICV case is highest pri for a bunch of use cases. It would be simpler to implement. We could just address that first
        2. Make sure we properly handle replication with any changes
        3. 'compacting flush' capability would be another plus while we're digging into compaction code
        ------------------------------------------------
        [5:21pm] tlipcon: hlog compaction?
        [5:22pm] nspiegelberg: it's an idea that's been floating around in my head the past couple weeks.
        [5:22pm] tlipcon: makes some sense
        [5:22pm] nspiegelberg: the BigTable paper actually mentions that it compacts logs. that got me started thinking about the idea
        [5:22pm] tlipcon: just gonna be tricky... all these heuristics
        [5:22pm] dj_ryan: being fully persisistent and high speed are hard to get
        [5:23pm] dj_ryan: i was thinking it might be possible to improve the speed of log reply
        [5:23pm] dj_ryan: in which case we could have more outstanding logs
        [5:23pm] nspiegelberg: well I think the purpose is to efficiently address edge cases
        [5:23pm] tlipcon: right, log replay and splitting are both kind of slow
        [5:23pm] nspiegelberg: if log entries were uniformly distributed, what we have now is perfect
        [5:23pm] tlipcon: nspiegelberg: I wonder how much the "compacting flush" would buy us
        [5:24pm] tlipcon: what BT calls minor compactions
        [5:24pm] nspiegelberg: that's another idea that jgray advocates
        [5:25pm] nspiegelberg: really, we can implement a trivial HLog compaction that is only usefull for ICV applications and it would be greatly beneficial for us
        [5:33pm] apurtell: "a trivial HLog compaction that is only usefull for ICV applications" – seems a good start, we'd find that useful and so would stumble i believe
        [5:35pm] nspiegelberg: we already have practical need both cases for HLog compaction, but the ICV application is definitely higher priority.
        [5:38pm] nspiegelberg: yeah, the only thing I haven't researched is replication impact. I imagine that we could handle HLog compactions independently on each cluster. Then, flag the compacted HLogs and just not send them to the replica cluster.
        [5:39pm] dj_ryan: we might want some jd input
        [5:39pm] dj_ryan: but basically there is a 'read point' in a hlog for the replication sender
        [5:39pm] dj_ryan: so if you are compacting stuff that was already sent, we'll be ok
        [5:39pm] dj_ryan: and at the target, they'd do similiar things i guess
        [5:40pm] nspiegelberg: definitely. RFC. I have migration woes right now, but I wanted to get the idea out there and have it running through ppls heads

        Show
        Nicolas Spiegelberg added a comment - IRC communication below. Highlights: 1. ICV case is highest pri for a bunch of use cases. It would be simpler to implement. We could just address that first 2. Make sure we properly handle replication with any changes 3. 'compacting flush' capability would be another plus while we're digging into compaction code ------------------------------------------------ [5:21pm] tlipcon: hlog compaction? [5:22pm] nspiegelberg: it's an idea that's been floating around in my head the past couple weeks. [5:22pm] tlipcon: makes some sense [5:22pm] nspiegelberg: the BigTable paper actually mentions that it compacts logs. that got me started thinking about the idea [5:22pm] tlipcon: just gonna be tricky... all these heuristics [5:22pm] dj_ryan: being fully persisistent and high speed are hard to get [5:23pm] dj_ryan: i was thinking it might be possible to improve the speed of log reply [5:23pm] dj_ryan: in which case we could have more outstanding logs [5:23pm] nspiegelberg: well I think the purpose is to efficiently address edge cases [5:23pm] tlipcon: right, log replay and splitting are both kind of slow [5:23pm] nspiegelberg: if log entries were uniformly distributed, what we have now is perfect [5:23pm] tlipcon: nspiegelberg: I wonder how much the "compacting flush" would buy us [5:24pm] tlipcon: what BT calls minor compactions [5:24pm] nspiegelberg: that's another idea that jgray advocates [5:25pm] nspiegelberg: really, we can implement a trivial HLog compaction that is only usefull for ICV applications and it would be greatly beneficial for us [5:33pm] apurtell: "a trivial HLog compaction that is only usefull for ICV applications" – seems a good start, we'd find that useful and so would stumble i believe [5:35pm] nspiegelberg: we already have practical need both cases for HLog compaction, but the ICV application is definitely higher priority. [5:38pm] nspiegelberg: yeah, the only thing I haven't researched is replication impact. I imagine that we could handle HLog compactions independently on each cluster. Then, flag the compacted HLogs and just not send them to the replica cluster. [5:39pm] dj_ryan: we might want some jd input [5:39pm] dj_ryan: but basically there is a 'read point' in a hlog for the replication sender [5:39pm] dj_ryan: so if you are compacting stuff that was already sent, we'll be ok [5:39pm] dj_ryan: and at the target, they'd do similiar things i guess [5:40pm] nspiegelberg: definitely. RFC. I have migration woes right now, but I wanted to get the idea out there and have it running through ppls heads
        Hide
        stack added a comment -

        What about doing the following (I thought this was what you were describing Nicolas but it seems like you are describing something else). When we hit N HLogs, we currently flush those MemStores that hold the oldest edits. What if in stead, we ran through the oldest M HLogs and rewrote them all into one new HLog file dropping edits that have made it into StoreFiles: i.e. those whose seqid is < what is the oldest seqid on the regionserver. We'd then atomically swap out the N old HLogs, moving aside for replication or snapshotting or whatever but the HRS would now replace the N HLogs w/ the newly written HLog in its HLog accounting. This would not be a compaction, more a cleaning process.

        Show
        stack added a comment - What about doing the following (I thought this was what you were describing Nicolas but it seems like you are describing something else). When we hit N HLogs, we currently flush those MemStores that hold the oldest edits. What if in stead, we ran through the oldest M HLogs and rewrote them all into one new HLog file dropping edits that have made it into StoreFiles: i.e. those whose seqid is < what is the oldest seqid on the regionserver. We'd then atomically swap out the N old HLogs, moving aside for replication or snapshotting or whatever but the HRS would now replace the N HLogs w/ the newly written HLog in its HLog accounting. This would not be a compaction, more a cleaning process.
        Hide
        Jonathan Gray added a comment -

        I think compaction and cleaning are the same thing, no? Or at least the primary point of these things is to make them smaller / evict edits rather than reducing the number of files. Both are about rewriting the files to end up with smaller hlogs in the end.

        I'd say both of these optimizations could be done at the same time and both seem like a good idea.

        The Nicolas idea is mostly beneficial for increment-type workloads where you have a lot of updates and only care about the latest version. Stack idea should have some impact on nearly all use cases but it's not clear to me if it would be a clear win because of the added overhead. On under-utilized clusters with extra IO available, I imagine it would be a win. If you're at all IO bound, maybe not.

        Show
        Jonathan Gray added a comment - I think compaction and cleaning are the same thing, no? Or at least the primary point of these things is to make them smaller / evict edits rather than reducing the number of files. Both are about rewriting the files to end up with smaller hlogs in the end. I'd say both of these optimizations could be done at the same time and both seem like a good idea. The Nicolas idea is mostly beneficial for increment-type workloads where you have a lot of updates and only care about the latest version. Stack idea should have some impact on nearly all use cases but it's not clear to me if it would be a clear win because of the added overhead. On under-utilized clusters with extra IO available, I imagine it would be a win. If you're at all IO bound, maybe not.
        Hide
        Jonathan Gray added a comment -

        Reading my own comment, I as a little confused. I'm not saying reducing number of files is not good or what would happen, just that, the real goal is reducing the need to evict hlogs. That is done by reducing the aggregate size of the hlogs (at least in principle... we use # of logs today but might need to tweak that with this stuff since real trade-off is about recovery time and that's largely related to size not #files).

        Show
        Jonathan Gray added a comment - Reading my own comment, I as a little confused. I'm not saying reducing number of files is not good or what would happen, just that, the real goal is reducing the need to evict hlogs. That is done by reducing the aggregate size of the hlogs (at least in principle... we use # of logs today but might need to tweak that with this stuff since real trade-off is about recovery time and that's largely related to size not #files).
        Hide
        Nicolas Spiegelberg added a comment -

        @stack: that's exactly what I was thinking for use case #2. by compaction, I really meant the abstract principles of unification + pruning, which is what our HFile compactions do.

        I disagree that this technique only would give benefits to clusters with extra IO. If an HLog compaction can delay 10 StoreFiles from being created, that's 10 IO-heavy compactions that we're also delaying. Also, we have the minCompactThreshold to keep StoreFile count low in exchange for IO-inefficient compactions, which would be mitigated by HLog compaction. The main IO problem with HLog compactions vs. just creating new HFiles is compression, which we should be able to easily add to offlined HLogs.

        Show
        Nicolas Spiegelberg added a comment - @stack: that's exactly what I was thinking for use case #2. by compaction, I really meant the abstract principles of unification + pruning, which is what our HFile compactions do. I disagree that this technique only would give benefits to clusters with extra IO. If an HLog compaction can delay 10 StoreFiles from being created, that's 10 IO-heavy compactions that we're also delaying. Also, we have the minCompactThreshold to keep StoreFile count low in exchange for IO-inefficient compactions, which would be mitigated by HLog compaction. The main IO problem with HLog compactions vs. just creating new HFiles is compression, which we should be able to easily add to offlined HLogs.
        Hide
        Nicolas Spiegelberg added a comment -

        so, talking about this internally. I wasn't 100% thinking along Stack's lines. My assumption is that we could support per-CF flushing of MemStore and would therefore have different seqno per-CF to prune on. This would prevent a slow-growing CF from being flushed until it reaches a significant size.

        another thing to keep in mind:

        HFile compaction = read HFiles + merge + write new HFIles
        HLog compaction = snapshot MemStore + prune + write aggregate HLog.

        so HLog compaction only adds write IO, not read IO. All said, Karthik's HBASE-3327 suggestion would be much easier to implement in the short term since HFile compactions would require merging the snapShot MemStore + current MemStore after compaction has finished.

        Show
        Nicolas Spiegelberg added a comment - so, talking about this internally. I wasn't 100% thinking along Stack's lines. My assumption is that we could support per-CF flushing of MemStore and would therefore have different seqno per-CF to prune on. This would prevent a slow-growing CF from being flushed until it reaches a significant size. another thing to keep in mind: HFile compaction = read HFiles + merge + write new HFIles HLog compaction = snapshot MemStore + prune + write aggregate HLog. so HLog compaction only adds write IO, not read IO. All said, Karthik's HBASE-3327 suggestion would be much easier to implement in the short term since HFile compactions would require merging the snapShot MemStore + current MemStore after compaction has finished.

          People

          • Assignee:
            Unassigned
            Reporter:
            Nicolas Spiegelberg
          • Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:

              Development