HBase
  1. HBase
  2. HBASE-2353

HBASE-2283 removed bulk sync optimization for multi-row puts

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.0
    • Component/s: None
    • Hadoop Flags:
      Reviewed

      Description

      previously to HBASE-2283 we used to call flush/sync once per put(Put[]) call (ie: batch of commits). Now we do for every row.

      This makes bulk uploads slower if you are using WAL. Is there an acceptable solution to achieve both safety and performance by bulk-sync'ing puts? Or would this not work in face of atomic guarantees?

      discuss!

      1. HBASE-2353_def_log_flush.patch
        2 kB
        Jean-Daniel Cryans
      2. hbase-2353.txt
        32 kB
        Todd Lipcon

        Activity

        Hide
        Andrew Purtell added a comment -

        I believe this also affects 0.20 branch as both group commit and 2283 were backported, so 2283 would have clobbered group commit on branch also.

        Show
        Andrew Purtell added a comment - I believe this also affects 0.20 branch as both group commit and 2283 were backported, so 2283 would have clobbered group commit on branch also.
        Hide
        Kannan Muthukkaruppan added a comment -

        The important thing is that memstore edits should happen after append+sync.

        Currently, batch put is simply a loop around append/sync/memstore-edit per put.

        If we tried to move to a model, where we first do append for each row, then a common sync, and then all the memstore changes-- then we would end up having to "lock" all the rows for the entire duration; (rather than the current model, which locks one row at a time.)

        Also, the code structure would get uglier I think – right now batch put pretty much is a thin wrapper around single Puts.

        This was a conscious change in HBASE-2283 for restoring correctness of semantics. But I should have perhaps called it out explicitly.

        @Ryan: Is the bulk upload case now noticeably slower?

        @Andrew: You are right that this affects 0.20 also. But you might be mixing the "group commit" and "multi-row put" terminology. Group commit should not have been cloberred by HBASE-2283. But yes, HBASE-2283 does remove, for correctness, the "batch sync" optimization in multi-row puts.

        Show
        Kannan Muthukkaruppan added a comment - The important thing is that memstore edits should happen after append+sync. Currently, batch put is simply a loop around append/sync/memstore-edit per put. If we tried to move to a model, where we first do append for each row, then a common sync, and then all the memstore changes-- then we would end up having to "lock" all the rows for the entire duration; (rather than the current model, which locks one row at a time.) Also, the code structure would get uglier I think – right now batch put pretty much is a thin wrapper around single Puts. This was a conscious change in HBASE-2283 for restoring correctness of semantics. But I should have perhaps called it out explicitly. @Ryan: Is the bulk upload case now noticeably slower? @Andrew: You are right that this affects 0.20 also. But you might be mixing the "group commit" and "multi-row put" terminology. Group commit should not have been cloberred by HBASE-2283 . But yes, HBASE-2283 does remove, for correctness, the "batch sync" optimization in multi-row puts.
        Hide
        Kannan Muthukkaruppan added a comment -

        update the title of the bug to be more specific.

        Show
        Kannan Muthukkaruppan added a comment - update the title of the bug to be more specific.
        Hide
        stack added a comment -

        @Ryan What if for batch put, first we wrote all that is in the batch out to the WAL and then per edit, you called the individual put method but with the do-not-WAL flag set to true? Would that get you your speed back? But how to treat errors? What do you tell the client if you've written the WAL but you fail to update memstore? How were errors treated previously? Seems like you have to reason about this stuff on a row by row basis?

        Show
        stack added a comment - @Ryan What if for batch put, first we wrote all that is in the batch out to the WAL and then per edit, you called the individual put method but with the do-not-WAL flag set to true? Would that get you your speed back? But how to treat errors? What do you tell the client if you've written the WAL but you fail to update memstore? How were errors treated previously? Seems like you have to reason about this stuff on a row by row basis?
        Hide
        Todd Lipcon added a comment -

        Under what scenarios would you fail to update memstore? It seems to me that those scenarios necessitate a full RS stop.

        Show
        Todd Lipcon added a comment - Under what scenarios would you fail to update memstore? It seems to me that those scenarios necessitate a full RS stop.
        Hide
        ryan rawson added a comment -

        the existing code has a return code interpreted as 'failed past index
        i' and the client will retry a number of times. so that might work.

        Show
        ryan rawson added a comment - the existing code has a return code interpreted as 'failed past index i' and the client will retry a number of times. so that might work.
        Hide
        stack added a comment -

        .bq Under what scenarios would you fail to update memstore? It seems to me that those scenarios necessitate a full RS stop.

        I suppose, i was thinking memstore update would fail because the RS had crashed/stopped. Can't think of any reason we'd part fail. Client wouldn't get a return code though edits had gone in because the bulk put had not completed (client would see an exception).

        Then there is the case where we add N of the M edits to the WAL file before we hit some HDFS issue that forces us return to the client. In this case, wouldn't you have to report the bulk put had completely failed since edits had no edits had made it to the MemStore?

        It seems like you have to process the bulk put, row by row.

        Show
        stack added a comment - .bq Under what scenarios would you fail to update memstore? It seems to me that those scenarios necessitate a full RS stop. I suppose, i was thinking memstore update would fail because the RS had crashed/stopped. Can't think of any reason we'd part fail. Client wouldn't get a return code though edits had gone in because the bulk put had not completed (client would see an exception). Then there is the case where we add N of the M edits to the WAL file before we hit some HDFS issue that forces us return to the client. In this case, wouldn't you have to report the bulk put had completely failed since edits had no edits had made it to the MemStore? It seems like you have to process the bulk put, row by row.
        Hide
        ryan rawson added a comment -

        I have new numbers, basically my bulk puts are now much slower than previously. This is a killer for us. Single thread import performance is now down to 2000-6000 rows/sec, down from 16,000+.

        The first fix to this is to bring back deferred log flush. I have a forthcoming patch.

        Here are my arguments:

        • There is no multi-row atomicity guarantee. Having other clients see the partial results of your batch put is acceptable because that is our consistency model - per row. That is the defacto situation right now anyways.
        • If the call succeeds, then we expect the puts to be durable. By ensuring syncFs() call returns before returning to the client we have this.
        • Partial failure by exception leaves the HLog in an uncertain state. The client will not know how many rows were successfully made durable, and thus would be required to redo the put.
        • Partial "failure" by return code means only part of the rows were made durable and available to other clients. This is normal and covered by the above cases I think.

        Given this, what makes the most sense? It seems like hlog.append() then syncFs() of all the puts, THEN memstore mutate is the way to go. In HRS.put our protection from 'over memory' is this call :

        this.cacheFlusher.reclaimMemStoreMemory();

        which will synchronously flush until we arent going to go over memory. If we somehow fail to add to memstore, it would be OOME which would kill the RS anyways. Considering the data for the Put is already in memory and we are just adjusting data structure nodes, it seems unlikely that we'd be in this case often/ever.

        Show
        ryan rawson added a comment - I have new numbers, basically my bulk puts are now much slower than previously. This is a killer for us. Single thread import performance is now down to 2000-6000 rows/sec, down from 16,000+. The first fix to this is to bring back deferred log flush. I have a forthcoming patch. Here are my arguments: There is no multi-row atomicity guarantee. Having other clients see the partial results of your batch put is acceptable because that is our consistency model - per row. That is the defacto situation right now anyways. If the call succeeds, then we expect the puts to be durable. By ensuring syncFs() call returns before returning to the client we have this. Partial failure by exception leaves the HLog in an uncertain state. The client will not know how many rows were successfully made durable, and thus would be required to redo the put. Partial "failure" by return code means only part of the rows were made durable and available to other clients. This is normal and covered by the above cases I think. Given this, what makes the most sense? It seems like hlog.append() then syncFs() of all the puts, THEN memstore mutate is the way to go. In HRS.put our protection from 'over memory' is this call : this.cacheFlusher.reclaimMemStoreMemory(); which will synchronously flush until we arent going to go over memory. If we somehow fail to add to memstore, it would be OOME which would kill the RS anyways. Considering the data for the Put is already in memory and we are just adjusting data structure nodes, it seems unlikely that we'd be in this case often/ever.
        Hide
        Kannan Muthukkaruppan added a comment -

        <<< The first fix to this is to bring back deferred log flush. I have a forthcoming patch. >>>

        How are the numbers looking with the posted patch (of not syncing if deferred log flush is set?).

        <<<Given this, what makes the most sense? It seems like hlog.append() then syncFs() of all the puts, THEN memstore mutate is the way to go. In HRS.put our protection from 'over memory' is this call :>>>

        Are you talking about how to improve things for batch puts in the default case (i.e. when deferred log flush is not set?). Is so, can you refer to my update on <22/Mar/10 05:47 PM> – do you plan to lock all the rows for the entire duration? Or reget the locks before doing the memstore mutates? If you reget the row locks, I think you'll introduce other correctness issues.

        Show
        Kannan Muthukkaruppan added a comment - <<< The first fix to this is to bring back deferred log flush. I have a forthcoming patch. >>> How are the numbers looking with the posted patch (of not syncing if deferred log flush is set?). <<<Given this, what makes the most sense? It seems like hlog.append() then syncFs() of all the puts, THEN memstore mutate is the way to go. In HRS.put our protection from 'over memory' is this call :>>> Are you talking about how to improve things for batch puts in the default case (i.e. when deferred log flush is not set?). Is so, can you refer to my update on <22/Mar/10 05:47 PM> – do you plan to lock all the rows for the entire duration? Or reget the locks before doing the memstore mutates? If you reget the row locks, I think you'll introduce other correctness issues.
        Hide
        Kannan Muthukkaruppan added a comment -

        @ Saw JD's update on another JIRA. Looks like default in trunk is deferred log flush is set.

        So my <<<Are you talking about how to improve things for batch puts in the default case (i.e. when deferred log flush is not set?).>>> should simply read <<< Are you talking about how to improve things for batch puts when deferred log flush is not set?>>>

        Show
        Kannan Muthukkaruppan added a comment - @ Saw JD's update on another JIRA. Looks like default in trunk is deferred log flush is set. So my <<<Are you talking about how to improve things for batch puts in the default case (i.e. when deferred log flush is not set?).>>> should simply read <<< Are you talking about how to improve things for batch puts when deferred log flush is not set?>>>
        Hide
        ryan rawson added a comment -

        We cant do things like hold row locks for any substantial length of
        time, that introduces the opportunities to deadlock.

        My original code patch moved the sync to the top level at the end of
        the put(Put[]) call. Maybe for this particular use case that might be
        the solution. It isn't perfect, but we dont want to cripple this use
        case (hbase is great if you can wait the months to load your data...)

        On Tue, Mar 23, 2010 at 6:20 PM, Kannan Muthukkaruppan (JIRA)

        Show
        ryan rawson added a comment - We cant do things like hold row locks for any substantial length of time, that introduces the opportunities to deadlock. My original code patch moved the sync to the top level at the end of the put(Put[]) call. Maybe for this particular use case that might be the solution. It isn't perfect, but we dont want to cripple this use case (hbase is great if you can wait the months to load your data...) On Tue, Mar 23, 2010 at 6:20 PM, Kannan Muthukkaruppan (JIRA)
        Hide
        Todd Lipcon added a comment -

        We cant do things like hold row locks for any substantial length of time, that introduces the opportunities to deadlock.

        Howso? Time of locking doesn't introduce deadlock, just order of locking. We could sort the rows first to avoid internal deadlock. Of course, the fact that we expose locks to users does break this - if a user has a lock on row C, and we try to lock A,B,C, we'll block on that row while holding the others. If a user locks in the "wrong order" the problem's even worse because we'd deadlock against the client.

        So, I don't think we can hold multiple row locks at the same time, no matter how short a period we do it for, assuming row locks continue to be user-exposed.

        Unfortunately, the opposite problem is just as bad... if we do log(a) memstore(a) log(b) memstore(b) syncAll(), then edit A becomes visible before it's synced, and that's a no-no.

        I don't have any good solutions, but here's a bad solution: HBASE-2332 (remove user-exposed row locks from region server). That JIRA could be made a little bit less drastic and say that user-exposed row locks are advisory locks (eg like flock) and they don't block IO (just other lock operations). That is to say, decouple the user exposed locks from the internal locks needed for consistency purposes.

        Show
        Todd Lipcon added a comment - We cant do things like hold row locks for any substantial length of time, that introduces the opportunities to deadlock. Howso? Time of locking doesn't introduce deadlock, just order of locking. We could sort the rows first to avoid internal deadlock. Of course, the fact that we expose locks to users does break this - if a user has a lock on row C, and we try to lock A,B,C, we'll block on that row while holding the others. If a user locks in the "wrong order" the problem's even worse because we'd deadlock against the client. So, I don't think we can hold multiple row locks at the same time, no matter how short a period we do it for, assuming row locks continue to be user-exposed. Unfortunately, the opposite problem is just as bad... if we do log(a) memstore(a) log(b) memstore(b) syncAll(), then edit A becomes visible before it's synced, and that's a no-no. I don't have any good solutions, but here's a bad solution: HBASE-2332 (remove user-exposed row locks from region server). That JIRA could be made a little bit less drastic and say that user-exposed row locks are advisory locks (eg like flock) and they don't block IO (just other lock operations). That is to say, decouple the user exposed locks from the internal locks needed for consistency purposes.
        Hide
        Kannan Muthukkaruppan added a comment -

        <<< Maybe for this particular use case that might be the solution. It isn't perfect, but we dont want to cripple this use case (hbase is great if you can wait the months to load your data...>>>

        But that has the issue of memstore edits happening before sync, which we don't want.

        Show
        Kannan Muthukkaruppan added a comment - <<< Maybe for this particular use case that might be the solution. It isn't perfect, but we dont want to cripple this use case (hbase is great if you can wait the months to load your data...>>> But that has the issue of memstore edits happening before sync, which we don't want.
        Hide
        ryan rawson added a comment -

        are we sure we dont want memstore edits happening before sync in a bulk load case?

        Are we sure that holding hundreds (or thousands) of row locks open for like 10-20 seconds is a good idea?

        I know people point to disabling the WAL completely to be fast, but I don't think we have to choose between fast (enough) and reasonably safe.

        Show
        ryan rawson added a comment - are we sure we dont want memstore edits happening before sync in a bulk load case? Are we sure that holding hundreds (or thousands) of row locks open for like 10-20 seconds is a good idea? I know people point to disabling the WAL completely to be fast, but I don't think we have to choose between fast (enough) and reasonably safe.
        Hide
        Kannan Muthukkaruppan added a comment -

        Ryan: So for your use case, if you are planning to go with deferredLogFlush to true, is this still a serious issue?

        Show
        Kannan Muthukkaruppan added a comment - Ryan: So for your use case, if you are planning to go with deferredLogFlush to true, is this still a serious issue?
        Hide
        ryan rawson added a comment -

        With deferred log flush we are in a better position. Do you think that we should provide a bulk commit code path that (sort of) mimics the group commit code path?

        Another thing to watch out for is people running the default config then writing nasty email/twitter notes because our out of the box performance isnt good unless you do tweaks X,Y,Z. It'd be nice to be fully performant without tweaking values such as deferredLogFlush.

        Show
        ryan rawson added a comment - With deferred log flush we are in a better position. Do you think that we should provide a bulk commit code path that (sort of) mimics the group commit code path? Another thing to watch out for is people running the default config then writing nasty email/twitter notes because our out of the box performance isnt good unless you do tweaks X,Y,Z. It'd be nice to be fully performant without tweaking values such as deferredLogFlush.
        Hide
        ryan rawson added a comment -

        Are you willing to cripple the performance of hbase put speed to
        ensure certain log atomic order of operation guarantees in a bulk
        insert case?

        On Tue, Mar 23, 2010 at 6:52 PM, Kannan Muthukkaruppan (JIRA)

        Show
        ryan rawson added a comment - Are you willing to cripple the performance of hbase put speed to ensure certain log atomic order of operation guarantees in a bulk insert case? On Tue, Mar 23, 2010 at 6:52 PM, Kannan Muthukkaruppan (JIRA)
        Hide
        Todd Lipcon added a comment -

        Are you willing to cripple the performance of hbase put speed to
        ensure certain log atomic order of operation guarantees in a bulk
        insert case?

        I think it's crucial that correctness is an option. That is to say, I am strongly
        against any design that precludes correct operation of bulk puts (where correct
        is the set of guarantees we're talking about in HBASE-2294 - visible edits must be
        durable, etc). I also am totally with you that for cases like mapreduce bulk loads
        there should be a "throw caution to the wind and shove that data in as fast as
        our little hard drives can spin" option.

        Show
        Todd Lipcon added a comment - Are you willing to cripple the performance of hbase put speed to ensure certain log atomic order of operation guarantees in a bulk insert case? I think it's crucial that correctness is an option . That is to say, I am strongly against any design that precludes correct operation of bulk puts (where correct is the set of guarantees we're talking about in HBASE-2294 - visible edits must be durable, etc). I also am totally with you that for cases like mapreduce bulk loads there should be a "throw caution to the wind and shove that data in as fast as our little hard drives can spin" option .
        Hide
        Todd Lipcon added a comment -

        Another thing to watch out for is people running the default config then writing nasty email/twitter notes because our out of the box performance isnt good unless you do tweaks X,Y,Z

        I think this is simply a matter of education. I personally care less about the opinions of people on twitter than I do about correct operation. Correct operation has its costs. HBase has been "cheating" for some time and hence had great performance. I am certain that it's going to be slower once it's correct, and absolutely OK with that.

        Show
        Todd Lipcon added a comment - Another thing to watch out for is people running the default config then writing nasty email/twitter notes because our out of the box performance isnt good unless you do tweaks X,Y,Z I think this is simply a matter of education. I personally care less about the opinions of people on twitter than I do about correct operation. Correct operation has its costs. HBase has been "cheating" for some time and hence had great performance. I am certain that it's going to be slower once it's correct, and absolutely OK with that.
        Hide
        Andrew Purtell added a comment -

        @Todd: Unfortunately the project will suffer if HBase is not as fast as reasonable for the default config. That is not strictly a technical consideration but is an important point. We have a bit of a PR battle going on because another project wants to be the prom queen. It is not possible to pretend that is not happening.

        Show
        Andrew Purtell added a comment - @Todd: Unfortunately the project will suffer if HBase is not as fast as reasonable for the default config. That is not strictly a technical consideration but is an important point. We have a bit of a PR battle going on because another project wants to be the prom queen. It is not possible to pretend that is not happening.
        Hide
        Todd Lipcon added a comment -

        If the "other project" isn't correct, make the PR battle should be about that If the "other project" is correct and still faster, we shouldn't compete by cheating, we should optimize the correct path!

        Show
        Todd Lipcon added a comment - If the "other project" isn't correct, make the PR battle should be about that If the "other project" is correct and still faster, we shouldn't compete by cheating, we should optimize the correct path!
        Hide
        stack added a comment -

        /me sorry, late to the game

        Correctness must be an option, if not the way we ship by default. We can choose to not ship it as default but hbase has to be able to be correct (I'm thinking that default we might ship with deferred logging if it improves our speed some but we must be clear to user about the cost to them of not syncing each row mutation).

        Bulk put is a relatively new feature. Its addition made for some nice upload numbers but our write speed before bulk put had been fine, at least compared to the competition (See Y! paper).

        What if you added flags to the bulk put Ryan that allowed you ask for a "sloppy" bulk put behavior for those of us who are fine redoing the bulk put if it doesn't all go in. The sloppy bulk put would write all to the WAL first (you might look at making yourself a special version of WALEdit and HLogKey for this case... would need special handling at split time, etc.), sync, and then do he memstore update (the latter could be done by calling single-row put with WAL set to false) w/o locking (or write the WAL afterward... though I think writing it first better). By default the bulk put would run row-by-row, WAL, sync, memstore-update.

        Show
        stack added a comment - /me sorry, late to the game Correctness must be an option, if not the way we ship by default. We can choose to not ship it as default but hbase has to be able to be correct (I'm thinking that default we might ship with deferred logging if it improves our speed some but we must be clear to user about the cost to them of not syncing each row mutation). Bulk put is a relatively new feature. Its addition made for some nice upload numbers but our write speed before bulk put had been fine, at least compared to the competition (See Y! paper). What if you added flags to the bulk put Ryan that allowed you ask for a "sloppy" bulk put behavior for those of us who are fine redoing the bulk put if it doesn't all go in. The sloppy bulk put would write all to the WAL first (you might look at making yourself a special version of WALEdit and HLogKey for this case... would need special handling at split time, etc.), sync, and then do he memstore update (the latter could be done by calling single-row put with WAL set to false) w/o locking (or write the WAL afterward... though I think writing it first better). By default the bulk put would run row-by-row, WAL, sync, memstore-update.
        Hide
        Andrew Purtell added a comment -

        I'm thinking that default we might ship with deferred logging if it improves our speed

        I was thinking the same with my above comment, and/or other similar trade offs as they are available.

        Show
        Andrew Purtell added a comment - I'm thinking that default we might ship with deferred logging if it improves our speed I was thinking the same with my above comment, and/or other similar trade offs as they are available.
        Hide
        Todd Lipcon added a comment -

        I like the idea of flags to decide on when to give up correctness.

        Here's a pseudocode idea that may be able to maintain correctness and speed... just brainstorming (there may be flaws!)

        def bulk_put(rows_to_write):
          while not rows_to_write.empty():
            minibatch = []
        
            # all minibatches must get at least one row
            row = rows_to_write.take()
            row.lock()
            minibatch.append(row)
        
            # try to grab as many more locks as we can
            # without blocking (prevents deadlock)
            while not rows_to_write.empty():
              row = rows_to_write.peek()
              if row.trylock():
                rows_to_write.take()
                minibatch.append(row)
              else:
                break
            # we now have locks on a number of rows
            write_to_hlog(minibatch)
            sync_hlog()
            write_to_memstore(minibatch)
            unlock_all_rows(minibatch)
        

        Essentially the thought here is that we try to lock as many rows together as we can without risking deadlock. This algorithm is deadlock free because we'll never block on a lock while holding another. So in the uncontended case, this algorithm turns into the "lock all rows, write all to hlog, sync, write all to memstore" but in the pathological contended case it turns into a sync per row.

        A couple variations are certainly available (eg loop all the way through rows_to_write making a minibatch of anything lockable might be better), but this general idea might be worth exploring?

        Show
        Todd Lipcon added a comment - I like the idea of flags to decide on when to give up correctness. Here's a pseudocode idea that may be able to maintain correctness and speed... just brainstorming (there may be flaws!) def bulk_put(rows_to_write): while not rows_to_write.empty(): minibatch = [] # all minibatches must get at least one row row = rows_to_write.take() row.lock() minibatch.append(row) # try to grab as many more locks as we can # without blocking (prevents deadlock) while not rows_to_write.empty(): row = rows_to_write.peek() if row.trylock(): rows_to_write.take() minibatch.append(row) else : break # we now have locks on a number of rows write_to_hlog(minibatch) sync_hlog() write_to_memstore(minibatch) unlock_all_rows(minibatch) Essentially the thought here is that we try to lock as many rows together as we can without risking deadlock. This algorithm is deadlock free because we'll never block on a lock while holding another. So in the uncontended case, this algorithm turns into the "lock all rows, write all to hlog, sync, write all to memstore" but in the pathological contended case it turns into a sync per row. A couple variations are certainly available (eg loop all the way through rows_to_write making a minibatch of anything lockable might be better), but this general idea might be worth exploring?
        Hide
        Kannan Muthukkaruppan added a comment -

        Having "durable" semantics as the default, and providing an optional "sloppy" overload of Put (which does the deferredLogFlush optimization) seems better to me.

        With regards to making the basic multi-row Put case faster, I guess it would help the single-threaded client case the most. If the client was already multi-threaded, then the group commit benefits would already be kicking in to amortize the cost of the syncs. Todd's suggestion of optimistically acquiring locks in batches of rows sounds good to me.

        Show
        Kannan Muthukkaruppan added a comment - Having "durable" semantics as the default, and providing an optional "sloppy" overload of Put (which does the deferredLogFlush optimization) seems better to me. With regards to making the basic multi-row Put case faster, I guess it would help the single-threaded client case the most. If the client was already multi-threaded, then the group commit benefits would already be kicking in to amortize the cost of the syncs. Todd's suggestion of optimistically acquiring locks in batches of rows sounds good to me.
        Hide
        Andrew Purtell added a comment -

        I disagree. I think higher performing options should be the default. I want durability as much if not more that others. However, the only users of out of the box configurations are prototypers, evaluators, and benchmarkers (and in this last case only the naïve ones) and it is good strategy to seek to avoid being labeled slow again by new users, unnecessarily. Any move into production requires some attention paid to configuration changes and tuning. As long as we provide clear guidance and detail what deferred log flushing trades away, we get the best result here in my opinion.

        Show
        Andrew Purtell added a comment - I disagree. I think higher performing options should be the default. I want durability as much if not more that others. However, the only users of out of the box configurations are prototypers, evaluators, and benchmarkers (and in this last case only the naïve ones) and it is good strategy to seek to avoid being labeled slow again by new users, unnecessarily. Any move into production requires some attention paid to configuration changes and tuning. As long as we provide clear guidance and detail what deferred log flushing trades away, we get the best result here in my opinion.
        Hide
        Jonathan Gray added a comment -

        I agree with Andrew. We were always burdened in the Postgres world by having horrid out-of-the-box performance, but awesome out-of-the-box durability.

        Being able to adjust these things for performance vs durability guarantees is awesome.

        Whichever the default is, there must be ample documentation explaining the various knobs and various trade-offs. People running HBase without diving into those docs are far more likely to be testing for performance, not durability. They will also likely not be in a production environment, or on clusters large enough and be running for enough time that node failure is likely.

        Show
        Jonathan Gray added a comment - I agree with Andrew. We were always burdened in the Postgres world by having horrid out-of-the-box performance, but awesome out-of-the-box durability. Being able to adjust these things for performance vs durability guarantees is awesome. Whichever the default is, there must be ample documentation explaining the various knobs and various trade-offs. People running HBase without diving into those docs are far more likely to be testing for performance, not durability. They will also likely not be in a production environment, or on clusters large enough and be running for enough time that node failure is likely.
        Hide
        Todd Lipcon added a comment -

        We've derailed this JIRA a fair amount, but I'll add to the pile. One solution might be to ship with example configurations, one for speed, one for correctness, and let users pick at setup time which they want to base their -site off of.

        Show
        Todd Lipcon added a comment - We've derailed this JIRA a fair amount, but I'll add to the pile. One solution might be to ship with example configurations, one for speed, one for correctness, and let users pick at setup time which they want to base their -site off of.
        Hide
        Jonathan Gray added a comment -

        There are probably some jiras laying around for some kind of CLI that would generate configuration after taking some user input about their requirements, cluster information, etc... Could boil that into there if it ever gets done. But since those things are always nice-to-haves that rarely get built, shipping with a few sample hbase-site's could be a good solution.

        Part of the initial setup in the docs would be to decide which to use and rename it to hbase-site.xml.

        Even still, there will need to be defaults, so can we still fight about that?

        Show
        Jonathan Gray added a comment - There are probably some jiras laying around for some kind of CLI that would generate configuration after taking some user input about their requirements, cluster information, etc... Could boil that into there if it ever gets done. But since those things are always nice-to-haves that rarely get built, shipping with a few sample hbase-site's could be a good solution. Part of the initial setup in the docs would be to decide which to use and rename it to hbase-site.xml. Even still, there will need to be defaults, so can we still fight about that?
        Hide
        Jonathan Gray added a comment -

        HBASE-1173 was what I was thinking of I guess. Has been closed.

        Show
        Jonathan Gray added a comment - HBASE-1173 was what I was thinking of I guess. Has been closed.
        Hide
        Andrew Purtell added a comment -

        Well worked example configs or some kind of wizard would be great to have, but to get back on track a little that means sufficient flexibility in the implementation to support those options. The proposal above to turn on and off features on the write path according to flags is a good one, a generic and extensible mechanism that can support many different policies. And, at no time should we accept a change that permanently reduces performance without providing a credible alternative, unless there is no such option.

        Show
        Andrew Purtell added a comment - Well worked example configs or some kind of wizard would be great to have, but to get back on track a little that means sufficient flexibility in the implementation to support those options. The proposal above to turn on and off features on the write path according to flags is a good one, a generic and extensible mechanism that can support many different policies. And, at no time should we accept a change that permanently reduces performance without providing a credible alternative, unless there is no such option.
        Hide
        Kannan Muthukkaruppan added a comment -

        @ Andrew/Jonathan: I see your point of view. The default isn't a major issue for contention for me-- was just stating a personal preference, but it wouldn't bother me a whole lot either way assuming the choices are well documented.

        Show
        Kannan Muthukkaruppan added a comment - @ Andrew/Jonathan: I see your point of view. The default isn't a major issue for contention for me-- was just stating a personal preference, but it wouldn't bother me a whole lot either way assuming the choices are well documented.
        Hide
        Todd Lipcon added a comment -

        Andrew, I agree with you in sentiment, but I also want to weigh in code complexity and QA. Specific to this issue (trying to drag this runaway train back!) I think it is very dangerous to have a configuration option that changes the order in which key operations occur. It makes it very tough to reason about how different processes will affect the system, especially if this sort of config option proliferates. And the matrix of configurations that we'll have to test explodes in a really bad way.

        Show
        Todd Lipcon added a comment - Andrew, I agree with you in sentiment, but I also want to weigh in code complexity and QA. Specific to this issue (trying to drag this runaway train back!) I think it is very dangerous to have a configuration option that changes the order in which key operations occur. It makes it very tough to reason about how different processes will affect the system, especially if this sort of config option proliferates. And the matrix of configurations that we'll have to test explodes in a really bad way.
        Hide
        Andrew Purtell added a comment -

        @Todd: At least in the scope of this jira the choice is between two sets of behaviors that are not difficult to explain or reason about in my opinion. Allowing some reordering (or even violation in consequence of failure) for the pretty specific use case of bulk importing, or more generally, high speed insertion of regeneratable data, I think is ok. We should have the option.

        Show
        Andrew Purtell added a comment - @Todd: At least in the scope of this jira the choice is between two sets of behaviors that are not difficult to explain or reason about in my opinion. Allowing some reordering (or even violation in consequence of failure) for the pretty specific use case of bulk importing, or more generally, high speed insertion of regeneratable data, I think is ok. We should have the option.
        Hide
        Karthik Ranganathan added a comment -

        Jumping in late here... just wanted to throw in my opinion.

        I feel that having the option to configure the behavior is good. I also feel that we should make correctness the default - because it takes someone some amount of working knowledge to differentiate between the two. When I think of any DB (whose internals I do not know), I always assume that it preserves data. And I almost always expect to tweak some settings to get better performance if it does not cut my needs - but do not expect to have to tweak something to get absolute data correctness.

        Another fallout of this "sloppy" option is that there is a possibility of data changing from underneath the application using it. The memstore may return a certain value when the application queries it, then region server goes down, replays the log and now the application may get a different answer (this case the correct one). While ok for the most part, it may not play nice with some application not aware of this. And its pretty hard to debug as well

        Just my 2 cents.

        Show
        Karthik Ranganathan added a comment - Jumping in late here... just wanted to throw in my opinion. I feel that having the option to configure the behavior is good. I also feel that we should make correctness the default - because it takes someone some amount of working knowledge to differentiate between the two. When I think of any DB (whose internals I do not know), I always assume that it preserves data. And I almost always expect to tweak some settings to get better performance if it does not cut my needs - but do not expect to have to tweak something to get absolute data correctness. Another fallout of this "sloppy" option is that there is a possibility of data changing from underneath the application using it. The memstore may return a certain value when the application queries it, then region server goes down, replays the log and now the application may get a different answer (this case the correct one). While ok for the most part, it may not play nice with some application not aware of this. And its pretty hard to debug as well Just my 2 cents.
        Hide
        Andrew Purtell added a comment -

        I like Todd's suggestion to have multiple example configs tuned for different trade offs that might be useful for different use cases but also vetted to be sane. I think that can satisfy everyone.

        Show
        Andrew Purtell added a comment - I like Todd's suggestion to have multiple example configs tuned for different trade offs that might be useful for different use cases but also vetted to be sane. I think that can satisfy everyone.
        Hide
        Jonathan Gray added a comment -

        +1 on shipping with multiple example configs. As long as getting started docs are clear, I don't care so much what the shipped-with defaults are and am fine w/ karthik's take that we should ship w/ correctness/durability.

        Show
        Jonathan Gray added a comment - +1 on shipping with multiple example configs. As long as getting started docs are clear, I don't care so much what the shipped-with defaults are and am fine w/ karthik's take that we should ship w/ correctness/durability.
        Hide
        Andrew Purtell added a comment -

        +1 on shipping with multiple example configs.

        I think that is settled.

        What about Ryan's original point that bulk importers now only have the option to turn off writes to the WAL? Can we get a performance option and correctness? What about the "minibatching" concept?

        Show
        Andrew Purtell added a comment - +1 on shipping with multiple example configs. I think that is settled. What about Ryan's original point that bulk importers now only have the option to turn off writes to the WAL? Can we get a performance option and correctness? What about the "minibatching" concept?
        Hide
        Todd Lipcon added a comment -

        I think the minibatching should actually give us both. Just need to find some time to write the code and give it a benchmark!

        Show
        Todd Lipcon added a comment - I think the minibatching should actually give us both. Just need to find some time to write the code and give it a benchmark!
        Hide
        Jonathan Gray added a comment -

        I like the optimistic mini-batching concept as a potential performance+correctness solution.

        I think in many heavy-write scenarios, there is a lot of region concurrency, but often little or no per-row concurrency. Even for non-import situations like a web application with user activity writes, you don't expect much row contention.

        Show
        Jonathan Gray added a comment - I like the optimistic mini-batching concept as a potential performance+correctness solution. I think in many heavy-write scenarios, there is a lot of region concurrency, but often little or no per-row concurrency. Even for non-import situations like a web application with user activity writes, you don't expect much row contention.
        Hide
        Jean-Daniel Cryans added a comment -

        Marking as blocker.

        Testing trunk with PE seqWrite, it's now a bit more than 5x slower (what took 55 secs now takes 320). Deferred log flushing would help here but it would still be slower than the bulk sync optimization we had. This is a huge performance regression, even if they get durability some of our users will see MR jobs that took maybe an hour now take more than 5... for a variety of reasons this is enough to make this issue a blocker.

        I think shipping with configs is good, but it won't solve this problem.

        This mini-batching solution sounds awesome, unsure how soon we can get it tho.

        Like Ryan was initially saying, bringing deferred log flush in 0.20 would be an easy task since it's a few lines to fix. The issue then is to decide whether we want to ship with this turned on or off by default (we already had a vote on this issue for trunk in November, we decided to enable it by default for all tables). Also if we turn this on, how big would the window be (currently 1 second in trunk).

        I would like to point out that the MySQL binary log isn't flushed for every edit by default. See http://dev.mysql.com/doc/refman/5.0/en/binary-log.html, grep for "sync_binlog". We can't rely on HDFS to flush the HLog so we do it with the polling timeout, also we already force flush catalog edits and tables with deferred log flush disabled are flushing others edits. We could set a very small window, say 100ms?, and everyone is free to change it for their own tables.

        Show
        Jean-Daniel Cryans added a comment - Marking as blocker. Testing trunk with PE seqWrite, it's now a bit more than 5x slower (what took 55 secs now takes 320). Deferred log flushing would help here but it would still be slower than the bulk sync optimization we had. This is a huge performance regression, even if they get durability some of our users will see MR jobs that took maybe an hour now take more than 5... for a variety of reasons this is enough to make this issue a blocker. I think shipping with configs is good, but it won't solve this problem. This mini-batching solution sounds awesome, unsure how soon we can get it tho. Like Ryan was initially saying, bringing deferred log flush in 0.20 would be an easy task since it's a few lines to fix. The issue then is to decide whether we want to ship with this turned on or off by default (we already had a vote on this issue for trunk in November, we decided to enable it by default for all tables). Also if we turn this on, how big would the window be (currently 1 second in trunk). I would like to point out that the MySQL binary log isn't flushed for every edit by default. See http://dev.mysql.com/doc/refman/5.0/en/binary-log.html , grep for "sync_binlog". We can't rely on HDFS to flush the HLog so we do it with the polling timeout, also we already force flush catalog edits and tables with deferred log flush disabled are flushing others edits. We could set a very small window, say 100ms?, and everyone is free to change it for their own tables.
        Hide
        Todd Lipcon added a comment -

        I would like to point out that the MySQL binary log isn't flushed for every edit by default.

        FWIW the binlog in MySQL is used for point-in-time-recovery from backups, and for replication, but it's not the analogue of the HLog. The InnoDB transaction log is the thing like HLog, and its default is to sync on every commit (but also allows tuning to be periodic)

        Show
        Todd Lipcon added a comment - I would like to point out that the MySQL binary log isn't flushed for every edit by default. FWIW the binlog in MySQL is used for point-in-time-recovery from backups, and for replication, but it's not the analogue of the HLog. The InnoDB transaction log is the thing like HLog, and its default is to sync on every commit (but also allows tuning to be periodic)
        Hide
        Andrew Purtell added a comment -

        I am seeing similar performance killing effects on the write path benching up on EC2. I was so concerned have switched from m1.large to c1.xlarge and am currently getting a new baseline using the larger instance types, presumably with better i/o characteristics, with 0.20.3 and 0.20.4 with dfs.support.append=false. When I switch dfs.support.append=true I expect to go off a cliff.

        Show
        Andrew Purtell added a comment - I am seeing similar performance killing effects on the write path benching up on EC2. I was so concerned have switched from m1.large to c1.xlarge and am currently getting a new baseline using the larger instance types, presumably with better i/o characteristics, with 0.20.3 and 0.20.4 with dfs.support.append=false. When I switch dfs.support.append=true I expect to go off a cliff.
        Hide
        Todd Lipcon added a comment -

        BTW, regarding defaults, I'm OK with deferred flush on or off, so long as it's very clearly documented in the "getting started" guide, and we provide an example config for "absolute durability" as well.

        Show
        Todd Lipcon added a comment - BTW, regarding defaults, I'm OK with deferred flush on or off, so long as it's very clearly documented in the "getting started" guide, and we provide an example config for "absolute durability" as well.
        Hide
        ryan rawson added a comment -

        I think Todd was going to try implementing his algorithm above. Lets see how
        that looks.

        On Apr 5, 2010 11:22 AM, "Jean-Daniel Cryans (JIRA)" <jira@apache.org>
        wrote:

        [
        https://issues.apache.org/jira/browse/HBASE-2353?page=com.atlassian.jira.plugin.system.issue.
        ..
        Jean-Daniel Cryans updated HBASE-2353:
        --------------------------------------

        Priority: Blocker (was: Major)
        Fix Version/s: 0.20.4

        Marking as blocker.

        Testing trunk with PE seqWrite, it's now a bit more than 5x slower (what
        took 55 secs now takes 320). Deferred log flushing would help here but it
        would still be slower than the bulk sync optimization we had. This is a huge
        performance regression, even if they get durability some of our users will
        see MR jobs that took maybe an hour now take more than 5... for a variety of
        reasons this is enough to make this issue a blocker.

        I think shipping with configs is good, but it won't solve this problem.

        This mini-batching solution sounds awesome, unsure how soon we can get it
        tho.

        Like Ryan was initially saying, bringing deferred log flush in 0.20 would be
        an easy task since it's a few lines to fix. The issue then is to decide
        whether we want to ship with this turned on or off by default (we already
        had a vote on this issue for trunk in November, we decided to enable it by
        default for all tables). Also if we turn this on, how big would the window
        be (currently 1 second in trunk).

        I would like to point out that the MySQL binary log isn't flushed for every
        edit by default. See http://dev.mysql.com/doc/refman/5.0/en/binary-log.html,
        grep for "sync_binlog". We can't rely on HDFS to flush the HLog so we do it
        with the polling timeout, also we already force flush catalog edits and
        tables with deferred log flush disabled are flushing others edits. We could
        set a very small window, say 100ms?, and everyone is free to change it for
        their own tables.

        Show
        ryan rawson added a comment - I think Todd was going to try implementing his algorithm above. Lets see how that looks. On Apr 5, 2010 11:22 AM, "Jean-Daniel Cryans (JIRA)" <jira@apache.org> wrote: [ https://issues.apache.org/jira/browse/HBASE-2353?page=com.atlassian.jira.plugin.system.issue . .. Jean-Daniel Cryans updated HBASE-2353 : -------------------------------------- Priority: Blocker (was: Major) Fix Version/s: 0.20.4 Marking as blocker. Testing trunk with PE seqWrite, it's now a bit more than 5x slower (what took 55 secs now takes 320). Deferred log flushing would help here but it would still be slower than the bulk sync optimization we had. This is a huge performance regression, even if they get durability some of our users will see MR jobs that took maybe an hour now take more than 5... for a variety of reasons this is enough to make this issue a blocker. I think shipping with configs is good, but it won't solve this problem. This mini-batching solution sounds awesome, unsure how soon we can get it tho. Like Ryan was initially saying, bringing deferred log flush in 0.20 would be an easy task since it's a few lines to fix. The issue then is to decide whether we want to ship with this turned on or off by default (we already had a vote on this issue for trunk in November, we decided to enable it by default for all tables). Also if we turn this on, how big would the window be (currently 1 second in trunk). I would like to point out that the MySQL binary log isn't flushed for every edit by default. See http://dev.mysql.com/doc/refman/5.0/en/binary-log.html , grep for "sync_binlog". We can't rely on HDFS to flush the HLog so we do it with the polling timeout, also we already force flush catalog edits and tables with deferred log flush disabled are flushing others edits. We could set a very small window, say 100ms?, and everyone is free to change it for their own tables.
        Hide
        Jean-Daniel Cryans added a comment -

        Patch to fix deferred log flush in trunk, HBASE-2283 stripped the calls. I also set it to false by default, less durability would be an option.

        Show
        Jean-Daniel Cryans added a comment - Patch to fix deferred log flush in trunk, HBASE-2283 stripped the calls. I also set it to false by default, less durability would be an option.
        Hide
        stack added a comment -

        Bulk move of 0.20.5 issues into 0.21.0 after vote that we merge branch into TRUNK up on list.

        Show
        stack added a comment - Bulk move of 0.20.5 issues into 0.21.0 after vote that we merge branch into TRUNK up on list.
        Hide
        Todd Lipcon added a comment -

        I'm going to take a stab at the optimistic mini-batching technique suggested above.

        Andrew: did you get any final numbers with your EC2 testing of my HDFS-side sync parallelization? In my tests here I saw similar performance on trunk vs 20 when my patches were included, but haven't done a real rigorous comparison.

        Show
        Todd Lipcon added a comment - I'm going to take a stab at the optimistic mini-batching technique suggested above. Andrew: did you get any final numbers with your EC2 testing of my HDFS-side sync parallelization? In my tests here I saw similar performance on trunk vs 20 when my patches were included, but haven't done a real rigorous comparison.
        Hide
        Andrew Purtell added a comment -

        Todd: So much has been in flux, we've been waiting for it to settle.

        Show
        Andrew Purtell added a comment - Todd: So much has been in flux, we've been waiting for it to settle.
        Hide
        HBase Review Board added a comment -

        Message from: "Todd Lipcon" <todd@cloudera.com>

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.hbase.org/r/167/
        -----------------------------------------------------------

        Review request for hbase, Kannan Muthukkaruppan and Ryan Rawson.

        Summary
        -------

        I implemented the "mini batching" idea we talked about on the JIRA.

        This currently breaks some of the error handling, so I dont intend to commit as is, but everyone is busy so wanted to put a review up now while I tidy up the rest.

        This addresses bug HBASE-2353.
        http://issues.apache.org/jira/browse/HBASE-2353

        Diffs


        src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 6b6d098
        src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java a1baff4
        src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 034690e

        Diff: http://review.hbase.org/r/167/diff

        Testing
        -------

        Some PEs on a real sync-enabled cluster, seems faster but haven't done scientific benchmarking.

        Thanks,

        Todd

        Show
        HBase Review Board added a comment - Message from: "Todd Lipcon" <todd@cloudera.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/167/ ----------------------------------------------------------- Review request for hbase, Kannan Muthukkaruppan and Ryan Rawson. Summary ------- I implemented the "mini batching" idea we talked about on the JIRA. This currently breaks some of the error handling, so I dont intend to commit as is, but everyone is busy so wanted to put a review up now while I tidy up the rest. This addresses bug HBASE-2353 . http://issues.apache.org/jira/browse/HBASE-2353 Diffs src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 6b6d098 src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java a1baff4 src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 034690e Diff: http://review.hbase.org/r/167/diff Testing ------- Some PEs on a real sync-enabled cluster, seems faster but haven't done scientific benchmarking. Thanks, Todd
        Hide
        HBase Review Board added a comment -

        Message from: stack@duboce.net

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.hbase.org/r/167/#review181
        -----------------------------------------------------------

        Ship it!

        Looks good to me. Minor comments below.

        src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
        <http://review.hbase.org/r/167/#comment848>

        Should this be public? Isn't it just used internally?

        src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
        <http://review.hbase.org/r/167/#comment847>

        Is this a copy?

        src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
        <http://review.hbase.org/r/167/#comment849>

        You were going to replace these w/ something from guava (or is this it?)

        src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
        <http://review.hbase.org/r/167/#comment850>

        Same here

        src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
        <http://review.hbase.org/r/167/#comment851>

        This creates new Map, pass in Map.Entry instead?

        src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
        <http://review.hbase.org/r/167/#comment852>

        I hate that this is even an option (smile)

        src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
        <http://review.hbase.org/r/167/#comment853>

        w can never be null here? (There was null check previous)

        src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
        <http://review.hbase.org/r/167/#comment854>

        Missing javadoc on new param

        • stack
        Show
        HBase Review Board added a comment - Message from: stack@duboce.net ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/167/#review181 ----------------------------------------------------------- Ship it! Looks good to me. Minor comments below. src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java < http://review.hbase.org/r/167/#comment848 > Should this be public? Isn't it just used internally? src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java < http://review.hbase.org/r/167/#comment847 > Is this a copy? src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java < http://review.hbase.org/r/167/#comment849 > You were going to replace these w/ something from guava (or is this it?) src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java < http://review.hbase.org/r/167/#comment850 > Same here src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java < http://review.hbase.org/r/167/#comment851 > This creates new Map, pass in Map.Entry instead? src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java < http://review.hbase.org/r/167/#comment852 > I hate that this is even an option (smile) src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java < http://review.hbase.org/r/167/#comment853 > w can never be null here? (There was null check previous) src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java < http://review.hbase.org/r/167/#comment854 > Missing javadoc on new param stack
        Hide
        HBase Review Board added a comment -

        Message from: "Todd Lipcon" <todd@cloudera.com>

        On 2010-06-11 09:59:29, stack wrote:

        > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1439

        > <http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1439>

        >

        > Should this be public? Isn't it just used internally?

        good call, will make it package private

        On 2010-06-11 09:59:29, stack wrote:

        > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1442

        > <http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1442>

        >

        > Is this a copy?

        Yep. I did it the simple way following the pseudocode in the JIRA, but we can definitely do it more efficiently by indexing into a list or arraylist. My bet is that the efficiency gains are marginal compared to the cost of actually writing to the WAL, etc, and this patch should already be a big gain over what we used to have.

        After this is committed let's file a followup JIRA to remove extra shallow copies. I'll add a TODO in the code also.

        On 2010-06-11 09:59:29, stack wrote:

        > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1467

        > <http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1467>

        >

        > You were going to replace these w/ something from guava (or is this it?)

        The guava Preconditions thing is supposed to be more for parameter checking, etc, than for internal assertions. Currently these don't get run, but I think we should enable -ea at least for unit tests.

        On 2010-06-11 09:59:29, stack wrote:

        > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1482

        > <http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1482>

        >

        > This creates new Map, pass in Map.Entry instead?

        I thought both .keySet() and .entrySet() were just views into the existing map? (and p.getFamilyMap() just returns a member)

        On 2010-06-11 09:59:29, stack wrote:

        > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1777

        > <http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1777>

        >

        > w can never be null here? (There was null check previous)

        yea, since it's assigned first in the try block, and that function doesn't throw exceptions.

        • Todd

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.hbase.org/r/167/#review181
        -----------------------------------------------------------

        Show
        HBase Review Board added a comment - Message from: "Todd Lipcon" <todd@cloudera.com> On 2010-06-11 09:59:29, stack wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1439 > < http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1439 > > > Should this be public? Isn't it just used internally? good call, will make it package private On 2010-06-11 09:59:29, stack wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1442 > < http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1442 > > > Is this a copy? Yep. I did it the simple way following the pseudocode in the JIRA, but we can definitely do it more efficiently by indexing into a list or arraylist. My bet is that the efficiency gains are marginal compared to the cost of actually writing to the WAL, etc, and this patch should already be a big gain over what we used to have. After this is committed let's file a followup JIRA to remove extra shallow copies. I'll add a TODO in the code also. On 2010-06-11 09:59:29, stack wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1467 > < http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1467 > > > You were going to replace these w/ something from guava (or is this it?) The guava Preconditions thing is supposed to be more for parameter checking, etc, than for internal assertions. Currently these don't get run, but I think we should enable -ea at least for unit tests. On 2010-06-11 09:59:29, stack wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1482 > < http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1482 > > > This creates new Map, pass in Map.Entry instead? I thought both .keySet() and .entrySet() were just views into the existing map? (and p.getFamilyMap() just returns a member) On 2010-06-11 09:59:29, stack wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1777 > < http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1777 > > > w can never be null here? (There was null check previous) yea, since it's assigned first in the try block, and that function doesn't throw exceptions. Todd ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/167/#review181 -----------------------------------------------------------
        Hide
        HBase Review Board added a comment -

        Message from: stack@duboce.net

        On 2010-06-11 09:59:29, stack wrote:

        > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1442

        > <http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1442>

        >

        > Is this a copy?

        Todd Lipcon wrote:

        Yep. I did it the simple way following the pseudocode in the JIRA, but we can definitely do it more efficiently by indexing into a list or arraylist. My bet is that the efficiency gains are marginal compared to the cost of actually writing to the WAL, etc, and this patch should already be a big gain over what we used to have.

        After this is committed let's file a followup JIRA to remove extra shallow copies. I'll add a TODO in the code also.

        Fine by me

        On 2010-06-11 09:59:29, stack wrote:

        > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1467

        > <http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1467>

        >

        > You were going to replace these w/ something from guava (or is this it?)

        Todd Lipcon wrote:

        The guava Preconditions thing is supposed to be more for parameter checking, etc, than for internal assertions. Currently these don't get run, but I think we should enable -ea at least for unit tests.

        NM. Just saw this "By default, Surefire enables JVM assertions for the execution of your test cases." So, keep your asserts as is (we should all take them on)

        On 2010-06-11 09:59:29, stack wrote:

        > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1482

        > <http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1482>

        >

        > This creates new Map, pass in Map.Entry instead?

        Todd Lipcon wrote:

        I thought both .keySet() and .entrySet() were just views into the existing map? (and p.getFamilyMap() just returns a member)

        Looks like I'm wrong, at least regards JDK1.6. Internally it seems to use entrySet. Below is from java.util.AbstractMap:

        public Set<K> keySet() {
        if (keySet == null) {
        keySet = new AbstractSet<K>() {
        public Iterator<K> iterator() {
        return new Iterator<K>() {
        private Iterator<Entry<K,V>> i = entrySet().iterator();
        ....

        • stack

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.hbase.org/r/167/#review181
        -----------------------------------------------------------

        Show
        HBase Review Board added a comment - Message from: stack@duboce.net On 2010-06-11 09:59:29, stack wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1442 > < http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1442 > > > Is this a copy? Todd Lipcon wrote: Yep. I did it the simple way following the pseudocode in the JIRA, but we can definitely do it more efficiently by indexing into a list or arraylist. My bet is that the efficiency gains are marginal compared to the cost of actually writing to the WAL, etc, and this patch should already be a big gain over what we used to have. After this is committed let's file a followup JIRA to remove extra shallow copies. I'll add a TODO in the code also. Fine by me On 2010-06-11 09:59:29, stack wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1467 > < http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1467 > > > You were going to replace these w/ something from guava (or is this it?) Todd Lipcon wrote: The guava Preconditions thing is supposed to be more for parameter checking, etc, than for internal assertions. Currently these don't get run, but I think we should enable -ea at least for unit tests. NM. Just saw this "By default, Surefire enables JVM assertions for the execution of your test cases." So, keep your asserts as is (we should all take them on) On 2010-06-11 09:59:29, stack wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1482 > < http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1482 > > > This creates new Map, pass in Map.Entry instead? Todd Lipcon wrote: I thought both .keySet() and .entrySet() were just views into the existing map? (and p.getFamilyMap() just returns a member) Looks like I'm wrong, at least regards JDK1.6. Internally it seems to use entrySet. Below is from java.util.AbstractMap: public Set<K> keySet() { if (keySet == null) { keySet = new AbstractSet<K>() { public Iterator<K> iterator() { return new Iterator<K>() { private Iterator<Entry<K,V>> i = entrySet().iterator(); .... stack ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/167/#review181 -----------------------------------------------------------
        Hide
        HBase Review Board added a comment -

        Message from: "Todd Lipcon" <todd@cloudera.com>

        On 2010-06-11 09:59:29, stack wrote:

        > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1439

        > <http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1439>

        >

        > Should this be public? Isn't it just used internally?

        Todd Lipcon wrote:

        good call, will make it package private

        Actually, all the other put methods are also public, even though HRegion itself is basically internal to the RS

        • Todd

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.hbase.org/r/167/#review181
        -----------------------------------------------------------

        Show
        HBase Review Board added a comment - Message from: "Todd Lipcon" <todd@cloudera.com> On 2010-06-11 09:59:29, stack wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java, line 1439 > < http://review.hbase.org/r/167/diff/1/?file=1237#file1237line1439 > > > Should this be public? Isn't it just used internally? Todd Lipcon wrote: good call, will make it package private Actually, all the other put methods are also public, even though HRegion itself is basically internal to the RS Todd ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/167/#review181 -----------------------------------------------------------
        Hide
        HBase Review Board added a comment -

        Message from: "Todd Lipcon" <todd@cloudera.com>

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.hbase.org/r/167/
        -----------------------------------------------------------

        (Updated 2010-06-11 15:18:10.087969)

        Review request for hbase, Kannan Muthukkaruppan and Ryan Rawson.

        Changes
        -------

        Significantly changed patch - now batch put returns a list of status codes. Also added some tests to make sure all the behavior is correct with locking, bad families, etc.

        I'd like to change HRegionInterface and HTable to return lists of status codes at some point too, but I think that can wait for a later patch. This one at least gets our performance back up by batching the syncs.

        Summary
        -------

        I implemented the "mini batching" idea we talked about on the JIRA.

        This currently breaks some of the error handling, so I dont intend to commit as is, but everyone is busy so wanted to put a review up now while I tidy up the rest.

        This addresses bug HBASE-2353.
        http://issues.apache.org/jira/browse/HBASE-2353

        Diffs (updated)


        src/main/java/org/apache/hadoop/hbase/HConstants.java 1e59533
        src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 62617ac
        src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java adc505b
        src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 541ec9b
        src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java ba04820
        src/test/java/org/apache/hadoop/hbase/MultithreadedTestUtil.java 870f925
        src/test/java/org/apache/hadoop/hbase/TestAcidGuarantees.java 8a5206c
        src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java e75767d

        Diff: http://review.hbase.org/r/167/diff

        Testing
        -------

        Some PEs on a real sync-enabled cluster, seems faster but haven't done scientific benchmarking.

        Thanks,

        Todd

        Show
        HBase Review Board added a comment - Message from: "Todd Lipcon" <todd@cloudera.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/167/ ----------------------------------------------------------- (Updated 2010-06-11 15:18:10.087969) Review request for hbase, Kannan Muthukkaruppan and Ryan Rawson. Changes ------- Significantly changed patch - now batch put returns a list of status codes. Also added some tests to make sure all the behavior is correct with locking, bad families, etc. I'd like to change HRegionInterface and HTable to return lists of status codes at some point too, but I think that can wait for a later patch. This one at least gets our performance back up by batching the syncs. Summary ------- I implemented the "mini batching" idea we talked about on the JIRA. This currently breaks some of the error handling, so I dont intend to commit as is, but everyone is busy so wanted to put a review up now while I tidy up the rest. This addresses bug HBASE-2353 . http://issues.apache.org/jira/browse/HBASE-2353 Diffs (updated) src/main/java/org/apache/hadoop/hbase/HConstants.java 1e59533 src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 62617ac src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java adc505b src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 541ec9b src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java ba04820 src/test/java/org/apache/hadoop/hbase/MultithreadedTestUtil.java 870f925 src/test/java/org/apache/hadoop/hbase/TestAcidGuarantees.java 8a5206c src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java e75767d Diff: http://review.hbase.org/r/167/diff Testing ------- Some PEs on a real sync-enabled cluster, seems faster but haven't done scientific benchmarking. Thanks, Todd
        Hide
        HBase Review Board added a comment -

        Message from: stack@duboce.net

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.hbase.org/r/167/#review197
        -----------------------------------------------------------

        Ship it!

        Looks good to me. No comments.

        • stack
        Show
        HBase Review Board added a comment - Message from: stack@duboce.net ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/167/#review197 ----------------------------------------------------------- Ship it! Looks good to me. No comments. stack
        Hide
        HBase Review Board added a comment -

        Message from: "Ryan Rawson" <ryanobjc@gmail.com>

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.hbase.org/r/167/#review204
        -----------------------------------------------------------

        generally looks interesting and sound. This may need to be reconciled with http://review.hbase.org/r/151/ - batch operations for put/delete/get.

        • Ryan
        Show
        HBase Review Board added a comment - Message from: "Ryan Rawson" <ryanobjc@gmail.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/167/#review204 ----------------------------------------------------------- generally looks interesting and sound. This may need to be reconciled with http://review.hbase.org/r/151/ - batch operations for put/delete/get. Ryan
        Hide
        stack added a comment -

        @Todd Translating the above comments by myself and Ryan, I think thats two +1s you got there.

        Show
        stack added a comment - @Todd Translating the above comments by myself and Ryan, I think thats two +1s you got there.
        Hide
        Todd Lipcon added a comment -

        Committed to trunk. Thanks for review, Stack and Ryan.

        Show
        Todd Lipcon added a comment - Committed to trunk. Thanks for review, Stack and Ryan.
        Hide
        Todd Lipcon added a comment -

        Attaching patch I committed (r2 from review board)

        Show
        Todd Lipcon added a comment - Attaching patch I committed (r2 from review board)
        Hide
        HBase Review Board added a comment -

        Message from: "Kannan Muthukkaruppan" <kannan@facebook.com>

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.hbase.org/r/167/#review247
        -----------------------------------------------------------

        Todd: I know this is already reviewed and committed. I just got around to reviewing this and the related HBase-2733. Looks really good!!

        One side effect of this change is that now a single WALEdit could contain data for more than 1 row. Previously all KVs inside a WALEdit corresponded to the same row. I don't think it should cause any problems, but thought I should call it out in case others can think of potential issues.

        • Kannan
        Show
        HBase Review Board added a comment - Message from: "Kannan Muthukkaruppan" <kannan@facebook.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/167/#review247 ----------------------------------------------------------- Todd: I know this is already reviewed and committed. I just got around to reviewing this and the related HBase-2733. Looks really good!! One side effect of this change is that now a single WALEdit could contain data for more than 1 row. Previously all KVs inside a WALEdit corresponded to the same row. I don't think it should cause any problems, but thought I should call it out in case others can think of potential issues. Kannan
        Hide
        HBase Review Board added a comment -

        Message from: stack@duboce.net

        On 2010-06-16 20:24:22, Kannan Muthukkaruppan wrote:

        > Todd: I know this is already reviewed and committed. I just got around to reviewing this and the related HBase-2733. Looks really good!!

        >

        > One side effect of this change is that now a single WALEdit could contain data for more than 1 row. Previously all KVs inside a WALEdit corresponded to the same row. I don't think it should cause any problems, but thought I should call it out in case others can think of potential issues.

        >

        >

        >

        I think its going to be alright Kannan. The patch over in hbase-1025, the fixup to the replay of split edits, is applying the kvs it finds in a WALEdit value one at a time. That they are for different rows but of the same region, it should be fine I'd say.

        • stack

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.hbase.org/r/167/#review247
        -----------------------------------------------------------

        Show
        HBase Review Board added a comment - Message from: stack@duboce.net On 2010-06-16 20:24:22, Kannan Muthukkaruppan wrote: > Todd: I know this is already reviewed and committed. I just got around to reviewing this and the related HBase-2733. Looks really good!! > > One side effect of this change is that now a single WALEdit could contain data for more than 1 row. Previously all KVs inside a WALEdit corresponded to the same row. I don't think it should cause any problems, but thought I should call it out in case others can think of potential issues. > > > I think its going to be alright Kannan. The patch over in hbase-1025, the fixup to the replay of split edits, is applying the kvs it finds in a WALEdit value one at a time. That they are for different rows but of the same region, it should be fine I'd say. stack ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.hbase.org/r/167/#review247 -----------------------------------------------------------

          People

          • Assignee:
            Todd Lipcon
            Reporter:
            ryan rawson
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development