Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The flow during a HRegionServer.put() seems to be the following. [For now, let's just consider single row Put containing edits to multiple column families/columns.

      HRegionServer.put() does a:

      HRegion.put();
      syncWal() (the HDFS sync call). /* this is assuming we have HDFS-200 */

      HRegion.put() does a:
      for each column family

      { HLog.append(all edits to the colum family); write all edits to Memstore; }

      HLog.append() does a :
      foreach edit in a single column family

      { doWrite() }

      doWrite() does a:
      this.writer.append().

      There seems to be two related issues here that could result in inconsistencies.

      Issue #1: A put() does a bunch of HLog.append() calls. These in turn do a bunch of "write" calls on the underlying DFS stream. If we crash after having written out some append's to DFS, recovery will run and apply a partial transaction to memstore.

      Issue #2: The updates to memstore should happen after the sync rather than before. Otherwise, there is the danger that the write to DFS (sync) fails for some reason & we return an error to the client, but we have already taken edits to the memstore. So subsequent reads will serve uncommitted data.

      1. rowLevelAtomicity_2283_v3.patch
        37 kB
        Kannan Muthukkaruppan
      2. rowLevelAtomicity_2283_v2.patch
        36 kB
        Kannan Muthukkaruppan
      3. rowLevelAtomicity_2283_v1.patch
        31 kB
        Kannan Muthukkaruppan

        Issue Links

          Activity

          Hide
          Kannan Muthukkaruppan added a comment -

          Followup discussions on the hbase-dev:

          JD wrote: <<< Indeed. The syncWal was taken back up in HRS as a way to optimize batch Puts but the fact it's called after all the MemStore operations is indeed a problem. I think we need to fix both (#1) and (#2) by ensuring we do only a single append for whatever we have to put and then syncWAL once before processing the MemStore. But, the other problem here is that the row locks have to be taken out on all rows before everything else in the case of a Put[] else we aren't atomic. And then I think some checks are ran under HRegion that we would need to run before everything else.>>>

          Ryan wrote: <<< Do we really need a single actual DFS atomic write operation? If we had some kind of end-of-row marker, would that help instead?>>>

          Yes, a marker or length-prefixed approach would suffice to recognize and ignore incomplete transactions during recovery.

          Ryan wrote: <<< But as you said, what happens if hlog append fails? The obvious thing would be to remove the additions from the memstore. But how to accomplish this easily?>>>

          Wouldn't moving all the memstore updates to happen after the sync suffice?

          Show
          Kannan Muthukkaruppan added a comment - Followup discussions on the hbase-dev: JD wrote: <<< Indeed. The syncWal was taken back up in HRS as a way to optimize batch Puts but the fact it's called after all the MemStore operations is indeed a problem. I think we need to fix both (#1) and (#2) by ensuring we do only a single append for whatever we have to put and then syncWAL once before processing the MemStore. But, the other problem here is that the row locks have to be taken out on all rows before everything else in the case of a Put[] else we aren't atomic. And then I think some checks are ran under HRegion that we would need to run before everything else.>>> Ryan wrote: <<< Do we really need a single actual DFS atomic write operation? If we had some kind of end-of-row marker, would that help instead?>>> Yes, a marker or length-prefixed approach would suffice to recognize and ignore incomplete transactions during recovery. Ryan wrote: <<< But as you said, what happens if hlog append fails? The obvious thing would be to remove the additions from the memstore. But how to accomplish this easily?>>> Wouldn't moving all the memstore updates to happen after the sync suffice?
          Hide
          stack added a comment -

          So, we should pull WAL.append out of Store and instead do the WAL appending up in the HRegionServer, add all edits for a row to WAL before proceeding per column family to add edits per MemStore.

          Show
          stack added a comment - So, we should pull WAL.append out of Store and instead do the WAL appending up in the HRegionServer, add all edits for a row to WAL before proceeding per column family to add edits per MemStore.
          Hide
          stack added a comment -

          Moving into 0.20.4.

          Show
          stack added a comment - Moving into 0.20.4.
          Hide
          dhruba borthakur added a comment -

          >Issue #2: The updates to memstore should happen after the sync rather than before.

          A related question is : what should the region server do if a write/sync to the Hlog fails? A simple option would be to shut itself down. Another option would be to roll the logs?

          Show
          dhruba borthakur added a comment - >Issue #2: The updates to memstore should happen after the sync rather than before. A related question is : what should the region server do if a write/sync to the Hlog fails? A simple option would be to shut itself down. Another option would be to roll the logs?
          Hide
          Kannan Muthukkaruppan added a comment -

          If I recall correctly, on write failure, the log is indeed already rolled, and an exception is thrown to the client (for the failure of the current transaction). I would have to check what we do if sync fails. But in either case rolling the logs seems like a good option. Shutting down the server might be a more heavy handed option.

          Was the thought that if we went with "shutting down the server" option, we could punt on Issue #2? My guess is the refactoring required for Issue #1 will make it easy to fix #2 also as part of the changes.

          Show
          Kannan Muthukkaruppan added a comment - If I recall correctly, on write failure, the log is indeed already rolled, and an exception is thrown to the client (for the failure of the current transaction). I would have to check what we do if sync fails. But in either case rolling the logs seems like a good option. Shutting down the server might be a more heavy handed option. Was the thought that if we went with "shutting down the server" option, we could punt on Issue #2? My guess is the refactoring required for Issue #1 will make it easy to fix #2 also as part of the changes.
          Hide
          Kannan Muthukkaruppan added a comment -

          Regarding Issue #1: It seems like all edits for a row (including any start/end txn markers) should in fact happen in one call to WAL append; otherwise, you can have interleaved edits in the log coming from multiple transactions. Agreed?

          Show
          Kannan Muthukkaruppan added a comment - Regarding Issue #1: It seems like all edits for a row (including any start/end txn markers) should in fact happen in one call to WAL append; otherwise, you can have interleaved edits in the log coming from multiple transactions. Agreed?
          Hide
          dhruba borthakur added a comment -

          If we say that we restart the region server when a write/sync to the region log fails, then we can defer a fix for #2. If we do that, then we do not need any refactoring of the code at all. We can solve #1 via putting a sync marker at the end of each transaction and making the HLog.reader handle this marker correctly (details not yet worked out).

          Show
          dhruba borthakur added a comment - If we say that we restart the region server when a write/sync to the region log fails, then we can defer a fix for #2. If we do that, then we do not need any refactoring of the code at all. We can solve #1 via putting a sync marker at the end of each transaction and making the HLog.reader handle this marker correctly (details not yet worked out).
          Hide
          dhruba borthakur added a comment -

          Oh, actually we have to do the refactoring, other we could satisfy a read request from an uncommitted transaction. So, my above comment is not applicable.

          Show
          dhruba borthakur added a comment - Oh, actually we have to do the refactoring, other we could satisfy a read request from an uncommitted transaction. So, my above comment is not applicable.
          Hide
          Jean-Daniel Cryans added a comment -

          Oh, actually we have to do the refactoring, other we could satisfy a read request from an uncommitted transaction.

          WRT serving stuff that's not committed, I'd say it's more in the scope of HBASE-2248 since we don't serve from the WAL. In the case of a region server failure, then like you said if we don't meet the marker with HLog.Reader then those edits won't be replayed?

          Show
          Jean-Daniel Cryans added a comment - Oh, actually we have to do the refactoring, other we could satisfy a read request from an uncommitted transaction. WRT serving stuff that's not committed, I'd say it's more in the scope of HBASE-2248 since we don't serve from the WAL. In the case of a region server failure, then like you said if we don't meet the marker with HLog.Reader then those edits won't be replayed?
          Hide
          Kannan Muthukkaruppan added a comment -

          Have coded the changes to make all the "appends" happen first, then "sync" and then all the memstore edits are done.

          Now, trying to work on collapsing all the appends into a single append operation. The marker can simply be the number of edits we expect to find in the append operation.

          Today, if a transaction contains 3 edits to c1, c2, c3 for row R. The HLog contains, has three log entries:

          <logseq1-for-edit1>:<KeyValue-for-edit-c1>
          <logseq2-for-edit2>:<KeyValue-for-edit-c2>
          <logseq3-for-edit3>:<KeyValue-for-edit-c3>

          In the new model, the plan is to have something like:

          <logseq#-for-entire-txn>:<3, <Keyvalue-for-edit-c1>, <KeyValue-for-edit-c2>, <KeyValue-for-edit-c3>>

          Implementing the change itself should be reasonably straightforward.

          But one quick question: In the new version, I presumably have to worry about supporting both the old & the new log format, correct? (i.e. for cases when someone upgraded from an older version of HBase to this version)

          Show
          Kannan Muthukkaruppan added a comment - Have coded the changes to make all the "appends" happen first, then "sync" and then all the memstore edits are done. Now, trying to work on collapsing all the appends into a single append operation. The marker can simply be the number of edits we expect to find in the append operation. Today, if a transaction contains 3 edits to c1, c2, c3 for row R. The HLog contains, has three log entries: <logseq1-for-edit1>:<KeyValue-for-edit-c1> <logseq2-for-edit2>:<KeyValue-for-edit-c2> <logseq3-for-edit3>:<KeyValue-for-edit-c3> In the new model, the plan is to have something like: <logseq#-for-entire-txn>:<3, <Keyvalue-for-edit-c1>, <KeyValue-for-edit-c2>, <KeyValue-for-edit-c3>> Implementing the change itself should be reasonably straightforward. But one quick question: In the new version, I presumably have to worry about supporting both the old & the new log format, correct? (i.e. for cases when someone upgraded from an older version of HBase to this version)
          Hide
          stack added a comment -

          .bq In the new version, I presumably have to worry about supporting both the old & the new log format, correct? (i.e. for cases when someone upgraded from an older version of HBase to this version)

          Well, a clean shutdown should have no outstanding wal logs since as part of the shutdown, we flush memstores. Would it be hard reading both types? I think that'd be the best thing to do if its not too hard. Otherwise, you could fail the startup if you find old-style files (make it easy on yourself and give the newstyle different name format) and have admin run a small script to convert old-style to new?

          Show
          stack added a comment - .bq In the new version, I presumably have to worry about supporting both the old & the new log format, correct? (i.e. for cases when someone upgraded from an older version of HBase to this version) Well, a clean shutdown should have no outstanding wal logs since as part of the shutdown, we flush memstores. Would it be hard reading both types? I think that'd be the best thing to do if its not too hard. Otherwise, you could fail the startup if you find old-style files (make it easy on yourself and give the newstyle different name format) and have admin run a small script to convert old-style to new?
          Hide
          Kannan Muthukkaruppan added a comment -

          Thanks for your input.

          Have the code changes to support things upward compatibly. The serialized format of a KeyValue starts with a "int" length. Overloading the length now for versioning. If the length is the special value -1, then will interpret the rest of the data in new format. Else, interpret the data in old format.

          A HLog entry could now either be:

          <HLogKey>:<KeyValue>
          or,
          <HLogKey>:<-1 <# of edits, <KeyValue1>, <KeyValue2>, ...>>

          I think we have the overall fix pretty much code complete. Will put up the patch after some basic testing (hopefully by tomorrow). And then will continue more detailed testing in parallel.

          Show
          Kannan Muthukkaruppan added a comment - Thanks for your input. Have the code changes to support things upward compatibly. The serialized format of a KeyValue starts with a "int" length. Overloading the length now for versioning. If the length is the special value -1, then will interpret the rest of the data in new format. Else, interpret the data in old format. A HLog entry could now either be: <HLogKey>:<KeyValue> or, <HLogKey>:<-1 <# of edits, <KeyValue1>, <KeyValue2>, ...>> I think we have the overall fix pretty much code complete. Will put up the patch after some basic testing (hopefully by tomorrow). And then will continue more detailed testing in parallel.
          Hide
          Kannan Muthukkaruppan added a comment -

          Attaching patch for intial review.

          • The order is now, append/sync/memstore for all edits (such as Put, checkAndPut, Delete, etc. operations).
          • Also edits within a single Put (row), are appened as a single edit (KeyValueList) to the WAL. This will ensure recovery doesn't apply partial tranactions.
          • I still need to make one small change to recovery – if we get an error while reading the WAL due to a partial txn at the end of the WAL-- (i.e. a incomplete HLogKey/KeyValueList) then ignore the error/partial txn. [This should happen, if at all, only for the last transaction in a WAL.]
          • I manually tested killing region server, and ensured that edits were recovered correctly.
          • I still need to run other unit tests, and need to add unit tests for these cases.

          (Thanks to Aravind/Karthik/Dhruba for discussions/suggestions on the issue & fix.)

          Show
          Kannan Muthukkaruppan added a comment - Attaching patch for intial review. The order is now, append/sync/memstore for all edits (such as Put, checkAndPut, Delete, etc. operations). Also edits within a single Put (row), are appened as a single edit (KeyValueList) to the WAL. This will ensure recovery doesn't apply partial tranactions. I still need to make one small change to recovery – if we get an error while reading the WAL due to a partial txn at the end of the WAL-- (i.e. a incomplete HLogKey/KeyValueList) then ignore the error/partial txn. [This should happen, if at all, only for the last transaction in a WAL.] I manually tested killing region server, and ensured that edits were recovered correctly. I still need to run other unit tests, and need to add unit tests for these cases. (Thanks to Aravind/Karthik/Dhruba for discussions/suggestions on the issue & fix.)
          Hide
          stack added a comment -

          Here's some comments.

          I think KeyValueList not the best name for a List that is purposed to appending to a WAL log. KVL implies List<KeyValue> which it replaces in your code but in fact it does much more. For one its a Writable. It also takes care of your fancy new transaction start/end markings. Rename it WALEdit or HLogEdit or something? Should it also be moved to the wal package? (o.a.h.h.regionserver.wal).

          Minor: Should it implement List<KeyValue>? Would that help? You wouldn't need to have a getList?

          I love the class comment on KVL.

          readNonLengthData should instead be an overloading of readFields, just add the length param?

          Minor: The following if (kvlist.size() > 0) { is usually cheaper if done as !kvlist.isEmpty IIRC, especially if the list one of the concurrent implementations where size calc is expensive (probably not pertinent here).

          Its interesting, our number-of-entries in hlog will change now, now you group up puts, etc.

          Patch looks great Kannan.

          Show
          stack added a comment - Here's some comments. I think KeyValueList not the best name for a List that is purposed to appending to a WAL log. KVL implies List<KeyValue> which it replaces in your code but in fact it does much more. For one its a Writable. It also takes care of your fancy new transaction start/end markings. Rename it WALEdit or HLogEdit or something? Should it also be moved to the wal package? (o.a.h.h.regionserver.wal). Minor: Should it implement List<KeyValue>? Would that help? You wouldn't need to have a getList? I love the class comment on KVL. readNonLengthData should instead be an overloading of readFields, just add the length param? Minor: The following if (kvlist.size() > 0) { is usually cheaper if done as !kvlist.isEmpty IIRC, especially if the list one of the concurrent implementations where size calc is expensive (probably not pertinent here). Its interesting, our number-of-entries in hlog will change now, now you group up puts, etc. Patch looks great Kannan.
          Hide
          Kannan Muthukkaruppan added a comment -

          Stack: thanks for the comments. I will take care of them and resubmit a patch along with a few other test changes I had to make.

          Show
          Kannan Muthukkaruppan added a comment - Stack: thanks for the comments. I will take care of them and resubmit a patch along with a few other test changes I had to make.
          Hide
          Kannan Muthukkaruppan added a comment -

          Currently, with my patch, TestGetClosestAtOrBefore:testUsingMetaAndBinary() (in regionserver) is broken.

          Debugged this a bit, and it seems that my change has somehow broken the interaction between the scanner & delete. What's the expected semantics when a delete happens in the middle of a scan, as the test does here:

              byte [] firstRowInC = HRegionInfo.createRegionName(Bytes.toBytes("" + 'C'),
                HConstants.EMPTY_BYTE_ARRAY, HConstants.ZEROES);
              Scan scan = new Scan(firstRowInC);
              s = mr.getScanner(scan);
              try {
                List<KeyValue> keys = new ArrayList<KeyValue>();
                while (s.next(keys)) {
                  mr.delete(new Delete(keys.get(0).getRow()), null, false);
                  keys.clear();
                }
              } finally {
                s.close();
              }
          

          Is the scanner expected to have snapshot semantics (i.e. not be affected by deletes that are happening)? With my patch, the scanner seems to be affected by deletes (need to debug why) – but I was curious to hear if the old behavior is the expected one?

          Show
          Kannan Muthukkaruppan added a comment - Currently, with my patch, TestGetClosestAtOrBefore:testUsingMetaAndBinary() (in regionserver) is broken. Debugged this a bit, and it seems that my change has somehow broken the interaction between the scanner & delete. What's the expected semantics when a delete happens in the middle of a scan, as the test does here: byte [] firstRowInC = HRegionInfo.createRegionName(Bytes.toBytes("" + 'C'), HConstants.EMPTY_BYTE_ARRAY, HConstants.ZEROES); Scan scan = new Scan(firstRowInC); s = mr.getScanner(scan); try { List<KeyValue> keys = new ArrayList<KeyValue>(); while (s.next(keys)) { mr.delete( new Delete(keys.get(0).getRow()), null , false ); keys.clear(); } } finally { s.close(); } Is the scanner expected to have snapshot semantics (i.e. not be affected by deletes that are happening)? With my patch, the scanner seems to be affected by deletes (need to debug why) – but I was curious to hear if the old behavior is the expected one?
          Hide
          stack added a comment -

          What are you seeing? Yes, the delete above should have an effect such that the subsequent gets return nothing and the ongoing Scan should progress unaffected.

          Show
          stack added a comment - What are you seeing? Yes, the delete above should have an effect such that the subsequent gets return nothing and the ongoing Scan should progress unaffected.
          Hide
          Todd Lipcon added a comment -

          Can we discuss the semantics questions in HBASE-2294? I really strongly believe we need to nail down these questions before we push forward with code that interacts so closely with them.

          Show
          Todd Lipcon added a comment - Can we discuss the semantics questions in HBASE-2294 ? I really strongly believe we need to nail down these questions before we push forward with code that interacts so closely with them.
          Hide
          Kannan Muthukkaruppan added a comment -

          scatch my comment (@15/Mar/10 07:14 PM). I was misreading my debug output. The ongoing scan is indeed unaffected by the delete. There is some other issue with this test + my patch that I am still trying to nail down.

          Show
          Kannan Muthukkaruppan added a comment - scatch my comment (@15/Mar/10 07:14 PM). I was misreading my debug output. The ongoing scan is indeed unaffected by the delete. There is some other issue with this test + my patch that I am still trying to nail down.
          Hide
          Kannan Muthukkaruppan added a comment -

          Found the issue with TestGetClosestAtOrBefore:testUsingMetaAndBinary() failures. It was indeed due to my patch .

          During a put(), updateKeys() is called to convert the special "LATEST" timestamp to "now". My patch had the inadvertent side effect of calling updateKeys() inside of the if (writeToWAL) check instead of doing so always. This particular test passes "false" for writeToWAL, and so the puts ended up having LATEST timestamps, as their timestamps never got adjusted. This causes subsequent deletes with "now" timestamps to become no-ops.

          Show
          Kannan Muthukkaruppan added a comment - Found the issue with TestGetClosestAtOrBefore:testUsingMetaAndBinary() failures. It was indeed due to my patch . During a put(), updateKeys() is called to convert the special "LATEST" timestamp to "now". My patch had the inadvertent side effect of calling updateKeys() inside of the if (writeToWAL) check instead of doing so always. This particular test passes "false" for writeToWAL, and so the puts ended up having LATEST timestamps, as their timestamps never got adjusted. This causes subsequent deletes with "now" timestamps to become no-ops.
          Hide
          Kannan Muthukkaruppan added a comment -

          Stack: re: KeyValueList you wrote: <<Should it also be moved to the wal package? (o.a.h.h.regionserver.wal).>>>

          I think the *.wal package is only in 0.21. Should I move it to o.a.h.h.regionserver instead for 0.20?

          Show
          Kannan Muthukkaruppan added a comment - Stack: re: KeyValueList you wrote: <<Should it also be moved to the wal package? (o.a.h.h.regionserver.wal).>>> I think the *.wal package is only in 0.21. Should I move it to o.a.h.h.regionserver instead for 0.20?
          Hide
          stack added a comment -

          Oh, yeah, wal is in TRUNK only... so yes, I'd suggest putting in o.a.h.h.regionserver for 0.20.

          Show
          stack added a comment - Oh, yeah, wal is in TRUNK only... so yes, I'd suggest putting in o.a.h.h.regionserver for 0.20.
          Hide
          Kannan Muthukkaruppan added a comment -

          Stack:

          btw, I think there might be preexisting issue with timestamps & WAL stuff for deletes that come in with LATEST timestamp. Could you check the code and confirm?

          Basically, in:

           delete(byte [] family, List<KeyValue> kvs, boolean writeToWAL)
          

          the "kv.updateLatestStamp(byteNow);" time stamp massaging happens after the WAL log.append() call. So the keyvalues written to the HLog does not have the massaged timestamp. On recovery, when these entries are replayed, we add them back to reconstructionCache but don't do anything with timestamps.

          Show
          Kannan Muthukkaruppan added a comment - Stack: btw, I think there might be preexisting issue with timestamps & WAL stuff for deletes that come in with LATEST timestamp. Could you check the code and confirm? Basically, in: delete( byte [] family, List<KeyValue> kvs, boolean writeToWAL) the "kv.updateLatestStamp(byteNow);" time stamp massaging happens after the WAL log.append() call. So the keyvalues written to the HLog does not have the massaged timestamp. On recovery, when these entries are replayed, we add them back to reconstructionCache but don't do anything with timestamps.
          Hide
          stack added a comment -

          @Kannan I confirm you found an ugly bug. Will I open an issue or you want to fix as part of this one. Good on you.

          Show
          stack added a comment - @Kannan I confirm you found an ugly bug. Will I open an issue or you want to fix as part of this one. Good on you.
          Hide
          Kannan Muthukkaruppan added a comment -

          Might be simpler to take care of this as part of my fix. Let me do that.

          Show
          Kannan Muthukkaruppan added a comment - Might be simpler to take care of this as part of my fix. Let me do that.
          Hide
          Kannan Muthukkaruppan added a comment -

          V2 of the patch. Addresses Stack's earlier comments. Also, fixes unit tests which were using some of the lower level APIs to use the modified APIs.

          Unit tests pass now. Manually tested crashing RS and triggering recovery for single puts & multiputs. Also tried manual tests for delete (but that revealed some pre-existing issues with delete). The "delete" issue is being tracked in a separate JIRA (HBASE-2337).

          Will create a separate JIRA for adding automated tests for these. That'll be my next task. Might need some pointers from Stack or others on this.

          Show
          Kannan Muthukkaruppan added a comment - V2 of the patch. Addresses Stack's earlier comments. Also, fixes unit tests which were using some of the lower level APIs to use the modified APIs. Unit tests pass now. Manually tested crashing RS and triggering recovery for single puts & multiputs. Also tried manual tests for delete (but that revealed some pre-existing issues with delete). The "delete" issue is being tracked in a separate JIRA ( HBASE-2337 ). Will create a separate JIRA for adding automated tests for these. That'll be my next task. Might need some pointers from Stack or others on this.
          Hide
          Kannan Muthukkaruppan added a comment -

          A small one line improvement to v2 patch based on internal feedback. The change makes it so that if someone reuses an instance of WALEdit across multiple calls to the sequence file reader, the readFields() method cleans out any old state of the object cleanly.

          Show
          Kannan Muthukkaruppan added a comment - A small one line improvement to v2 patch based on internal feedback. The change makes it so that if someone reuses an instance of WALEdit across multiple calls to the sequence file reader, the readFields() method cleans out any old state of the object cleanly.
          Hide
          stack added a comment -

          Patch looks good. Applied to 0.20 branch. Had to mess in transactional hbase to make it use new WALEdit rather than KeyValue as HLog value. Working on forward port to TRUNK.

          Show
          stack added a comment - Patch looks good. Applied to 0.20 branch. Had to mess in transactional hbase to make it use new WALEdit rather than KeyValue as HLog value. Working on forward port to TRUNK.
          Hide
          stack added a comment -

          Committed to TRUNK after some masssage including disable of replication contrib building at J-D's request... he wants to update replication to match this patch himself.

          Thanks for the patch Kannan.

          Show
          stack added a comment - Committed to TRUNK after some masssage including disable of replication contrib building at J-D's request... he wants to update replication to match this patch himself. Thanks for the patch Kannan.
          Hide
          Jean-Daniel Cryans added a comment -

          Looks like the patch for trunk disables deferred log flush, in HLog.append:

          // sync txn to file system
          this.sync(info.isMetaRegion());
          

          Doesn't check info.getTableDesc().isDeferredLogFlush()

          Show
          Jean-Daniel Cryans added a comment - Looks like the patch for trunk disables deferred log flush, in HLog.append: // sync txn to file system this .sync(info.isMetaRegion()); Doesn't check info.getTableDesc().isDeferredLogFlush()
          Hide
          stack added a comment -

          Let me reopen so we get deferred log flush back in again. J-D, what should be happening in append? We check deferred flush flag and if set, do not call sync? Thanks.

          Show
          stack added a comment - Let me reopen so we get deferred log flush back in again. J-D, what should be happening in append? We check deferred flush flag and if set, do not call sync? Thanks.
          Hide
          Jean-Daniel Cryans added a comment -

          Yeah check if info.isMetaRegion() || ! info.getTableDesc().isDeferredLogFlush()
          You want to force the sync if it's a catalog region or if deferred log flush is off. Actually I guess you could even get rid of the first check, unless we want to guard ourselves from a user setting .META. as deferred log flushed?

          Show
          Jean-Daniel Cryans added a comment - Yeah check if info.isMetaRegion() || ! info.getTableDesc().isDeferredLogFlush() You want to force the sync if it's a catalog region or if deferred log flush is off. Actually I guess you could even get rid of the first check, unless we want to guard ourselves from a user setting .META. as deferred log flushed?
          Hide
          Kannan Muthukkaruppan added a comment -

          Is the "deferred log flush" trunk specific? What is its use case/intended semantics? [Are you ok with the memstore edits being done before a sync for this use case?]

          Show
          Kannan Muthukkaruppan added a comment - Is the "deferred log flush" trunk specific? What is its use case/intended semantics? [Are you ok with the memstore edits being done before a sync for this use case?]
          Hide
          Jean-Daniel Cryans added a comment -

          This is from HBASE-1944 (see our use case there) and it is currently trunk-specific since it's a new feature that came along at the same time as group commit. It relies on the awaitNanos timer in HLog.LogSyncer.run to hflush entries that were appended but not flushed to the DNs. This is turned on by default in trunk (edits are less durable) after a vote came along in November and, if I remember correctly, Stack wasn't ok with the idea of much slower inserts out of the box compared to the 0.20 branch.

          Show
          Jean-Daniel Cryans added a comment - This is from HBASE-1944 (see our use case there) and it is currently trunk-specific since it's a new feature that came along at the same time as group commit. It relies on the awaitNanos timer in HLog.LogSyncer.run to hflush entries that were appended but not flushed to the DNs. This is turned on by default in trunk (edits are less durable) after a vote came along in November and, if I remember correctly, Stack wasn't ok with the idea of much slower inserts out of the box compared to the 0.20 branch.
          Hide
          stack added a comment -

          Marking these as fixed against 0.21.0 rather than against 0.20.5.

          Show
          stack added a comment - Marking these as fixed against 0.21.0 rather than against 0.20.5.

            People

            • Assignee:
              Kannan Muthukkaruppan
              Reporter:
              Kannan Muthukkaruppan
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development