HBase
  1. HBase
  2. HBASE-5954

Allow proper fsync support for HBase

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      At least get recommendation into 0.96 doc and some numbers running w/ this hdfs feature enabled.

      1. 5954-trunk-hdfs-trunk.txt
        14 kB
        Lars Hofhansl
      2. 5954-trunk-hdfs-trunk-v2.txt
        17 kB
        Lars Hofhansl
      3. 5954-trunk-hdfs-trunk-v3.txt
        17 kB
        Lars Hofhansl
      4. 5954-trunk-hdfs-trunk-v4.txt
        18 kB
        Lars Hofhansl
      5. 5954-trunk-hdfs-trunk-v5.txt
        18 kB
        Lars Hofhansl
      6. 5954-trunk-hdfs-trunk-v6.txt
        18 kB
        Lars Hofhansl
      7. hbase-hdfs-744.txt
        7 kB
        Lars Hofhansl

        Issue Links

          Activity

          Hide
          Lars Hofhansl added a comment - - edited

          Over in HDFS-744 I propose changes to HDFS to allow fsync on files.
          Without that both WAL and HFiles (resulting from compactions/flushes) are not guaranteed to be on disk at the DFS replicas.

          The HDFS-744 is becoming big, so I'm not sure how the hadoop folks will receive it. I any case, this issues tracks the necessary changes on the HBase side.

          Show
          Lars Hofhansl added a comment - - edited Over in HDFS-744 I propose changes to HDFS to allow fsync on files. Without that both WAL and HFiles (resulting from compactions/flushes) are not guaranteed to be on disk at the DFS replicas. The HDFS-744 is becoming big, so I'm not sure how the hadoop folks will receive it. I any case, this issues tracks the necessary changes on the HBase side.
          Hide
          Lars Hofhansl added a comment -

          Here's what I have. Not configurable.
          With this and HDFS-744, it should be possible to test this out.

          I will be doing that.

          Show
          Lars Hofhansl added a comment - Here's what I have. Not configurable. With this and HDFS-744 , it should be possible to test this out. I will be doing that.
          Hide
          Lars Hofhansl added a comment -

          Here's a patch against HBase-trunk matching the v2-trunk patch in HDFS-744.

          Again, this is not yet configurable and just for testing.

          Show
          Lars Hofhansl added a comment - Here's a patch against HBase-trunk matching the v2-trunk patch in HDFS-744 . Again, this is not yet configurable and just for testing.
          Hide
          Lars Hofhansl added a comment -

          Btw. the last attached patch also includes a changed pom.xml file. -Dhadoop.profile=30 can be used to build HBase against hadoop 3.0.0 snapshots.

          In order to build hadoop-trunk and hbase-trunk just do:

          1. mvn -Pnative -Pdist -Dtar -DskipTests clean install
          2. mvn clean install -Dhadoop.profile=30 -DskipTests
          Show
          Lars Hofhansl added a comment - Btw. the last attached patch also includes a changed pom.xml file. -Dhadoop.profile=30 can be used to build HBase against hadoop 3.0.0 snapshots. In order to build hadoop-trunk and hbase-trunk just do: mvn -Pnative -Pdist -Dtar -DskipTests clean install mvn clean install -Dhadoop.profile=30 -DskipTests
          Hide
          Lars Hofhansl added a comment -

          Patch that makes durable sync configurable (separately for WAL and HFiles).

          Should probably allow all file creation to be durable (losing tableinfo or regioninfo files would be bad too).

          Show
          Lars Hofhansl added a comment - Patch that makes durable sync configurable (separately for WAL and HFiles). Should probably allow all file creation to be durable (losing tableinfo or regioninfo files would be bad too).
          Hide
          Luke Lu added a comment -

          Hi Lars, could you post some preliminary benchmark numbers for hsync vs hflush? Thanks.

          Show
          Luke Lu added a comment - Hi Lars, could you post some preliminary benchmark numbers for hsync vs hflush? Thanks.
          Hide
          Lars Hofhansl added a comment -

          Will do. I assume the result will be devastating
          With the HDFS replica chaining we'd eat the fsync time N times (for N replicas).

          Show
          Lars Hofhansl added a comment - Will do. I assume the result will be devastating With the HDFS replica chaining we'd eat the fsync time N times (for N replicas).
          Hide
          Lars Hofhansl added a comment - - edited

          It's hard to get even a single run of PE run to completion (even without durable sync enabled). I'm getting various exceptions. Sometimes: "could only replicate to 0 datanodes", sometime name node problems: "LeaseExpiredException" on an HLog file.

          At this point I do not think these are due to my HDFS changes, this is all in code that I did not touch.

          Show
          Lars Hofhansl added a comment - - edited It's hard to get even a single run of PE run to completion (even without durable sync enabled). I'm getting various exceptions. Sometimes: "could only replicate to 0 datanodes", sometime name node problems: "LeaseExpiredException" on an HLog file. At this point I do not think these are due to my HDFS changes, this is all in code that I did not touch.
          Hide
          Luke Lu added a comment -

          Actually, I don't think the result should be too bad (within an order of magnitude), as fsync on replicas should happen in parallel (upstream DN should forward the sync packet before doing its own fsync and wait for the ack from downstream DN). I do expect that the increased sync latency would expose more bugs (especially lock contention and races) both in HDFS and HBase.

          Show
          Luke Lu added a comment - Actually, I don't think the result should be too bad (within an order of magnitude), as fsync on replicas should happen in parallel (upstream DN should forward the sync packet before doing its own fsync and wait for the ack from downstream DN). I do expect that the increased sync latency would expose more bugs (especially lock contention and races) both in HDFS and HBase.
          Hide
          Lars Hofhansl added a comment -

          This is only partially true. It is true for the sync packet stuff that I added, because this does not require closing the block.

          If the block needs to be closed (which causes it to be fsync'ed) it is done after the ack from the downstream DN and before the ack to the upstream DN. Here in that case the fsyncs are serial. Looking at the code, that part seems hard to change.

          Good news is: HLog files are smaller than a DFS block, so for HBase we never run into the 2nd issue.
          Semi bad news: HFiles also need to fsync'ed at least on block close, so here we'd see the issue. But since HFiles are written asynchronously it should be OK.

          Show
          Lars Hofhansl added a comment - This is only partially true. It is true for the sync packet stuff that I added, because this does not require closing the block. If the block needs to be closed (which causes it to be fsync'ed) it is done after the ack from the downstream DN and before the ack to the upstream DN. Here in that case the fsyncs are serial. Looking at the code, that part seems hard to change. Good news is: HLog files are smaller than a DFS block, so for HBase we never run into the 2nd issue. Semi bad news: HFiles also need to fsync'ed at least on block close, so here we'd see the issue. But since HFiles are written asynchronously it should be OK.
          Hide
          Lars Hofhansl added a comment -

          Finally some numbers for "PerformanceEvaluation --nomapred --rows=100000 --presplit=3 randomWrite 10"
          Without WAL/HFile sync: ~18s
          With WAL sync: ~34s

          With WAL sync on, I see a constant ~70mb/s write load. Without WAL sync I see a few spikes of far higher IO load.

          This is all on a single machine with HBase in fully distributed mode on top of a pseudo distributed HDFS.

          Note that my patch does not yet do HFile sync'ing correctly.

          Show
          Lars Hofhansl added a comment - Finally some numbers for "PerformanceEvaluation --nomapred --rows=100000 --presplit=3 randomWrite 10" Without WAL/HFile sync: ~18s With WAL sync: ~34s With WAL sync on, I see a constant ~70mb/s write load. Without WAL sync I see a few spikes of far higher IO load. This is all on a single machine with HBase in fully distributed mode on top of a pseudo distributed HDFS. Note that my patch does not yet do HFile sync'ing correctly.
          Hide
          Lars Hofhansl added a comment -

          Ok... Now with HFile sync.
          With HFile sync: ~20s
          Both WAL and HFile sync: ~35s

          With HFile sync enabled, I seek occasional IO spikes on top of the constant WAL IO.

          Show
          Lars Hofhansl added a comment - Ok... Now with HFile sync. With HFile sync: ~20s Both WAL and HFile sync: ~35s With HFile sync enabled, I seek occasional IO spikes on top of the constant WAL IO.
          Hide
          Lars Hofhansl added a comment -

          Patch that works correctly for WAL and HFile sync. (This is the one I used for above performance tests)

          Show
          Lars Hofhansl added a comment - Patch that works correctly for WAL and HFile sync. (This is the one I used for above performance tests)
          Hide
          Luke Lu added a comment -

          Thanks for the numbers, Lars! Are you using ext3? I wonder what the numbers would look like if you enable barrier=1 in the mount options or just use ext4 (with barrier turned on by default). If the underlying fs doesn't do barrier, the result is somewhat meaningless (you might as well use hflush).

          Show
          Luke Lu added a comment - Thanks for the numbers, Lars! Are you using ext3? I wonder what the numbers would look like if you enable barrier=1 in the mount options or just use ext4 (with barrier turned on by default). If the underlying fs doesn't do barrier, the result is somewhat meaningless (you might as well use hflush).
          Hide
          Lars Hofhansl added a comment -

          I'm using ext4 with default mount options (indeed the numbers would be useless otherwise)

          Show
          Lars Hofhansl added a comment - I'm using ext4 with default mount options (indeed the numbers would be useless otherwise)
          Hide
          Lars Hofhansl added a comment -

          Patch matching latest patch on HDFS-744. Will probably change again soon.

          Show
          Lars Hofhansl added a comment - Patch matching latest patch on HDFS-744 . Will probably change again soon.
          Hide
          Lars Hofhansl added a comment -

          Minor change, matching latest HDFS-744 patch.

          Show
          Lars Hofhansl added a comment - Minor change, matching latest HDFS-744 patch.
          Hide
          Lars Hofhansl added a comment -

          Previous patch accidentally always enable WAL sync if HDFS-744 is present.

          Show
          Lars Hofhansl added a comment - Previous patch accidentally always enable WAL sync if HDFS-744 is present.
          Hide
          Lars Hofhansl added a comment -

          HDFS-744 was committed. Time to finalize this patch.
          Todd suggested at the hack-a-thon to have a sync option per column family rather than per HBase cluster.

          Show
          Lars Hofhansl added a comment - HDFS-744 was committed. Time to finalize this patch. Todd suggested at the hack-a-thon to have a sync option per column family rather than per HBase cluster.
          Hide
          Todd Lipcon added a comment -

          Even per-put would be pretty nice, actually – I can imagine some applications where most updates are "unimportant" but the occasional one should be hard-persisted.

          Show
          Todd Lipcon added a comment - Even per-put would be pretty nice, actually – I can imagine some applications where most updates are "unimportant" but the occasional one should be hard-persisted.
          Hide
          Lars Hofhansl added a comment -

          HDFS-744 is in Hadoop-2.0 now.
          I'd like to make a clean patch with only the cluster wide option first. That way, this can be enabled and allow HBase to still be a nice HDFS citizen.

          Then we can add other options (per CF, per Put, etc).

          Show
          Lars Hofhansl added a comment - HDFS-744 is in Hadoop-2.0 now. I'd like to make a clean patch with only the cluster wide option first. That way, this can be enabled and allow HBase to still be a nice HDFS citizen. Then we can add other options (per CF, per Put, etc).
          Hide
          Lars Hofhansl added a comment -

          I almost feel like the HFile sync should be enabled by default. This ensures that after an HFile is written and closed all of its blocks are sync'ed to disk. The overhead of this is "fairly" minimal.
          The edit-by-edit sync for the WAL is definitely optional.

          Show
          Lars Hofhansl added a comment - I almost feel like the HFile sync should be enabled by default. This ensures that after an HFile is written and closed all of its blocks are sync'ed to disk. The overhead of this is "fairly" minimal. The edit-by-edit sync for the WAL is definitely optional.
          Hide
          Lars Hofhansl added a comment -

          Would be nice to be able to enable this (when using Hadoop-2).
          Not sure about the 0.94 target, but since the change is not disruptive to HBase and purely optional it would be a good addition.

          Show
          Lars Hofhansl added a comment - Would be nice to be able to enable this (when using Hadoop-2). Not sure about the 0.94 target, but since the change is not disruptive to HBase and purely optional it would be a good addition.
          Hide
          Lars Hofhansl added a comment - - edited

          I think the API going multiple ways (these are not mutually exclusive):

          1. hsync for HFiles (would guard compactions, etc, very lightweight), enabled with a config option (default on I think)
          2. hsync all WAL edits (very expensive, but would not require client changes), enabled with a config option (default off)
          3. hsync for tables or column families for HFiles (configured in the table/column descriptor)
          4. hsync for tables or column families for the WAL (configured in the table/column descriptor)
          5. WAL hsync per Put. Gives control to the application. A batch put would hsync the WAL if at least one Put in the batch was market with hsync. What about deletes? In 0.94 they are not batched; could it at the end of operation there.
          6. WAL hsync per RPC. Could send flag with the RPC from the client. I.e. HTable would have a Put(List<Put> puts, boolean hsync) method
          7. HTable.hsync. Client calls this when WAL must be sync'ed. Most flexible, but incurs an extra RPC to the RegionServer just to force the hsync.

          Comments welcome.

          Edit: Forgot some options.

          Show
          Lars Hofhansl added a comment - - edited I think the API going multiple ways (these are not mutually exclusive): hsync for HFiles (would guard compactions, etc, very lightweight), enabled with a config option (default on I think) hsync all WAL edits (very expensive, but would not require client changes), enabled with a config option (default off) hsync for tables or column families for HFiles (configured in the table/column descriptor) hsync for tables or column families for the WAL (configured in the table/column descriptor) WAL hsync per Put. Gives control to the application. A batch put would hsync the WAL if at least one Put in the batch was market with hsync. What about deletes? In 0.94 they are not batched; could it at the end of operation there. WAL hsync per RPC. Could send flag with the RPC from the client. I.e. HTable would have a Put(List<Put> puts, boolean hsync) method HTable.hsync. Client calls this when WAL must be sync'ed. Most flexible, but incurs an extra RPC to the RegionServer just to force the hsync. Comments welcome. Edit: Forgot some options.
          Hide
          stack added a comment -

          Would be nice to be able to enable this (when using Hadoop-2).

          Can we do it in the hadoop2 compatibility module?

          I think its fine its optionally off in 0.94 and that in 0.96 its on by default.

          I like your list. Would think a CF/Table option and the HTable#hsync the more important options to offer (though on the latter, perhaps a Put+hsync would be better given extra rpc).

          Show
          stack added a comment - Would be nice to be able to enable this (when using Hadoop-2). Can we do it in the hadoop2 compatibility module? I think its fine its optionally off in 0.94 and that in 0.96 its on by default. I like your list. Would think a CF/Table option and the HTable#hsync the more important options to offer (though on the latter, perhaps a Put+hsync would be better given extra rpc).
          Hide
          Lars Hofhansl added a comment -

          Thinking more about HTable#hsync, I think that would hard to make useful to an application. The application would need to do know which RegionServers to hsync the WAL on (unless we want to flush for a Table, which means all RegionServer hosting regions for this table need to hsync their WAL, and that does not seem to be useful).

          A similar argument goes for the RPCs (which will be split to multiple RegionServers).
          So #6 and #7 are out I think.

          I think Todd was right after all (just took me a long time to come around to it), a flag per Put/Delete/Increment/Append/etc, would be best option.

          Show
          Lars Hofhansl added a comment - Thinking more about HTable#hsync, I think that would hard to make useful to an application. The application would need to do know which RegionServers to hsync the WAL on (unless we want to flush for a Table, which means all RegionServer hosting regions for this table need to hsync their WAL, and that does not seem to be useful). A similar argument goes for the RPCs (which will be split to multiple RegionServers). So #6 and #7 are out I think. I think Todd was right after all (just took me a long time to come around to it), a flag per Put/Delete/Increment/Append/etc, would be best option.
          Hide
          Lars Hofhansl added a comment -

          Another question: HBase 0.96 now has the hadoop

          {1|2}

          -compat projects.
          On the other hand I do not want to have completely diverging implementations for this for 0.94 and 0.96 (which would mean to use reflection in both branches). Any thoughts on that?

          Show
          Lars Hofhansl added a comment - Another question: HBase 0.96 now has the hadoop {1|2} -compat projects. On the other hand I do not want to have completely diverging implementations for this for 0.94 and 0.96 (which would mean to use reflection in both branches). Any thoughts on that?
          Hide
          Andrew Purtell added a comment -

          a flag per Put/Delete/Increment/Append/etc, would be best option

          Makes sense, since both you and Todd got here after giving it consideration.

          On the other hand I do not want to have completely diverging implementations for this for 0.94 and 0.96

          I think we have to take this pain, a reflection based strategy in 0.94 and a module based strategy in 0.96. I this case I'd judge it worth it. But that's going to be a high bar for other things.

          Show
          Andrew Purtell added a comment - a flag per Put/Delete/Increment/Append/etc, would be best option Makes sense, since both you and Todd got here after giving it consideration. On the other hand I do not want to have completely diverging implementations for this for 0.94 and 0.96 I think we have to take this pain, a reflection based strategy in 0.94 and a module based strategy in 0.96. I this case I'd judge it worth it. But that's going to be a high bar for other things.
          Hide
          Andrew Purtell added a comment -

          More on my comment above. We have two options: we can start breaking out reflection into modules in 0.94 too, or save all of that for 0.96. I don't have a strong opinion but if I had to make a choice the refactoring should be in next major version / trunk / currently unstable.

          Show
          Andrew Purtell added a comment - More on my comment above. We have two options: we can start breaking out reflection into modules in 0.94 too, or save all of that for 0.96. I don't have a strong opinion but if I had to make a choice the refactoring should be in next major version / trunk / currently unstable.
          Hide
          Lars Hofhansl added a comment -

          Created HBASE-6492.
          Since I am interested in having this in 0.94 I'll start with the reflection based approach (but still in trunk for HadoopQA).

          This is what I am going to do:

          1. global HFiles hsync option upon close block (this will also apply sync'ing the WAL on close)
          2. global WAL edit hsync opion
          3. hsync CF's HFiles
          4. hsync CF's WAL edits
          5. WAL hsync per Put/Delete/Append/Increment/etc
          Show
          Lars Hofhansl added a comment - Created HBASE-6492 . Since I am interested in having this in 0.94 I'll start with the reflection based approach (but still in trunk for HadoopQA). This is what I am going to do: global HFiles hsync option upon close block (this will also apply sync'ing the WAL on close) global WAL edit hsync opion hsync CF's HFiles hsync CF's WAL edits WAL hsync per Put/Delete/Append/Increment/etc
          Hide
          Lars Hofhansl added a comment -

          Sigh... The WAL files would still need to be sync'ed upon blockclose. Since they mix data from different stores, there's no telling ahead of time. Which leads to:

          1. global hsync option upon close block for HFiles and Hlogs (makes no sense to sync HLogs but not HFiles or vice versa)
          2. global WAL edit hsync option
          3. hsync CF's WAL edits
          4. WAL hsync per Put/Delete/Append/Increment/etc
          Show
          Lars Hofhansl added a comment - Sigh... The WAL files would still need to be sync'ed upon blockclose. Since they mix data from different stores, there's no telling ahead of time. Which leads to: global hsync option upon close block for HFiles and Hlogs (makes no sense to sync HLogs but not HFiles or vice versa) global WAL edit hsync option hsync CF's WAL edits WAL hsync per Put/Delete/Append/Increment/etc
          Hide
          Luke Lu added a comment -

          Hi Lars,

          We just noticed that HDFS-744 did not implement the correct hsync semantics (mostly due to HDFS-265) so that the hsync is slower AND (arguably) less durable than hflush in Hadoop 1.x.

          Show
          Luke Lu added a comment - Hi Lars, We just noticed that HDFS-744 did not implement the correct hsync semantics (mostly due to HDFS-265 ) so that the hsync is slower AND (arguably) less durable than hflush in Hadoop 1.x.
          Hide
          Lars Hofhansl added a comment -

          Hi Luke,

          you mean hflush in hadoop 2 less durable than hflush in hadoop 1. hsync is still better (even when it is not on the synchronous path, so there's a little gap where a client was told that everything is on disk when in fact it isn't).

          Filed HDFS-3979 (you know that , just for the benefit of others reading here), which needs some testing to be committed.

          Since the hsync code is only in hadoop 2.1.0+ we'd need a new shim here for that version (or reflect the sh*t out of it).

          I'm still happy to commit this to 0.94.x.

          Show
          Lars Hofhansl added a comment - Hi Luke, you mean hflush in hadoop 2 less durable than hflush in hadoop 1. hsync is still better (even when it is not on the synchronous path, so there's a little gap where a client was told that everything is on disk when in fact it isn't). Filed HDFS-3979 (you know that , just for the benefit of others reading here), which needs some testing to be committed. Since the hsync code is only in hadoop 2.1.0+ we'd need a new shim here for that version (or reflect the sh*t out of it). I'm still happy to commit this to 0.94.x.
          Hide
          Luke Lu added a comment -

          hsync is still better (even when it is not on the synchronous path, so there's a little gap where a client was told that everything is on disk when in fact it isn't).

          No. It (hsync) is worse in the sense that there will be data loss (i.e., inconsistency: acked writes missing) if people bounce hdfs, while hbase is still writing. Bouncing hdfs happens much more often in practice than actual machine/pdu failures. To put it more strongly, hsync in the current hadoop 2 is semantically wrong.

          Show
          Luke Lu added a comment - hsync is still better (even when it is not on the synchronous path, so there's a little gap where a client was told that everything is on disk when in fact it isn't). No. It (hsync) is worse in the sense that there will be data loss (i.e., inconsistency: acked writes missing) if people bounce hdfs, while hbase is still writing. Bouncing hdfs happens much more often in practice than actual machine/pdu failures. To put it more strongly, hsync in the current hadoop 2 is semantically wrong.
          Hide
          Lars Hofhansl added a comment -

          I disagree, having the signal out for the DNs to sync durably to disk now (even if they only do as soon as they get to it) is better than having an indefinite amount in which the data on only in volatile memory (which is as good as we can get with only hflush).

          But let's not harp this point... We all agree that it needs to be fixed.

          I could use some help testing HDFS-3979. I believe the pipeline tests cover the failure scenarios (which are the same whether we flush/sync or not), the test needed there is verifying that the fsync is on the synchronous path of the client.

          Show
          Lars Hofhansl added a comment - I disagree, having the signal out for the DNs to sync durably to disk now (even if they only do as soon as they get to it) is better than having an indefinite amount in which the data on only in volatile memory (which is as good as we can get with only hflush). But let's not harp this point... We all agree that it needs to be fixed. I could use some help testing HDFS-3979 . I believe the pipeline tests cover the failure scenarios (which are the same whether we flush/sync or not), the test needed there is verifying that the fsync is on the synchronous path of the client.
          Hide
          Kan Zhang added a comment -

          It's true hsync is better than hflush in terms of persisting to disk. However, IMO, what's important for apps is whether it is safe to discard data from their buffers when acknowledged (without having to worry about retrying the writes in case of cluster failures). The current hsync doesn't give you that assurance, while both pre and after HDFS-265 hflush implementations do with respect to the semantics they support.

          Show
          Kan Zhang added a comment - It's true hsync is better than hflush in terms of persisting to disk. However, IMO, what's important for apps is whether it is safe to discard data from their buffers when acknowledged (without having to worry about retrying the writes in case of cluster failures). The current hsync doesn't give you that assurance, while both pre and after HDFS-265 hflush implementations do with respect to the semantics they support.
          Hide
          Lars Hofhansl added a comment -

          Wait, in all cases the data was flushed to the DN. hsync and hflush give exactly the same guarantees here (it's the same code path).
          If hsync is broken here, so it hflush.

          What can currently happen (past HDFS-265) for both hflush and hsync is that the data is still in the DN buffers (not in the OS buffers - in case of hflush, or on disk - in case of hsync)... Unless I seriously misunderstand.

          Show
          Lars Hofhansl added a comment - Wait, in all cases the data was flushed to the DN. hsync and hflush give exactly the same guarantees here (it's the same code path). If hsync is broken here, so it hflush. What can currently happen (past HDFS-265 ) for both hflush and hsync is that the data is still in the DN buffers (not in the OS buffers - in case of hflush, or on disk - in case of hsync)... Unless I seriously misunderstand.
          Hide
          Kan Zhang added a comment -

          Hi Lars, I think you did misunderstand what I said, esp. on the part "... with respect to the semantics they support." As you know, post HDFS-265 hflush implements API3 and when hflush returns data is guaranteed to be in the DN buffers on all replica nodes, which is what API3 promises and client can't expect anything more than that by calling hflush. But client would expect data to hit disk by calling hsync, which is not guaranteed to happen in the current implementation. The expected semantics are simply different for hflush and hsync.

          Show
          Kan Zhang added a comment - Hi Lars, I think you did misunderstand what I said, esp. on the part "... with respect to the semantics they support." As you know, post HDFS-265 hflush implements API3 and when hflush returns data is guaranteed to be in the DN buffers on all replica nodes, which is what API3 promises and client can't expect anything more than that by calling hflush. But client would expect data to hit disk by calling hsync, which is not guaranteed to happen in the current implementation. The expected semantics are simply different for hflush and hsync.
          Hide
          Lars Hofhansl added a comment -

          I see, you're right, the expectations are different. Let's fix HDFS-3979 already

          Show
          Lars Hofhansl added a comment - I see, you're right, the expectations are different. Let's fix HDFS-3979 already
          Hide
          Lars Hofhansl added a comment -

          Unscheduling from both 0.94 and 0.96 until HDFS-3979 is committed.

          Show
          Lars Hofhansl added a comment - Unscheduling from both 0.94 and 0.96 until HDFS-3979 is committed.
          Hide
          Liang Xie added a comment -

          Hi Lars Hofhansl,the HDFS-3979 has been committed, so maybe we can have a more clear target/fix version plan on HBASE-5954 now, right?

          Show
          Liang Xie added a comment - Hi Lars Hofhansl ,the HDFS-3979 has been committed, so maybe we can have a more clear target/fix version plan on HBASE-5954 now, right?
          Hide
          Lars Hofhansl added a comment -

          I'm planning to pick this up again, soon.

          Show
          Lars Hofhansl added a comment - I'm planning to pick this up again, soon.
          Hide
          Lars Hofhansl added a comment - - edited

          While I'm at it, I'd like to add deferred log flush as a per operation option as well.
          So we'd have:

          1. No WAL update (for the existing option to disable writing to the WAL)
          2. deferred log flush
          3. flush WAL (default)
          4. sync WAL

          If there are multiple mutations in a batch the highest option will be used for the entire batch.

          Since these options cannot be combined in a sensible way. This is best represented by an enum.
          In 0.96, we can break wire compatibility. I'll just add a protobuf enum, and remove the current writeToWal bit. The actual logic will be put in the hadoop-2 shim module.

          In 0.94 this is a bit more tricky. Both in terms of doing this in a wire compatible way and in terms of being forced to use reflection to detect and use Hadoop-2 vs Hadoop-1. Leaning towards only doing this in 0.96, even though I really wanted this in 0.94.

          Comments?

          Show
          Lars Hofhansl added a comment - - edited While I'm at it, I'd like to add deferred log flush as a per operation option as well. So we'd have: No WAL update (for the existing option to disable writing to the WAL) deferred log flush flush WAL (default) sync WAL If there are multiple mutations in a batch the highest option will be used for the entire batch. Since these options cannot be combined in a sensible way. This is best represented by an enum. In 0.96, we can break wire compatibility. I'll just add a protobuf enum, and remove the current writeToWal bit. The actual logic will be put in the hadoop-2 shim module. In 0.94 this is a bit more tricky. Both in terms of doing this in a wire compatible way and in terms of being forced to use reflection to detect and use Hadoop-2 vs Hadoop-1. Leaning towards only doing this in 0.96, even though I really wanted this in 0.94. Comments?
          Hide
          Liang Xie added a comment -

          sound good for me. hope more guys have an eye on it

          Show
          Liang Xie added a comment - sound good for me. hope more guys have an eye on it
          Hide
          Lars Hofhansl added a comment - - edited

          In 0.96 Put.java, Delete.java, and Increment.java still have readFields() and write() methods from writable.
          Were they left over by accident? I assume those can be removed now?

          Show
          Lars Hofhansl added a comment - - edited In 0.96 Put.java, Delete.java, and Increment.java still have readFields() and write() methods from writable. Were they left over by accident? I assume those can be removed now?
          Hide
          Jimmy Xiang added a comment -

          Those methods should be removed. But old applications may not compile any more if someone happens to use them. Are we ok with that?

          Show
          Jimmy Xiang added a comment - Those methods should be removed. But old applications may not compile any more if someone happens to use them. Are we ok with that?
          Hide
          Lars Hofhansl added a comment -

          I'd vote for removing them. If we keep these we have failed with wire compatibility and all the protofuf stuff was for nothing.

          Put/Delete is still used as writable at least in these cases:

          • IdentityTableReduce.java
          • MultiPut.java
          • HRegionServer.checkAndMutate
          Show
          Lars Hofhansl added a comment - I'd vote for removing them. If we keep these we have failed with wire compatibility and all the protofuf stuff was for nothing. Put/Delete is still used as writable at least in these cases: IdentityTableReduce.java MultiPut.java HRegionServer.checkAndMutate
          Hide
          Andrew Purtell added a comment -

          I'd vote for removing them. If we keep these we have failed with wire compatibility and all the protofuf stuff was for nothing.

          +1

          Show
          Andrew Purtell added a comment - I'd vote for removing them. If we keep these we have failed with wire compatibility and all the protofuf stuff was for nothing. +1
          Hide
          Lars Hofhansl added a comment -

          Filed HBASE-7215.

          Show
          Lars Hofhansl added a comment - Filed HBASE-7215 .
          Hide
          Varun Sharma added a comment -

          Btw, are we going to provide a hard option of whether to do either a "sync" or a "flush" per transaction ? A middle ground feature like syncing every N seconds (like there is in REDIS) or every N edits (like there is in MySQL) would be a nice to have. This might also be doable on the client side by forcing the N-th RPC to be a sync operation but would be nice on the server side.

          Show
          Varun Sharma added a comment - Btw, are we going to provide a hard option of whether to do either a "sync" or a "flush" per transaction ? A middle ground feature like syncing every N seconds (like there is in REDIS) or every N edits (like there is in MySQL) would be a nice to have. This might also be doable on the client side by forcing the N-th RPC to be a sync operation but would be nice on the server side.
          Hide
          Harsh J added a comment -

          A middle ground feature like syncing every N seconds (like there is in REDIS) or every N edits (like there is in MySQL) would be a nice to have.

          Are you looking for deferred WAL flush (per-CF property)?

          Show
          Harsh J added a comment - A middle ground feature like syncing every N seconds (like there is in REDIS) or every N edits (like there is in MySQL) would be a nice to have. Are you looking for deferred WAL flush (per-CF property)?
          Hide
          Varun Sharma added a comment -

          No I am not talking about deferred WAL flush. This is what I know but i maybe wrong:
          1) HBase uses hflush for WAL which ensures that data is in OS buffers and leaves the data in the hands of the OS - after that the time from OS cache -> disk persistence is variable
          2) With sync, we will synchronize the WAL to disk so there is no data loss

          I am asking about the possibility of intermittent sync(s) performed by the region server every N edits - so N edits where we do hflush and then we do hsync or every N seconds. Because, going from hflush -> hsync for WAL will kill performance. If we can have gaurantees that say last 1 or 0.5 second worth of data is intact and similarly, you can lose 1000 edits in case of power failure - that is a nice to have.

          Show
          Varun Sharma added a comment - No I am not talking about deferred WAL flush. This is what I know but i maybe wrong: 1) HBase uses hflush for WAL which ensures that data is in OS buffers and leaves the data in the hands of the OS - after that the time from OS cache -> disk persistence is variable 2) With sync, we will synchronize the WAL to disk so there is no data loss I am asking about the possibility of intermittent sync(s) performed by the region server every N edits - so N edits where we do hflush and then we do hsync or every N seconds. Because, going from hflush -> hsync for WAL will kill performance. If we can have gaurantees that say last 1 or 0.5 second worth of data is intact and similarly, you can lose 1000 edits in case of power failure - that is a nice to have.
          Hide
          Lars Hofhansl added a comment -

          We can make this arbitrarily complicated.

          A flush or deferred log flush gets us pretty far. It'll flush the data to the data nodes, which will then asynchronously (but ASAP) flush it to the OS buffers. In Linux the dirty page cache is periodically flushed to disk (that can be configured - default is 30s). I am not sure what else you want?

          sync'ing only really makes sense (IMHO) when it is done synchronously.

          Show
          Lars Hofhansl added a comment - We can make this arbitrarily complicated. A flush or deferred log flush gets us pretty far. It'll flush the data to the data nodes, which will then asynchronously (but ASAP) flush it to the OS buffers. In Linux the dirty page cache is periodically flushed to disk (that can be configured - default is 30s). I am not sure what else you want? sync'ing only really makes sense (IMHO) when it is done synchronously.
          Hide
          Varun Sharma added a comment -

          Firstly, I certainly don't want to complicate this too much - this is a really nice functionality to have and we can worry about the details later.

          So, in a lot of systems, there is variability of amount of edits per second - lets say high in the night but low during the day, so the you lose more data during the night than day if this is only time bound. Some systems like mySQL have a bound on the number of edits before syncing - that saidm I am happy with what we have, in its current form. What I suggest is only a nice to have...

          THanks

          Show
          Varun Sharma added a comment - Firstly, I certainly don't want to complicate this too much - this is a really nice functionality to have and we can worry about the details later. So, in a lot of systems, there is variability of amount of edits per second - lets say high in the night but low during the day, so the you lose more data during the night than day if this is only time bound. Some systems like mySQL have a bound on the number of edits before syncing - that saidm I am happy with what we have, in its current form. What I suggest is only a nice to have... THanks
          Hide
          Dave Latham added a comment -

          Would be great to see this go in to 0.96. Lars or anyone still chewing on it?
          (Recently suffered a datacenter wide power loss and lost some hfiles from regions that completed a major compaction seconds beforehand).

          Show
          Dave Latham added a comment - Would be great to see this go in to 0.96. Lars or anyone still chewing on it? (Recently suffered a datacenter wide power loss and lost some hfiles from regions that completed a major compaction seconds beforehand).
          Hide
          Lars Hofhansl added a comment -

          With HBASE-7801 and HBASE-8375 we're almost there.
          The only part missing is to actually hook in the fsync part.
          Did not do it yet, because of reflection hell. In 0.96 we have separate modules for hadoop versions, so we can avoid reflection there.

          Show
          Lars Hofhansl added a comment - With HBASE-7801 and HBASE-8375 we're almost there. The only part missing is to actually hook in the fsync part. Did not do it yet, because of reflection hell. In 0.96 we have separate modules for hadoop versions, so we can avoid reflection there.
          Hide
          Lars Hofhansl added a comment -

          Note that for performance reasons you probably do not want this (sync edits in the WAL). You probably just want to enable sync-on-close in HDFS.

          Show
          Lars Hofhansl added a comment - Note that for performance reasons you probably do not want this (sync edits in the WAL). You probably just want to enable sync-on-close in HDFS.
          Hide
          Enis Soztutar added a comment -

          Would be great to see this go in to 0.96.

          Agreed. Initially not performant, this will enable us to work on the performance issues as well.

          You probably just want to enable sync-on-close in HDFS.

          Should we make this default?

          Show
          Enis Soztutar added a comment - Would be great to see this go in to 0.96. Agreed. Initially not performant, this will enable us to work on the performance issues as well. You probably just want to enable sync-on-close in HDFS. Should we make this default?
          Hide
          stack added a comment -

          Should we make this default?

          I'd vote for that.

          Show
          stack added a comment - Should we make this default? I'd vote for that.
          Hide
          Suresh Srinivas added a comment -

          Should we make this default?

          +1 from me as well.

          Show
          Suresh Srinivas added a comment - Should we make this default? +1 from me as well.
          Hide
          Dave Latham added a comment -

          You probably just want to enable sync-on-close in HDFS.

          Referring to HDFS-1539?

          +1

          Show
          Dave Latham added a comment - You probably just want to enable sync-on-close in HDFS. Referring to HDFS-1539 ? +1
          Hide
          Lars Hofhansl added a comment -

          This is a HDFS server side setting, though, so we cannot enforce that via an HBase config.
          From my tests I found that this must be paired with the sync behind writes hint, otherwise file creation is quite slow (50 or more ms per file in a real cluster).

          And obviously this does not help with last edits in the WAL, as they are not in a closed block.

          Show
          Lars Hofhansl added a comment - This is a HDFS server side setting, though, so we cannot enforce that via an HBase config. From my tests I found that this must be paired with the sync behind writes hint, otherwise file creation is quite slow (50 or more ms per file in a real cluster). And obviously this does not help with last edits in the WAL, as they are not in a closed block.
          Hide
          Lars Hofhansl added a comment -

          And yes HDFS-1539 as well as HDFS-2465 for the fadvice hints.

          Show
          Lars Hofhansl added a comment - And yes HDFS-1539 as well as HDFS-2465 for the fadvice hints.
          Hide
          Dave Latham added a comment -

          Should update the recommended HDFS configuration in the book then? I think losing a region of data after a compaction and power failure should be prevented by default.

          Show
          Dave Latham added a comment - Should update the recommended HDFS configuration in the book then? I think losing a region of data after a compaction and power failure should be prevented by default.
          Hide
          stack added a comment -

          What is to be done to finish this up for 0.95? I've marked it critical so gets attention.

          Show
          stack added a comment - What is to be done to finish this up for 0.95? I've marked it critical so gets attention.
          Hide
          Lars Hofhansl added a comment -

          HBASE-7801 and HBASE-8375 put the right client APIs in place, and in 0.95+ we have the Hadoop1/Hadoop2 modules to avoid reflection hell.

          This is then just a matter of:

          1. Instantiate a Writer with sync-on-close enabled
          2. pass the fsync through to our log syncer and issue the sync when requested

          I think we would not allow pairing fsync with asynchronous sync'ing (the current API would not support it anyway).

          I'll see if I can find some time next week. Although my guess is that most folks want global sync-on-close and global sync-behind-writes on the HDFS cluster backing HBase.

          Show
          Lars Hofhansl added a comment - HBASE-7801 and HBASE-8375 put the right client APIs in place, and in 0.95+ we have the Hadoop1/Hadoop2 modules to avoid reflection hell. This is then just a matter of: Instantiate a Writer with sync-on-close enabled pass the fsync through to our log syncer and issue the sync when requested I think we would not allow pairing fsync with asynchronous sync'ing (the current API would not support it anyway). I'll see if I can find some time next week. Although my guess is that most folks want global sync-on-close and global sync-behind-writes on the HDFS cluster backing HBase.
          Hide
          stack added a comment -

          This can go in any time. Doesn't have to be plugged against 0.96.0

          Show
          stack added a comment - This can go in any time. Doesn't have to be plugged against 0.96.0
          Hide
          haosdent added a comment -

          hi, Lars Hofhansl, whether your disks have raid or not? I have tested the hsync of hdfs again and again. I found it will spent nearly 50ms while hflush just spent 2ms on non-raid disks.

          Show
          haosdent added a comment - hi, Lars Hofhansl , whether your disks have raid or not? I have tested the hsync of hdfs again and again. I found it will spent nearly 50ms while hflush just spent 2ms on non-raid disks.
          Hide
          haosdent added a comment -

          Lars Hofhansl My test result:
          Without WAL/HFile sync: ~13s
          With WAL/HFile sync: ~120s

          Anything wrong?

          Show
          haosdent added a comment - Lars Hofhansl My test result: Without WAL/HFile sync: ~13s With WAL/HFile sync: ~120s Anything wrong?
          Hide
          haosdent added a comment -

          Only when we open write barrier and mount disk with "data=ordered", we could make sure that the data have been flush to physics disk after we call fsync system call.

          Show
          haosdent added a comment - Only when we open write barrier and mount disk with "data=ordered", we could make sure that the data have been flush to physics disk after we call fsync system call.
          Hide
          Liang Xie added a comment -

          haosdent, IMHO, fsync + write barrier combination should has guarantee the data be written to disk (with issuing a disk cache flush instruction). is it relatived with "data=ordered" mount option? thanks

          Show
          Liang Xie added a comment - haosdent , IMHO, fsync + write barrier combination should has guarantee the data be written to disk (with issuing a disk cache flush instruction). is it relatived with "data=ordered" mount option? thanks
          Hide
          haosdent added a comment -

          Liang XieIf mount disk with "data=writeback", the dirty data may be in disk cache after fsync system call return. Until the data more than a ratio in disk cache or timer is trigger, them will flush to physics storage. We could improve the performance of hsync by disable journal and write cache. But after disable write cache, the whole write performance is worse than before. Fsync is a very heavy system call, I think it is unfeasible to call fsync after every write operation. Just post my test result about fsync roughly below:

          1.ext4,noatime,barrier=1,data=ordered, enable disk write cache, enable journal, append 4k to a file
          fdatasync 25ms
          fsync 25ms
          2.ext4,noatime,barrier=0,data=writeback, disable disk write cache, enable journal, append 4k to a file
          fdatasync 33ms
          fsync 33ms
          3.ext4,noatime,barrier=0,data=writeback, disable disk write cache, disable journal, append 4k to a file
          fdatasync 8ms
          fsync 8ms

          Show
          haosdent added a comment - Liang Xie If mount disk with "data=writeback", the dirty data may be in disk cache after fsync system call return. Until the data more than a ratio in disk cache or timer is trigger, them will flush to physics storage. We could improve the performance of hsync by disable journal and write cache. But after disable write cache, the whole write performance is worse than before. Fsync is a very heavy system call, I think it is unfeasible to call fsync after every write operation. Just post my test result about fsync roughly below: 1.ext4,noatime,barrier=1,data=ordered, enable disk write cache, enable journal, append 4k to a file fdatasync 25ms fsync 25ms 2.ext4,noatime,barrier=0,data=writeback, disable disk write cache, enable journal, append 4k to a file fdatasync 33ms fsync 33ms 3.ext4,noatime,barrier=0,data=writeback, disable disk write cache, disable journal, append 4k to a file fdatasync 8ms fsync 8ms
          Hide
          haosdent added a comment -

          Liang XieBecause there is only have a journal file on every disk in extN, the system will commit all files metadata transactions when we open write barrier. After flush all files metadata transactions to the journal file in physics disk, the system will flush data(both metadata and block) of the special file to disk. So the fsync would spent more time when we have a lot of IO in a disk. My weibo is http://weibo.com/haosdent , welcome for more discussion.

          Show
          haosdent added a comment - Liang Xie Because there is only have a journal file on every disk in extN, the system will commit all files metadata transactions when we open write barrier. After flush all files metadata transactions to the journal file in physics disk, the system will flush data(both metadata and block) of the special file to disk. So the fsync would spent more time when we have a lot of IO in a disk. My weibo is http://weibo.com/haosdent , welcome for more discussion.
          Hide
          Lars Hofhansl added a comment -

          @haosdent, we can't break the laws of physics.
          If you sync every single edit you'll see terrible performance, how can we expect otherwise?
          HBase (even without fsync) wants things in batches, in PE HTable is doing it's default batching (2m batches), so that's where the cost is amortized.

          Enabling sync behind writes should improve this too (since we're writing immutable data), since by the time we issue the sync some data will already be sync'ed.

          Lastly, fsync is fsync (or rather fdatasync and friends since we're sync'ing files and not filesystems)... Once executed, previously cached data is on disk no matter what the filesystem chooses to cache during normal operations; only barriers are needed for correctness (AFAIK).

          Show
          Lars Hofhansl added a comment - @haosdent, we can't break the laws of physics. If you sync every single edit you'll see terrible performance, how can we expect otherwise? HBase (even without fsync) wants things in batches, in PE HTable is doing it's default batching (2m batches), so that's where the cost is amortized. Enabling sync behind writes should improve this too (since we're writing immutable data), since by the time we issue the sync some data will already be sync'ed. Lastly, fsync is fsync (or rather fdatasync and friends since we're sync'ing files and not filesystems)... Once executed, previously cached data is on disk no matter what the filesystem chooses to cache during normal operations; only barriers are needed for correctness (AFAIK).
          Hide
          haosdent added a comment -

          When we use the hsync of HDFS, the JVM in datanode will call fsync or fdatasync to ensure the dirty data of file is flush to stable storage. When I do the test as same as Lars Hofhansl, I have this test result:
          Without WAL/HFile sync: ~13s
          With WAL sync, Without HFile sync: ~120s (Sorry, I make some input mistakes before.)

          I think you may unclear about the differences between "data=writeback" and "data=order, barrier=1". These posts may be help you understand them.
          1.https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/writebarr.html
          2.http://lwn.net/Articles/457667/

          Show
          haosdent added a comment - When we use the hsync of HDFS, the JVM in datanode will call fsync or fdatasync to ensure the dirty data of file is flush to stable storage. When I do the test as same as Lars Hofhansl , I have this test result: Without WAL/HFile sync: ~13s With WAL sync, Without HFile sync: ~120s (Sorry, I make some input mistakes before.) I think you may unclear about the differences between "data=writeback" and "data=order, barrier=1". These posts may be help you understand them. 1. https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/writebarr.html 2. http://lwn.net/Articles/457667/
          Hide
          Luke Lu added a comment -

          As you've found out, the fsync performance is very sensitive to the disk controller and its settings. You need a SATA/SAS controller (RAID or not) with battery or (recently) flash (FBWC in HP cards, nvcache in Dell cards) backed write cache to get good and safe sync performance on HDFS as the additional seek to sync the checksum file is really expensive without a write cache to queue the writes, which calls for HDFS-2699, which I think should be an per file option.

          Show
          Luke Lu added a comment - As you've found out, the fsync performance is very sensitive to the disk controller and its settings. You need a SATA/SAS controller (RAID or not) with battery or (recently) flash (FBWC in HP cards, nvcache in Dell cards) backed write cache to get good and safe sync performance on HDFS as the additional seek to sync the checksum file is really expensive without a write cache to queue the writes, which calls for HDFS-2699 , which I think should be an per file option.
          Hide
          Lars Hofhansl added a comment -

          haosdent From the article you linked: "With barriers enabled, an fsync() call will also issue a storage cache flush.", which is exactly what I said.

          Show
          Lars Hofhansl added a comment - haosdent From the article you linked: "With barriers enabled, an fsync() call will also issue a storage cache flush.", which is exactly what I said.
          Hide
          Lars Hofhansl added a comment -

          Also, my tests above were run with write barriers enabled and data=ordered. Did you run PE, or a different test?

          Show
          Lars Hofhansl added a comment - Also, my tests above were run with write barriers enabled and data=ordered. Did you run PE, or a different test?
          Hide
          haosdent added a comment -

          >my tests above were run with write barriers enabled and data=ordered.
          Lars HofhanslIt seems very interesting. Did you use RAID?

          Show
          haosdent added a comment - >my tests above were run with write barriers enabled and data=ordered. Lars Hofhansl It seems very interesting. Did you use RAID?
          Hide
          Lars Hofhansl added a comment -

          Not sure on which machine I ran this now. I can redo. On my work machine I have 4 disks in RAID10.

          Show
          Lars Hofhansl added a comment - Not sure on which machine I ran this now. I can redo. On my work machine I have 4 disks in RAID10.
          Hide
          haosdent added a comment -

          Lars HofhanslHaha, I have test hsync() in RAID10 before. A hsync() call would spent 4ms. Because the data are written to RAID card cache, it is very fast.

          Show
          haosdent added a comment - Lars Hofhansl Haha, I have test hsync() in RAID10 before. A hsync() call would spent 4ms. Because the data are written to RAID card cache, it is very fast.
          Hide
          Lars Hofhansl added a comment -

          Nope. Software RAID.

          Show
          Lars Hofhansl added a comment - Nope. Software RAID.
          Hide
          haosdent added a comment -

          Lars Hofhansl Oh, are you sure you disks didn't have RAID card? I test hsync() again and again. The test result shows hsync() is a very heavy operation.

          Show
          haosdent added a comment - Lars Hofhansl Oh, are you sure you disks didn't have RAID card? I test hsync() again and again. The test result shows hsync() is a very heavy operation.
          Hide
          Andrew Purtell added a comment -

          Where are we with this issue?

          Show
          Andrew Purtell added a comment - Where are we with this issue?
          Hide
          Lars Hofhansl added a comment -

          It's probably best to unschedule this for now.

          Show
          Lars Hofhansl added a comment - It's probably best to unschedule this for now.
          Hide
          Andrew Purtell added a comment -

          Unscheduled.

          Show
          Andrew Purtell added a comment - Unscheduled.

            People

            • Assignee:
              Lars Hofhansl
              Reporter:
              Lars Hofhansl
            • Votes:
              1 Vote for this issue
              Watchers:
              34 Start watching this issue

              Dates

              • Created:
                Updated:

                Development