HBase
  1. HBase
  2. HBASE-2467

Concurrent flushers in HLog sync using HDFS-895

    Details

    • Hadoop Flags:
      Reviewed

      Description

      HDFS-895 changes hflush() to be able to run concurrently from multiple threads, where flushes can be concurrent with further writes to the same file.

      We need to rip out/amend the group commit code a bit to take advantage of this.

      1. hbase-2467.txt
        11 kB
        Todd Lipcon
      2. HBASE-2467-v2.patch
        11 kB
        Jean-Daniel Cryans
      3. HBASE-2467-v3.patch
        10 kB
        Jean-Daniel Cryans
      4. HBASE-2467-v4.patch
        9 kB
        Jean-Daniel Cryans

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Here's a preliminary total hack patch, ran a 5 node random write PE on it with 40 writers multithreaded mode, got about 12.7K rows/sec over 5 RSes.

          (not to be committed yet, needs a bit more work)

          Show
          Todd Lipcon added a comment - Here's a preliminary total hack patch, ran a 5 node random write PE on it with 40 writers multithreaded mode, got about 12.7K rows/sec over 5 RSes. (not to be committed yet, needs a bit more work)
          Hide
          stack added a comment -

          Patch looks good to me (What you think j-d?). I need hdfs-895 in place to play with this, right?

          Show
          stack added a comment - Patch looks good to me (What you think j-d?). I need hdfs-895 in place to play with this, right?
          Hide
          Todd Lipcon added a comment -

          yep, you need HDFS-895. Still needs to be cleaned up a lot and tested

          Show
          Todd Lipcon added a comment - yep, you need HDFS-895 . Still needs to be cleaned up a lot and tested
          Hide
          Jean-Daniel Cryans added a comment -

          We need to rip out/amend the group commit code a bit to take advantage of this.

          -1 the advantage of group commit is that you don't sync every append, Dhruba created HDFS-895 because it was slowing down group commit (syncs would wait on appends from other threads, doesn't make sense).

          Show
          Jean-Daniel Cryans added a comment - We need to rip out/amend the group commit code a bit to take advantage of this. -1 the advantage of group commit is that you don't sync every append, Dhruba created HDFS-895 because it was slowing down group commit (syncs would wait on appends from other threads, doesn't make sense).
          Hide
          Todd Lipcon added a comment -

          JD, the issue here is that group commit ends up creating "bubbles". With the HDFS-895 code, calling sync doesn't block other syncers or writers, so each thread can just sync its own edits.

          There's a slight improvement to be made by doing batching on a very small scale, to increase HDFS packet size if the edits are <512 bytes, but this patch resulted in a 3-5x speed increase without having to turn on deferred log flush.

          Show
          Todd Lipcon added a comment - JD, the issue here is that group commit ends up creating "bubbles". With the HDFS-895 code, calling sync doesn't block other syncers or writers, so each thread can just sync its own edits. There's a slight improvement to be made by doing batching on a very small scale, to increase HDFS packet size if the edits are <512 bytes, but this patch resulted in a 3-5x speed increase without having to turn on deferred log flush.
          Hide
          Jean-Daniel Cryans added a comment -

          But at a lower level somehow you still have to wait for another thread to finish before writing your own stuff in the block right? So what you're telling me is that 895 basically takes the grouping at a lower level?

          Show
          Jean-Daniel Cryans added a comment - But at a lower level somehow you still have to wait for another thread to finish before writing your own stuff in the block right? So what you're telling me is that 895 basically takes the grouping at a lower level?
          Hide
          Todd Lipcon added a comment -

          Writing is still synchronized, but flushing isn't. Basically, each packet that the DFS Client sends out gets a sequence number. Flush has two parts: (a) any data that hasnt' been sent yet gets sent, and (b) client needs to wait for an ack that this data has been received by the full pipeline. We used to do both parts synchronized. HDFS-895 makes it so only part A is synchronized, and then the waiting can happen concurrently with more writes, flushes, etc.

          Show
          Todd Lipcon added a comment - Writing is still synchronized, but flushing isn't. Basically, each packet that the DFS Client sends out gets a sequence number. Flush has two parts: (a) any data that hasnt' been sent yet gets sent, and (b) client needs to wait for an ack that this data has been received by the full pipeline. We used to do both parts synchronized. HDFS-895 makes it so only part A is synchronized, and then the waiting can happen concurrently with more writes, flushes, etc.
          Hide
          Jean-Daniel Cryans added a comment -

          Cool +1 then, thanks for the explanation.

          Show
          Jean-Daniel Cryans added a comment - Cool +1 then, thanks for the explanation.
          Hide
          stack added a comment -

          @ J-D: What you think of todd's patch?

          Show
          stack added a comment - @ J-D: What you think of todd's patch?
          Hide
          Jean-Daniel Cryans added a comment -

          I think like he says, it needs more work. The forceSync isn't used anymore but it's still in hflush, we lose the ability to set flushLogEntries, and there's refactoring around logRollRequested that's out of scope.

          Show
          Jean-Daniel Cryans added a comment - I think like he says, it needs more work. The forceSync isn't used anymore but it's still in hflush, we lose the ability to set flushLogEntries, and there's refactoring around logRollRequested that's out of scope.
          Hide
          stack added a comment -

          Need this for 0.20.5, 0.21.

          Show
          stack added a comment - Need this for 0.20.5, 0.21.
          Hide
          dhruba borthakur added a comment -

          I would not support putting this on 0.20.5, especially because this is a performance fix. Isn't it posible to defer this to a new major release?

          Show
          dhruba borthakur added a comment - I would not support putting this on 0.20.5, especially because this is a performance fix. Isn't it posible to defer this to a new major release?
          Hide
          dhruba borthakur added a comment -

          Just to elaborate, has somebody already measured the gain from this one?

          Show
          dhruba borthakur added a comment - Just to elaborate, has somebody already measured the gain from this one?
          Hide
          Todd Lipcon added a comment -

          dhruba: I've been running all of my tests with these patches for the last several weeks and it's been very stable. I haven't done a recent benchmark but it was a substantial gain - close to 2x if I remember correctly. Before this patch gets finally committed I'll do a proper benchmark comparison.

          Show
          Todd Lipcon added a comment - dhruba: I've been running all of my tests with these patches for the last several weeks and it's been very stable. I haven't done a recent benchmark but it was a substantial gain - close to 2x if I remember correctly. Before this patch gets finally committed I'll do a proper benchmark comparison.
          Hide
          dhruba borthakur added a comment -

          Awesome. Thanks Todd!

          Show
          dhruba borthakur added a comment - Awesome. Thanks Todd!
          Hide
          stack added a comment -

          Bulk move of 0.20.5 issues into 0.21.0 after vote that we merge branch into TRUNK up on list.

          Show
          stack added a comment - Bulk move of 0.20.5 issues into 0.21.0 after vote that we merge branch into TRUNK up on list.
          Hide
          Jean-Daniel Cryans added a comment -

          Kind of refreshed patch for trunk. I applied it to our internal branch and tested with YCSB (and cdh3), on pure insert tests I get around 25-30% faster with the patch.

          One issue to solve is that this removes deferred log flushing, which can be pretty useful in cases like ours where we keep tons of counters and don't really mind missing 2-3 increments.

          Show
          Jean-Daniel Cryans added a comment - Kind of refreshed patch for trunk. I applied it to our internal branch and tested with YCSB (and cdh3), on pure insert tests I get around 25-30% faster with the patch. One issue to solve is that this removes deferred log flushing, which can be pretty useful in cases like ours where we keep tons of counters and don't really mind missing 2-3 increments.
          Hide
          stack added a comment -

          What did you do to the patch J-D? In the hadoop we ship, we should include hdfs-895? (Its a dfsclient-only fix, right?). What should we do about group commit. Open a new issue? If hdfs-895 is not in place, we just run slower, is that right?

          On the patch:

          If concurrency, should 'private boolean logRollRequested' be volatile? (Maybe not necessary – I see check is done in a synchronized method?)

          I don't get this bit:

          +      synchronized (this) {
          +        this.syncTime += System.currentTimeMillis() - now;
          +        this.syncOps++;
          +      }
          +
          +      synchronized (this.updateLock) {
          

          Why not do the syncTime and syncOps under the updateLock rather than do a synchronize on 'this'? Wouldn't it have same effect but be less obnoxious? (Haven't looked at rest of this class to check if syncops and synctime updates are done under a synchronize(this)).

          Show
          stack added a comment - What did you do to the patch J-D? In the hadoop we ship, we should include hdfs-895? (Its a dfsclient-only fix, right?). What should we do about group commit. Open a new issue? If hdfs-895 is not in place, we just run slower, is that right? On the patch: If concurrency, should 'private boolean logRollRequested' be volatile? (Maybe not necessary – I see check is done in a synchronized method?) I don't get this bit: + synchronized ( this ) { + this .syncTime += System .currentTimeMillis() - now; + this .syncOps++; + } + + synchronized ( this .updateLock) { Why not do the syncTime and syncOps under the updateLock rather than do a synchronize on 'this'? Wouldn't it have same effect but be less obnoxious? (Haven't looked at rest of this class to check if syncops and synctime updates are done under a synchronize(this)).
          Hide
          Jean-Daniel Cryans added a comment -

          What did you do to the patch J-D?

          Mostly just a refresh, and I removed an now-unused boolean on sync().

          In the hadoop we ship, we should include hdfs-895?

          I think it would be great.

          What should we do about group commit. Open a new issue?

          We don't need it anymore, but having deferred log flush could still be useful. Should be in the same issue.

          If hdfs-895 is not in place, we just run slower, is that right?

          Yes, under concurrent writers it's much slower without 895.

          If concurrency, should 'private boolean logRollRequested' be volatile?

          As a matter of fact, it's under 2 different sync (this and updateLock). Maybe safer to put volatile?

          Why not do the syncTime and syncOps under the updateLock rather than do a synchronize on 'this'?

          Makes sense.

          Show
          Jean-Daniel Cryans added a comment - What did you do to the patch J-D? Mostly just a refresh, and I removed an now-unused boolean on sync(). In the hadoop we ship, we should include hdfs-895? I think it would be great. What should we do about group commit. Open a new issue? We don't need it anymore, but having deferred log flush could still be useful. Should be in the same issue. If hdfs-895 is not in place, we just run slower, is that right? Yes, under concurrent writers it's much slower without 895. If concurrency, should 'private boolean logRollRequested' be volatile? As a matter of fact, it's under 2 different sync (this and updateLock). Maybe safer to put volatile? Why not do the syncTime and syncOps under the updateLock rather than do a synchronize on 'this'? Makes sense.
          Hide
          Todd Lipcon added a comment -

          We don't need it anymore, but having deferred log flush could still be useful. Should be in the same issue.

          There is one case in which group commit is still helpful - the way sync works is that it always sends the last "chunk" of the file, even if some of the bytes were already sent. The chunks here are 512 bytes (ie checksum boundaries). So, if you're always syncing very small edits (ie <512 bytes) then there is a benefit to waiting a milli or two to cross that 512-byte boundary. Otherwise with each write we will re-send the previous writes as well during the sync.

          We can do a little better on the HDFS side to get this "for free", though, so probably not worth worrying about in this patch.

          Show
          Todd Lipcon added a comment - We don't need it anymore, but having deferred log flush could still be useful. Should be in the same issue. There is one case in which group commit is still helpful - the way sync works is that it always sends the last "chunk" of the file, even if some of the bytes were already sent. The chunks here are 512 bytes (ie checksum boundaries). So, if you're always syncing very small edits (ie <512 bytes) then there is a benefit to waiting a milli or two to cross that 512-byte boundary. Otherwise with each write we will re-send the previous writes as well during the sync. We can do a little better on the HDFS side to get this "for free", though, so probably not worth worrying about in this patch.
          Hide
          Jean-Daniel Cryans added a comment -

          Rebased patch after HBASE-2922, also I changed the synchronization bit that Stack referred to and added the missing bits for deferred log flushing to be fully working. I did some loading tests, with deferred log flush it's 80-100% faster than without it, but still using HDFS-895 along with the rest of this patch (which is a good bit faster already). To sum up:

          • With this patch and HDFS-895, it's 25-30% faster.
          • With this patch and deferred log flush on, it's another 80-100% faster, but less durable (obviously).
          Show
          Jean-Daniel Cryans added a comment - Rebased patch after HBASE-2922 , also I changed the synchronization bit that Stack referred to and added the missing bits for deferred log flushing to be fully working. I did some loading tests, with deferred log flush it's 80-100% faster than without it, but still using HDFS-895 along with the rest of this patch (which is a good bit faster already). To sum up: With this patch and HDFS-895 , it's 25-30% faster. With this patch and deferred log flush on, it's another 80-100% faster, but less durable (obviously).
          Hide
          ryan rawson added a comment -

          HDFS-895 is client side only, so we should patch that into a build that we ship with HBase and also apply this patch.

          Show
          ryan rawson added a comment - HDFS-895 is client side only, so we should patch that into a build that we ship with HBase and also apply this patch.
          Hide
          Jonathan Gray added a comment -

          What's story on this? We're waiting for 895 to be committed to 20-append branch then this is closed? Or still HBase code to be committed?

          Show
          Jonathan Gray added a comment - What's story on this? We're waiting for 895 to be committed to 20-append branch then this is closed? Or still HBase code to be committed?
          Hide
          Todd Lipcon added a comment -

          We can't commit this to hbase until HDFS-895 is committed to HDFS. If you run this HBase patch without HDFS-895 you lose group commit and everything goes much slower. HDFS-895 needs to go into both trunk and 20-append

          Show
          Todd Lipcon added a comment - We can't commit this to hbase until HDFS-895 is committed to HDFS. If you run this HBase patch without HDFS-895 you lose group commit and everything goes much slower. HDFS-895 needs to go into both trunk and 20-append
          Hide
          ryan rawson added a comment -

          Since 895 is client side only we can ship with special jars in hbase this
          closing this issue.

          Show
          ryan rawson added a comment - Since 895 is client side only we can ship with special jars in hbase this closing this issue.
          Hide
          stack added a comment -

          We have to have this to ship. Made it a blocker. HDFS-895 is client-side only (check) so we can make our own hadoop jar if we have to.

          Show
          stack added a comment - We have to have this to ship. Made it a blocker. HDFS-895 is client-side only (check) so we can make our own hadoop jar if we have to.
          Hide
          Jean-Daniel Cryans added a comment -

          Refreshed patch that doesn't include the deferred log flushing reenabling since it was done in HBASE-3204.

          Show
          Jean-Daniel Cryans added a comment - Refreshed patch that doesn't include the deferred log flushing reenabling since it was done in HBASE-3204 .
          Hide
          Nicolas Spiegelberg added a comment -

          How much operational experience have you had with production machines running HDFS-895 + HBASE-2467? I did a code review of HDFS-895 and feel like it's pretty solid. We have it applied to our internal branch, but we don't have HBASE=2467 applied. Although I think the HDFS-895 code is good, there's no question that it's very intricate and needs plenty of test time with large traffic before we should rely on it. I hear some of you guys have been running it on production traffic for months, is that correct? If so, I think that's sufficient test time and we can commit this.

          Show
          Nicolas Spiegelberg added a comment - How much operational experience have you had with production machines running HDFS-895 + HBASE-2467 ? I did a code review of HDFS-895 and feel like it's pretty solid. We have it applied to our internal branch, but we don't have HBASE=2467 applied. Although I think the HDFS-895 code is good, there's no question that it's very intricate and needs plenty of test time with large traffic before we should rely on it. I hear some of you guys have been running it on production traffic for months, is that correct? If so, I think that's sufficient test time and we can commit this.
          Hide
          Jean-Daniel Cryans added a comment -

          We've been running it since August on all our clusters, never had a single issue.

          Show
          Jean-Daniel Cryans added a comment - We've been running it since August on all our clusters, never had a single issue.
          Hide
          Todd Lipcon added a comment -

          The only potential issue I'm aware of is this possible case where HBase will call hflush() on a closed stream... maybe that problem is gone in the most recent iteration of this patch. JD, have you seen that come up on recent versions? I think we can look carefully through the HBase side to see if this race still exists – it was happening when the log rolled in between the append and hflush. We may need an rwlock to protect log rolling.

          Show
          Todd Lipcon added a comment - The only potential issue I'm aware of is this possible case where HBase will call hflush() on a closed stream... maybe that problem is gone in the most recent iteration of this patch. JD, have you seen that come up on recent versions? I think we can look carefully through the HBase side to see if this race still exists – it was happening when the log rolled in between the append and hflush. We may need an rwlock to protect log rolling.
          Hide
          Jean-Daniel Cryans added a comment -

          JD, have you seen that come up on recent versions?

          Here's one that happened while the server was going down:

          2010-10-29 03:11:33,017 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer:
          java.io.IOException: Cannot append; log is closed
                  at org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:840)
                  at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1641)
                  at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1261)
                  at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1777)
                  at sun.reflect.GeneratedMethodAccessor110.invoke(Unknown Source)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                  at java.lang.reflect.Method.invoke(Method.java:597)
                  at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:560)
                  at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1027)
          2010-10-29 03:11:33,053 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer:
          java.lang.NullPointerException
                  at org.apache.hadoop.io.SequenceFile$Writer.checkAndWriteSync(SequenceFile.java:975)
                  at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1017)
                  at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:984)
                  at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.append(SequenceFileLogWriter.java:111)
                  at org.apache.hadoop.hbase.regionserver.wal.HLog.doWrite(HLog.java:996)
                  at org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:851)
          

          Another that happens when the RS is going down:

          2010-11-03 03:18:40,137 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer:
          :java.lang.NullPointerException
          	at org.apache.hadoop.io.SequenceFile$Writer.getLength(SequenceFile.java:1048)
          	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.getLength(SequenceFileLogWriter.java:132)
          	at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:913)
          

          But I don't see it during normal runtime.

          Show
          Jean-Daniel Cryans added a comment - JD, have you seen that come up on recent versions? Here's one that happened while the server was going down: 2010-10-29 03:11:33,017 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: java.io.IOException: Cannot append; log is closed at org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:840) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1641) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1261) at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1777) at sun.reflect.GeneratedMethodAccessor110.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:560) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1027) 2010-10-29 03:11:33,053 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: java.lang.NullPointerException at org.apache.hadoop.io.SequenceFile$Writer.checkAndWriteSync(SequenceFile.java:975) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1017) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:984) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.append(SequenceFileLogWriter.java:111) at org.apache.hadoop.hbase.regionserver.wal.HLog.doWrite(HLog.java:996) at org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:851) Another that happens when the RS is going down: 2010-11-03 03:18:40,137 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: :java.lang.NullPointerException at org.apache.hadoop.io.SequenceFile$Writer.getLength(SequenceFile.java:1048) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.getLength(SequenceFileLogWriter.java:132) at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:913) But I don't see it during normal runtime.
          Hide
          stack added a comment -

          Todd updated hdfs-895 patch for 0.20 append. I built an hadoop with his new patch. Ryan posted it up on his repo. Going to test it basically works.

          Show
          stack added a comment - Todd updated hdfs-895 patch for 0.20 append. I built an hadoop with his new patch. Ryan posted it up on his repo. Going to test it basically works.
          Hide
          Jean-Daniel Cryans added a comment -

          Committed to the last patch to branch and trunk after running unit tests and doing some cluster testing, thanks for the great work Todd!

          Show
          Jean-Daniel Cryans added a comment - Committed to the last patch to branch and trunk after running unit tests and doing some cluster testing, thanks for the great work Todd!

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development