HBase
  1. HBase
  2. HBASE-2315

BookKeeper for write-ahead logging

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: regionserver
    • Labels:
      None

      Description

      BookKeeper, a contrib of the ZooKeeper project, is a fault tolerant and high throughput write-ahead logging service. This issue provides an implementation of write-ahead logging for hbase using BookKeeper. Apart from expected throughput improvements, BookKeeper also has stronger durability guarantees compared to the implementation currently used by hbase.

      1. HBASE-2315.patch
        14 kB
        Flavio Junqueira
      2. zookeeper-dev-bookkeeper.jar
        150 kB
        Flavio Junqueira
      3. bookkeeperOverview.pdf
        140 kB
        Flavio Junqueira

        Issue Links

          Activity

          Hide
          Lars Hofhansl added a comment -

          Sounds like fine plan. If you write per region logs you do not need to split at all, but I don't know whether that is feasible to do in BK or not.

          Show
          Lars Hofhansl added a comment - Sounds like fine plan. If you write per region logs you do not need to split at all, but I don't know whether that is feasible to do in BK or not.
          Hide
          Ted Yu added a comment -

          The plan is good.
          Please use HBASE-5937 for the first step.

          Thanks

          Show
          Ted Yu added a comment - The plan is good. Please use HBASE-5937 for the first step. Thanks
          Hide
          Flavio Junqueira added a comment -

          Based on your feedback and our own observations from inspecting the code, here is a rough idea of what we would like to do.

          In the first step, we make HLog an interface exposing the public methods of the current HLog class and make the current HLog class an implementation of the interface. We also create an HLog factory to allow us to instantiate different HLog implementations. Eventually we will have this factory creating BKHLog when we tell it to do so via configuration. So far there is no new functionality.

          In the second step, we implement BKHLog and decide what to do with the splitter. It is still not entirely clear how to adapt the splitter to BK. Perhaps we don't need a splitter at all with BookKeeper?

          Let me know if there is any comment about these steps. Otherwise, I'll create two subtasks and start working on the first.

          Show
          Flavio Junqueira added a comment - Based on your feedback and our own observations from inspecting the code, here is a rough idea of what we would like to do. In the first step, we make HLog an interface exposing the public methods of the current HLog class and make the current HLog class an implementation of the interface. We also create an HLog factory to allow us to instantiate different HLog implementations. Eventually we will have this factory creating BKHLog when we tell it to do so via configuration. So far there is no new functionality. In the second step, we implement BKHLog and decide what to do with the splitter. It is still not entirely clear how to adapt the splitter to BK. Perhaps we don't need a splitter at all with BookKeeper? Let me know if there is any comment about these steps. Otherwise, I'll create two subtasks and start working on the first.
          Hide
          stack added a comment -

          Flavio:

          You have what you need? Jesse and Jon I think did a good job above filling in bits I waved hands over when we talked (I like Jesse's suggestion undoing even our dependency of FS).

          When we talked, we also figured that hbase would need access to bk or the bkfs or the bk interface at replay of edits time, on region open. In particular, we'd need to have a factory like the WAL log factory here abouts in the code: http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/HRegion.html#2721 This is called when the region is opened by the regionserver. It does all its init and as part of the init, it looks to see if there are edits to replay. If edits-to-replay are out in bk, we need to be able to get at them at this very point.

          Show
          stack added a comment - Flavio: You have what you need? Jesse and Jon I think did a good job above filling in bits I waved hands over when we talked (I like Jesse's suggestion undoing even our dependency of FS). When we talked, we also figured that hbase would need access to bk or the bkfs or the bk interface at replay of edits time, on region open. In particular, we'd need to have a factory like the WAL log factory here abouts in the code: http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/HRegion.html#2721 This is called when the region is opened by the regionserver. It does all its init and as part of the init, it looks to see if there are edits to replay. If edits-to-replay are out in bk, we need to be able to get at them at this very point.
          Hide
          Ted Yu added a comment -

          The ctor of HLog takes a FileSystem parameter. Since the FileSystem isn't important to bookkeeper, my feeling is that the approach in previous patch makes sense.

          You can remodel that patch by introducing a new hbase module.

          Thanks

          Show
          Ted Yu added a comment - The ctor of HLog takes a FileSystem parameter. Since the FileSystem isn't important to bookkeeper, my feeling is that the approach in previous patch makes sense. You can remodel that patch by introducing a new hbase module. Thanks
          Hide
          Flavio Junqueira added a comment -

          Hi Ted, fs is part of the issue I was discussing before. We don't have a filesystem implementation for bookkeeper, so we can't use the filesystem instance passed.

          About the reader and the writer, I was configuring them in the hbase-default configuration file:

          <property>
              <name>hbase.regionserver.hlog.reader.impl</name>
              <value>org.apache.hadoop.hbase.regionserver.wal.BookKeeperLogReader</value>
              <description>The HLog file reader implementation.</description>
            </property>
            <property>
              <name>hbase.regionserver.hlog.writer.impl</name>
              <value>org.apache.hadoop.hbase.regionserver.wal.BookKeeperLogWriter</value>
              <description>The HLog file writer implementation.</description>
            </property>
          

          I assumed previously that HLog was instantiated elsewhere.

          Show
          Flavio Junqueira added a comment - Hi Ted, fs is part of the issue I was discussing before. We don't have a filesystem implementation for bookkeeper, so we can't use the filesystem instance passed. About the reader and the writer, I was configuring them in the hbase-default configuration file: <property> <name>hbase.regionserver.hlog.reader.impl</name> <value>org.apache.hadoop.hbase.regionserver.wal.BookKeeperLogReader</value> <description>The HLog file reader implementation.</description> </property> <property> <name>hbase.regionserver.hlog.writer.impl</name> <value>org.apache.hadoop.hbase.regionserver.wal.BookKeeperLogWriter</value> <description>The HLog file writer implementation.</description> </property> I assumed previously that HLog was instantiated elsewhere.
          Hide
          Ted Yu added a comment -

          @Flavio:
          Looking at the attached patch:

          +    public void init(FileSystem fs, Path path, Configuration conf){
          

          Parameter fs isn't used.

          Further, you implemented HLog.Reader and HLog.Writer. I don't see where HLog is constructed.

          Thanks

          Show
          Ted Yu added a comment - @Flavio: Looking at the attached patch: + public void init(FileSystem fs, Path path, Configuration conf){ Parameter fs isn't used. Further, you implemented HLog.Reader and HLog.Writer. I don't see where HLog is constructed. Thanks
          Hide
          Jonathan Hsieh added a comment -

          @Flavio

          I'm suggesting you use the Factory pattern to create the HLog, and have in the config file that would have it instantiate a new BKHLog (doesn't even use FS or Path) that returns its own instances of BKReader/BKWriters.

          I only see a handful of non-test places where HLogs are constructed so this part shouldn't be too hard to make factory pluggable:

          https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L1267
          https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L3719
          https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HMerge.java#L161
          (I think there are a few more)..

          My guess is that it will take a little bit of work but won't be too onerous to do.

          Show
          Jonathan Hsieh added a comment - @Flavio I'm suggesting you use the Factory pattern to create the HLog, and have in the config file that would have it instantiate a new BKHLog (doesn't even use FS or Path) that returns its own instances of BKReader/BKWriters. I only see a handful of non-test places where HLogs are constructed so this part shouldn't be too hard to make factory pluggable: https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L1267 https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L3719 https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HMerge.java#L161 (I think there are a few more).. My guess is that it will take a little bit of work but won't be too onerous to do.
          Hide
          Flavio Junqueira added a comment -

          @Jonathan My issue has been exactly with the initialization of the log reader and writer. They take a FileSystem and a Path as input parameters. As I mentioned above, but I've been looking at implementing a BK FileSystem, and it looks a bit hacky.

          If I get your suggestion, you're saying that we could reuse most of HLog and focus on generalizing its constructors and the init methods. Is it right?

          Show
          Flavio Junqueira added a comment - @Jonathan My issue has been exactly with the initialization of the log reader and writer. They take a FileSystem and a Path as input parameters. As I mentioned above, but I've been looking at implementing a BK FileSystem, and it looks a bit hacky. If I get your suggestion, you're saying that we could reuse most of HLog and focus on generalizing its constructors and the init methods. Is it right?
          Hide
          Jonathan Hsieh added a comment -

          @Flavio - with the encapsulation comment – basically if you look at the interface there is very little that is hdfs-specific about it – append, open, close, roll operations all basically take hbase constructs and could have any implementation behind it. The hdfs-specifics have been contained in its constructor and the init functions of the reader/writer classes.

          Show
          Jonathan Hsieh added a comment - @Flavio - with the encapsulation comment – basically if you look at the interface there is very little that is hdfs-specific about it – append, open, close, roll operations all basically take hbase constructs and could have any implementation behind it. The hdfs-specifics have been contained in its constructor and the init functions of the reader/writer classes.
          Hide
          Jesse Yates added a comment -

          @Flavio - yup, that's what I was getting at.

          Show
          Jesse Yates added a comment - @Flavio - yup, that's what I was getting at.
          Hide
          Flavio Junqueira added a comment -

          If I understand Jesse's proposal, the idea is to have a WAL interface and essentially bring all hdfs dependent WAL code under FsWAL? We would then implement an equivalent BkWAL for a BookKeeper backend?

          I didn't understand the point about the HLog class being relatively well encapsulated, Jonathan. Do you mean to say that it would be relatively easy to implement FsWAL by using HLog?

          Show
          Flavio Junqueira added a comment - If I understand Jesse's proposal, the idea is to have a WAL interface and essentially bring all hdfs dependent WAL code under FsWAL? We would then implement an equivalent BkWAL for a BookKeeper backend? I didn't understand the point about the HLog class being relatively well encapsulated, Jonathan. Do you mean to say that it would be relatively easy to implement FsWAL by using HLog?
          Hide
          Jonathan Hsieh added a comment -

          +1 to what Jesse said. The HLog class is relatively well encapsulated. It already has its own Reader and Writer interfaces (I'd probably ignore the FS argument or some how hide BK specific stuff into a constructor), and a Metric helper class. Excluding its constructors, HLog only has a few hdfs-specific public methods (hsync, hflush, sync) and some public but static methods.

          Show
          Jonathan Hsieh added a comment - +1 to what Jesse said. The HLog class is relatively well encapsulated. It already has its own Reader and Writer interfaces (I'd probably ignore the FS argument or some how hide BK specific stuff into a constructor), and a Metric helper class. Excluding its constructors, HLog only has a few hdfs-specific public methods (hsync, hflush, sync) and some public but static methods.
          Hide
          Jesse Yates added a comment -

          personally, I'd prefer to see a logical higher-level WAL interface and then have an FsWAL subclass that might have some more specifics to make an easier match to an underlying FS implementation (hadoop or otherwise). Maybe a bit more pain upfront, but will (likely) pay off in the end).

          Show
          Jesse Yates added a comment - personally, I'd prefer to see a logical higher-level WAL interface and then have an FsWAL subclass that might have some more specifics to make an easier match to an underlying FS implementation (hadoop or otherwise). Maybe a bit more pain upfront, but will (likely) pay off in the end).
          Hide
          Flavio Junqueira added a comment -

          Hi Jonathan, A BK backend for logging can potentially deal with the issues you're raising. The main issue is that it doesn't expose a file system interface, and HBase currently seems to rely upon one. If we have an interface that matches more naturally the BK api, then we will possibly be able to deal with the issues you're raising without much effort.

          I was actually trying to implement a BookKeeperFS class that extends FileSystem, but it doesn't feel natural. I was wondering if we should try to work on the interface first instead of hacking it in by implementing an HLog.Reader and an HLog.Writer.

          Show
          Flavio Junqueira added a comment - Hi Jonathan, A BK backend for logging can potentially deal with the issues you're raising. The main issue is that it doesn't expose a file system interface, and HBase currently seems to rely upon one. If we have an interface that matches more naturally the BK api, then we will possibly be able to deal with the issues you're raising without much effort. I was actually trying to implement a BookKeeperFS class that extends FileSystem, but it doesn't feel natural. I was wondering if we should try to work on the interface first instead of hacking it in by implementing an HLog.Reader and an HLog.Writer.
          Hide
          Jonathan Hsieh added a comment -

          @flavio

          Li's been inactive for a while so I'd suggest taking HBASE-5937 and making the WAL an interface. This would allow for experimentation here and with other alternate wal implementation related JIRA's

          HBASE-6116 is about a potentially improving the latency when writing to a single hdfs file. Normally writes to hdfs get pipelined to 2 other replicas on 2 different data nodes and this incurs some overhead. The intuition behind that is that for small writes the overhead may cause more latency than the data transfer and the hope is that by having the client send in a parallel/broadcast we'd reduce the overhead latency.

          HBASE-5699 is about improving throughput and improving latency by writing to multiple hdfs files. This attacks the scenarios where a data node may be blocked (gc/swapping/etc) causing all other writes to be blocked. Currently a region server serves many regions but all of a region server's writes go to a single WAL files with writes to different regions are intermingled.

          Let's say these normally take 10ms but in a gc case it ends up taking 1000ms. In one scenario with a wal per region, writes to another region could go a different hdfs file and not get blocked.

          In another, we'd have two wal files for the region server and we could detect that one write was blocking (let's say after 20ms) then the retry against another set of data nodes. Ideally we've improved our worst case from 1000ms to 30-40ms.

          Show
          Jonathan Hsieh added a comment - @flavio Li's been inactive for a while so I'd suggest taking HBASE-5937 and making the WAL an interface. This would allow for experimentation here and with other alternate wal implementation related JIRA's HBASE-6116 is about a potentially improving the latency when writing to a single hdfs file. Normally writes to hdfs get pipelined to 2 other replicas on 2 different data nodes and this incurs some overhead. The intuition behind that is that for small writes the overhead may cause more latency than the data transfer and the hope is that by having the client send in a parallel/broadcast we'd reduce the overhead latency. HBASE-5699 is about improving throughput and improving latency by writing to multiple hdfs files. This attacks the scenarios where a data node may be blocked (gc/swapping/etc) causing all other writes to be blocked. Currently a region server serves many regions but all of a region server's writes go to a single WAL files with writes to different regions are intermingled. Let's say these normally take 10ms but in a gc case it ends up taking 1000ms. In one scenario with a wal per region, writes to another region could go a different hdfs file and not get blocked. In another, we'd have two wal files for the region server and we could detect that one write was blocking (let's say after 20ms) then the retry against another set of data nodes. Ideally we've improved our worst case from 1000ms to 30-40ms.
          Hide
          Ted Yu added a comment - - edited

          Here is the JIRA that tracks parallel hdfs writes:
          HBASE-6116 Allow parallel HDFS writes for HLogs

          HBASE-5699 would allow multiple WALs per region server. Grouping of edits can be per-table or, one WAL per region.
          WAL interface abstraction appears in the discussion there.

          Feel free to propose WAL interface for this JIRA.

          Show
          Ted Yu added a comment - - edited Here is the JIRA that tracks parallel hdfs writes: HBASE-6116 Allow parallel HDFS writes for HLogs HBASE-5699 would allow multiple WALs per region server. Grouping of edits can be per-table or, one WAL per region. WAL interface abstraction appears in the discussion there. Feel free to propose WAL interface for this JIRA.
          Hide
          Flavio Junqueira added a comment -

          Hi Ted, I don't fully understand what HBASE-5699 is proposing. It mixes parallel writes and HDFS. In my perspective, we are able to perform efficient parallel writes out of the box and we can benefit from an interface that exposes enough information for us to separate the edits into different logs (e.g., one per region). If the outcome of HBASE-5699 is a set of changes to the interface that allow this to happen, then we could benefit from it. For a first cut of this issue, I don't think we need parallel writes, though, so I would say it is not dependent, but I'd be happy to hear a different opinion.

          Show
          Flavio Junqueira added a comment - Hi Ted, I don't fully understand what HBASE-5699 is proposing. It mixes parallel writes and HDFS. In my perspective, we are able to perform efficient parallel writes out of the box and we can benefit from an interface that exposes enough information for us to separate the edits into different logs (e.g., one per region). If the outcome of HBASE-5699 is a set of changes to the interface that allow this to happen, then we could benefit from it. For a first cut of this issue, I don't think we need parallel writes, though, so I would say it is not dependent, but I'd be happy to hear a different opinion.
          Hide
          Ted Yu added a comment -

          Nice summary, Flavio.

          to have logs per region with BookKeeper

          HBASE-4529 has not been active.
          HBASE-5699 is the one receiving attention recently.
          Does this JIRA depend on HBASE-5699 ?

          Show
          Ted Yu added a comment - Nice summary, Flavio. to have logs per region with BookKeeper HBASE-4529 has not been active. HBASE-5699 is the one receiving attention recently. Does this JIRA depend on HBASE-5699 ?
          Hide
          Flavio Junqueira added a comment -

          Apologies for not being very active here. Yesterday we had a chat with Stack about moving forward with this jira. Here are a few key points I got from the discussion:

          • To use the reader and writer interfaces, we need to implement a BookKeeper filesystem because the init method expects such an object;
          • The RegionServer maintains a list of existing logs and we need to map them to ledgers. The easiest way is probably to maintain this map in ZooKeeper and manage it through the Filesystem implementation;
          • Currently there is one single FileSystem object in the RegionServer and we would need to have at least two, one being used for BookKeeper;
          • Stack suggested that it would be a great idea to to have logs per region with BookKeeper. BookKeeper is designed exactly to serve use cases that have multiple concurrent logs being written to. According to Stack, this feature has the potential of reducing time to recover significantly. Enabling this feature might require some more changes to the wal interface, though, since we would need to separate the edits according to region. (Related to HBASE-4529 ?)

          Let us know if anyone has a comment, and otherwise we will push it forward.

          Show
          Flavio Junqueira added a comment - Apologies for not being very active here. Yesterday we had a chat with Stack about moving forward with this jira. Here are a few key points I got from the discussion: To use the reader and writer interfaces, we need to implement a BookKeeper filesystem because the init method expects such an object; The RegionServer maintains a list of existing logs and we need to map them to ledgers. The easiest way is probably to maintain this map in ZooKeeper and manage it through the Filesystem implementation; Currently there is one single FileSystem object in the RegionServer and we would need to have at least two, one being used for BookKeeper; Stack suggested that it would be a great idea to to have logs per region with BookKeeper. BookKeeper is designed exactly to serve use cases that have multiple concurrent logs being written to. According to Stack, this feature has the potential of reducing time to recover significantly. Enabling this feature might require some more changes to the wal interface, though, since we would need to separate the edits according to region. (Related to HBASE-4529 ?) Let us know if anyone has a comment, and otherwise we will push it forward.
          Hide
          Flavio Junqueira added a comment -

          Hi Zhihong, Thanks for the bringing this up, and yes, I'm still interested in providing an updated patch. I'll work on one.

          Show
          Flavio Junqueira added a comment - Hi Zhihong, Thanks for the bringing this up, and yes, I'm still interested in providing an updated patch. I'll work on one.
          Hide
          Ted Yu added a comment -

          @Flavio:
          I searched Omid code and didn't find BookKeeperLogWriter being used.

          If you still prefer to add BookKeeperLog

          {Writer,Reader}

          to HBase, please provide an updated patch (including pom.xml).

          Show
          Ted Yu added a comment - @Flavio: I searched Omid code and didn't find BookKeeperLogWriter being used. If you still prefer to add BookKeeperLog {Writer,Reader} to HBase, please provide an updated patch (including pom.xml).
          Hide
          stack added a comment -

          No.

          If you want us to switch to an interface, just say (will happen faster if you put up a patch).

          Show
          stack added a comment - No. If you want us to switch to an interface, just say (will happen faster if you put up a patch).
          Hide
          Benjamin Reed added a comment -

          we looked into the problem of figuring out the path to use for the WAL and found the following: it appears that the assumption that the WAL is stored in HDFS is embedded in HBase. when looking up a WAL, for example, the FileSystem object is used to check existence. Deletion of logs also happens outside of the WAL interfaces. to be truly pluggable a WAL interface should be used to enumerate and delete logs. have you guys thought about doing this?

          Show
          Benjamin Reed added a comment - we looked into the problem of figuring out the path to use for the WAL and found the following: it appears that the assumption that the WAL is stored in HDFS is embedded in HBase. when looking up a WAL, for example, the FileSystem object is used to check existence. Deletion of logs also happens outside of the WAL interfaces. to be truly pluggable a WAL interface should be used to enumerate and delete logs. have you guys thought about doing this?
          Hide
          stack added a comment -

          .bq I have started benchmarking hbase+bookkeeper with the Yahoo! benchmark, but I stumbled upon a problem. When I wrote this preliminary patch, I didn't have a nice way of creating a unique znode name and keeping it for fetching the last log of a region server, given the current interface of WAL. My understanding of the code is that the hdfs file where the is stored is passed to the reader/writer. I'm considering the current hdfs path as the znode path (or at least a part of it), but I would accept suggestions if anyone is willing to give me one.

          The HDFS path should map to a znode path? That'd work. The WAL HDFS path has the regionserver name in it IIRC. You probably don't need the full thing. Just from the regionserver name on down (should be shallow).

          Let us know if you want us to change stuff in here around WAL. You might want to wait a day or so till hbase-2437 goes in (refactoring of hlog). Stuff should be easier to grok after that patch has had its way.

          Is HBASE-2527 of any use to you? It just went into trunk.

          Show
          stack added a comment - .bq I have started benchmarking hbase+bookkeeper with the Yahoo! benchmark, but I stumbled upon a problem. When I wrote this preliminary patch, I didn't have a nice way of creating a unique znode name and keeping it for fetching the last log of a region server, given the current interface of WAL. My understanding of the code is that the hdfs file where the is stored is passed to the reader/writer. I'm considering the current hdfs path as the znode path (or at least a part of it), but I would accept suggestions if anyone is willing to give me one. The HDFS path should map to a znode path? That'd work. The WAL HDFS path has the regionserver name in it IIRC. You probably don't need the full thing. Just from the regionserver name on down (should be shallow). Let us know if you want us to change stuff in here around WAL. You might want to wait a day or so till hbase-2437 goes in (refactoring of hlog). Stuff should be easier to grok after that patch has had its way. Is HBASE-2527 of any use to you? It just went into trunk.
          Hide
          Flavio Junqueira added a comment -

          Just a couple of quick points about the developments of this jira:

          1. We have introduced deletion and garbage collection of ledgers to BookKeeper (ZOOKEEPER-464);
          2. Our next goal as far as features for bookkeeper goes is ZOOKEEPER-712 (see my previous post);
          3. I have started benchmarking hbase+bookkeeper with the Yahoo! benchmark, but I stumbled upon a problem. When I wrote this preliminary patch, I didn't have a nice way of creating a unique znode name and keeping it for fetching the last log of a region server, given the current interface of WAL. My understanding of the code is that the hdfs file where the is stored is passed to the reader/writer. I'm considering the current hdfs path as the znode path (or at least a part of it), but I would accept suggestions if anyone is willing to give me one.
          Show
          Flavio Junqueira added a comment - Just a couple of quick points about the developments of this jira: We have introduced deletion and garbage collection of ledgers to BookKeeper ( ZOOKEEPER-464 ); Our next goal as far as features for bookkeeper goes is ZOOKEEPER-712 (see my previous post); I have started benchmarking hbase+bookkeeper with the Yahoo! benchmark, but I stumbled upon a problem. When I wrote this preliminary patch, I didn't have a nice way of creating a unique znode name and keeping it for fetching the last log of a region server, given the current interface of WAL. My understanding of the code is that the hdfs file where the is stored is passed to the reader/writer. I'm considering the current hdfs path as the znode path (or at least a part of it), but I would accept suggestions if anyone is willing to give me one.
          Hide
          Flavio Junqueira added a comment -

          Hi Todd, Great points:

          1. We plan to have a mechanism to re-replicate the ledger fragments of a bookie; some of our applications require it. There is a wiki page I had created with some initial thoughts, but I need to update it: "http://wiki.apache.org/hadoop/BookieRecoveryPage". I also thought there was a jira open about it already, but I couldn't find it, so I created a new one: ZOOKEEPER-712.
          2. We currently don't have a notion of rack awareness, but it is a good idea and not difficult to add. We just need to have a bookie adding that information to its corresponding znode and use it during the selection of bookies upon creating a leader. I have also opened a jira to add it: ZOOKEEPER-711.

          On the documentation, please feel free to open a jira and mention all point that you believe could be improved.

          Show
          Flavio Junqueira added a comment - Hi Todd, Great points: We plan to have a mechanism to re-replicate the ledger fragments of a bookie; some of our applications require it. There is a wiki page I had created with some initial thoughts, but I need to update it: "http://wiki.apache.org/hadoop/BookieRecoveryPage". I also thought there was a jira open about it already, but I couldn't find it, so I created a new one: ZOOKEEPER-712 . We currently don't have a notion of rack awareness, but it is a good idea and not difficult to add. We just need to have a bookie adding that information to its corresponding znode and use it during the selection of bookies upon creating a leader. I have also opened a jira to add it: ZOOKEEPER-711 . On the documentation, please feel free to open a jira and mention all point that you believe could be improved.
          Hide
          Todd Lipcon added a comment -

          Thanks for attaching the design doc, Flavio. Here are a few questions generally about BK (would be great to address these in the docs too, not just this JIRA):

          • Logs are stored replicated on several bookies. If a bookie goes down or loses a disk, the replication count obviously decreases. Will BK re-replicate the "under-replicated" parts of the log files? Or do we assume that logs are short-lived enough that we'll never lose all the replicas during the span of time for which the log is necessary?
          • Similar to the above, is there any notion of rack awareness? eg does it ensure that each part of the log is replicated at least on two racks so that a full rack failure doesn't lose the logs?

          Basically on a broad level what I'm asking is this: HDFS takes a lot of care to ensure availability of data "forever" across a variety of failure scenarios. From the design doc it wasn't clear that BK provides the same levels of safety. Can you please clarify what features (if any) are missing from BK that are present in HDFS with regard to reliability, etc?

          Show
          Todd Lipcon added a comment - Thanks for attaching the design doc, Flavio. Here are a few questions generally about BK (would be great to address these in the docs too, not just this JIRA): Logs are stored replicated on several bookies. If a bookie goes down or loses a disk, the replication count obviously decreases. Will BK re-replicate the "under-replicated" parts of the log files? Or do we assume that logs are short-lived enough that we'll never lose all the replicas during the span of time for which the log is necessary? Similar to the above, is there any notion of rack awareness? eg does it ensure that each part of the log is replicated at least on two racks so that a full rack failure doesn't lose the logs? Basically on a broad level what I'm asking is this: HDFS takes a lot of care to ensure availability of data "forever" across a variety of failure scenarios. From the design doc it wasn't clear that BK provides the same levels of safety. Can you please clarify what features (if any) are missing from BK that are present in HDFS with regard to reliability, etc?
          Hide
          Flavio Junqueira added a comment -

          Todd, I'm attaching the BK overview document as promised.

          Show
          Flavio Junqueira added a comment - Todd, I'm attaching the BK overview document as promised.
          Hide
          Flavio Junqueira added a comment -

          Ryan: I don't have much to add to what Ben said in his comment. I just wanted to mention that in the current patch, I have added a configuration property to set the number of replicas for each write:

          hbase.wal.bk.quorumsize
          

          the default is 2.

          Andrew: As we reduce the number of bytes in each write, the overhead per byte increases, so batching writes and appending writes of the order of 1kbytes would give us a more efficient use of the BK client. Achieving 1M ops/s over 100 nodes (or larger values if you will) depends on the length of writes, the replication factor, and the amount of bandwidth (both I/O and network) you have available. In our observations, it is not a problem for a client to produce more than 10k appends/s of 1k-4kbytes, so in your example, it is just a matter of provisioning your system appropriately with respect to disk drives and network.

          Show
          Flavio Junqueira added a comment - Ryan: I don't have much to add to what Ben said in his comment. I just wanted to mention that in the current patch, I have added a configuration property to set the number of replicas for each write: hbase.wal.bk.quorumsize the default is 2. Andrew: As we reduce the number of bytes in each write, the overhead per byte increases, so batching writes and appending writes of the order of 1kbytes would give us a more efficient use of the BK client. Achieving 1M ops/s over 100 nodes (or larger values if you will) depends on the length of writes, the replication factor, and the amount of bandwidth (both I/O and network) you have available. In our observations, it is not a problem for a client to produce more than 10k appends/s of 1k-4kbytes, so in your example, it is just a matter of provisioning your system appropriately with respect to disk drives and network.
          Hide
          Benjamin Reed added a comment -

          to get performance you need a dedicated drive, not necessarily dedicated hardware. bk takes care of the batching for you. there is opportunistic batching that happens in the network channel and to the disk. (the later being the most important.)

          i don't think ryan's persistence question has been answered. when bk responds successfully to a write request, it means that the data has been synced to the requisite number of devices. when testing we usually use 3 or 5 bookies with a replication factor of 2 or 3. (bookkeeper does striping.)

          we have an asynchronous write interface to allow high throughput by pipelining requests, we invoke completions as each write completes successfully.

          Show
          Benjamin Reed added a comment - to get performance you need a dedicated drive, not necessarily dedicated hardware. bk takes care of the batching for you. there is opportunistic batching that happens in the network channel and to the disk. (the later being the most important.) i don't think ryan's persistence question has been answered. when bk responds successfully to a write request, it means that the data has been synced to the requisite number of devices. when testing we usually use 3 or 5 bookies with a replication factor of 2 or 3. (bookkeeper does striping.) we have an asynchronous write interface to allow high throughput by pipelining requests, we invoke completions as each write completes successfully.
          Hide
          Andrew Purtell added a comment -

          @Flavio: Thanks for your kind reply.

          In fact, with a previous version of BK, we were able to get 70kwrites/s for writes of 128 bytes

          Dedicated hardware, I presume. How many BK nodes would be needed alongside HBase+Hadoop nodes to achieve 1M ops/sec over 100 concurrent ledgers? HBase WAL has group commit but that can only batch so much... depends on the order of edits coming from userland. So what amount of batching? (Average transaction size to BK for example.)

          Show
          Andrew Purtell added a comment - @Flavio: Thanks for your kind reply. In fact, with a previous version of BK, we were able to get 70kwrites/s for writes of 128 bytes Dedicated hardware, I presume. How many BK nodes would be needed alongside HBase+Hadoop nodes to achieve 1M ops/sec over 100 concurrent ledgers? HBase WAL has group commit but that can only batch so much... depends on the order of edits coming from userland. So what amount of batching? (Average transaction size to BK for example.)
          Hide
          Flavio Junqueira added a comment -

          Thanks for all comments and questions so far. Let me try to answer all in a batch:

          Stack: We are certainly interested in exploring, and learning your requirements would certainly help to guide our exploration. On making BK transparent to the user installing BK, I don't have any reason to think that it would be difficult to achieve. The script I use to start a bookie is very simple and we'll soon have the feature I was mentioning above of a bookie registering automatically with ZooKeeper. In any case, I made sure that it is possible to use BK optionally, so hdfs is still used by default with the current patch.

          Ryan, Andrew: The number of nodes going from 20 to a 100 is not a problem; We have tested BK with a few thousand concurrent ledgers. One important aspect to pay attention to is the length of writes. In our experience, BK is more efficient when we batch small requests. If it is possible to batch small requests, then I don't see an issue with obtaining the throughput figures you're asking for. In fact, with a previous version of BK, we were able to get 70kwrites/s for writes of 128 bytes.

          Todd: There is some documentation available on the ZooKeeper site, and we have added more for the next release (3.3.0). You can get the new documentation from trunk, but I'll also add a pdf here for your convenience.

          Show
          Flavio Junqueira added a comment - Thanks for all comments and questions so far. Let me try to answer all in a batch: Stack: We are certainly interested in exploring, and learning your requirements would certainly help to guide our exploration. On making BK transparent to the user installing BK, I don't have any reason to think that it would be difficult to achieve. The script I use to start a bookie is very simple and we'll soon have the feature I was mentioning above of a bookie registering automatically with ZooKeeper. In any case, I made sure that it is possible to use BK optionally, so hdfs is still used by default with the current patch. Ryan, Andrew: The number of nodes going from 20 to a 100 is not a problem; We have tested BK with a few thousand concurrent ledgers. One important aspect to pay attention to is the length of writes. In our experience, BK is more efficient when we batch small requests. If it is possible to batch small requests, then I don't see an issue with obtaining the throughput figures you're asking for. In fact, with a previous version of BK, we were able to get 70kwrites/s for writes of 128 bytes. Todd: There is some documentation available on the ZooKeeper site, and we have added more for the next release (3.3.0). You can get the new documentation from trunk, but I'll also add a pdf here for your convenience.
          Hide
          Mahadev konar added a comment -

          todd,
          here is some documentation that might help you understand the architecture of BooKkeeper:

          http://hadoop.apache.org/zookeeper/docs/r3.2.2/

          this documentation has been improved in 3.3 (not branched yet), so you can checkout

          http://svn.apache.org/repos/asf/hadoop/zookeeper/trunk/docs/bookkeeperOverview.html

          and read that as well.

          hope this helps.

          Show
          Mahadev konar added a comment - todd, here is some documentation that might help you understand the architecture of BooKkeeper: http://hadoop.apache.org/zookeeper/docs/r3.2.2/ this documentation has been improved in 3.3 (not branched yet), so you can checkout http://svn.apache.org/repos/asf/hadoop/zookeeper/trunk/docs/bookkeeperOverview.html and read that as well. hope this helps.
          Hide
          Todd Lipcon added a comment -

          Hey Flavio,

          Would you mind pointing us to (or uploading) a design doc / architectural overview for BK? I think that would be very helpful for us to understand the scaling characteristics better.

          -Todd

          Show
          Todd Lipcon added a comment - Hey Flavio, Would you mind pointing us to (or uploading) a design doc / architectural overview for BK? I think that would be very helpful for us to understand the scaling characteristics better. -Todd
          Hide
          Andrew Purtell added a comment -

          On performance, how about 100-200k ops/sec with data sizes about 150 bytes or so? This would be aggregately generated on 20 nodes.

          Extend that to 100. What does this look like? Can BK cope?

          Show
          Andrew Purtell added a comment - On performance, how about 100-200k ops/sec with data sizes about 150 bytes or so? This would be aggregately generated on 20 nodes. Extend that to 100. What does this look like? Can BK cope?
          Hide
          stack added a comment -

          One other thing, can you speak to the following?

          HBase gets dissed on occasion because we are 'complicated', made up of different systems – hdfs, zk, etc. – so adding a new system (bk to do wals) would make this situation even worse?

          In our integration of zk, we did our best to hide it from user. We manage start up and shutdown of ensemble for those who don't want to be bothered with managing their own zk ensemble. Would such a thing be possible with a bk cluster?

          Show
          stack added a comment - One other thing, can you speak to the following? HBase gets dissed on occasion because we are 'complicated', made up of different systems – hdfs, zk, etc. – so adding a new system (bk to do wals) would make this situation even worse? In our integration of zk, we did our best to hide it from user. We manage start up and shutdown of ensemble for those who don't want to be bothered with managing their own zk ensemble. Would such a thing be possible with a bk cluster?
          Hide
          ryan rawson added a comment -

          A few more questions...

          What's the persistence story? How many nodes is the log data stored on?

          On performance, how about 100-200k ops/sec with data sizes about 150 bytes
          or so? This would be aggregately generated on 20 nodes.

          On Mar 12, 2010 2:18 PM, "Flavio Paiva Junqueira (JIRA)" <jira@apache.org>
          wrote:

          [
          https://issues.apache.org/jira/browse/HBASE-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844627#action_12844627]

          Flavio Paiva Junqueira commented on HBASE-2315:
          -----------------------------------------------
          Those are good questions, Stack. BookKeeper scales throughput with the
          number of servers, so adding more bookies should increase your current
          throughput if your traffic is not saturating the network already.
          Consequently, if you don't have a constraint on the number of bookies you'd
          like to use, your limitation would be the amount of bandwidth you have
          available.

          Just to give you some numbers, we have so far been able to saturate the
          network when writing around 1k bytes per entry, and the number of writes/s
          for 1k writes is of the order of tens of thousands for 3-5 bookies. Now, if
          I pick the largest numbers in your small example to consider the worst case
          (5 nodes, 1MB writes, 5k writes/s), then we would need a 40Gbit/s network,
          so I'm not sure you can do it with any distributed system unless you write
          locally in all nodes, in which case you can't guarantee the data will be
          available upon a node crash. Let me know if I'm misinterpreting your comment
          and you have something else in mind.

          I also have to mention that we added a feature to enable thousands of
          concurrent ledgers with minimal performance penalty on the writes, so I
          don't see any trouble in increasing the number of concurrent nodes as long
          as the BookKeeper cluster is provisioned accordingly. Of course, it would be
          great to measure it with hbase, though.

          Show
          ryan rawson added a comment - A few more questions... What's the persistence story? How many nodes is the log data stored on? On performance, how about 100-200k ops/sec with data sizes about 150 bytes or so? This would be aggregately generated on 20 nodes. On Mar 12, 2010 2:18 PM, "Flavio Paiva Junqueira (JIRA)" <jira@apache.org> wrote: [ https://issues.apache.org/jira/browse/HBASE-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844627#action_12844627 ] Flavio Paiva Junqueira commented on HBASE-2315 : ----------------------------------------------- Those are good questions, Stack. BookKeeper scales throughput with the number of servers, so adding more bookies should increase your current throughput if your traffic is not saturating the network already. Consequently, if you don't have a constraint on the number of bookies you'd like to use, your limitation would be the amount of bandwidth you have available. Just to give you some numbers, we have so far been able to saturate the network when writing around 1k bytes per entry, and the number of writes/s for 1k writes is of the order of tens of thousands for 3-5 bookies. Now, if I pick the largest numbers in your small example to consider the worst case (5 nodes, 1MB writes, 5k writes/s), then we would need a 40Gbit/s network, so I'm not sure you can do it with any distributed system unless you write locally in all nodes, in which case you can't guarantee the data will be available upon a node crash. Let me know if I'm misinterpreting your comment and you have something else in mind. I also have to mention that we added a feature to enable thousands of concurrent ledgers with minimal performance penalty on the writes, so I don't see any trouble in increasing the number of concurrent nodes as long as the BookKeeper cluster is provisioned accordingly. Of course, it would be great to measure it with hbase, though.
          Hide
          stack added a comment -

          Thanks for the interesting comments.

          WAL is per regionserver, not per region.

          On WAL throughput, since the 'append' to the WAL is what takes the bulk of the time making an hbase update, depending on hardware and how often we flush, we'll see variously 1-10k writes a second. We wil see hdfs holding up writes on occasion when struggling.. with latency going way up.

          I'm up for more exploration if you fellas are. Let me take a look at this patch of yours Flavio.

          Thanks for happening by lads.

          Show
          stack added a comment - Thanks for the interesting comments. WAL is per regionserver, not per region. On WAL throughput, since the 'append' to the WAL is what takes the bulk of the time making an hbase update, depending on hardware and how often we flush, we'll see variously 1-10k writes a second. We wil see hdfs holding up writes on occasion when struggling.. with latency going way up. I'm up for more exploration if you fellas are. Let me take a look at this patch of yours Flavio. Thanks for happening by lads.
          Hide
          Benjamin Reed added a comment -

          one thing that would be great for BookKeeper is to try to characterize your load and then run some saturation tests to see how the performance looks. previously we varied messages sizes from 128 bytes to 8K. should we try something like 100 bytes, 1000 bytes, 10K, 100K, 1M?

          we should also vary the number of clients writing. what is a good number? 10-1000 or 10-10000? is the WAL per server or per region.

          do you have WAL throughput numbers for the throughput you are getting from HDFS?

          just to give you an idea, we are in the 10s of thousands of messages/sec for 1K messages. (the performance increases as you add more servers, aka bookies.) for larger messages it really is the network bandwidth that becomes the bottleneck.

          Show
          Benjamin Reed added a comment - one thing that would be great for BookKeeper is to try to characterize your load and then run some saturation tests to see how the performance looks. previously we varied messages sizes from 128 bytes to 8K. should we try something like 100 bytes, 1000 bytes, 10K, 100K, 1M? we should also vary the number of clients writing. what is a good number? 10-1000 or 10-10000? is the WAL per server or per region. do you have WAL throughput numbers for the throughput you are getting from HDFS? just to give you an idea, we are in the 10s of thousands of messages/sec for 1K messages. (the performance increases as you add more servers, aka bookies.) for larger messages it really is the network bandwidth that becomes the bottleneck.
          Hide
          Flavio Junqueira added a comment -

          Those are good questions, Stack. BookKeeper scales throughput with the number of servers, so adding more bookies should increase your current throughput if your traffic is not saturating the network already. Consequently, if you don't have a constraint on the number of bookies you'd like to use, your limitation would be the amount of bandwidth you have available.

          Just to give you some numbers, we have so far been able to saturate the network when writing around 1k bytes per entry, and the number of writes/s for 1k writes is of the order of tens of thousands for 3-5 bookies. Now, if I pick the largest numbers in your small example to consider the worst case (5 nodes, 1MB writes, 5k writes/s), then we would need a 40Gbit/s network, so I'm not sure you can do it with any distributed system unless you write locally in all nodes, in which case you can't guarantee the data will be available upon a node crash. Let me know if I'm misinterpreting your comment and you have something else in mind.

          I also have to mention that we added a feature to enable thousands of concurrent ledgers with minimal performance penalty on the writes, so I don't see any trouble in increasing the number of concurrent nodes as long as the BookKeeper cluster is provisioned accordingly. Of course, it would be great to measure it with hbase, though.

          Show
          Flavio Junqueira added a comment - Those are good questions, Stack. BookKeeper scales throughput with the number of servers, so adding more bookies should increase your current throughput if your traffic is not saturating the network already. Consequently, if you don't have a constraint on the number of bookies you'd like to use, your limitation would be the amount of bandwidth you have available. Just to give you some numbers, we have so far been able to saturate the network when writing around 1k bytes per entry, and the number of writes/s for 1k writes is of the order of tens of thousands for 3-5 bookies. Now, if I pick the largest numbers in your small example to consider the worst case (5 nodes, 1MB writes, 5k writes/s), then we would need a 40Gbit/s network, so I'm not sure you can do it with any distributed system unless you write locally in all nodes, in which case you can't guarantee the data will be available upon a node crash. Let me know if I'm misinterpreting your comment and you have something else in mind. I also have to mention that we added a feature to enable thousands of concurrent ledgers with minimal performance penalty on the writes, so I don't see any trouble in increasing the number of concurrent nodes as long as the BookKeeper cluster is provisioned accordingly. Of course, it would be great to measure it with hbase, though.
          Hide
          stack added a comment -

          Thanks you for this work Flavio. Before going to the patch, a few of us, when BookKeeper showed up first, had this notion of writing HBase WAL to a BK Ensemble but we never did any work on it because we figured it just wouldn't be able to keep up.

          For example, posit a smallish hbase cluster of 3-5 nodes each taking on writes of 10bytes-1MB of edits at rates of 100-5k writes a second. Would BK be able to keep up? What if is was a 10-20 node cluster? What you think?

          Show
          stack added a comment - Thanks you for this work Flavio. Before going to the patch, a few of us, when BookKeeper showed up first, had this notion of writing HBase WAL to a BK Ensemble but we never did any work on it because we figured it just wouldn't be able to keep up. For example, posit a smallish hbase cluster of 3-5 nodes each taking on writes of 10bytes-1MB of edits at rates of 100-5k writes a second. Would BK be able to keep up? What if is was a 10-20 node cluster? What you think?
          Hide
          Flavio Junqueira added a comment -

          To compile hbase with this patch, it is necessary to add the bookkeeper jar to the lib folder. I've attached the jar to this jira for convenience, but you can also get it and compile from zookeeper/trunk.

          This patch requires a few new configuration properties on hbase-site.xml:

           <property>
                  <name>hbase.wal.bk.zkserver</name>
                  <value>xxx.yyy.zzz:10281</value> 
          </property>
          
          <property>
                 <name>hbase.regionserver.hlog.reader.impl</name>
                 <value>org.apache.hadoop.hbase.regionserver.wal.BookKeeperLogReader</value>
           </property>
          
           <property>
                 <name>hbase.regionserver.hlog.writer.impl</name>
                 <value>org.apache.hadoop.hbase.regionserver.wal.BookKeeperLogWriter</value>
          </property>
          

          The zookeeper server (or servers) used with bookkeeper can be the same as the one used with hbase, but the value of the property has to be set regardless of which server you're using. If this is not ok, we may consider changing on the next iteration, assuming there will be one.

          It is also important to create the following znodes in the zookeeper instance pointed by hbase.wal.bk.zkserver:

          /ledgers
          /ledgers/available/
          /ledgers/available/<bookie-addr>
          

          where "/ledgers/available/<bookie-addr>" is a node representing a bookie and the format of <bookie-addr> should be "xxx.yyy.zzz:port". Consequently, there must be one such znode for each available bookie. We have a jira open to make bookie registration automatic, and it should be available soon.

          I have been creating these znodes manually for my tests, but I acknowledge that a script to bootstrap the process would be quite handy.

          Finally, I would be happy to get some feedback and suggestions for improvement.

          Show
          Flavio Junqueira added a comment - To compile hbase with this patch, it is necessary to add the bookkeeper jar to the lib folder. I've attached the jar to this jira for convenience, but you can also get it and compile from zookeeper/trunk. This patch requires a few new configuration properties on hbase-site.xml: <property> <name>hbase.wal.bk.zkserver</name> <value>xxx.yyy.zzz:10281</value> </property> <property> <name>hbase.regionserver.hlog.reader.impl</name> <value>org.apache.hadoop.hbase.regionserver.wal.BookKeeperLogReader</value> </property> <property> <name>hbase.regionserver.hlog.writer.impl</name> <value>org.apache.hadoop.hbase.regionserver.wal.BookKeeperLogWriter</value> </property> The zookeeper server (or servers) used with bookkeeper can be the same as the one used with hbase, but the value of the property has to be set regardless of which server you're using. If this is not ok, we may consider changing on the next iteration, assuming there will be one. It is also important to create the following znodes in the zookeeper instance pointed by hbase.wal.bk.zkserver: /ledgers /ledgers/available/ /ledgers/available/<bookie-addr> where "/ledgers/available/<bookie-addr>" is a node representing a bookie and the format of <bookie-addr> should be "xxx.yyy.zzz:port". Consequently, there must be one such znode for each available bookie. We have a jira open to make bookie registration automatic, and it should be available soon. I have been creating these znodes manually for my tests, but I acknowledge that a script to bootstrap the process would be quite handy. Finally, I would be happy to get some feedback and suggestions for improvement.
          Hide
          Flavio Junqueira added a comment -

          Attaching BookKeeper jar for testing purposes. The same jar can be also obtained from the ZooKeeper trunk.

          Show
          Flavio Junqueira added a comment - Attaching BookKeeper jar for testing purposes. The same jar can be also obtained from the ZooKeeper trunk.
          Hide
          Flavio Junqueira added a comment -

          I'm attaching a preliminary patch. I haven't tested extensively, but it works in my small setup with one computer running hbase+dfs, three bookies, and one zookeeper server.

          Show
          Flavio Junqueira added a comment - I'm attaching a preliminary patch. I haven't tested extensively, but it works in my small setup with one computer running hbase+dfs, three bookies, and one zookeeper server.

            People

            • Assignee:
              Unassigned
              Reporter:
              Flavio Junqueira
            • Votes:
              0 Vote for this issue
              Watchers:
              37 Start watching this issue

              Dates

              • Created:
                Updated:

                Development