HBase
  1. HBase
  2. HBASE-2170

hbase lightweight client library as a distribution

    Details

    • Type: Wish Wish
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      As a wish - it would be nice to have a hbase client library (subset of the current hbase distribution) that needs to be present at the hbase client level to interact with the master/region servers.

      From an app integration - users of hbase can just link against the client library as opposed to getting the entire library to link against.

        Issue Links

          Activity

          Hide
          Paul Smith added a comment -

          One of the things I was actually starting to look at was related to this, over on HBASE-2051 there is a utility library, of which a tiny-weeny class was an HBaseConfigurationFactory class for Spring users (such as myself).

          The lightweight client library in many cases only needs the basic parameters of the zookeeper location, and the path on HDFS.

          
          Map<String, String> propertyMap = ImmutableMap.of("hbase.rootdir",
                          "file:///tmp/hbase-${user.name}/hbase", "hbase.cluster.distributed", "false",
                          "hbase.zookeeper.quorum", "localhost");
          
          HBaseConfigurationFactory hBaseConfigurationFactory = new HBaseConfigurationFactory(
                          propertyMap);
          
          HBaseConfiguration config = (HBaseConfiguration) hbaseConfigurationFactory.getObject();
          
          

          I think it's less about needing a smaller library, but more about a cleaner interface to create a logical 'connection' to the HBase cluster.

          Show
          Paul Smith added a comment - One of the things I was actually starting to look at was related to this, over on HBASE-2051 there is a utility library, of which a tiny-weeny class was an HBaseConfigurationFactory class for Spring users (such as myself). The lightweight client library in many cases only needs the basic parameters of the zookeeper location, and the path on HDFS. Map< String , String > propertyMap = ImmutableMap.of( "hbase.rootdir" , "file: ///tmp/hbase-${user.name}/hbase" , "hbase.cluster.distributed" , " false " , "hbase.zookeeper.quorum" , "localhost" ); HBaseConfigurationFactory hBaseConfigurationFactory = new HBaseConfigurationFactory( propertyMap); HBaseConfiguration config = (HBaseConfiguration) hbaseConfigurationFactory.getObject(); I think it's less about needing a smaller library, but more about a cleaner interface to create a logical 'connection' to the HBase cluster.
          Hide
          ryan rawson added a comment -

          in theory a client will only need the client/ and ipc/ directories and maybe a few more. Also zookeeper. And some other libraries.

          Show
          ryan rawson added a comment - in theory a client will only need the client/ and ipc/ directories and maybe a few more. Also zookeeper. And some other libraries.
          Hide
          Paul Smith added a comment -

          in theory a client will only need the client/ and ipc/ directories and maybe a few more. Also zookeeper. And some other libraries.

          that would make it simple to pull those packages out into a 'hbase-client' Maven module (see HBASE-2099). We could start by getting that depending on hbase-core, and exclude all transitive dependencies but add add back in the specific ones needed (zookeeper et al).

          Show
          Paul Smith added a comment - in theory a client will only need the client/ and ipc/ directories and maybe a few more. Also zookeeper. And some other libraries. that would make it simple to pull those packages out into a 'hbase-client' Maven module (see HBASE-2099 ). We could start by getting that depending on hbase-core, and exclude all transitive dependencies but add add back in the specific ones needed (zookeeper et al).
          Hide
          Karthik K added a comment -
          that would make it simple to pull those packages out into a 'hbase-client' Maven module (see HBASE-2099). We could start by getting that depending on hbase-core, and exclude all transitive dependencies but add add back in the specific ones needed (zookeeper et al).

          +1 . Along ryan's suggestion - it might need some minor refactoring to carve out the separate package , but it might worth the effort to focus on the releases of the server side.

          Show
          Karthik K added a comment - that would make it simple to pull those packages out into a 'hbase-client' Maven module (see HBASE-2099 ). We could start by getting that depending on hbase-core, and exclude all transitive dependencies but add add back in the specific ones needed (zookeeper et al). +1 . Along ryan's suggestion - it might need some minor refactoring to carve out the separate package , but it might worth the effort to focus on the releases of the server side.
          Hide
          Lars Francke added a comment -

          Now that we've moved to Maven I had a look at this. While it might be possible to build custom assemblies for the client it won't be easy to automate - I hope Kay Kay or Paul can comment on that as they have more experience with this

          We're also looking at all the dependencies and marking them as "optional" if they are only needed for certain features (client, server, ...). But it's Mavens philosophy that an optional dependency is a sign for an "inefficient" build layout and that the project should be split into multiple modules.

          I haven't taken a close look at this and I'm not very familiar with the HBase source code but it seems to me as if it would make sense to split the current "core" module in at least three parts: common, client and server. This would allow us to bundle all those classes that are used by both components in the common module which hopefully has none or very little external dependencies.

          I know that this would mean yet another big change in the source code layout so take it at as a proposal for one possible solution to the problem. I've not been with the project for very long but I've seen questions about how to use HBase in Java multiple times. This would hopefully make this easier and clear it up a bit.

          If you decide this is a good idea I'd be willing to spend some time on it.

          Show
          Lars Francke added a comment - Now that we've moved to Maven I had a look at this. While it might be possible to build custom assemblies for the client it won't be easy to automate - I hope Kay Kay or Paul can comment on that as they have more experience with this We're also looking at all the dependencies and marking them as "optional" if they are only needed for certain features (client, server, ...). But it's Mavens philosophy that an optional dependency is a sign for an "inefficient" build layout and that the project should be split into multiple modules. I haven't taken a close look at this and I'm not very familiar with the HBase source code but it seems to me as if it would make sense to split the current "core" module in at least three parts: common , client and server . This would allow us to bundle all those classes that are used by both components in the common module which hopefully has none or very little external dependencies. I know that this would mean yet another big change in the source code layout so take it at as a proposal for one possible solution to the problem. I've not been with the project for very long but I've seen questions about how to use HBase in Java multiple times. This would hopefully make this easier and clear it up a bit. If you decide this is a good idea I'd be willing to spend some time on it.
          Hide
          Karthik K added a comment -
          current "core" module in at least three parts: common, client and server.
          .. the common module which hopefully has none or very little external dependencies.

          +1 . That is something along the lines of what I was thinking.

          It is true it might need another round of refactoring of code. I was thinking of adding ipc / client and thrift/generated (and stargate client ) packages in the client .

          I believe that it might be very useful long term for the adoption of the client alone with specific focus on examples , while there could be more focus on the server module. And yes - the client library would be much more light-weight than the entire app server .

          I can spend some time too on this one, now that we have a stable build, it might be good to have a look at this.

          Show
          Karthik K added a comment - current "core" module in at least three parts: common, client and server. .. the common module which hopefully has none or very little external dependencies. +1 . That is something along the lines of what I was thinking. It is true it might need another round of refactoring of code. I was thinking of adding ipc / client and thrift/generated (and stargate client ) packages in the client . I believe that it might be very useful long term for the adoption of the client alone with specific focus on examples , while there could be more focus on the server module. And yes - the client library would be much more light-weight than the entire app server . I can spend some time too on this one, now that we have a stable build, it might be good to have a look at this.
          Hide
          dhruba borthakur added a comment -

          It appears that from a build perspective (inefficient maven dependencies, etc, etc) you would want to split hbase into multiple jars. But from a deployment & operation perspective, it is a nightmare. Can somebody please enumerate which jars are currently required by a hbase client application and also enumerate which jars will be needed by after this patch?

          Also, can somebody who is using hbase vouch for the fact that the splitting of jars is helpful?

          The analogy I draw is that the hadoop libraries are the same for the hadoop clients as well as servers. I have found it to be great help for operation purposes: there is lesser chance of mixing up different incompatible versions.

          Show
          dhruba borthakur added a comment - It appears that from a build perspective (inefficient maven dependencies, etc, etc) you would want to split hbase into multiple jars. But from a deployment & operation perspective, it is a nightmare. Can somebody please enumerate which jars are currently required by a hbase client application and also enumerate which jars will be needed by after this patch? Also, can somebody who is using hbase vouch for the fact that the splitting of jars is helpful? The analogy I draw is that the hadoop libraries are the same for the hadoop clients as well as servers. I have found it to be great help for operation purposes: there is lesser chance of mixing up different incompatible versions.
          Hide
          ryan rawson added a comment -

          common would depend on hadoop-core,hadoop-hdfs at least (and possibly hadoop-mapred alas).

          and i somewhat agree with dhruba, in a world of copying jars around, a single jar reduces the mess-up possibility significantly. In an automated ops-world this is somewhat less important I hope

          Show
          ryan rawson added a comment - common would depend on hadoop-core,hadoop-hdfs at least (and possibly hadoop-mapred alas). and i somewhat agree with dhruba, in a world of copying jars around, a single jar reduces the mess-up possibility significantly. In an automated ops-world this is somewhat less important I hope
          Hide
          Karthik K added a comment -
          But from a deployment & operation perspective, it is a nightmare.

          Why ? It is a nightmare now. Theoretically - if there were a client for HBase doing some M-R / inserts / scans - we have to add the hbase-.jar and everything in lib/.jar as well , because it is monolithic. Currently - we just add hbase-0.xx.jar and add other jars as required until there is no classnotfoundexception, which is really not the job of the client developer, but that of the system publisher to make the distinction.

          I guess that holds true for any client/server system, to publish a light-weight client , to focus on the server development better with different release cycles, depending on the need.

          Can somebody please enumerate which jars are currently required by a hbase client application and also enumerate which jars will be needed by after this patch?

          The point is to shift the responsibility from the hbase client user to the hbase maintainers , to decide which dependencies need to go with the client, so the client does not need to do add a jar from lib until no class not found exception occurs algorithm. On top of my head - I can think of log4j / zk / thrift / rest , that would come in here.

          Also, can somebody who is using hbase vouch for the fact that the splitting of jars is helpful?

          To begin with , we do . We have a HBase farm set up , consisting of the data set / and being inserted from the outside world and independent developers who develop on top of HBase ( on top of the ramp up curve, that they have to get on to the platform) are plain confused by the list of jars and adding them to the client namespace, and if it does not make sense to add , say hdfs.jar in the client lib of hbase , if all they wanted to do was to do some scans / M-R on the hbase data, as that information is immaterial. We have plans to start using mahout and build algorithms on top of the platform, and it makes no sense whatsoever to bring in the hidden dependencies of hbase and expose it to the hbase client to discourage them entirely.

          The analogy I draw is that the hadoop libraries are the same for the hadoop clients as well as servers. I have found it to be great help for operation purposes: there is lesser chance of mixing up different incompatible versions.

          But the hbase client is not meant to be on the same machine as a server ? From an operational purpose - you will be having a different set of jars for client and the server. And the common jar, would not be released explicitly but be part of the client and the server as appropriate.

          in a world of copying jars around, a single jar reduces the mess-up possibility significantly. In an automated ops-world this is somewhat less important I hope'

          The client is susceptible to less changes and the server a lot more. If there were 2 jars - client and the server, would not that make it more clear as to the developer and the ops as to which jar needs to be replaced and which machines are affected ?

          Show
          Karthik K added a comment - But from a deployment & operation perspective, it is a nightmare. Why ? It is a nightmare now. Theoretically - if there were a client for HBase doing some M-R / inserts / scans - we have to add the hbase- .jar and everything in lib/ .jar as well , because it is monolithic. Currently - we just add hbase-0.xx.jar and add other jars as required until there is no classnotfoundexception, which is really not the job of the client developer, but that of the system publisher to make the distinction. I guess that holds true for any client/server system, to publish a light-weight client , to focus on the server development better with different release cycles, depending on the need. Can somebody please enumerate which jars are currently required by a hbase client application and also enumerate which jars will be needed by after this patch? The point is to shift the responsibility from the hbase client user to the hbase maintainers , to decide which dependencies need to go with the client, so the client does not need to do add a jar from lib until no class not found exception occurs algorithm. On top of my head - I can think of log4j / zk / thrift / rest , that would come in here. Also, can somebody who is using hbase vouch for the fact that the splitting of jars is helpful? To begin with , we do . We have a HBase farm set up , consisting of the data set / and being inserted from the outside world and independent developers who develop on top of HBase ( on top of the ramp up curve, that they have to get on to the platform) are plain confused by the list of jars and adding them to the client namespace, and if it does not make sense to add , say hdfs.jar in the client lib of hbase , if all they wanted to do was to do some scans / M-R on the hbase data, as that information is immaterial. We have plans to start using mahout and build algorithms on top of the platform, and it makes no sense whatsoever to bring in the hidden dependencies of hbase and expose it to the hbase client to discourage them entirely. The analogy I draw is that the hadoop libraries are the same for the hadoop clients as well as servers. I have found it to be great help for operation purposes: there is lesser chance of mixing up different incompatible versions. But the hbase client is not meant to be on the same machine as a server ? From an operational purpose - you will be having a different set of jars for client and the server. And the common jar, would not be released explicitly but be part of the client and the server as appropriate. in a world of copying jars around, a single jar reduces the mess-up possibility significantly. In an automated ops-world this is somewhat less important I hope' The client is susceptible to less changes and the server a lot more. If there were 2 jars - client and the server, would not that make it more clear as to the developer and the ops as to which jar needs to be replaced and which machines are affected ?
          Hide
          Paul Smith added a comment -

          btw, there's no reason not to produce a 'hbase-uber' artifact that combines everything into one as there is now, that can be done with a fairly simple assembly. For those that need/like having slimmer client libraries they can still do that, those that feel safer with a fat jar, go right ahead.

          Show
          Paul Smith added a comment - btw, there's no reason not to produce a 'hbase-uber' artifact that combines everything into one as there is now, that can be done with a fairly simple assembly. For those that need/like having slimmer client libraries they can still do that, those that feel safer with a fat jar, go right ahead.
          Hide
          Karthik K added a comment -

          $ mvn jdepend:generate

          <project name="xml2html" default="change">
          <target name="change">
          <style basedir="." destdir="."
          includes="jdepend-report.xml"
          style="$

          {ant.home}

          /etc/jdepend.xsl" />
          </target>
          </project>

          Interesting report that can be taken as a starting point / estimation of dependencies before venturing along the route.

          Show
          Karthik K added a comment - $ mvn jdepend:generate <project name="xml2html" default="change"> <target name="change"> <style basedir="." destdir="." includes="jdepend-report.xml" style="$ {ant.home} /etc/jdepend.xsl" /> </target> </project> Interesting report that can be taken as a starting point / estimation of dependencies before venturing along the route.
          Hide
          Karthik K added a comment -

          The report is a bit misleading because java/javax and other external packages, outside hbase are not ignored. Will submit a new one , after fixing the same.

          Show
          Karthik K added a comment - The report is a bit misleading because java/javax and other external packages, outside hbase are not ignored. Will submit a new one , after fixing the same.
          Hide
          dhruba borthakur added a comment -

          +1 to Paul smith's proposal that we create a hbase-uher artifact. can this be done as part of this patch Kay Kay?

          Show
          dhruba borthakur added a comment - +1 to Paul smith's proposal that we create a hbase-uher artifact. can this be done as part of this patch Kay Kay?
          Hide
          Karthik K added a comment -
          btw, there's no reason not to produce a 'hbase-uber' artifact that combines everything into one as there is now, that can be done with a fairly simple assembly. For those that need/like having slimmer client libraries they can still do that, those that feel safer with a fat jar, go right ahead.

          Thanks Paul for the hint.

          +1 to Paul smith's proposal that we create a hbase-uher artifact. can this be done as part of this patch Kay Kay?

          Sure, Dhruba.

          There will some be some work to move around some classes / methods to achieve a decent separation between the client / common / server , before we venture down the road of 3 separate artifacts. So yes - there will be an option for the single jar as well, for distribution.

          Show
          Karthik K added a comment - btw, there's no reason not to produce a 'hbase-uber' artifact that combines everything into one as there is now, that can be done with a fairly simple assembly. For those that need/like having slimmer client libraries they can still do that, those that feel safer with a fat jar, go right ahead. Thanks Paul for the hint. +1 to Paul smith's proposal that we create a hbase-uher artifact. can this be done as part of this patch Kay Kay? Sure, Dhruba. There will some be some work to move around some classes / methods to achieve a decent separation between the client / common / server , before we venture down the road of 3 separate artifacts. So yes - there will be an option for the single jar as well, for distribution.
          Hide
          stack added a comment -

          An argument for making hbase-common, hbase-server, and hbase-client not really mentioned above is that if we adopt these componentizations, and follow through with them, we'll end up with a software base that is more easy to maintain than the current hairball we currently have.

          I'm good w/ going this route. What gives me pain is that we'd have to do it using mvn modules. We have the opportunity just now of making our mvn build simple having ejected all contribs; we could undo module support. But module support I think is required to do the above componentization.

          Show
          stack added a comment - An argument for making hbase-common, hbase-server, and hbase-client not really mentioned above is that if we adopt these componentizations, and follow through with them, we'll end up with a software base that is more easy to maintain than the current hairball we currently have. I'm good w/ going this route. What gives me pain is that we'd have to do it using mvn modules. We have the opportunity just now of making our mvn build simple having ejected all contribs; we could undo module support. But module support I think is required to do the above componentization.
          Hide
          Karthik K added a comment -

          As the first step, (in the current code base), refactoring of the code , taking the jdepend reports as a cue might be a good place to begin with (w.r.t stability of the packages) , where we separately identify the package list to be grouped .

          After that is done, it would be easier to split the modules into server , client , common artifacts.

          Show
          Karthik K added a comment - As the first step, (in the current code base), refactoring of the code , taking the jdepend reports as a cue might be a good place to begin with (w.r.t stability of the packages) , where we separately identify the package list to be grouped . After that is done, it would be easier to split the modules into server , client , common artifacts.
          Hide
          ryan rawson added a comment -

          would we need to do common,client,server? could we do:

          hbase-client (contains o.a.h.h.client and o.a.h.h.util & others)
          ^

          hbase-server (everything else)

          Show
          ryan rawson added a comment - would we need to do common,client,server? could we do: hbase-client (contains o.a.h.h.client and o.a.h.h.util & others) ^ hbase-server (everything else)
          Hide
          Bruno Dumon added a comment -

          In case this is useful for someone: as a temporary solution in Lily we have a hbase-client wrapper project which defines the necessary excludes:

          http://dev.outerthought.org/svn_public/outerthought_lilycms/trunk/hbase-client/

          This might save some time for someone who wants to do the same.

          As I found it too much work to enter all these exclusions manually, and to be sure no new dependencies sneak in when switching to newer hbase versions, I made a little Maven plugin:

          http://dev.outerthought.org/svn_public/outerthought_lilycms/trunk/tools/hbase-exclusions-plugin/

          Show
          Bruno Dumon added a comment - In case this is useful for someone: as a temporary solution in Lily we have a hbase-client wrapper project which defines the necessary excludes: http://dev.outerthought.org/svn_public/outerthought_lilycms/trunk/hbase-client/ This might save some time for someone who wants to do the same. As I found it too much work to enter all these exclusions manually, and to be sure no new dependencies sneak in when switching to newer hbase versions, I made a little Maven plugin: http://dev.outerthought.org/svn_public/outerthought_lilycms/trunk/tools/hbase-exclusions-plugin/
          Hide
          Lars Francke added a comment -

          I think I would spend some time on this but only if we have a decision on how to go forward.

          I'd strongly prefer a modularization of HBase for various reasons. Stack laid out a few of them in terms of software design and maintainability.
          My first step would be to see if we need two or three modules (as per Ryan's latest comment) and go forward from there.

          Show
          Lars Francke added a comment - I think I would spend some time on this but only if we have a decision on how to go forward. I'd strongly prefer a modularization of HBase for various reasons. Stack laid out a few of them in terms of software design and maintainability. My first step would be to see if we need two or three modules (as per Ryan's latest comment) and go forward from there.
          Hide
          stack added a comment -

          @Lars What you thinking? A client rewrite? (Out of interest, have you've seen https://github.com/stumbleupon/asynchbase?) What other components you thinking?

          Show
          stack added a comment - @Lars What you thinking? A client rewrite? (Out of interest, have you've seen https://github.com/stumbleupon/asynchbase? ) What other components you thinking?
          Hide
          Lars Francke added a comment -

          I had hoped that a client rewrite would not be needed. I did take a look at it a few months back and the interdependencies weren't as bad as I had thought at first. I might be wrong though.

          I've seen that asynchronous client but never used it. Very promising though! Are you thinking about bringing it into the core?

          What I would do is to identify everything that the client package depends on and then try to find a way to limit its dependencies to the absolute minimum and create a Maven module out of it. This might involve some code changes but without taking a closer look it's hard to say.

          Does this modularization need a vote on the mailing list? Just to see if you want to go ahead with it or not.

          Show
          Lars Francke added a comment - I had hoped that a client rewrite would not be needed. I did take a look at it a few months back and the interdependencies weren't as bad as I had thought at first. I might be wrong though. I've seen that asynchronous client but never used it. Very promising though! Are you thinking about bringing it into the core? What I would do is to identify everything that the client package depends on and then try to find a way to limit its dependencies to the absolute minimum and create a Maven module out of it. This might involve some code changes but without taking a closer look it's hard to say. Does this modularization need a vote on the mailing list? Just to see if you want to go ahead with it or not.
          Hide
          stack added a comment -

          @Lars I don't think we'll be bringing asynchbase into core (Benoît won't let us! Ahead of his refusal, he makes a good argument that it makes no sense bringing it in). I think a separate maven module would be necessary making the separation, probably the first step, or rather second step. The first step is finding an advocate willing to take on this knarly issue.

          There would be a few advantages to extracting the client. Way back when the gumgum lads tried to make the client/server go against Interfaces only. Currently our Interface has zk pollution, a pollution that has gotten worse since gumgum boys last tried it. This might be the third step we'd take on making a separate client lib.

          Show
          stack added a comment - @Lars I don't think we'll be bringing asynchbase into core (Benoît won't let us! Ahead of his refusal, he makes a good argument that it makes no sense bringing it in). I think a separate maven module would be necessary making the separation, probably the first step, or rather second step. The first step is finding an advocate willing to take on this knarly issue. There would be a few advantages to extracting the client. Way back when the gumgum lads tried to make the client/server go against Interfaces only. Currently our Interface has zk pollution, a pollution that has gotten worse since gumgum boys last tried it. This might be the third step we'd take on making a separate client lib.
          Hide
          stack added a comment -

          Modularization would be pretty disruptive. I'd say that we should wait on branch of 0.92 before we'd do it? What you think?

          Show
          stack added a comment - Modularization would be pretty disruptive. I'd say that we should wait on branch of 0.92 before we'd do it? What you think?
          Hide
          Todd Lipcon added a comment -

          I am +1 on this getting done "at some point" Won't comment on pre- or post-92.

          We could get started on the necessary refactoring before doing the build changes, though, by using a tool like JDepend to analyze inter-class and inter-package dependencies. Anyone have experience with this? (I don't, I've just heard of it)

          Show
          Todd Lipcon added a comment - I am +1 on this getting done "at some point" Won't comment on pre- or post-92. We could get started on the necessary refactoring before doing the build changes, though, by using a tool like JDepend to analyze inter-class and inter-package dependencies. Anyone have experience with this? (I don't, I've just heard of it)
          Hide
          stack added a comment -

          We have (or can get easily) a license for http://www.headwaysoftware.com/products/structure101/index.php. They are Irish so it must be good software.

          Show
          stack added a comment - We have (or can get easily) a license for http://www.headwaysoftware.com/products/structure101/index.php . They are Irish so it must be good software.
          Hide
          Benoit Sigoure added a comment -

          @Lars I don't think we'll be bringing asynchbase into core (Benoît won't let us! Ahead of his refusal, he makes a good argument that it makes no sense bringing it in)

          Yeah, I don't think asynchbase will ever be part of core.

          • asynchbase is .. well.. async. Most programmers don't feel comfortable with this programming paradigm. Plus Java has extremely poor support for asynchronous programming (because the syntax required to write callbacks is extremely verbose and pollutes the code and because the standard libraries don't play nicely with asynchronous programming). I still think parallel asynchronous programming is the best option for truly scalable application servers, but I can also understand that average programmers need simple APIs that use paradigms they're used to.
          • Having said that, an async interface can easily be turned into a sync interface (the opposite isn't true), but I have no interest in maintaining a sync interface that wraps the asynchbase interface.
          • asynchbase uses a license that is in "Category X" (excluded) for Apache projects (LGPLv3+). I have no interest in debating politics / licenses in this thread. I just want to precise that contrary to a popular belief, LGPLv3+ isn't incompatible with the Apache license (version 2), it's just not allowed by the Apache Software Foundation in Apache projects. A lot of people believe there is an incompatibility due to popular misinformation.
          • I personally do not wish to work with SVN and JIRA. Even with git-svn. I'm currently more than happy to maintain asynchbase on GitHub.

          I hope you understand.

          Having said that, I think asynchbase contains a sane implementation and I'd be happy to answer any question you might have about it. It contains detailed documentation of the HBase RPC protocol and greatly outperforms the traditional HBase client. After switching OpenTSDB to it, I was able to push up to an order of magnitude more write throughput to HBase. When doing a batch import where the data source is single threaded, I can push up to 200k KeyValue per second to HBase (without WAL) on a 4 year old CPU.

          Show
          Benoit Sigoure added a comment - @Lars I don't think we'll be bringing asynchbase into core (Benoît won't let us! Ahead of his refusal, he makes a good argument that it makes no sense bringing it in) Yeah, I don't think asynchbase will ever be part of core. asynchbase is .. well.. async. Most programmers don't feel comfortable with this programming paradigm. Plus Java has extremely poor support for asynchronous programming (because the syntax required to write callbacks is extremely verbose and pollutes the code and because the standard libraries don't play nicely with asynchronous programming). I still think parallel asynchronous programming is the best option for truly scalable application servers, but I can also understand that average programmers need simple APIs that use paradigms they're used to. Having said that, an async interface can easily be turned into a sync interface (the opposite isn't true), but I have no interest in maintaining a sync interface that wraps the asynchbase interface. asynchbase uses a license that is in "Category X" (excluded) for Apache projects (LGPLv3+). I have no interest in debating politics / licenses in this thread. I just want to precise that contrary to a popular belief, LGPLv3+ isn't incompatible with the Apache license (version 2), it's just not allowed by the Apache Software Foundation in Apache projects. A lot of people believe there is an incompatibility due to popular misinformation. I personally do not wish to work with SVN and JIRA. Even with git-svn . I'm currently more than happy to maintain asynchbase on GitHub . I hope you understand. Having said that, I think asynchbase contains a sane implementation and I'd be happy to answer any question you might have about it. It contains detailed documentation of the HBase RPC protocol and greatly outperforms the traditional HBase client . After switching OpenTSDB to it, I was able to push up to an order of magnitude more write throughput to HBase. When doing a batch import where the data source is single threaded, I can push up to 200k KeyValue per second to HBase (without WAL) on a 4 year old CPU.
          Hide
          dhruba borthakur added a comment -

          > I can push up to 200k KeyValue per second to HBase (without WAL) on a 4 year old CPU.

          This is an impressive number. Just curious if u were able to run the same benchmark with WAL turned on, and what numbers you see then..

          Show
          dhruba borthakur added a comment - > I can push up to 200k KeyValue per second to HBase (without WAL) on a 4 year old CPU. This is an impressive number. Just curious if u were able to run the same benchmark with WAL turned on, and what numbers you see then..
          Hide
          stack added a comment -

          @Benoît Can you post your comment to the dev list. Its a waste having it as a comment in JIRA.

          Show
          stack added a comment - @Benoît Can you post your comment to the dev list. Its a waste having it as a comment in JIRA.
          Hide
          Benoit Sigoure added a comment -

          This is an impressive number. Just curious if u were able to run the same benchmark with WAL turned on, and what numbers you see then..

          Curiously enough, I see the same numbers.

          This is the 1st import I did Thursday (no WAL)

          $ ./src/tsdb import /tmp/data.gz
          [...]
          2011-03-17 18:45:51,797 INFO  [main] TextImporter: ... 1000000 data points in 6688ms (149521.5 points/s)
          2011-03-17 18:45:56,836 INFO  [main] TextImporter: ... 2000000 data points in 5044ms (198255.4 points/s)
          2011-03-17 18:46:01,823 INFO  [main] TextImporter: ... 3000000 data points in 4986ms (200561.6 points/s)
          2011-03-17 18:46:06,848 INFO  [main] TextImporter: ... 4000000 data points in 5025ms (199005.0 points/s)
          2011-03-17 18:46:11,865 INFO  [main] TextImporter: ... 5000000 data points in 5016ms (199362.0 points/s)
          2011-03-17 18:46:14,315 INFO  [main] TextImporter: Processed /tmp/data.gz in 29211 ms, 5487065 data points (187842.4 points/s)
          2011-03-17 18:46:14,315 INFO  [main] TextImporter: Total: imported 5487065 data points in 29.212s (187838.4 points/s)
          

          Note: 1 data point = 1 KeyValue.

          I commented out dp.setBatchImport(true); in TextImporter.getDataPoints and ran the same import again. Note: this isn't exactly an apples-to-apples comparison because I'm going to overwrite existing KeyValue instead of creating new ones. The table has VERSIONS=>1 but I think we disabled major compactions so we don't delete old data (Stack/JD correct me if I'm mistaken about our setup).

          $ ./src/tsdb import /tmp/data.gz
          [...]
          2011-03-19 19:09:36,102 INFO  [main] TextImporter: ... 1000000 data points in 6699ms (149276.0 points/s)
          2011-03-19 19:09:41,101 INFO  [main] TextImporter: ... 2000000 data points in 5004ms (199840.1 points/s)
          2011-03-19 19:09:46,051 INFO  [main] TextImporter: ... 3000000 data points in 4949ms (202061.0 points/s)
          2011-03-19 19:09:51,006 INFO  [main] TextImporter: ... 4000000 data points in 4955ms (201816.3 points/s)
          2011-03-19 19:09:56,017 INFO  [main] TextImporter: ... 5000000 data points in 5010ms (199600.8 points/s)
          2011-03-19 19:09:58,422 INFO  [main] TextImporter: Processed /tmp/data.gz in 29025 ms, 5487065 data points (189046.2 points/s)
          2011-03-19 19:09:58,422 INFO  [main] TextImporter: Total: imported 5487065 data points in 29.026s (189041.3 points/s)
          

          So... this totally surprises me. I expected to see a big performance drop with the WAL enabled. I wondered if I didn't properly recompile the code or if something else was still disabling the WAL, but I verified with strace that the WAL was turned on in the RPC that was going out:

          $ strace -f -e trace=write -s 4096 ./src/tsdb import /tmp/data.gz
          [...]
          [pid 21364] write(32, "\0\0\312\313\0\0\0\3\0\10multiPut\0\0\0\00199\0\0\0\1Btsdb,\0\3\371L\301[\360\0\0\7\0\2;,1300586854474.a2a283a471dfcf5dcda82d05f2d468ed.\0\0\0:\1\r\0\3\371MZ2\200\0\0\7\0\0\216\177\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\1\0\0\0\1\1t\0\0\0(\0\0\6\340\0\0\0,\0\0\0\34\0\0\0\10\0\r\0\3\371MZ2\200\0\0\7\0\0\216\1t\0{\177\377\377\377\377\377\377\377\4\0\0\0\0C>\0\0\0\0\0,\0\0\0\34\0\0\0\10\0\r\0\3\371MZ2\200\0\0\7\0\0\216\1t\1k\177\377\377\377\377\377\377\377\4\0\0\0\0Cd...
          

          This shows that the WAL is enabled. Having the source of asynchbase's MultiPutRequest greatly helps make sense of this otherwise impossible to understand blob:

          • We can easily see where the region name is, it contains an MD5 sum followed by a period (.).
          • After the region name, the next 4 bytes are the number of edits for this region: \0\0\0: = 58
          • Then there's a byte with value 1 with the "versioning" of the Put object: \1
          • Then there's a the row key of the row we're writing to: \r\0\3\371MZ2\200\0\0\7\0\0\216 where:
            • \r is a vint indicating that the key length is 13 bytes
            • The first 3 bytes of the row key in OpenTSDB correspond to the metric ID: \0\3\371
            • The next 4 bytes in OpenTSDB correspond to a UNIX timestamp: MZ2\200. Using Python, it's easy to confirm that:
              >>> import struct
              >>> import time
              >>> struct.unpack(">I", "MZ2\200")
              (1297756800,)
              >>> time.ctime(*_)
              'Tue Feb 15 00:00:00 2011'
              
            • The next 6 bytes in OpenTSDB correspond to a tag:
              • 3 bytes for a tag name ID: \0\0\7
              • 3 bytes for a tag value ID: \0\0\216
          • Then we have the timestamp of the edit, which is unset, so it's Long.MAX_VALUE which is \177\377\377\377\377\377\377\377
          • Then we have the RowLock ID. In this case no row lock is involved, so the value is -1L: \377\377\377\377\377\377\377\377
          • Then we have one byte indicating whether or not to use the WAL. In this case, the byte is \1 so the WAL is enabled. □

          After undoing my change to test once more with the WAL, I here's the output of strace:

          [pid 21727] write(32, "\0\0\312\313\0\0\0\3\0\10multiPut\0\0\0\00199\0\0\0\1Btsdb,\0\3\371L\301[\360\0\0\7\0\2;,1300586854474.a2a283a471dfcf5dcda82d05f2d468ed.\0\0\0:\1\r\0\3\371MZ2\200\0\0\7\0\0\216\177\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\0\0\0\0\1\1t\0\0\0(\0\0\6\340\0\0\0,\0\0\0\34\0\0\0\10\0\r\0\3\371MZ2\200\0\0\7\0\0\216\1t\0{\177\377\377\377\377\377\377\377\4\0\0\0\0C>\0\0\0\0\0,\0\0\0\34\0\0\0\10\0\r\0\3\371MZ2\200\0\0\7\0\0\216\1t\1k\177\377\377\377\377\377\377\377\4\0\0\0\0Cd...
          

          Note that the RPC is exactly the same modulo this byte that indicates whether the WAL is enabled or not. In the last excerpt the byte is \0 (it's the first byte after the first long sequence of \377), which indicates that as expected the WAL is disabled.

          In both cases I can consistently do 200k KeyValue inserts per second. asynchbase was written specifically for high throughput server applications.

          I'm running the tests above on a machine with a single crappy Intel E5405 (2.00GHz, 1 physical CPUs, 4 cores/CPU, 1 hardware thread/core = 4 hw threads total). The HBase cluster I'm writing to is one of StumbleUpon's clusters. It's a fairly small cluster, but its size is largely irrelevant because the keys imported here only live in 3 different regions on 3 different servers. During the import, the client consumes about 130% CPU (or 1 core and a third, if you prefer).

          PS: Stack, I replied here because even though I hate JIRA, it was easier to format my message here than on the mailing list.

          Show
          Benoit Sigoure added a comment - This is an impressive number. Just curious if u were able to run the same benchmark with WAL turned on, and what numbers you see then.. Curiously enough, I see the same numbers. This is the 1st import I did Thursday (no WAL) $ ./src/tsdb import /tmp/data.gz [...] 2011-03-17 18:45:51,797 INFO [main] TextImporter: ... 1000000 data points in 6688ms (149521.5 points/s) 2011-03-17 18:45:56,836 INFO [main] TextImporter: ... 2000000 data points in 5044ms (198255.4 points/s) 2011-03-17 18:46:01,823 INFO [main] TextImporter: ... 3000000 data points in 4986ms (200561.6 points/s) 2011-03-17 18:46:06,848 INFO [main] TextImporter: ... 4000000 data points in 5025ms (199005.0 points/s) 2011-03-17 18:46:11,865 INFO [main] TextImporter: ... 5000000 data points in 5016ms (199362.0 points/s) 2011-03-17 18:46:14,315 INFO [main] TextImporter: Processed /tmp/data.gz in 29211 ms, 5487065 data points (187842.4 points/s) 2011-03-17 18:46:14,315 INFO [main] TextImporter: Total: imported 5487065 data points in 29.212s (187838.4 points/s) Note: 1 data point = 1 KeyValue . I commented out dp.setBatchImport(true); in TextImporter.getDataPoints and ran the same import again. Note: this isn't exactly an apples-to-apples comparison because I'm going to overwrite existing KeyValue instead of creating new ones. The table has VERSIONS=>1 but I think we disabled major compactions so we don't delete old data (Stack/JD correct me if I'm mistaken about our setup). $ ./src/tsdb import /tmp/data.gz [...] 2011-03-19 19:09:36,102 INFO [main] TextImporter: ... 1000000 data points in 6699ms (149276.0 points/s) 2011-03-19 19:09:41,101 INFO [main] TextImporter: ... 2000000 data points in 5004ms (199840.1 points/s) 2011-03-19 19:09:46,051 INFO [main] TextImporter: ... 3000000 data points in 4949ms (202061.0 points/s) 2011-03-19 19:09:51,006 INFO [main] TextImporter: ... 4000000 data points in 4955ms (201816.3 points/s) 2011-03-19 19:09:56,017 INFO [main] TextImporter: ... 5000000 data points in 5010ms (199600.8 points/s) 2011-03-19 19:09:58,422 INFO [main] TextImporter: Processed /tmp/data.gz in 29025 ms, 5487065 data points (189046.2 points/s) 2011-03-19 19:09:58,422 INFO [main] TextImporter: Total: imported 5487065 data points in 29.026s (189041.3 points/s) So... this totally surprises me. I expected to see a big performance drop with the WAL enabled. I wondered if I didn't properly recompile the code or if something else was still disabling the WAL, but I verified with strace that the WAL was turned on in the RPC that was going out: $ strace -f -e trace=write -s 4096 ./src/tsdb import /tmp/data.gz [...] [pid 21364] write(32, "\0\0\312\313\0\0\0\3\0\10multiPut\0\0\0\00199\0\0\0\1Btsdb,\0\3\371L\301[\360\0\0\7\0\2;,1300586854474.a2a283a471dfcf5dcda82d05f2d468ed.\0\0\0:\1\r\0\3\371MZ2\200\0\0\7\0\0\216\177\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\1\0\0\0\1\1t\0\0\0(\0\0\6\340\0\0\0,\0\0\0\34\0\0\0\10\0\r\0\3\371MZ2\200\0\0\7\0\0\216\1t\0{\177\377\377\377\377\377\377\377\4\0\0\0\0C>\0\0\0\0\0,\0\0\0\34\0\0\0\10\0\r\0\3\371MZ2\200\0\0\7\0\0\216\1t\1k\177\377\377\377\377\377\377\377\4\0\0\0\0Cd... This shows that the WAL is enabled. Having the source of asynchbase's MultiPutRequest greatly helps make sense of this otherwise impossible to understand blob: We can easily see where the region name is, it contains an MD5 sum followed by a period ( . ). After the region name, the next 4 bytes are the number of edits for this region: \0\0\0: = 58 Then there's a byte with value 1 with the "versioning" of the Put object: \1 Then there's a the row key of the row we're writing to: \r\0\3\371MZ2\200\0\0\7\0\0\216 where: \r is a vint indicating that the key length is 13 bytes The first 3 bytes of the row key in OpenTSDB correspond to the metric ID: \0\3\371 The next 4 bytes in OpenTSDB correspond to a UNIX timestamp: MZ2\200 . Using Python, it's easy to confirm that: >>> import struct >>> import time >>> struct.unpack( ">I" , "MZ2\200" ) (1297756800,) >>> time.ctime(*_) 'Tue Feb 15 00:00:00 2011' The next 6 bytes in OpenTSDB correspond to a tag: 3 bytes for a tag name ID: \0\0\7 3 bytes for a tag value ID: \0\0\216 Then we have the timestamp of the edit, which is unset, so it's Long.MAX_VALUE which is \177\377\377\377\377\377\377\377 Then we have the RowLock ID. In this case no row lock is involved, so the value is -1L : \377\377\377\377\377\377\377\377 Then we have one byte indicating whether or not to use the WAL. In this case, the byte is \1 so the WAL is enabled. □ After undoing my change to test once more with the WAL, I here's the output of strace : [pid 21727] write(32, "\0\0\312\313\0\0\0\3\0\10multiPut\0\0\0\00199\0\0\0\1Btsdb,\0\3\371L\301[\360\0\0\7\0\2;,1300586854474.a2a283a471dfcf5dcda82d05f2d468ed.\0\0\0:\1\r\0\3\371MZ2\200\0\0\7\0\0\216\177\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\0\0\0\0\1\1t\0\0\0(\0\0\6\340\0\0\0,\0\0\0\34\0\0\0\10\0\r\0\3\371MZ2\200\0\0\7\0\0\216\1t\0{\177\377\377\377\377\377\377\377\4\0\0\0\0C>\0\0\0\0\0,\0\0\0\34\0\0\0\10\0\r\0\3\371MZ2\200\0\0\7\0\0\216\1t\1k\177\377\377\377\377\377\377\377\4\0\0\0\0Cd... Note that the RPC is exactly the same modulo this byte that indicates whether the WAL is enabled or not. In the last excerpt the byte is \0 (it's the first byte after the first long sequence of \377 ), which indicates that as expected the WAL is disabled. In both cases I can consistently do 200k KeyValue inserts per second. asynchbase was written specifically for high throughput server applications. I'm running the tests above on a machine with a single crappy Intel E5405 (2.00GHz, 1 physical CPUs, 4 cores/CPU, 1 hardware thread/core = 4 hw threads total). The HBase cluster I'm writing to is one of StumbleUpon's clusters. It's a fairly small cluster, but its size is largely irrelevant because the keys imported here only live in 3 different regions on 3 different servers. During the import, the client consumes about 130% CPU (or 1 core and a third, if you prefer). PS: Stack, I replied here because even though I hate JIRA, it was easier to format my message here than on the mailing list.

            People

            • Assignee:
              Unassigned
              Reporter:
              Karthik K
            • Votes:
              4 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:

                Development