HBase
  1. HBase
  2. HBASE-794

Language neutral IPC as a first class component of HBase architecture

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Later
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Client, IPC/RPC, master, regionserver
    • Labels:
      None

      Description

      This issue considers making a language neutral IPC mechanism and wire format a first class component of HBase architecture. Clients could talk to the master and regionserver using this protocol instead of HRPC at their option.

      Options for language neutral IPC include:

        Issue Links

          Activity

          Hide
          Andrew Purtell added a comment -

          Prevailing project opinion is to wait until Avro is ready and use it. Closing this issue. Reopen if this changes.

          Show
          Andrew Purtell added a comment - Prevailing project opinion is to wait until Avro is ready and use it. Closing this issue. Reopen if this changes.
          Hide
          Andrew Purtell added a comment -

          Posting on the AVRO list shows it is better than them all – somehow. Matt Massie @ Cloudera is making C bindings. Chad @ Powerset is trying to get code to dig in.

          Show
          Andrew Purtell added a comment - Posting on the AVRO list shows it is better than them all – somehow. Matt Massie @ Cloudera is making C bindings. Chad @ Powerset is trying to get code to dig in.
          Hide
          Bryan Duxbury added a comment -

          I think it'd be simpler than what you suggest. My thinking is that when we encounter a binary on the wire, if we're set up to do so (right server, right compiler options, etc), then we just capture the offset and length from the underlying buffer. We change the Thrift object API slightly to have the option of returning the buf/off/len object, so that you can do the only copy when you're finally writing it out to wherever it's going. There shouldn't be any "overhead" to operating on binaries. However, for the non-data portion of the stream (the so-called control stream), we'll still have to do things like unpack ints from bytes and the like. Without digging in too much deeper, I think that only binary and maybe strings could benefit from this approach. I'd strongly advise against using strings in your Thrift objects, though, if you care about performance, since UTF-8 en/decoding seems to be a real dog at times.

          Show
          Bryan Duxbury added a comment - I think it'd be simpler than what you suggest. My thinking is that when we encounter a binary on the wire, if we're set up to do so (right server, right compiler options, etc), then we just capture the offset and length from the underlying buffer. We change the Thrift object API slightly to have the option of returning the buf/off/len object, so that you can do the only copy when you're finally writing it out to wherever it's going. There shouldn't be any "overhead" to operating on binaries. However, for the non-data portion of the stream (the so-called control stream), we'll still have to do things like unpack ints from bytes and the like. Without digging in too much deeper, I think that only binary and maybe strings could benefit from this approach. I'd strongly advise against using strings in your Thrift objects, though, if you care about performance, since UTF-8 en/decoding seems to be a real dog at times.
          Hide
          Andrew Purtell added a comment -

          @Bryan: Thanks. If I understand correctly, we'd like to hold on to the byte buffers into which frames were read from the wire, and build some very lightweight structures which index individual KeyValues contained within. We'd also want to avoid any unmarshalling overhead as the data type is plain byte[]. This is like exposing some middle layer of the Thrift stack. (A data stream.) I think it would also be useful if structured data could be mixed inline. (A control stream.) So maybe we'd have some convenient structured metadata at the start of a transaction which can be processed using functions at the top of the Thrift stack, followed by a lot of instances of a KeyValue type, which we could drop down to slice up the backing byte buffer for zero copy from there. I don't know how feasible this is, just thinking out loud here.

          Show
          Andrew Purtell added a comment - @Bryan: Thanks. If I understand correctly, we'd like to hold on to the byte buffers into which frames were read from the wire, and build some very lightweight structures which index individual KeyValues contained within. We'd also want to avoid any unmarshalling overhead as the data type is plain byte[]. This is like exposing some middle layer of the Thrift stack. (A data stream.) I think it would also be useful if structured data could be mixed inline. (A control stream.) So maybe we'd have some convenient structured metadata at the start of a transaction which can be processed using functions at the top of the Thrift stack, followed by a lot of instances of a KeyValue type, which we could drop down to slice up the backing byte buffer for zero copy from there. I don't know how feasible this is, just thinking out loud here.
          Hide
          Bryan Duxbury added a comment -

          Composition might be just fine, but I was suggesting that approach so that you never had to copy stuff around - it'd already be there.

          I'll open a ticket with Thrift for buf/off/len approach to byte arrays.

          Show
          Bryan Duxbury added a comment - Composition might be just fine, but I was suggesting that approach so that you never had to copy stuff around - it'd already be there. I'll open a ticket with Thrift for buf/off/len approach to byte arrays.
          Hide
          stack added a comment -

          @Ryan True. I should have thought of that.

          @Bryan Not inherit but maybe it could have a TCell but then thrift is all over our application, not just at the porch. Thrift should take a byte array, offset and length; copying is a waste.

          Show
          stack added a comment - @Ryan True. I should have thought of that. @Bryan Not inherit but maybe it could have a TCell but then thrift is all over our application, not just at the porch. Thrift should take a byte array, offset and length; copying is a waste.
          Hide
          Bryan Duxbury added a comment -

          I'm not sure that this will work out for you, but you could make Cell inherit from TCell. If you did that, then the Thrift serialization code would be the same, but you could have additional methods on Cell that implemented all your domain specific logic. Obviously, this won't work if Cell needs to extend some other class.

          I have also been thinking if it would be possible for Thrift to take either native byte[] or some sort of buffer and offset/length structure, which is what KeyValue sounds like (correct me if I'm wrong).

          Show
          Bryan Duxbury added a comment - I'm not sure that this will work out for you, but you could make Cell inherit from TCell. If you did that, then the Thrift serialization code would be the same, but you could have additional methods on Cell that implemented all your domain specific logic. Obviously, this won't work if Cell needs to extend some other class. I have also been thinking if it would be possible for Thrift to take either native byte[] or some sort of buffer and offset/length structure, which is what KeyValue sounds like (correct me if I'm wrong).
          Hide
          ryan rawson added a comment -

          KeyValue isn't a subclass of byte [] and never will be...

          This is because a KeyValue isnt just a wrapper over byte [], it is designed to reference a larger byte buffer. One will always need to use an array offset and length when accessing sub parts of a KeyValue.

          Show
          ryan rawson added a comment - KeyValue isn't a subclass of byte [] and never will be... This is because a KeyValue isnt just a wrapper over byte [], it is designed to reference a larger byte buffer. One will always need to use an array offset and length when accessing sub parts of a KeyValue.
          Hide
          Andrew Purtell added a comment -

          In a related issue (HBASE-1367), Stack posted this comment:

          On the convertions from Cell and Cell [] to Lists of TCells, I suppose there is no way to avoid this, even in the new stuff where we have lists of KeyValue (though KeyValue is nothing but a byte [] really). If we could have KeyValue subclass "byte []" then we could pass the list of KeyValues to thrift - but its not possible subclassing byte []. I suppose no way to have thrift use KeyValue lists directly - treat them as containers of byte []?

          See https://issues.apache.org/jira/browse/HBASE-1367?focusedCommentId=12705881&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12705881

          Show
          Andrew Purtell added a comment - In a related issue ( HBASE-1367 ), Stack posted this comment: On the convertions from Cell and Cell [] to Lists of TCells, I suppose there is no way to avoid this, even in the new stuff where we have lists of KeyValue (though KeyValue is nothing but a byte [] really). If we could have KeyValue subclass "byte []" then we could pass the list of KeyValues to thrift - but its not possible subclassing byte []. I suppose no way to have thrift use KeyValue lists directly - treat them as containers of byte []? See https://issues.apache.org/jira/browse/HBASE-1367?focusedCommentId=12705881&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12705881
          Hide
          Jonathan Gray added a comment -

          There are two instances where it would be helpful to have an RPC call that we could access the received byte[] directly.

          Sending of PUTS and receiving of GETS. This is being changed in 880 to being List<KeyValue> (does not require serializiation/deserialization). I believe, by far, the biggest improvement we could make to help with our GC woes during imports is doing this for writes.

          Just thought we should keep this in mind as we explore rpc stuff.

          Right now, we allocate data for writes at least 3 times on the way in to the memcache. The last two times, it is at the individual keyval level (lots of small allocations). Using KeyValue, we could read the entire buffer into a single byte[] and pass around KeyValue pointers all the way to the Memcache. These would be grouped by row in the case of batching.

          I do not think we need to go down the path of pooling. Changes like above, and moving away from Trees wherever possible as we're working on now, should help tremendously. If we do end up needing more improvements, the optimizations would be isolated to the data structure backing Memcache. I think it'd be a fun project... but i agree that it can bite you hard with a GC and we should avoid it at all costs (short of rewrites in c)

          Show
          Jonathan Gray added a comment - There are two instances where it would be helpful to have an RPC call that we could access the received byte[] directly. Sending of PUTS and receiving of GETS. This is being changed in 880 to being List<KeyValue> (does not require serializiation/deserialization). I believe, by far, the biggest improvement we could make to help with our GC woes during imports is doing this for writes. Just thought we should keep this in mind as we explore rpc stuff. Right now, we allocate data for writes at least 3 times on the way in to the memcache. The last two times, it is at the individual keyval level (lots of small allocations). Using KeyValue, we could read the entire buffer into a single byte[] and pass around KeyValue pointers all the way to the Memcache. These would be grouped by row in the case of batching. I do not think we need to go down the path of pooling. Changes like above, and moving away from Trees wherever possible as we're working on now, should help tremendously. If we do end up needing more improvements, the optimizations would be isolated to the data structure backing Memcache. I think it'd be a fun project... but i agree that it can bite you hard with a GC and we should avoid it at all costs (short of rewrites in c)
          Hide
          ryan rawson added a comment -

          As for the API of HBase, there are a number of special cases:

          • get()/scanner()/next() return null values all the time, or 0 length arrays (in the case of next()).
          • nearly all the data is simple, byte[]. Big wins by not copying the same bytes around a dozen times.
          • Our internal 0-copy systems are called KeyValue, and allow many many values to share the same underlying bytes (reading from a hfile) or reducing the number of copies from the RPC -> IO with HDFS.

          The actual implementation of a new RPC is a 0.21 task, so we are still just evaluating things.

          One thing we will not do in the future is serializing trees to the RPC and back again. Our new API may be mostly arrays of KeyValues. Since a client has to be smart, there is no problems with pushing detailed knowledge of how we store columns/values to the client.

          I'm still not sure how much gain we are going to get by pooling/reusing objects in Java... Outsmarting the GC tends to bite you hard.

          Show
          ryan rawson added a comment - As for the API of HBase, there are a number of special cases: get()/scanner()/next() return null values all the time, or 0 length arrays (in the case of next()). nearly all the data is simple, byte[]. Big wins by not copying the same bytes around a dozen times. Our internal 0-copy systems are called KeyValue, and allow many many values to share the same underlying bytes (reading from a hfile) or reducing the number of copies from the RPC -> IO with HDFS. The actual implementation of a new RPC is a 0.21 task, so we are still just evaluating things. One thing we will not do in the future is serializing trees to the RPC and back again. Our new API may be mostly arrays of KeyValues. Since a client has to be smart, there is no problems with pushing detailed knowledge of how we store columns/values to the client. I'm still not sure how much gain we are going to get by pooling/reusing objects in Java... Outsmarting the GC tends to bite you hard.
          Hide
          Bryan Duxbury added a comment -

          Wow, guess I really should have been watching this issue. I'll try and address some things.

          Returning null: Thrift methods can't return null directly, but they can return a non-null struct with none of its fields set, or a non-null struct with a flag set. This isn't anything new necessarily, but I should note that we do this all over the place at Rapleaf to get around this restriction. You definitely do not need to use exceptions to communicate "null". Moreover, using exceptions this way is probably worse than you think, as I think returning an exception causes the connection to close, at least in some libraries. Also, it might be possible to allow null to be returned by Thrift methods in general, just for C++ to be unable to return null. If this is a do-or-die issue, please help us out by opening a ticket over on the Thrift JIRA so we can discuss solutions.

          Thrift's Java RPC layer: I did in fact write a bunch of the server layer to use native Java NIO. This code lives in TNonblockingServer (single threaded) and THsHaServer (thread pool) respectively. Both server implementations also add some nice stuff like fixed total read buffer size (to protect server from overload). It's been very robust in our use of the code at Rapleaf so far. I would recommend it on the strength of my experiences.

          Garbage/instantiation cost: Thrift objects are probably a little more memory inefficient than they need to be right now due to some slightly naive implementation decisions, but I've taken some steps to reducing the overhead of an object. Additionally, you could probably reuse some instances of objects at the top level with almost no work. With a little work in the library, you could probably reuse most objects all the way down your instance's object tree, saving you memory. If you are more interested in this bullet, shoot me an email and we can talk about it in more detail.

          Zero copy system: Right now, Thrift is not zero-copy. I think it would be very cool, though, to create the framework to make that happen. We'd probably only need to make a few transport interface changes. Maybe we should open a ticket?

          Framed Transport: This is very effective at improving the performance of the Thrift IO stuff, especially if you're doing real IO without a buffer somewhere in between. It's also mandatory for using the nonblocking servers.

          Custom protocols: Certainly, if you wanted to, you could write your own Thrift protocol. However, I would say this defeats the purpose of Thrift, in giving you a respectable cross-platform library out of the box. Further, protobuf as a Thrift protocol has been proposed before, and the two systems are not trivially compatible.

          "Raw" RPC: If your goal is to avoid deserializing some stuff, Chad has previously suggested having the ability to specify that you don't want certain fields deserialized. I don't know if this is your objective. If your keys and values are actually just byte arrays on either side, then there isn't any serialization to speak of, beyond the byte[] copy off the wire. I could imagine doing something to make this a non-copy operation, though. (See comment above on zero-copy architecture.)

          I think Andrew's idea of making a simulator is a great idea. Otherwise it's going to mean a ton of work and a subjective evaluation.

          I also want to say that there are few things I would like to improve as much as Thrift performance. Thrift is a cornerstone at Rapleaf, so anything we can do to make it faster is a big win. I am eager and willing to work with anyone who can show me use cases that identify slowness in Thrift so that I can erase the problem.

          Show
          Bryan Duxbury added a comment - Wow, guess I really should have been watching this issue. I'll try and address some things. Returning null: Thrift methods can't return null directly , but they can return a non-null struct with none of its fields set, or a non-null struct with a flag set. This isn't anything new necessarily, but I should note that we do this all over the place at Rapleaf to get around this restriction. You definitely do not need to use exceptions to communicate "null". Moreover, using exceptions this way is probably worse than you think, as I think returning an exception causes the connection to close, at least in some libraries. Also, it might be possible to allow null to be returned by Thrift methods in general , just for C++ to be unable to return null. If this is a do-or-die issue, please help us out by opening a ticket over on the Thrift JIRA so we can discuss solutions. Thrift's Java RPC layer: I did in fact write a bunch of the server layer to use native Java NIO. This code lives in TNonblockingServer (single threaded) and THsHaServer (thread pool) respectively. Both server implementations also add some nice stuff like fixed total read buffer size (to protect server from overload). It's been very robust in our use of the code at Rapleaf so far. I would recommend it on the strength of my experiences. Garbage/instantiation cost: Thrift objects are probably a little more memory inefficient than they need to be right now due to some slightly naive implementation decisions, but I've taken some steps to reducing the overhead of an object. Additionally, you could probably reuse some instances of objects at the top level with almost no work. With a little work in the library, you could probably reuse most objects all the way down your instance's object tree, saving you memory. If you are more interested in this bullet, shoot me an email and we can talk about it in more detail. Zero copy system: Right now, Thrift is not zero-copy. I think it would be very cool, though, to create the framework to make that happen. We'd probably only need to make a few transport interface changes. Maybe we should open a ticket? Framed Transport: This is very effective at improving the performance of the Thrift IO stuff, especially if you're doing real IO without a buffer somewhere in between. It's also mandatory for using the nonblocking servers. Custom protocols: Certainly, if you wanted to, you could write your own Thrift protocol. However, I would say this defeats the purpose of Thrift, in giving you a respectable cross-platform library out of the box. Further, protobuf as a Thrift protocol has been proposed before, and the two systems are not trivially compatible. "Raw" RPC: If your goal is to avoid deserializing some stuff, Chad has previously suggested having the ability to specify that you don't want certain fields deserialized. I don't know if this is your objective. If your keys and values are actually just byte arrays on either side, then there isn't any serialization to speak of, beyond the byte[] copy off the wire. I could imagine doing something to make this a non-copy operation, though. (See comment above on zero-copy architecture.) I think Andrew's idea of making a simulator is a great idea. Otherwise it's going to mean a ton of work and a subjective evaluation. I also want to say that there are few things I would like to improve as much as Thrift performance. Thrift is a cornerstone at Rapleaf, so anything we can do to make it faster is a big win. I am eager and willing to work with anyone who can show me use cases that identify slowness in Thrift so that I can erase the problem.
          Hide
          Andrew Purtell added a comment -

          I'm sitting in the lounge now in Taipei waiting to get on the plane, back on 5/4 with time for this. Pointers on graft points most welcome.

          Show
          Andrew Purtell added a comment - I'm sitting in the lounge now in Taipei waiting to get on the plane, back on 5/4 with time for this. Pointers on graft points most welcome.
          Hide
          stack added a comment -

          On 3. above, I asked Bryan (On 1., no thanks to double IDLing and thanks Chad for digging in on eisahy.com).

          @Andrew

          Looks great. Drop transactional table test from your list. Not necessary to our figuring whats better. You going to work on this now Andrew? I can help w/ where to make the cuts for the graft.

          Show
          stack added a comment - On 3. above, I asked Bryan (On 1., no thanks to double IDLing and thanks Chad for digging in on eisahy.com). @Andrew Looks great. Drop transactional table test from your list. Not necessary to our figuring whats better. You going to work on this now Andrew? I can help w/ where to make the cuts for the graft.
          Hide
          Andrew Purtell added a comment -

          Chad, I agree that the eishay.com numbers are interesting but are not sufficient to make a decision here. This is why I propose to move ahead with actually trying to put pbufs and Thrift into a running system as replacements for HRPC and see what happens given several use cases. So far we have a suite of repeated runs of:

          • Insert of single BatchUpdate
          • Insert of a batch of BatchUpdate, let's do 1000
          • Fetch single RowResult (equiv to scan with batching of 1 RowResult)
          • Scan with batching of 30 RowResult
          • Scan with batching of 1000 RowResult
          • Transactional table / secondary index transaction

          If someone thinks the above would not capture enough information in some way, please advise.

          Show
          Andrew Purtell added a comment - Chad, I agree that the eishay.com numbers are interesting but are not sufficient to make a decision here. This is why I propose to move ahead with actually trying to put pbufs and Thrift into a running system as replacements for HRPC and see what happens given several use cases. So far we have a suite of repeated runs of: Insert of single BatchUpdate Insert of a batch of BatchUpdate, let's do 1000 Fetch single RowResult (equiv to scan with batching of 1 RowResult) Scan with batching of 30 RowResult Scan with batching of 1000 RowResult Transactional table / secondary index transaction If someone thinks the above would not capture enough information in some way, please advise.
          Hide
          Andrew Purtell added a comment -

          On the list Ryan indicates the target is 25-30K ops/sec per machine.

          Show
          Andrew Purtell added a comment - On the list Ryan indicates the target is 25-30K ops/sec per machine.
          Hide
          Chad Walters added a comment -

          25-30k ops/sec total across how many machines?

          Show
          Chad Walters added a comment - 25-30k ops/sec total across how many machines?
          Hide
          ryan rawson added a comment -

          my system currently does 25-30k ops/sec undere high load. the current system really cranks through a lot of garbage generation. i would want to see a system that meets several performance characteristics:

          • overall throughput
          • low per request latency
          • low garbage gen
          • works with the 0 copy system currently in place

          to me, performance is the most important thing.

          Show
          ryan rawson added a comment - my system currently does 25-30k ops/sec undere high load. the current system really cranks through a lot of garbage generation. i would want to see a system that meets several performance characteristics: overall throughput low per request latency low garbage gen works with the 0 copy system currently in place to me, performance is the most important thing.
          Hide
          Chad Walters added a comment -

          > I would set my performance targets at 30-50k ops/sec (what I currently get on my systems). >Given the other perf graphs, its probably a question of "will any serialization be fast enough".
          >It'd be nice to get a kb/op measure of garbage gen activity.

          Ryan, can you elaborate a little further on these numbers? I'd love to understand your application usage patterns a little more fully. Also, when you say "will any serialization be fast enough" what are you comparing against?

          Show
          Chad Walters added a comment - > I would set my performance targets at 30-50k ops/sec (what I currently get on my systems). >Given the other perf graphs, its probably a question of "will any serialization be fast enough". >It'd be nice to get a kb/op measure of garbage gen activity. Ryan, can you elaborate a little further on these numbers? I'd love to understand your application usage patterns a little more fully. Also, when you say "will any serialization be fast enough" what are you comparing against?
          Hide
          Chad Walters added a comment -

          Frankly I am a little puzzled by a lot of the data produced posted at that site. Thrift and protobufs are both slower than json? More investigation is needed (as I noted on the thrift-dev list).

          Show
          Chad Walters added a comment - Frankly I am a little puzzled by a lot of the data produced posted at that site. Thrift and protobufs are both slower than json? More investigation is needed (as I noted on the thrift-dev list).
          Hide
          Chad Walters added a comment -

          Re 1: Agreed with Andrew – not a good way to go

          Re 2 and 4: Looking at the test code at http://code.google.com/p/thrift-protobuf-compare it' clear the TBinaryProtocol was used for the tests, not TCompactBinaryProtocol. They also used TIOStreamTransport – can't speak to the quality of this transport one way or another although I gotta think that wrapping it with TFramedTransport would likely reduce some overhead.

          Re 3: No idea. I gotta imagine that some work is needed here since the Java stuff is less mature than the C++ stuff. Is Bryan watching this ticket? He's the guy to ask for sure.

          Re 5: We should look at where we really want nulls and see if this is too onerous.

          Re 6: Let's please not go down the route of custom RPC – let's leverage something reasonable such that the RPC isn't really part of the problem.

          Show
          Chad Walters added a comment - Re 1: Agreed with Andrew – not a good way to go Re 2 and 4: Looking at the test code at http://code.google.com/p/thrift-protobuf-compare it' clear the TBinaryProtocol was used for the tests, not TCompactBinaryProtocol. They also used TIOStreamTransport – can't speak to the quality of this transport one way or another although I gotta think that wrapping it with TFramedTransport would likely reduce some overhead. Re 3: No idea. I gotta imagine that some work is needed here since the Java stuff is less mature than the C++ stuff. Is Bryan watching this ticket? He's the guy to ask for sure. Re 5: We should look at where we really want nulls and see if this is too onerous. Re 6: Let's please not go down the route of custom RPC – let's leverage something reasonable such that the RPC isn't really part of the problem.
          Hide
          ryan rawson added a comment -

          Scanning isn't always 30 rows small - I personally set my scanner caching to 1000 rows.

          I would set my performance targets at 30-50k ops/sec (what I currently get on my systems). Given the other perf graphs, its probably a question of "will any serialization be fast enough". It'd be nice to get a kb/op measure of garbage gen activity.

          With the cms with capped parnew as a stock gc setup, we need to keep garbage gen as low as possible.

          Show
          ryan rawson added a comment - Scanning isn't always 30 rows small - I personally set my scanner caching to 1000 rows. I would set my performance targets at 30-50k ops/sec (what I currently get on my systems). Given the other perf graphs, its probably a question of "will any serialization be fast enough". It'd be nice to get a kb/op measure of garbage gen activity. With the cms with capped parnew as a stock gc setup, we need to keep garbage gen as low as possible.
          Hide
          Andrew Purtell added a comment -

          Re 1. Stacking pbufs on Thrift means two IDLs, two marshalling/unmarshalling layers, and glue. HBase would be dependent on both Thrift and protobuf toolchains. In my opinion that's not the way to go.

          Re 2. Based on the eishay.com results, Thrift is competitive with protobufs, but I can't say if the new thrift compact binary protocol was used. Will insure this is the case in our tests.

          Re 5. We could do what Chad suggested and handle interfaces within Thrift that return null with a boolean plus optional value.

          Re 6. Nod. I'll move forward with putting in test pbuf and Thrift RPC on the master, regionserver, and clients for supporting insert, get, and scan ops.

          Show
          Andrew Purtell added a comment - Re 1. Stacking pbufs on Thrift means two IDLs, two marshalling/unmarshalling layers, and glue. HBase would be dependent on both Thrift and protobuf toolchains. In my opinion that's not the way to go. Re 2. Based on the eishay.com results, Thrift is competitive with protobufs, but I can't say if the new thrift compact binary protocol was used. Will insure this is the case in our tests. Re 5. We could do what Chad suggested and handle interfaces within Thrift that return null with a boolean plus optional value. Re 6. Nod. I'll move forward with putting in test pbuf and Thrift RPC on the master, regionserver, and clients for supporting insert, get, and scan ops.
          Hide
          stack added a comment -

          1. Could we do our own thrift protocol, one that uses protobufs so protobufs uses thrift as its rpc?
          2. Protobufs is more compact that the new thrift binary? (Important when keys and values are small)
          3. Anyone comment on the quality of the java rpc in thrift? (Bryan making it nio was the last I heard).
          4. The comparison that Andrew posts is interesting; protobufs does slightly better usually but is way worse doing object creation.
          5. The lack of null is a pain; otherwise, I'd think it wouldn't take much hacking up a thrift IDL to do the client/regionserver back and forth to drop in thrift in place of what we have; its only a few methods. Absence of null means client code has to be modified to handle whatever the null replacement is (if its exceptions, that'd make thrift look bad performance-wise).
          6. Regards opening a new issue to do a raw RPC, wistfully, I'd love it if we didn't have to. Would be cool if we could do this issue I'd rather not have to write our own RPC (And while on the wishful thinking, ignoring for a sec. the issues raised above, I wish we could just use thrift-- Its open, there is expertise to hand, and we'd get help from the community).

          Show
          stack added a comment - 1. Could we do our own thrift protocol, one that uses protobufs so protobufs uses thrift as its rpc? 2. Protobufs is more compact that the new thrift binary? (Important when keys and values are small) 3. Anyone comment on the quality of the java rpc in thrift? (Bryan making it nio was the last I heard). 4. The comparison that Andrew posts is interesting; protobufs does slightly better usually but is way worse doing object creation. 5. The lack of null is a pain; otherwise, I'd think it wouldn't take much hacking up a thrift IDL to do the client/regionserver back and forth to drop in thrift in place of what we have; its only a few methods. Absence of null means client code has to be modified to handle whatever the null replacement is (if its exceptions, that'd make thrift look bad performance-wise). 6. Regards opening a new issue to do a raw RPC, wistfully, I'd love it if we didn't have to. Would be cool if we could do this issue I'd rather not have to write our own RPC (And while on the wishful thinking, ignoring for a sec. the issues raised above, I wish we could just use thrift-- Its open, there is expertise to hand, and we'd get help from the community).
          Hide
          Andrew Purtell added a comment -

          So should we open another issue for prototyping async I/O into and out of bytebuffers from the IPC handler and use the buffers to back internal data structures as is done currently for storefile I/O? Should 794 be tabled in favor of a raw/custom protocol?

          Show
          Andrew Purtell added a comment - So should we open another issue for prototyping async I/O into and out of bytebuffers from the IPC handler and use the buffers to back internal data structures as is done currently for storefile I/O? Should 794 be tabled in favor of a raw/custom protocol?
          Hide
          ryan rawson added a comment -

          we also need to verify exactly how much garbage serialization/deserialization does. Our initial thought with the KeyValue patches (HBASE-1234) is to use a raw protocol that is totally custom and wouldnt require copying data over and over inside the server. My own tests show that GC overhead is a significant part of HBase, it would be nice to do a data recv() then not copy that and use KeyValue instead of object conversion. Stack notes that in this profiling, a substantial amount of CPU time goes into marshall/demarshall code.

          Show
          ryan rawson added a comment - we also need to verify exactly how much garbage serialization/deserialization does. Our initial thought with the KeyValue patches ( HBASE-1234 ) is to use a raw protocol that is totally custom and wouldnt require copying data over and over inside the server. My own tests show that GC overhead is a significant part of HBase, it would be nice to do a data recv() then not copy that and use KeyValue instead of object conversion. Stack notes that in this profiling, a substantial amount of CPU time goes into marshall/demarshall code.
          Hide
          Andrew Purtell added a comment -

          Interesting head to head testing results here:

          http://www.eishay.com/2009/03/more-on-benchmarking-java-serialization.html

          But, I want retest using the latest trunk from both the protobuf and Thrift trees (I have that also), using Bryan's recommendations for Thrift: TCompactProtocol, FramedTransport, and THsHaServer. Also, comparisons for the HBase antcipiated use case of first class integration, which in this context means all IPC/RPC between master, regionservers, and clients, not just as a client access option:

          • Insert of single BatchUpdate
          • Insert of a batch of BatchUpdate
          • Fetch single RowResult (equiv to scan with batching of 1 RowResult)
          • Scan with batching of 30 RowResult
          • Transactional table / secondary index transactions

          This is why this issue lingers. We should either make a simulator or do direct addition of test code on the master, regionservers, and client library that supports real actions. My feeling is ultimately the latter option is the better one. Recently I had been waiting for all of the architectural changes to the regionserver – e.g. KeyValue – to settle, and have otherwise not had the available personal time. Both of those considerations have now changed.

          Show
          Andrew Purtell added a comment - Interesting head to head testing results here: http://www.eishay.com/2009/03/more-on-benchmarking-java-serialization.html But, I want retest using the latest trunk from both the protobuf and Thrift trees (I have that also), using Bryan's recommendations for Thrift: TCompactProtocol, FramedTransport, and THsHaServer. Also, comparisons for the HBase antcipiated use case of first class integration, which in this context means all IPC/RPC between master, regionservers, and clients, not just as a client access option: Insert of single BatchUpdate Insert of a batch of BatchUpdate Fetch single RowResult (equiv to scan with batching of 1 RowResult) Scan with batching of 30 RowResult Transactional table / secondary index transactions This is why this issue lingers. We should either make a simulator or do direct addition of test code on the master, regionservers, and client library that supports real actions. My feeling is ultimately the latter option is the better one. Recently I had been waiting for all of the architectural changes to the regionserver – e.g. KeyValue – to settle, and have otherwise not had the available personal time. Both of those considerations have now changed.
          Hide
          Chad Walters added a comment -

          >I just wish that compact binary protocols wasn't a late addition. It doesn't give me
          >confidence by the late addition of what should have been there from day one.

          I think you are assuming a lot of things with this statement. I suggested testing both protocols out – the performance characteristics are different, not necessarily monotonically better one way or another. It's not like the original Thrift writers didn't consider more compact protocols – they were just optimizing for a particular performance space. At least they had the foresight to allow for different protocol implementations (albeit not without some restrictions and wrinkles). I suspect that you might not find much difference between the two for HBase's needs but I am curious to see what you come up with.

          >Perhaps you can outline exactly how you can work around lack of NULLs in thrift
          >that does not involve exceptions? I think it should not be required to use exceptions
          >during non-exceptional events - many systems have lower performance under thrown
          >exceptions, and the code path is complex.

          Haven't thought about it too deeply and it's late on Friday so here is are some extremely off-the-cuff answers:

          1. Return a struct that has a two fields – a required boolean and an optional field of the object type. Certainly not ideal but could be used in the hopefully limited cases where you really want to be able to return null sometimes.

          2. Return a flag object somehow outside the range of valid responses that indicates null. Not possible in all cases if all values are valid but certainly viable in some.

          I am sure we could come up with more if need be. That said, and like I said before, please go ahead and contribute a "nulllable" type annotation for Thrift.

          Show
          Chad Walters added a comment - >I just wish that compact binary protocols wasn't a late addition. It doesn't give me >confidence by the late addition of what should have been there from day one. I think you are assuming a lot of things with this statement. I suggested testing both protocols out – the performance characteristics are different, not necessarily monotonically better one way or another. It's not like the original Thrift writers didn't consider more compact protocols – they were just optimizing for a particular performance space. At least they had the foresight to allow for different protocol implementations (albeit not without some restrictions and wrinkles). I suspect that you might not find much difference between the two for HBase's needs but I am curious to see what you come up with. >Perhaps you can outline exactly how you can work around lack of NULLs in thrift >that does not involve exceptions? I think it should not be required to use exceptions >during non-exceptional events - many systems have lower performance under thrown >exceptions, and the code path is complex. Haven't thought about it too deeply and it's late on Friday so here is are some extremely off-the-cuff answers: 1. Return a struct that has a two fields – a required boolean and an optional field of the object type. Certainly not ideal but could be used in the hopefully limited cases where you really want to be able to return null sometimes. 2. Return a flag object somehow outside the range of valid responses that indicates null. Not possible in all cases if all values are valid but certainly viable in some. I am sure we could come up with more if need be. That said, and like I said before, please go ahead and contribute a "nulllable" type annotation for Thrift.
          Hide
          ryan rawson added a comment -

          one last comment, protobuf has an RPC framework, it just requires one to provide a transport.

          Show
          ryan rawson added a comment - one last comment, protobuf has an RPC framework, it just requires one to provide a transport.
          Hide
          ryan rawson added a comment -

          Perhaps you can outline exactly how you can work around lack of NULLs in thrift that does not involve exceptions? I think it should not be required to use exceptions during non-exceptional events - many systems have lower performance under thrown exceptions, and the code path is complex.

          I just wish that compact binary protocols wasn't a late addition. It doesn't give me confidence by the late addition of what should have been there from day one.

          Show
          ryan rawson added a comment - Perhaps you can outline exactly how you can work around lack of NULLs in thrift that does not involve exceptions? I think it should not be required to use exceptions during non-exceptional events - many systems have lower performance under thrown exceptions, and the code path is complex. I just wish that compact binary protocols wasn't a late addition. It doesn't give me confidence by the late addition of what should have been there from day one.
          Hide
          Chad Walters added a comment -

          I am biased here since I have worked on and backed Thrift extensively but I am in favor of sticking with Thrift.

          I'll grant you that Thrift isn't as mature as protocol buffers but it has a lot of useful features that protocol buffers lacks. Protocol buffers isn't a complete enough solution for my tastes – no RPC, lacks many languages that Thrift supports, etc.

          One of the great efficiencies of C++ is the fact that you can have objects as values in stead of references. You probably won't make too much traction suggesting that that be pulled out of Thrift altogether although you could probably argue for the addition of type annotations that would allow for C++ to have references instead of values for some fields. Then you could introduce nulls. This is actually a feature I would very much like to see in Thrift. The lack of nulls can be worked around in other ways as well.

          Definitely follow Bryan's suggestion of testing out the CompactBinaryProtocol as well as the default BinaryProtocol.

          Show
          Chad Walters added a comment - I am biased here since I have worked on and backed Thrift extensively but I am in favor of sticking with Thrift. I'll grant you that Thrift isn't as mature as protocol buffers but it has a lot of useful features that protocol buffers lacks. Protocol buffers isn't a complete enough solution for my tastes – no RPC, lacks many languages that Thrift supports, etc. One of the great efficiencies of C++ is the fact that you can have objects as values in stead of references. You probably won't make too much traction suggesting that that be pulled out of Thrift altogether although you could probably argue for the addition of type annotations that would allow for C++ to have references instead of values for some fields. Then you could introduce nulls. This is actually a feature I would very much like to see in Thrift. The lack of nulls can be worked around in other ways as well. Definitely follow Bryan's suggestion of testing out the CompactBinaryProtocol as well as the default BinaryProtocol.
          Hide
          ryan rawson added a comment -

          +1 on protobufs. Generally:

          • extremely stable, used extensively within google
          • well tested, good test suite, etc.

          One small aspect I dislike about Thrift RPC is the inability to return null objects. This means that the thrift protocol for hbase must throw exceptions when a key is not found. For clients that dont like that, they have to write wrappers around the thrift API. I looked into it, the reason is the C++ binding returns objects, not pointers, thus you cannot have a 'null' object. Obviously the fix it to change the C++ API, but that seems like a bit much within the context of 'fix HBASE's RPC method'.

          We absolutely require a non-versioned RPC protocol. We need to be able to peacemeal upgrade our clusters. I would like to avoid ever using bin/stop-hbase.sh moving forward.

          Show
          ryan rawson added a comment - +1 on protobufs. Generally: extremely stable, used extensively within google well tested, good test suite, etc. One small aspect I dislike about Thrift RPC is the inability to return null objects. This means that the thrift protocol for hbase must throw exceptions when a key is not found. For clients that dont like that, they have to write wrappers around the thrift API. I looked into it, the reason is the C++ binding returns objects, not pointers, thus you cannot have a 'null' object. Obviously the fix it to change the C++ API, but that seems like a bit much within the context of 'fix HBASE's RPC method'. We absolutely require a non-versioned RPC protocol. We need to be able to peacemeal upgrade our clusters. I would like to avoid ever using bin/stop-hbase.sh moving forward.
          Hide
          Andrew Purtell added a comment -

          No release depends on this issue, which is advisory only. I will get this in for the 0.20 timeframe so we can settle the question, but removed "fix version".

          Show
          Andrew Purtell added a comment - No release depends on this issue, which is advisory only. I will get this in for the 0.20 timeframe so we can settle the question, but removed "fix version".
          Hide
          Andrew Purtell added a comment -

          Yes I can get this done in that time.

          Show
          Andrew Purtell added a comment - Yes I can get this done in that time.
          Hide
          stack added a comment -

          Shall we move these out of 0.20.0 Andrew? You think they'll be done in next week or two?

          Show
          stack added a comment - Shall we move these out of 0.20.0 Andrew? You think they'll be done in next week or two?
          Hide
          Andrew Purtell added a comment -

          From Ryan Rawson on hbase-user@:
          > Avro isnt language independent yet.

          Show
          Andrew Purtell added a comment - From Ryan Rawson on hbase-user@: > Avro isnt language independent yet.
          Hide
          Andrew Purtell added a comment -

          Not yet. I updated to Thrift trunk the other day and put together a stripped down service definition for bulk benchmarking because I see 110 went in.

          I've been holding off due to private communication that Cutting had something in the works. Now I think we can proceed.

          Show
          Andrew Purtell added a comment - Not yet. I updated to Thrift trunk the other day and put together a stripped down service definition for bulk benchmarking because I see 110 went in. I've been holding off due to private communication that Cutting had something in the works. Now I think we can proceed.
          Hide
          Bryan Duxbury added a comment -

          Did you guys ever experiment with Thrift to any interesting outcome?

          Show
          Bryan Duxbury added a comment - Did you guys ever experiment with Thrift to any interesting outcome?
          Hide
          Andrew Purtell added a comment -

          AVRO may be an option.

          From https://issues.apache.org/jira/browse/HBASE-1015?focusedCommentId=12695265&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12695265 :

          http://people.apache.org/~cutting/avro.git/
          
          To learn more:
          
          git clone http://people.apache.org/~cutting/avro.git/ avro
          cat avro/README.txt
          
          Show
          Andrew Purtell added a comment - AVRO may be an option. From https://issues.apache.org/jira/browse/HBASE-1015?focusedCommentId=12695265&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12695265 : http: //people.apache.org/~cutting/avro.git/ To learn more: git clone http: //people.apache.org/~cutting/avro.git/ avro cat avro/README.txt
          Hide
          Bryan Duxbury added a comment -

          THRIFT-110 contains a patch that has a beta version CompactProtocol implementation. If you want to be more "official" you can use TBinaryProtocol, but that takes up noticeably more space on the wire.

          Show
          Bryan Duxbury added a comment - THRIFT-110 contains a patch that has a beta version CompactProtocol implementation. If you want to be more "official" you can use TBinaryProtocol, but that takes up noticeably more space on the wire.
          Hide
          Carlos Valiente added a comment -

          Close - use TCompactProtocol, FramedTransport, and THsHaServer (since you probably want nonblocking with multithreading).

          'grep -r CompactProtocol .' does not return anything on my Thrift source tree - Which protocol is that, Bryan?

          Show
          Carlos Valiente added a comment - Close - use TCompactProtocol, FramedTransport, and THsHaServer (since you probably want nonblocking with multithreading). 'grep -r CompactProtocol .' does not return anything on my Thrift source tree - Which protocol is that, Bryan?
          Hide
          Andrew Purtell added a comment -

          Thanks Bryan.

          Show
          Andrew Purtell added a comment - Thanks Bryan.
          Hide
          Bryan Duxbury added a comment -

          Close - use TCompactProtocol, FramedTransport, and THsHaServer (since you probably want nonblocking with multithreading).

          Show
          Bryan Duxbury added a comment - Close - use TCompactProtocol, FramedTransport, and THsHaServer (since you probably want nonblocking with multithreading).
          Hide
          Andrew Purtell added a comment -

          So for Thrift, that's DenseProtocol with FramedTransport for transport and NonblockingServer on the HBase side, got it.

          Show
          Andrew Purtell added a comment - So for Thrift, that's DenseProtocol with FramedTransport for transport and NonblockingServer on the HBase side, got it.
          Hide
          Bryan Duxbury added a comment -

          Don't use buffered transport - use FramedTransport. More efficient, and dovetails with the NonblockingServer implementation in Thrift, if you decide to use that.

          Show
          Bryan Duxbury added a comment - Don't use buffered transport - use FramedTransport. More efficient, and dovetails with the NonblockingServer implementation in Thrift, if you decide to use that.
          Hide
          stack added a comment -

          And don't forget buffered transport when checking thrift.

          Show
          stack added a comment - And don't forget buffered transport when checking thrift.
          Hide
          Andrew Purtell added a comment -

          Need a bake off between Thrift and Google protobufs. Evaluate given several criteria:

          • performance
          • wire format economy
          • ease of use / expressiveness of IDL
          • extensibility, future proofing
          • support for binary data

          Use the latest SVN checkout from both projects and also apply the patch for THRIFT-110 (TDenseProtocol) before evaluating Thrift.

          Show
          Andrew Purtell added a comment - Need a bake off between Thrift and Google protobufs. Evaluate given several criteria: performance wire format economy ease of use / expressiveness of IDL extensibility, future proofing support for binary data Use the latest SVN checkout from both projects and also apply the patch for THRIFT-110 (TDenseProtocol) before evaluating Thrift.
          Hide
          Andrew Purtell added a comment -

          Up on IRC stack mentions Hadoop work in this area for their 0.20.0. Investigate.

          Show
          Andrew Purtell added a comment - Up on IRC stack mentions Hadoop work in this area for their 0.20.0. Investigate.
          Hide
          Andrew Purtell added a comment -

          Refactored issue to make it more general in focus.

          Show
          Andrew Purtell added a comment - Refactored issue to make it more general in focus.

            People

            • Assignee:
              Unassigned
              Reporter:
              Andrew Purtell
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development