I'm hijacking this thread for the description (as opposed to the title). Let's start thinking about a high-performance, secure transport for Avro.
Here's a dump of my current thoughts on this topic, after reading up a bit on SASL, and reading through some of the the Hadoop security patches.
First off, we should probably call this a "protocol". It's a bit tricky, since we've already got a notion of Avro protocols, but "transport" reminds people of http://en.wikipedia.org/wiki/Transport_Layer, i.e., UDP vs TCP, and that's not what we're discussing here. (On the TCP vs UDP front, let's focus our efforts first on a TCP protocol. There might be a lot of value of having a UDP protocol as well, but it's clear that we'll need a TCP one.)
It's a bit meta, but I'd like us to consider describing Avro's protocol in terms of (and here the terminology falls down) an Avro protocol, or at least in terms of Avro records. Instead of saying "and then there shall be a long, encoded like so, and then it shall by follows by that many bytes", we should just say, and "then shall we receive a record with the following schema". We already do so in part, and I think that's the right direction. I think it will make the description of the protocol clearer, and, I think, it will let the implementation worry re-use some schema functionality. (I think implementations should use the most type-safe APIs they have available to them, but, hey, that's by definition an implementation detail.)
In terms of the "primitives", here's what I can think of:
- CALL; this is the work-horse of the RPC, analagous to http://hadoop.apache.org/avro/docs/1.2.0/spec.html#Call+Format. If we decide to do schema resolution at the handshake level, we would do it here. Returns the response. May throw AuthenticationRequired.
- AUTHENTICATE: this is the command for authentication. SASL sometimes requires a back and forth (until it's "done"); we'd put the hooks for all of that here.
- DISCOVER: Asks the server for information about itself. Specifically, servers may tell clients what protocols they support. This may throw AuthenticationRequired or return nothing, if the server wants to be cagey. This is in some sense similar to FB303: https://svn.apache.org/repos/asf/incubator/thrift/trunk/contrib/fb303/if/fb303.thrift . In a friendly environment, a server could tell you who's running it (a username), what machine it's on, arbitrary key/value statistics.
We absolutely need to support piggy-backing of commands. One way to do that is for clients to simply be able to send multiple commands in a row, without waiting for the response. Or having commands able to include subcommands.
We need to support out-of-order responses and "one way" (don't wait for a response) commands.
We still need to do framing. Also, SASL requires that all bytes after the succesful SASL authentication are wrapped by SASL, so servers and clients need to have a state machine that understands that, and wraps appropriately. (We could maybe have avoided framing if we supported framing directly in Avro's string primitive type, like we do in Avro's map type, by having a negative string length indicate a string that is continued.)
Finally, we need to think hard about how to version this protocol itself. It's appealing to be able to add commands in the future ("oneway" is an example) or to enrich the response of commands like "DISCOVER". It's noteworthy that text-based protocols like IMAP have had little trouble extending themselves to stuff like SASL, because they could just augment what existing commands did. (RFC 4959 is pretty short.) A simple approach would be to bootstrap it by sending hash(avro protocol schema), and doing much like we do with calls right now.
Anyway, that's where I am right now. Looking forward to more discussion.