Thanks for the feedback Aaron. Here are my thoughts on the questions you raised:
- Ease of Deployment:
- Will Sqoop have an embedded server to support localhost execution? This is certainly a possiblity. I do feel that it is more of a packaging concern though and not a design concern. For example - BigTop should be able to easily package the system with say Tomcat or other middleware.
- ...will Sqoop still support the ability to use the existing 'ad hoc' connection mechanism? Absolutely. The idea of connections as first class objects is a prerequisite for tighter security. But nothing stops a user to deploy Sqoop in a mode where the security is not enabled, or if the operator has admin privileges as well.
- Command Line Backward compatiblity: Will Sqoop2 be backwards-compatible with these arguments? My inclination is to not be backward compatible due to reasons of controlling overall implementation complexity. To a certin degree - we could preserve the same command line interface if that is a critical requirement, so you would be able to run with most Sqoop 1.x standard command line options. But then there are things that are interpreted differently by different connectors and in order to be fully backward compatible, the connectors too would have to be backward compatible and respect the same semantics they did as before. There is no harm in doing so, just that the overall burden for the new implementation will put a drag on its progress towards further improvement. I would prefer that Sqoop 2 be not required to be backward compatible with current implementation, as long as there is an easy migration path from the previous sytem to the new.
- Metadata store: How and where does Sqoop store information about Connections, resource limits, etc? Even though the writeup does not talk about this, I imagine having a pluggable store interface that is backed by an embedded derby database. This will allow Sqoop to integrate with HCatalog when it is ready for production.
- How, if at all, do we guard against end-users starting a second Sqoop server to get around resource limits? We should provide implementation that uses the metadata store to manage resource limits etc. Which, as you point out, is easy to bypass if the user has access to connection information - where they can setup a new instance pointing to a different metastore that violates these restrictions. But that is no different from abusing the resources outside of Sqoop by directly running programms/sessions against the database. Such use/abuse is beyond the scope of security implementation in Sqoop IMO.
I also don't believe that it's productive for the command-line client to use the REST API directly. Starting a server (even on localhost) as a pre-req for running a command-line tool seems overly complicated to me.
I agree that there are differences in how one uses a tool vs how one uses a service. Services have the added burden of being managed and monitored, where as tools are usually controlled by the user entriely. Once the service is started/available - using a service becomes far easier than using a tool. The end-user does not have to worry about classpath details or making sure that they have the correct drivers installed. The client provides a thin facade to access the service and run it from anywhere and on its own does not require any management. This generally scales very well as compared to a heavy client that requires individual installations to be managed.
I think a better architecture may be to define a number of Operations internally. Each Operation can have a programmatic (Java) API that executes it. Each Operation can also be bound to a REST API endpoint. But this way a user can still simply run the command-line application without configuring an entire server. The command-line app would run the Operation directly, as opposed to running it in the address space of a separate process somewhere. This would reduce the number of layers of complexity when debugging what goes wrong. Involving the network (even loopback) where none is needed seems like asking for trouble.
I think that underneath the covers the logic of most of Sqoop 2 will indeed be implemented as operations that can be invoked without needing a web based service for testing purposes. The difference is that it won't be that way for the packaged system, which will be wired to work in a service model. Testablity is certainly a core requirement for any system and any implementation that does not lend it self to this is deficient. Given that, I don't think debugging would be that much more difficult than what it is right now to say debugging MR applications.
Finally, on the front of API compatibility: Arvind, in an offline discussion, we talked about having a separate API package of interfaces that would have "api level" versioning (a la the Servlet API) that is distinct from the implementation version. Is that still part of your vision for Sqoop 2? I don't see it described in this proposal.
Thanks for pointing that out - yes it is. For those who were not part of our offline discussion - the summary is that Sqoop 2 will expose versioned REST API that would automatically bind to different clients. So technically you could upgrade Sqoop 2 to Sqoop 2.5 etc which may have new API but the old clients will continue to work as is. The only caveat is that we may not be able to retrofit it to support Sqoop 1.x based on the discussion above.
I looked through the proposed source layout for this. Without a README specifying what goes in which directories, it's hard for me to understand what you're trying to accomplish. What's the "infra" project for?
The infra project would be the Sqoop infrastructure. We could name it "core" or "arch" or other commonly used names. The purpose of this project is to be able to define the core system architecture and design which gets used by other modules where necessary.
I think based on what I said above about Operations, etc, there should be a "libsqoop" project that corresponds to the guts of the project. The "server" should just be a REST API implementation (perhaps w/ an embedded Jetty server, but also perhaps deployable as a WAR on a fully-administered Tomcat instance) that embeds libsqoop to perform the Operations. And the client, similarly, is a thin command-line-arg parsing shell that embeds libsqoop to perform Operations directly.
I believe the infra module is what we are talking about here. I am hesitant to give it a name that suggests it is a library since there will be a bit of logic in dealing with extensions, job lifecycle, and other operational details which actively define the overall functioning of the system. Effectively though, it will still do the same thing as what you have suggested for libsqoop.
Is infra ~= libsqoop in this idea? Or is that about independent testing of connectors, etc?
Yes - infra ~ libsqoop.
I think there should also be a plugin-api library (libsqoopapi?) which the connector/*/ projects link against, rather than libsqoop itself. This API would also be used by third-party SqoopTool implementations.
Good suggestion - we can have a separate module for Sqoop extension API. It probably belongs to connection/api module.