Sqoop
  1. Sqoop
  2. SQOOP-365

Proposal for next major revision of Sqoop.

    Details

    • Type: Wish Wish
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This issue tracks the design and development of the next major revision of Sqoop. The proposal has been articulated on the wiki at the following location:

      https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2

      Please review the proposal and add your comments to this JIRA.

      1. sqoop2.tar.gz
        7 kB
        Arvind Prabhakar

        Issue Links

          Activity

          Hide
          Kathleen Ting added a comment -

          Please RSVP to kathleen at apache dot org if you would like to attend the weekly scheduled conference call this Wednesday (5/23) at 9am PDT to discuss the current state of Sqoop 2 and the work that is going on. Once you RSVP, you'll receive the call details. Thanks.

          Show
          Kathleen Ting added a comment - Please RSVP to kathleen at apache dot org if you would like to attend the weekly scheduled conference call this Wednesday (5/23) at 9am PDT to discuss the current state of Sqoop 2 and the work that is going on. Once you RSVP, you'll receive the call details. Thanks.
          Hide
          Arvind Prabhakar added a comment -

          Thanks to Bilung for checking in the client shell code. More information at the following page:

          Show
          Arvind Prabhakar added a comment - Thanks to Bilung for checking in the client shell code. More information at the following page: Sqoop2 - Shell Client
          Hide
          Arvind Prabhakar added a comment -

          Updates since the last meeting:

          Show
          Arvind Prabhakar added a comment - Updates since the last meeting: Sqoop 2 Manual Setup Resource Layout
          Hide
          Arvind Prabhakar added a comment -

          Here is a message that was just sent to dev list. Please RSVP to arvind at apache dot org if you would like to attend.

          Dear Sqoop Developers,
          
          We will be hosting a conference call next Wednesday (5/9) 
          at 9am PDT to discuss the current state of Sqoop 2 and 
          the work that is going on. This call will be a good starting 
          point for those who are eager to understand the current 
          structure of the code and how we are planning on developing 
          it further. Please RSVP to arvind at apache dot org (without 
          replying to the list) if you would like to attend the call. 
          
          I will send out the call details to those who RSVP by Tuesday
          5/8 9am PDT. It will be great if you could also include any 
          specific questions/comments regarding Sqoop 2 that you would 
          like to discuss on the call and we can budget the time for the
          call accordingly. Once the call is over, we will be publishing 
          the notes from it on the project wiki.
          
          Thanks,
          Arvind Prabhakar
          
          Show
          Arvind Prabhakar added a comment - Here is a message that was just sent to dev list. Please RSVP to arvind at apache dot org if you would like to attend. Dear Sqoop Developers, We will be hosting a conference call next Wednesday (5/9) at 9am PDT to discuss the current state of Sqoop 2 and the work that is going on. This call will be a good starting point for those who are eager to understand the current structure of the code and how we are planning on developing it further. Please RSVP to arvind at apache dot org (without replying to the list) if you would like to attend the call. I will send out the call details to those who RSVP by Tuesday 5/8 9am PDT. It will be great if you could also include any specific questions/comments regarding Sqoop 2 that you would like to discuss on the call and we can budget the time for the call accordingly. Once the call is over, we will be publishing the notes from it on the project wiki. Thanks, Arvind Prabhakar
          Hide
          Arvind Prabhakar added a comment -

          Venkatesh, thanks for prodding on this. I will be adding a wiki page soon to detail how the branch can be built and tested. I also intend to host a meeting to go over what has been developed so far and were we can benefit from contributions. Stay tuned.

          Show
          Arvind Prabhakar added a comment - Venkatesh, thanks for prodding on this. I will be adding a wiki page soon to detail how the branch can be built and tested. I also intend to host a meeting to go over what has been developed so far and were we can benefit from contributions. Stay tuned.
          Hide
          Venkatesh Seetharam added a comment -

          Any update on this?

          Show
          Venkatesh Seetharam added a comment - Any update on this?
          Hide
          Ken Krugler added a comment -

          The three biggest challenges we ran into using Sqoop were:

          1. Needing better control over the connection process. E.g. being able to set the connection timeout.

          2. Password management. Since passwords can be managed via a separate system, making it easy to integrate with such a system would be critical for at least one of our clients. The key is avoiding passwords on the command line, in logging, and being sent in the clear to slaves in the cluster.

          3. Synchronization with third party plugins (primarily OraOop). As new functionality is added to core Sqoop, currently it seems like there's a significant burden placed on the plug-in developer to keep in sync, or to detect/handle the existence (or not) of some new functionality. Improving ease of extension will hopefully also address issues with ease of compatibility.

          – Ken

          Show
          Ken Krugler added a comment - The three biggest challenges we ran into using Sqoop were: 1. Needing better control over the connection process. E.g. being able to set the connection timeout. 2. Password management. Since passwords can be managed via a separate system, making it easy to integrate with such a system would be critical for at least one of our clients. The key is avoiding passwords on the command line, in logging, and being sent in the clear to slaves in the cluster. 3. Synchronization with third party plugins (primarily OraOop). As new functionality is added to core Sqoop, currently it seems like there's a significant burden placed on the plug-in developer to keep in sync, or to detect/handle the existence (or not) of some new functionality. Improving ease of extension will hopefully also address issues with ease of compatibility. – Ken
          Hide
          David Robson added a comment -

          It would be great if you could run a Sqoop export directly from a HQL statement. For example if I want to create a summary of data in Hadoop then load this into a database for reporting; to use Sqoop I would have to create a temporary table in Hive and copy this across. Currently Quest provides this functionality with Data Transporter for Hive but it would be nice if this was in Sqoop so users did not have to go to another tool or write extra steps of creating a temporary table. It seems like this is the main gap in Sqoop functionality - with import you can specify a where clause to filter data, and you can import into hive - but then you can't export the Hive query results back out.

          Show
          David Robson added a comment - It would be great if you could run a Sqoop export directly from a HQL statement. For example if I want to create a summary of data in Hadoop then load this into a database for reporting; to use Sqoop I would have to create a temporary table in Hive and copy this across. Currently Quest provides this functionality with Data Transporter for Hive but it would be nice if this was in Sqoop so users did not have to go to another tool or write extra steps of creating a temporary table. It seems like this is the main gap in Sqoop functionality - with import you can specify a where clause to filter data, and you can import into hive - but then you can't export the Hive query results back out.
          Hide
          Arvind Prabhakar added a comment -

          Branch created with basic build structure. Initial code drop expected in a few days.
          https://svn.apache.org/repos/asf/incubator/sqoop/branches/sqoop2/

          Show
          Arvind Prabhakar added a comment - Branch created with basic build structure. Initial code drop expected in a few days. https://svn.apache.org/repos/asf/incubator/sqoop/branches/sqoop2/
          Hide
          Aaron Kimball added a comment -

          +1

          Show
          Aaron Kimball added a comment - +1
          Hide
          Jarek Jarcec Cecho added a comment -

          Creating experimental branch for sqoop 2 is fine with me.

          Show
          Jarek Jarcec Cecho added a comment - Creating experimental branch for sqoop 2 is fine with me.
          Hide
          Arvind Prabhakar added a comment -

          It seems that the conversation on this thread has winded down. I therefore propose that we create a separate branch to do a proof-of-concept implementation for Sqoop2. Once we have some critical code in that branch, we can discuss if it is worthy of taking it forward as trunk.

          Unless I hear any comments on this, I will go ahead and create the branch in the next 24 hours.

          Show
          Arvind Prabhakar added a comment - It seems that the conversation on this thread has winded down. I therefore propose that we create a separate branch to do a proof-of-concept implementation for Sqoop2. Once we have some critical code in that branch, we can discuss if it is worthy of taking it forward as trunk. Unless I hear any comments on this, I will go ahead and create the branch in the next 24 hours.
          Hide
          Aaron Kimball added a comment -

          @Arvind, These responses all make sense. Thank you for clarifying this.

          @Rob I'd point out that "there is no risk of version conflicts" is perhaps optimistic. The client still has to speak a subset of the REST API that the server supports. As I read the spec, it's still an open question as to how the REST API will be versioned (if at all?) and what the plan is for the REST server to support older versions of its protocol.

          Show
          Aaron Kimball added a comment - @Arvind, These responses all make sense. Thank you for clarifying this. @Rob I'd point out that "there is no risk of version conflicts" is perhaps optimistic. The client still has to speak a subset of the REST API that the server supports. As I read the spec, it's still an open question as to how the REST API will be versioned (if at all?) and what the plan is for the REST server to support older versions of its protocol.
          Hide
          Arvind Prabhakar added a comment -

          @Brock - your preference is noted. Please see my response to the same question by Aaron above. In summary, I would rather be functionally backward compatible than from the exact interface.

          That said, it is a good idea and patches are always welcome!

          Show
          Arvind Prabhakar added a comment - @Brock - your preference is noted. Please see my response to the same question by Aaron above. In summary, I would rather be functionally backward compatible than from the exact interface. That said, it is a good idea and patches are always welcome!
          Hide
          Brock Noland added a comment -

          I'd like to see a mostly backwards compatible command line client.

          Show
          Brock Noland added a comment - I'd like to see a mostly backwards compatible command line client.
          Hide
          Rob Weltman added a comment -

          Having only thin REST clients and instead keeping all functionality in a server has many benefits beyond those listed in https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2 and in Arvind's comment above. It is easy to invoke from any programming language. It is easier to debug because all logging and all execution is in one place. There is no risk of version conflicts between different clients (the only version that matters is the server's). Auditing (in addition to authentication and authorization which were already discussed) is much easier. Much less painful to upgrade in environments where there are multiple clients with different owners.

          Show
          Rob Weltman added a comment - Having only thin REST clients and instead keeping all functionality in a server has many benefits beyond those listed in https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2 and in Arvind's comment above. It is easy to invoke from any programming language. It is easier to debug because all logging and all execution is in one place. There is no risk of version conflicts between different clients (the only version that matters is the server's). Auditing (in addition to authentication and authorization which were already discussed) is much easier. Much less painful to upgrade in environments where there are multiple clients with different owners.
          Hide
          Arvind Prabhakar added a comment -

          Thanks for the feedback Aaron. Here are my thoughts on the questions you raised:

          • Ease of Deployment:
            • Will Sqoop have an embedded server to support localhost execution? This is certainly a possiblity. I do feel that it is more of a packaging concern though and not a design concern. For example - BigTop should be able to easily package the system with say Tomcat or other middleware.
            • ...will Sqoop still support the ability to use the existing 'ad hoc' connection mechanism? Absolutely. The idea of connections as first class objects is a prerequisite for tighter security. But nothing stops a user to deploy Sqoop in a mode where the security is not enabled, or if the operator has admin privileges as well.
          • Command Line Backward compatiblity: Will Sqoop2 be backwards-compatible with these arguments? My inclination is to not be backward compatible due to reasons of controlling overall implementation complexity. To a certin degree - we could preserve the same command line interface if that is a critical requirement, so you would be able to run with most Sqoop 1.x standard command line options. But then there are things that are interpreted differently by different connectors and in order to be fully backward compatible, the connectors too would have to be backward compatible and respect the same semantics they did as before. There is no harm in doing so, just that the overall burden for the new implementation will put a drag on its progress towards further improvement. I would prefer that Sqoop 2 be not required to be backward compatible with current implementation, as long as there is an easy migration path from the previous sytem to the new.
          • Metadata store: How and where does Sqoop store information about Connections, resource limits, etc? Even though the writeup does not talk about this, I imagine having a pluggable store interface that is backed by an embedded derby database. This will allow Sqoop to integrate with HCatalog when it is ready for production.
            • How, if at all, do we guard against end-users starting a second Sqoop server to get around resource limits? We should provide implementation that uses the metadata store to manage resource limits etc. Which, as you point out, is easy to bypass if the user has access to connection information - where they can setup a new instance pointing to a different metastore that violates these restrictions. But that is no different from abusing the resources outside of Sqoop by directly running programms/sessions against the database. Such use/abuse is beyond the scope of security implementation in Sqoop IMO.

          I also don't believe that it's productive for the command-line client to use the REST API directly. Starting a server (even on localhost) as a pre-req for running a command-line tool seems overly complicated to me.

          I agree that there are differences in how one uses a tool vs how one uses a service. Services have the added burden of being managed and monitored, where as tools are usually controlled by the user entriely. Once the service is started/available - using a service becomes far easier than using a tool. The end-user does not have to worry about classpath details or making sure that they have the correct drivers installed. The client provides a thin facade to access the service and run it from anywhere and on its own does not require any management. This generally scales very well as compared to a heavy client that requires individual installations to be managed.

          I think a better architecture may be to define a number of Operations internally. Each Operation can have a programmatic (Java) API that executes it. Each Operation can also be bound to a REST API endpoint. But this way a user can still simply run the command-line application without configuring an entire server. The command-line app would run the Operation directly, as opposed to running it in the address space of a separate process somewhere. This would reduce the number of layers of complexity when debugging what goes wrong. Involving the network (even loopback) where none is needed seems like asking for trouble.

          I think that underneath the covers the logic of most of Sqoop 2 will indeed be implemented as operations that can be invoked without needing a web based service for testing purposes. The difference is that it won't be that way for the packaged system, which will be wired to work in a service model. Testablity is certainly a core requirement for any system and any implementation that does not lend it self to this is deficient. Given that, I don't think debugging would be that much more difficult than what it is right now to say debugging MR applications.

          Finally, on the front of API compatibility: Arvind, in an offline discussion, we talked about having a separate API package of interfaces that would have "api level" versioning (a la the Servlet API) that is distinct from the implementation version. Is that still part of your vision for Sqoop 2? I don't see it described in this proposal.

          Thanks for pointing that out - yes it is. For those who were not part of our offline discussion - the summary is that Sqoop 2 will expose versioned REST API that would automatically bind to different clients. So technically you could upgrade Sqoop 2 to Sqoop 2.5 etc which may have new API but the old clients will continue to work as is. The only caveat is that we may not be able to retrofit it to support Sqoop 1.x based on the discussion above.

          I looked through the proposed source layout for this. Without a README specifying what goes in which directories, it's hard for me to understand what you're trying to accomplish. What's the "infra" project for?

          The infra project would be the Sqoop infrastructure. We could name it "core" or "arch" or other commonly used names. The purpose of this project is to be able to define the core system architecture and design which gets used by other modules where necessary.

          I think based on what I said above about Operations, etc, there should be a "libsqoop" project that corresponds to the guts of the project. The "server" should just be a REST API implementation (perhaps w/ an embedded Jetty server, but also perhaps deployable as a WAR on a fully-administered Tomcat instance) that embeds libsqoop to perform the Operations. And the client, similarly, is a thin command-line-arg parsing shell that embeds libsqoop to perform Operations directly.

          I believe the infra module is what we are talking about here. I am hesitant to give it a name that suggests it is a library since there will be a bit of logic in dealing with extensions, job lifecycle, and other operational details which actively define the overall functioning of the system. Effectively though, it will still do the same thing as what you have suggested for libsqoop.

          Is infra ~= libsqoop in this idea? Or is that about independent testing of connectors, etc?

          Yes - infra ~ libsqoop.

          I think there should also be a plugin-api library (libsqoopapi?) which the connector/*/ projects link against, rather than libsqoop itself. This API would also be used by third-party SqoopTool implementations.

          Good suggestion - we can have a separate module for Sqoop extension API. It probably belongs to connection/api module.

          Show
          Arvind Prabhakar added a comment - Thanks for the feedback Aaron. Here are my thoughts on the questions you raised: Ease of Deployment : Will Sqoop have an embedded server to support localhost execution? This is certainly a possiblity. I do feel that it is more of a packaging concern though and not a design concern. For example - BigTop should be able to easily package the system with say Tomcat or other middleware. ...will Sqoop still support the ability to use the existing 'ad hoc' connection mechanism? Absolutely. The idea of connections as first class objects is a prerequisite for tighter security. But nothing stops a user to deploy Sqoop in a mode where the security is not enabled, or if the operator has admin privileges as well. Command Line Backward compatiblity : Will Sqoop2 be backwards-compatible with these arguments? My inclination is to not be backward compatible due to reasons of controlling overall implementation complexity. To a certin degree - we could preserve the same command line interface if that is a critical requirement, so you would be able to run with most Sqoop 1.x standard command line options. But then there are things that are interpreted differently by different connectors and in order to be fully backward compatible, the connectors too would have to be backward compatible and respect the same semantics they did as before. There is no harm in doing so, just that the overall burden for the new implementation will put a drag on its progress towards further improvement. I would prefer that Sqoop 2 be not required to be backward compatible with current implementation, as long as there is an easy migration path from the previous sytem to the new. Metadata store : How and where does Sqoop store information about Connections, resource limits, etc? Even though the writeup does not talk about this, I imagine having a pluggable store interface that is backed by an embedded derby database. This will allow Sqoop to integrate with HCatalog when it is ready for production. How, if at all, do we guard against end-users starting a second Sqoop server to get around resource limits? We should provide implementation that uses the metadata store to manage resource limits etc. Which, as you point out, is easy to bypass if the user has access to connection information - where they can setup a new instance pointing to a different metastore that violates these restrictions. But that is no different from abusing the resources outside of Sqoop by directly running programms/sessions against the database. Such use/abuse is beyond the scope of security implementation in Sqoop IMO. I also don't believe that it's productive for the command-line client to use the REST API directly. Starting a server (even on localhost) as a pre-req for running a command-line tool seems overly complicated to me. I agree that there are differences in how one uses a tool vs how one uses a service. Services have the added burden of being managed and monitored, where as tools are usually controlled by the user entriely. Once the service is started/available - using a service becomes far easier than using a tool. The end-user does not have to worry about classpath details or making sure that they have the correct drivers installed. The client provides a thin facade to access the service and run it from anywhere and on its own does not require any management. This generally scales very well as compared to a heavy client that requires individual installations to be managed. I think a better architecture may be to define a number of Operations internally. Each Operation can have a programmatic (Java) API that executes it. Each Operation can also be bound to a REST API endpoint. But this way a user can still simply run the command-line application without configuring an entire server. The command-line app would run the Operation directly, as opposed to running it in the address space of a separate process somewhere. This would reduce the number of layers of complexity when debugging what goes wrong. Involving the network (even loopback) where none is needed seems like asking for trouble. I think that underneath the covers the logic of most of Sqoop 2 will indeed be implemented as operations that can be invoked without needing a web based service for testing purposes. The difference is that it won't be that way for the packaged system, which will be wired to work in a service model. Testablity is certainly a core requirement for any system and any implementation that does not lend it self to this is deficient. Given that, I don't think debugging would be that much more difficult than what it is right now to say debugging MR applications. Finally, on the front of API compatibility: Arvind, in an offline discussion, we talked about having a separate API package of interfaces that would have "api level" versioning (a la the Servlet API) that is distinct from the implementation version. Is that still part of your vision for Sqoop 2? I don't see it described in this proposal. Thanks for pointing that out - yes it is. For those who were not part of our offline discussion - the summary is that Sqoop 2 will expose versioned REST API that would automatically bind to different clients. So technically you could upgrade Sqoop 2 to Sqoop 2.5 etc which may have new API but the old clients will continue to work as is. The only caveat is that we may not be able to retrofit it to support Sqoop 1.x based on the discussion above. I looked through the proposed source layout for this. Without a README specifying what goes in which directories, it's hard for me to understand what you're trying to accomplish. What's the "infra" project for? The infra project would be the Sqoop infrastructure. We could name it "core" or "arch" or other commonly used names. The purpose of this project is to be able to define the core system architecture and design which gets used by other modules where necessary. I think based on what I said above about Operations, etc, there should be a "libsqoop" project that corresponds to the guts of the project. The "server" should just be a REST API implementation (perhaps w/ an embedded Jetty server, but also perhaps deployable as a WAR on a fully-administered Tomcat instance) that embeds libsqoop to perform the Operations. And the client, similarly, is a thin command-line-arg parsing shell that embeds libsqoop to perform Operations directly. I believe the infra module is what we are talking about here. I am hesitant to give it a name that suggests it is a library since there will be a bit of logic in dealing with extensions, job lifecycle, and other operational details which actively define the overall functioning of the system. Effectively though, it will still do the same thing as what you have suggested for libsqoop. Is infra ~= libsqoop in this idea? Or is that about independent testing of connectors, etc? Yes - infra ~ libsqoop. I think there should also be a plugin-api library (libsqoopapi?) which the connector/*/ projects link against, rather than libsqoop itself. This API would also be used by third-party SqoopTool implementations. Good suggestion - we can have a separate module for Sqoop extension API. It probably belongs to connection/api module.
          Hide
          Aaron Kimball added a comment -

          This proposal looks like a good start! Here are some questions I have about it:

          • One of the main advantages of Sqoop in it's current form is its ease of deployment by end-users. Like Pig, it can be installed on a client machine without burdening cluster operators.
            • How will we maintain this ease-of-deployment in the face of the web-based app? Can/will Sqoop come with a self-contained server (e.g. Jetty?) to support 'localhost' execution of the web app?
            • I like the idea of pre-defined connections. But will Sqoop still support the ability to use the existing 'ad hoc' connection mechanism? For users who already have a username/password they can use to connect to a database, it may be useful for them to get started easily with their existing credentials, without requiring an operator to configure a connection.
          • Many production deployments count on running Sqoop in commnad-line mode using the existing command-line arguments to specify the job. Will Sqoop2 be backwards-compatible with these arguments?
          • How and where does Sqoop store information about Connections, resource limits, etc?
            • How, if at all, do we guard against end-users starting a second Sqoop server to get around resource limits? Are the resource limits and temporary locking info, etc, stored in the target database itself? (If so, how do we guard against stale locks..?)

          I also don't believe that it's productive for the command-line client to use the REST API directly. Starting a server (even on localhost) as a pre-req for running a command-line tool seems overly complicated to me.

          I think a better architecture may be to define a number of Operations internally. Each Operation can have a programmatic (Java) API that executes it. Each Operation can also be bound to a REST API endpoint. But this way a user can still simply run the command-line application without configuring an entire server. The command-line app would run the Operation directly, as opposed to running it in the address space of a separate process somewhere. This would reduce the number of layers of complexity when debugging what goes wrong. Involving the network (even loopback) where none is needed seems like asking for trouble.

          Finally, on the front of API compatibility: Arvind, in an offline discussion, we talked about having a separate API package of interfaces that would have "api level" versioning (a la the Servlet API) that is distinct from the implementation version. Is that still part of your vision for Sqoop 2? I don't see it described in this proposal.

          I looked through the proposed source layout for this. Without a README specifying what goes in which directories, it's hard for me to understand what you're trying to accomplish. What's the "infra" project for?

          I think based on what I said above about Operations, etc, there should be a "libsqoop" project that corresponds to the guts of the project. The "server" should just be a REST API implementation (perhaps w/ an embedded Jetty server, but also perhaps deployable as a WAR on a fully-administered Tomcat instance) that embeds libsqoop to perform the Operations. And the client, similarly, is a thin command-line-arg parsing shell that embeds libsqoop to perform Operations directly.

          Is infra ~= libsqoop in this idea? Or is that about independent testing of connectors, etc?

          I think there should also be a plugin-api library (libsqoopapi?) which the connector/*/ projects link against, rather than libsqoop itself. This API would also be used by third-party SqoopTool implementations.

          This document's off to a great start – this is definitely in line with the next evolution of Sqoop as a first-class mechanism for getting data into Hadoop. Looking forward to your answers!

          Cheers,
          Aaron

          Show
          Aaron Kimball added a comment - This proposal looks like a good start! Here are some questions I have about it: One of the main advantages of Sqoop in it's current form is its ease of deployment by end-users. Like Pig, it can be installed on a client machine without burdening cluster operators. How will we maintain this ease-of-deployment in the face of the web-based app? Can/will Sqoop come with a self-contained server (e.g. Jetty?) to support 'localhost' execution of the web app? I like the idea of pre-defined connections. But will Sqoop still support the ability to use the existing 'ad hoc' connection mechanism? For users who already have a username/password they can use to connect to a database, it may be useful for them to get started easily with their existing credentials, without requiring an operator to configure a connection. Many production deployments count on running Sqoop in commnad-line mode using the existing command-line arguments to specify the job. Will Sqoop2 be backwards-compatible with these arguments? How and where does Sqoop store information about Connections, resource limits, etc? How, if at all, do we guard against end-users starting a second Sqoop server to get around resource limits? Are the resource limits and temporary locking info, etc, stored in the target database itself? (If so, how do we guard against stale locks..?) I also don't believe that it's productive for the command-line client to use the REST API directly. Starting a server (even on localhost) as a pre-req for running a command-line tool seems overly complicated to me. I think a better architecture may be to define a number of Operations internally. Each Operation can have a programmatic (Java) API that executes it. Each Operation can also be bound to a REST API endpoint. But this way a user can still simply run the command-line application without configuring an entire server. The command-line app would run the Operation directly, as opposed to running it in the address space of a separate process somewhere. This would reduce the number of layers of complexity when debugging what goes wrong. Involving the network (even loopback) where none is needed seems like asking for trouble. Finally, on the front of API compatibility: Arvind, in an offline discussion, we talked about having a separate API package of interfaces that would have "api level" versioning (a la the Servlet API) that is distinct from the implementation version. Is that still part of your vision for Sqoop 2? I don't see it described in this proposal. I looked through the proposed source layout for this. Without a README specifying what goes in which directories, it's hard for me to understand what you're trying to accomplish. What's the "infra" project for? I think based on what I said above about Operations, etc, there should be a "libsqoop" project that corresponds to the guts of the project. The "server" should just be a REST API implementation (perhaps w/ an embedded Jetty server, but also perhaps deployable as a WAR on a fully-administered Tomcat instance) that embeds libsqoop to perform the Operations. And the client, similarly, is a thin command-line-arg parsing shell that embeds libsqoop to perform Operations directly. Is infra ~= libsqoop in this idea? Or is that about independent testing of connectors, etc? I think there should also be a plugin-api library (libsqoopapi?) which the connector/*/ projects link against, rather than libsqoop itself. This API would also be used by third-party SqoopTool implementations. This document's off to a great start – this is definitely in line with the next evolution of Sqoop as a first-class mechanism for getting data into Hadoop. Looking forward to your answers! Cheers, Aaron
          Hide
          Arvind Prabhakar added a comment -

          I have attached a stubbed layout of what Sqoop 2 workspace might look like. I suggest that we take it into a new branch for doing a proof of concept prototype.

          Show
          Arvind Prabhakar added a comment - I have attached a stubbed layout of what Sqoop 2 workspace might look like. I suggest that we take it into a new branch for doing a proof of concept prototype.

            People

            • Assignee:
              Arvind Prabhakar
              Reporter:
              Arvind Prabhakar
            • Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:

                Development