Hadoop Common
  1. Hadoop Common
  2. HADOOP-6894

Common foundation for Hadoop client tools

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      As Hadoop widespreads and matures the number of tools and utilities for users keeps growing.

      Some of them are bundled with Hadoop core, some with Hadoop contrib, some on their own, some are full fledged servers on their own. For example, just to name a few: distcp, streaming, pipes, har, pig, hive, oozie.

      Today there is no standard mechanism for making these tools available to users. Neither there is a standard mechanism for these tools to integrate and distributed them with each other.

      The lack of a common foundation creates issues for developers and users.

      1. deployment.pdf
        61 kB
        Owen O'Malley
      2. HadoopClientTools.pdf
        339 kB
        Alejandro Abdelnur
      3. HadoopClientToolsV2_1.pdf
        447 kB
        Alejandro Abdelnur
      4. HadoopClientToolsV2.pdf
        443 kB
        Alejandro Abdelnur

        Issue Links

          Activity

          Hide
          Alejandro Abdelnur added a comment -

          This proposal defines simple and well understood guidelines for:

          • Development (optional)
          • Directory layout (mandatory)
          • Packaging (mandatory)
          • Platform packaging (optional)
          • Stack release (optional)
          • Execution (mandatory)

          The Hadoop hadoop command line tool is out of scope.

          Server tools (like start/stop Hadoop or Oozie) are out of scope.

          Show
          Alejandro Abdelnur added a comment - This proposal defines simple and well understood guidelines for: Development (optional) Directory layout (mandatory) Packaging (mandatory) Platform packaging (optional) Stack release (optional) Execution (mandatory) The Hadoop hadoop command line tool is out of scope. Server tools (like start/stop Hadoop or Oozie) are out of scope.
          Hide
          Owen O'Malley added a comment -

          Comments:
          1. We desperately need a packaging and deployment strategy in Hadoop. The current tarballs based deployment is insufficient.
          2. A lot of discussion of potential layouts for deployment already happened on HADOOP-6255, which you've completely ignored. Installing each tool is a separate directory with the tool and version name is how it was done before we had package managers. Package managers have taken over because they are far more convenient than tarballs with install instructions. I don't think this proposal is moving the project in the correct direction. Granted, we probably need to produce tarballs for operating systems without package managers, but certainly I'd expect most installations to use packages.
          3. Any packaging and/or deployment strategy that ignores cluster deployment won't work.
          4. Naming versions based on the Hadoop version makes sense, but should only be to the branch. (ie. pig-0.7-0.20 not pig-0.7-0.20.2). Naturally this Jira isn't the place to discuss changes in other sub-projects (ie Pig).

          Show
          Owen O'Malley added a comment - Comments: 1. We desperately need a packaging and deployment strategy in Hadoop. The current tarballs based deployment is insufficient. 2. A lot of discussion of potential layouts for deployment already happened on HADOOP-6255 , which you've completely ignored. Installing each tool is a separate directory with the tool and version name is how it was done before we had package managers. Package managers have taken over because they are far more convenient than tarballs with install instructions. I don't think this proposal is moving the project in the correct direction. Granted, we probably need to produce tarballs for operating systems without package managers, but certainly I'd expect most installations to use packages. 3. Any packaging and/or deployment strategy that ignores cluster deployment won't work. 4. Naming versions based on the Hadoop version makes sense, but should only be to the branch. (ie. pig-0.7-0.20 not pig-0.7-0.20.2). Naturally this Jira isn't the place to discuss changes in other sub-projects (ie Pig).
          Hide
          Alejandro Abdelnur added a comment -

          Thanks Owen,

          I have not ignored HADOOP-6255, the proposal integrates some of the ideas discussed there (there is still no consensus on some of them in HADOOP-6255):

          • A common directory layout
          • Support for native components
          • TAR files as initial distro from where native packages are created

          Regarding your other comments:

          1. Yes, we need clear story for packaging.
          2. Tarballs on its own are not sufficient. IMO tarballs with metadata that allows the creation of native packages are. If you are using tarballs directly, you have to resolve dependencies by hand, if you are using native packages things are done for you. A packager project can use the tarballs+metadata to create native packages.
          3. Cluster deployment is a complete different beast and that is why it is ignored by this proposal, this proposal focuses on a solution for users (the client side of things).
          4. This Jira only recommends (it does not impose) a consistent versioning for client tools that would make things simpler for the users.

          Show
          Alejandro Abdelnur added a comment - Thanks Owen, I have not ignored HADOOP-6255 , the proposal integrates some of the ideas discussed there (there is still no consensus on some of them in HADOOP-6255 ): A common directory layout Support for native components TAR files as initial distro from where native packages are created Regarding your other comments: 1. Yes, we need clear story for packaging. 2. Tarballs on its own are not sufficient. IMO tarballs with metadata that allows the creation of native packages are. If you are using tarballs directly, you have to resolve dependencies by hand, if you are using native packages things are done for you. A packager project can use the tarballs+metadata to create native packages. 3. Cluster deployment is a complete different beast and that is why it is ignored by this proposal, this proposal focuses on a solution for users (the client side of things). 4. This Jira only recommends (it does not impose) a consistent versioning for client tools that would make things simpler for the users.
          Hide
          Alejandro Abdelnur added a comment -

          Attached is a proposal (V2) Owen and I worked out together.

          This proposal enables a complementary integration with HADOOP-6255

          What has changed:

          • hadoop client commands are just another tool
          • 'tools/' dir is now 'share/'
          • there can be a special tool 'commons' where 'plugin' tools like har, viewfs, hdfsprx are added as dependencies
          • all tools registered in 'commons' are always avail to any tool that depends on 'commons'
          • there is an 'etc/' directory where tools configurations live
          Show
          Alejandro Abdelnur added a comment - Attached is a proposal (V2) Owen and I worked out together. This proposal enables a complementary integration with HADOOP-6255 What has changed: hadoop client commands are just another tool 'tools/' dir is now 'share/' there can be a special tool 'commons' where 'plugin' tools like har, viewfs, hdfsprx are added as dependencies all tools registered in 'commons' are always avail to any tool that depends on 'commons' there is an 'etc/' directory where tools configurations live
          Hide
          Alejandro Abdelnur added a comment -

          taken care of some typos.

          clarifying that etc/ must be in the classpath

          Show
          Alejandro Abdelnur added a comment - taken care of some typos. clarifying that etc/ must be in the classpath
          Hide
          Owen O'Malley added a comment -

          Ok, this is a major revision of this proposal to discuss the packaging and deployment of Hadoop. It describes the packages that we need, the deployment directory structure, and the mechanisms to add additional tools to the deployment.

          Show
          Owen O'Malley added a comment - Ok, this is a major revision of this proposal to discuss the packaging and deployment of Hadoop. It describes the packages that we need, the deployment directory structure, and the mechanisms to add additional tools to the deployment.

            People

            • Assignee:
              Owen O'Malley
              Reporter:
              Alejandro Abdelnur
            • Votes:
              1 Vote for this issue
              Watchers:
              27 Start watching this issue

              Dates

              • Created:
                Updated:

                Development