Bigtop
  1. Bigtop
  2. BIGTOP-316

split up hadoop packages into common, hdfs, mapreduce (and yarn)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4.0
    • Component/s: General
    • Labels:
      None

      Description

      Here are the new names I would like to propose in hadoop-0.23 branch:

      • hadoop (for hadoop-common, but since it won't be limited to just the hadoop-common project it is hadoop)
      • hadoop-hdfs
      • hadoop-yarn
      • hadoop-mapreduce
      • hadoop-hdfs-namenode
      • hadoop-hdfs-secondarynamenode
      • hadoop-hdfs-datanode
      • hadoop-yarn-resourcemanager
      • hadoop-yarn-nodemanager
      • hadoop-mapreduce-historyserver
      • hadoop-libhdfs
      • hadoop-conf-pseudo
      • hadoop-docs

      given that they look a tad too long, an alternative would be to drop hadoop- prefix and go with
      the shorter versions (even though technically all these projects are sub projects of Hadoop):

      • hadoop
      • hadoop-conf-pseudo
      • hadoop-docs
      • hdfs
      • yarn
      • mapreduce
      • hdfs-namenode
      • hdfs-secondarynamenode
      • hdfs-datanode
      • yarn-resourcemanager
      • yarn-nodemanager
      • mapreduce-historyserver
      • libhdfs
      • hadoop-conf-pseudo
      • hadoop-docs

      Please leave your opinion on which one is better in the comments section.

      P.S. Quick search over at http://pkgs.org revealed no name clashes for both
      long and short versions.

      1. BIGTOP-316.patch.txt
        57 kB
        Roman Shaposhnik
      2. BIGTOP-316.patch2.txt
        59 kB
        Roman Shaposhnik

        Issue Links

          Activity

          Hide
          Matt Foley added a comment -

          Please give a detailed rationale for this split-up. Thank you.

          Show
          Matt Foley added a comment - Please give a detailed rationale for this split-up. Thank you.
          Hide
          Roman Shaposhnik added a comment -

          Sure. Just keep in mind that this JIRA only applies to Hadoop post 0.21.

          Rationale: it seems that there's a growing interest in mixing and matching individual subprojects of Hadoop:

          • running mixed version of HDFS and mapreduce framework (e.g. 1.0 mapreduce on top of 0.22/0.23 HDFS)
          • substituting mapreduce with alternative frameworks such as MPI (only available in .23)
          • substituting HDFS with alternative implementations of distributed filesystems
          • using standlaone HDFS+HBase clusters

          On top of that, there's an increased level of interest from deployment perspective to have a precise
          control on what bits end up being installed on every node with a cluster.

          Implementing the fix for this JIRA will pave the way for a much more flexible ways of utilizing individual
          sub-projects of Hadoop.

          Finally, from the standpoint of packaging guidelines of every major distribution it is highly desirable
          to split loosely coupled components into individual packages. Debian developers would go as far
          as insisting that every jar file must be its own package.

          Implementing the fix for this JIRA will pave the way for Bigtop to be used as a basis for Hadoop
          packaging in major Linux distributions.

          Please let us know if this raises any concerns on your side.

          Show
          Roman Shaposhnik added a comment - Sure. Just keep in mind that this JIRA only applies to Hadoop post 0.21. Rationale: it seems that there's a growing interest in mixing and matching individual subprojects of Hadoop: running mixed version of HDFS and mapreduce framework (e.g. 1.0 mapreduce on top of 0.22/0.23 HDFS) substituting mapreduce with alternative frameworks such as MPI (only available in .23) substituting HDFS with alternative implementations of distributed filesystems using standlaone HDFS+HBase clusters On top of that, there's an increased level of interest from deployment perspective to have a precise control on what bits end up being installed on every node with a cluster. Implementing the fix for this JIRA will pave the way for a much more flexible ways of utilizing individual sub-projects of Hadoop. Finally, from the standpoint of packaging guidelines of every major distribution it is highly desirable to split loosely coupled components into individual packages. Debian developers would go as far as insisting that every jar file must be its own package. Implementing the fix for this JIRA will pave the way for Bigtop to be used as a basis for Hadoop packaging in major Linux distributions. Please let us know if this raises any concerns on your side.
          Hide
          eric baldeschwieler added a comment -

          Does all of this end up only in BigTop or does it effect the Hadoop project trees? What's involved? Do you plan to work on this?

          Show
          eric baldeschwieler added a comment - Does all of this end up only in BigTop or does it effect the Hadoop project trees? What's involved? Do you plan to work on this?
          Hide
          Konstantin Boudnik added a comment - - edited

          > Do you plan to work on this?
          And this is relevant exactly why?

          Show
          Konstantin Boudnik added a comment - - edited > Do you plan to work on this? And this is relevant exactly why?
          Hide
          Konstantin Boudnik added a comment -

          Modularization property of a platform is one a highly effective ways to keep APIs in check, IMO. Split of packages physically will only help the cause. While I have my doubts that much of the flexibility can be achieved in the current state of affairs (even with the best packaging structure put in place) I would still +1 it because this is the step in the right direction.

          Show
          Konstantin Boudnik added a comment - Modularization property of a platform is one a highly effective ways to keep APIs in check, IMO. Split of packages physically will only help the cause. While I have my doubts that much of the flexibility can be achieved in the current state of affairs (even with the best packaging structure put in place) I would still +1 it because this is the step in the right direction.
          Hide
          Peter Linnell added a comment -

          As someone who does to a lot of distro packaging for rpm, splitting these monolithic packages into more discrete components is definitely a sane thing to do. To the contrary, continuing to keep all the bits in its current form makes builds and maintenance a chore.

          I would hope it lands both in Bigtop and Hadoop.

          +1 +10 if I could

          Show
          Peter Linnell added a comment - As someone who does to a lot of distro packaging for rpm, splitting these monolithic packages into more discrete components is definitely a sane thing to do. To the contrary, continuing to keep all the bits in its current form makes builds and maintenance a chore. I would hope it lands both in Bigtop and Hadoop. +1 +10 if I could
          Hide
          Matt Foley added a comment -

          One of the benefits of monolithic packages is that you know the set of bits within is claimed to be self-consistent. If that is split up into many packages, then I would like to at least be able to say "give me the version 1.0.0 of all packages", and easily get them. Furthermore, I need to be able to look at an installed system (in the field) and say, "yes, all the pieces are version 1.0.0" – or alternatively, "here's the problem: this piece is only version 0.20.204, but the other pieces are version 1.0.0, and that's not compatible."

          While people are welcome to experiment with mix-and-match, it's going to be a minefield of late-found problems in the field. Let's do whatever is necessary to make it easy to spot when mix-and-match has been attempted. How do we address this? Can a naming convention for the piece-wise packages help address the problem?

          Show
          Matt Foley added a comment - One of the benefits of monolithic packages is that you know the set of bits within is claimed to be self-consistent. If that is split up into many packages, then I would like to at least be able to say "give me the version 1.0.0 of all packages", and easily get them. Furthermore, I need to be able to look at an installed system (in the field) and say, "yes, all the pieces are version 1.0.0" – or alternatively, "here's the problem: this piece is only version 0.20.204, but the other pieces are version 1.0.0, and that's not compatible." While people are welcome to experiment with mix-and-match, it's going to be a minefield of late-found problems in the field. Let's do whatever is necessary to make it easy to spot when mix-and-match has been attempted. How do we address this? Can a naming convention for the piece-wise packages help address the problem?
          Hide
          Bruno Mahé added a comment -

          Packages are already split, so this is not really new. It only makes separation much cleaner and deeper.

          Matt> I agree.
          Package dependencies already enable such behaviours. Although Bigtop has not really enforced them that much, but that could change if we start going that way.

          Show
          Bruno Mahé added a comment - Packages are already split, so this is not really new. It only makes separation much cleaner and deeper. Matt> I agree. Package dependencies already enable such behaviours. Although Bigtop has not really enforced them that much, but that could change if we start going that way.
          Hide
          Konstantin Boudnik added a comment -

          Matt, most of you concerns are addressed by package dependencies. We do this between Hadoop and Pig or Oozie or Hive. Why Hadoop is inherently different? The whole split of the project which had happen back in 0.21 was done with that exact purpose: to provide better flexibility and cleaner functionality isolation.

          According to your logic, the whole stack on top of Hadoop needs to be delivered as a monolithic package to face the benefit of said self-consistency.

          Show
          Konstantin Boudnik added a comment - Matt, most of you concerns are addressed by package dependencies. We do this between Hadoop and Pig or Oozie or Hive. Why Hadoop is inherently different? The whole split of the project which had happen back in 0.21 was done with that exact purpose: to provide better flexibility and cleaner functionality isolation. According to your logic, the whole stack on top of Hadoop needs to be delivered as a monolithic package to face the benefit of said self-consistency.
          Hide
          Matt Foley added a comment -

          Cos, I don't think I deserve the aggressive response. Please re-read my comment. I specifically did not say that we should remain with a monolithic packaging, and in fact explicitly said letting people experiment with mixing and matching was welcome. "According to my logic", I have only expressed a requirement to be able to readily identify when someone ignorant has mixed-and-matched something they shouldn't have – because I think it is likely to happen a LOT in the field.

          Regarding package dependencies being an adequate solution: I see that they would support my first concern - "give me the version 1.0.0 of all packages" - I suppose by defining a super-package dependent on all the right versioned sub-components. However, if the user is trying to experiment with mix-and-match, then they can't use that super-package, because it will enforce the monolithic versioning, right? So they'll install the bits and pieces individually. How do we (a) make sure they get all the pieces they need, and (b) identify which pieces are mis-fits if the net result doesn't work?

          BTW, if you are thinking that all the sub-components can be given dependencies that make it impossible to install them with other sub-components that don't meet their needs, I don't buy it. There's no way you're going to successfully do that while also maintaining the goal to allow mix-and-match experimentation; the result will be over-constrained and meet neither goal well.

          Show
          Matt Foley added a comment - Cos, I don't think I deserve the aggressive response. Please re-read my comment. I specifically did not say that we should remain with a monolithic packaging, and in fact explicitly said letting people experiment with mixing and matching was welcome. "According to my logic", I have only expressed a requirement to be able to readily identify when someone ignorant has mixed-and-matched something they shouldn't have – because I think it is likely to happen a LOT in the field. Regarding package dependencies being an adequate solution: I see that they would support my first concern - "give me the version 1.0.0 of all packages" - I suppose by defining a super-package dependent on all the right versioned sub-components. However, if the user is trying to experiment with mix-and-match, then they can't use that super-package, because it will enforce the monolithic versioning, right? So they'll install the bits and pieces individually. How do we (a) make sure they get all the pieces they need, and (b) identify which pieces are mis-fits if the net result doesn't work? BTW, if you are thinking that all the sub-components can be given dependencies that make it impossible to install them with other sub-components that don't meet their needs, I don't buy it. There's no way you're going to successfully do that while also maintaining the goal to allow mix-and-match experimentation; the result will be over-constrained and meet neither goal well.
          Hide
          Bruno Mahé added a comment - - edited

          Matt, I believe Roman was aiming at contributors being able to mix and match versions, not really users. I don't see how users would be able to do so and I personnaly see Bigtop as a coherent stack instead of a kit to build stacks (although nothing prevent anyone from selecting their own favourite downstream versions).
          Even if Bigtop components would not specify any dependencies, there would still be some work needed to make sure components compile against each others (for instance, doing a maven build on hbase, would pull hadoop from its default profile, not the one we want to exercise, so the way a component is built may have to be tweaked) as well as testing their integration.
          But mix-matching is only one of many benefits of such proposal.

          Show
          Bruno Mahé added a comment - - edited Matt, I believe Roman was aiming at contributors being able to mix and match versions, not really users. I don't see how users would be able to do so and I personnaly see Bigtop as a coherent stack instead of a kit to build stacks (although nothing prevent anyone from selecting their own favourite downstream versions). Even if Bigtop components would not specify any dependencies, there would still be some work needed to make sure components compile against each others (for instance, doing a maven build on hbase, would pull hadoop from its default profile, not the one we want to exercise, so the way a component is built may have to be tweaked) as well as testing their integration. But mix-matching is only one of many benefits of such proposal.
          Hide
          Konstantin Boudnik added a comment -

          > Cos, I don't think I deserve the aggressive response.
          Matt, I don't think you got any. I was merely inducing upon your premises.

          Show
          Konstantin Boudnik added a comment - > Cos, I don't think I deserve the aggressive response. Matt, I don't think you got any. I was merely inducing upon your premises.
          Hide
          Eli Collins added a comment -

          I think the description and discussion make this change sound like a much bigger deal than it actually is. The motivations is that the packages should be updated to reflect the project split. Eg we should be able to install hdfs w/o mapreduce, or hdfs w/o yarn, or yarn w/o mapreduce (now that MR is user-side there's no reason to install it on all the servers right?), etc.

          Currently there is one hadoop spec file that results in a bunch of hadoop packages (hadoop-<version>, hadoop-datanode, hadooop-namenode, hadoop-jobtracker, etc) where most packages depend on the primary hadoop-<version> package and introduce just service-specific bits (eg a datanode service script). What this change is proposing is that instead of a bunch of hadoop-* packages we subdivide into common, hdfs, mapreduce, and yarn packages. Note that there is still a single coherent set of packages for a given bigtop release. We'll enforce this with version dependencies the same we do today to make sure hadoop-datanode depends the right hadoop-version, and eg hbase package x depends on zookeeper package x, ie this doesn't allow for arbitrary mixing and matching, it just updates the package structure to reflect the project split.

          Show
          Eli Collins added a comment - I think the description and discussion make this change sound like a much bigger deal than it actually is. The motivations is that the packages should be updated to reflect the project split. Eg we should be able to install hdfs w/o mapreduce, or hdfs w/o yarn, or yarn w/o mapreduce (now that MR is user-side there's no reason to install it on all the servers right?), etc. Currently there is one hadoop spec file that results in a bunch of hadoop packages (hadoop-<version>, hadoop-datanode, hadooop-namenode, hadoop-jobtracker, etc) where most packages depend on the primary hadoop-<version> package and introduce just service-specific bits (eg a datanode service script). What this change is proposing is that instead of a bunch of hadoop-* packages we subdivide into common, hdfs, mapreduce, and yarn packages. Note that there is still a single coherent set of packages for a given bigtop release. We'll enforce this with version dependencies the same we do today to make sure hadoop-datanode depends the right hadoop-version, and eg hbase package x depends on zookeeper package x, ie this doesn't allow for arbitrary mixing and matching, it just updates the package structure to reflect the project split.
          Hide
          Konstantin Boudnik added a comment -

          > ie this doesn't allow for arbitrary mixing and matching
          One can produce a downstream stack definition which will be dependent on whatever version of hdfs and mapreduce packages one is pleased with. How it doesn't allow mix-matching?

          Show
          Konstantin Boudnik added a comment - > ie this doesn't allow for arbitrary mixing and matching One can produce a downstream stack definition which will be dependent on whatever version of hdfs and mapreduce packages one is pleased with. How it doesn't allow mix-matching?
          Hide
          Eli Collins added a comment -

          I mean that a given bigtop repo will only contain matched packages for that release (you'd have to install multiple bigtop repos to have hdfs 24 and yarn 23 running side by side) and that even though the projects are split we should use versions in the dependencies, eg hdfs-23-* should depend on common-23-* instead of common-* so that until common, hdfs, and mr are released independently and work compatibly we prevent mixing and matching packages that we know won't work together.

          Show
          Eli Collins added a comment - I mean that a given bigtop repo will only contain matched packages for that release (you'd have to install multiple bigtop repos to have hdfs 24 and yarn 23 running side by side) and that even though the projects are split we should use versions in the dependencies, eg hdfs-23-* should depend on common-23-* instead of common-* so that until common, hdfs, and mr are released independently and work compatibly we prevent mixing and matching packages that we know won't work together.
          Hide
          Konstantin Boudnik added a comment -

          That's true for official releases. However, for one-off builds and validations it would work great, which is exactly the purpose of the mix-n-match.

          Show
          Konstantin Boudnik added a comment - That's true for official releases. However, for one-off builds and validations it would work great, which is exactly the purpose of the mix-n-match.
          Hide
          Peter Linnell added a comment -

          +1 for the latter

          Show
          Peter Linnell added a comment - +1 for the latter
          Hide
          Bruno Mahé added a comment -

          +0 with a preference for the former.
          My main concern with the latter is indexing. For instance it is convenient to know that typing a "yum search hadoop" will bring me all hadoop related packages, and through the naming of the package, they will be directly related to it. With the latter, typing a "yum search hadoop" will bring a bunch of packages, but I would have to pay a lot more attention to the results.

          Show
          Bruno Mahé added a comment - +0 with a preference for the former. My main concern with the latter is indexing. For instance it is convenient to know that typing a "yum search hadoop" will bring me all hadoop related packages, and through the naming of the package, they will be directly related to it. With the latter, typing a "yum search hadoop" will bring a bunch of packages, but I would have to pay a lot more attention to the results.
          Hide
          Konstantin Boudnik added a comment -

          I agree with Bruno on this one: having a common prefix makes sense, actually.
          +1 otherwise

          Show
          Konstantin Boudnik added a comment - I agree with Bruno on this one: having a common prefix makes sense, actually. +1 otherwise
          Hide
          Roman Shaposhnik added a comment -

          Here's a preliminary patch for the naming convention #1. It is still not a final one, since we're blocked on upstream HADOOP-7939 for now. However, the workarounds for HADOOP-7939 seem to be reasonable and this patch overall will be a very useful incremental step forward.

          Keep an eye on FIXME/HADOOP-7939 annotations to see what will go away once HADOOP-7939 is fixed.

          Also, this patch does NOT (yet):

          • split configuration location (multiple packages contribute to /etc/hadoop/conf)
          • split actual file locations (multiple packages contribute to /usr/lib/hadoop)
          • make real use of /var/[log|run]/[hdfs|mapreduce] (for now they are simpy symlinks to hadoop)

          Once again, all of the above is going to go away once HADOOP-7939 gets fixed. Please don't mind it too much.

          Show
          Roman Shaposhnik added a comment - Here's a preliminary patch for the naming convention #1. It is still not a final one, since we're blocked on upstream HADOOP-7939 for now. However, the workarounds for HADOOP-7939 seem to be reasonable and this patch overall will be a very useful incremental step forward. Keep an eye on FIXME/ HADOOP-7939 annotations to see what will go away once HADOOP-7939 is fixed. Also, this patch does NOT (yet): split configuration location (multiple packages contribute to /etc/hadoop/conf) split actual file locations (multiple packages contribute to /usr/lib/hadoop) make real use of /var/ [log|run] / [hdfs|mapreduce] (for now they are simpy symlinks to hadoop) Once again, all of the above is going to go away once HADOOP-7939 gets fixed. Please don't mind it too much.
          Hide
          Bruno Mahé added a comment - - edited

          Overall, it looks great!
          Some notes:

          • Any reason to set hadoop.tmp.dir to /var/lib/hdfs/cache/$ {user.name} instead of /var/lib/hdfs/cache/${user.name}

            /tmp ?

          • I don't really like moving /usr/libexec to /usr/lib/hadoop/libexec. But if it makes it work, I don't really have choice.
          • Where does $ {bin_wrapper/hadoop/yarn}

            come from?

          • I would set hdfs user's homedir to /var/lib/hdfs. Same for other users (I know it's not new)
          • the %preun in hdfs, yarn and mapreduce in the spec file won't work. the %service_macro will take care of the services. So please delete these %preun sections
          Show
          Bruno Mahé added a comment - - edited Overall, it looks great! Some notes: Any reason to set hadoop.tmp.dir to /var/lib/hdfs/cache/$ {user.name} instead of /var/lib/hdfs/cache/${user.name} /tmp ? I don't really like moving /usr/libexec to /usr/lib/hadoop/libexec. But if it makes it work, I don't really have choice. Where does $ {bin_wrapper/hadoop/yarn} come from? I would set hdfs user's homedir to /var/lib/hdfs. Same for other users (I know it's not new) the %preun in hdfs, yarn and mapreduce in the spec file won't work. the %service_macro will take care of the services. So please delete these %preun sections
          Hide
          Roman Shaposhnik added a comment -

          Bruno, thanks a million for a very nice and helpful feedback. I've attached a patch that now builds on RPM and takes care of some of your concerns (home for daemons and %preun).

          To answer your other questions:
          1. Where does $

          {bin_wrapper/hadoop/yarn}

          come from?
          That's just a hack to have hadoop launcher script include yarn. Ugly, but it'll go away as much as the rest of HADOOP-7939 related things. I suggest we just keep it for now.

          2. Any reason to set hadoop.tmp.dir to /var/lib/hdfs/cache/$

          {user.name} instead of /var/lib/hdfs/cache/${user.name}

          /tmp
          The hadoop.tmp.dir is not really a tmp dir, but more of a prefix for other things that Hadoop constructs on the fly if they are not setup explicitly.

          Thanks,
          Roman.

          Show
          Roman Shaposhnik added a comment - Bruno, thanks a million for a very nice and helpful feedback. I've attached a patch that now builds on RPM and takes care of some of your concerns (home for daemons and %preun). To answer your other questions: 1. Where does $ {bin_wrapper/hadoop/yarn} come from? That's just a hack to have hadoop launcher script include yarn. Ugly, but it'll go away as much as the rest of HADOOP-7939 related things. I suggest we just keep it for now. 2. Any reason to set hadoop.tmp.dir to /var/lib/hdfs/cache/$ {user.name} instead of /var/lib/hdfs/cache/${user.name} /tmp The hadoop.tmp.dir is not really a tmp dir, but more of a prefix for other things that Hadoop constructs on the fly if they are not setup explicitly. Thanks, Roman.
          Hide
          Bruno Mahé added a comment -

          Awesome!
          Thanks a lot
          +1 for that second patch

          Show
          Bruno Mahé added a comment - Awesome! Thanks a lot +1 for that second patch
          Hide
          Roman Shaposhnik added a comment -

          Perfect! I filed this: BIGTOP-340 to keep track of one of the issues you've raised.

          Show
          Roman Shaposhnik added a comment - Perfect! I filed this: BIGTOP-340 to keep track of one of the issues you've raised.
          Hide
          Peter Linnell added a comment -

          Strong +1 Nicely done.

          Show
          Peter Linnell added a comment - Strong +1 Nicely done.

            People

            • Assignee:
              Roman Shaposhnik
              Reporter:
              Roman Shaposhnik
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development