Bigtop
  1. Bigtop
  2. BIGTOP-358

now that hadoop packages have been split we have to update the dependencies on the downstream packages

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: backlog
    • Component/s: None
    • Labels:
      None

      Description

      This is actually slightly more complicated than it sounds: it is pretty straightforward to replace a dependency on hadoop with a dependency on hadoop-mapreduce it is less clear what to do with HDFS. Strictly speaking HDFS is not a hard dependency (one can run on a local filesystems just fine).

      Thoughts?

      1. bigtop.png
        58 kB
        Roman Shaposhnik
      2. bigtop.dot
        0.5 kB
        Roman Shaposhnik

        Activity

        Hide
        Bruno Mahé added a comment -

        First thought would be to decouple mapreduce/yarn from hdfs then. Make hdfs or any other filesystem for hadoop provide a virtual provide such as hadoop-filesystem and make mapreduce/yarn depend on it. So if a downstream project only depend on mapreduce/yarn api, it does not have to pull hdfs if it's not needed.

        Show
        Bruno Mahé added a comment - First thought would be to decouple mapreduce/yarn from hdfs then. Make hdfs or any other filesystem for hadoop provide a virtual provide such as hadoop-filesystem and make mapreduce/yarn depend on it. So if a downstream project only depend on mapreduce/yarn api, it does not have to pull hdfs if it's not needed.
        Hide
        Peter Linnell added a comment -

        +1 to Bruno's comments. Perfectly legit solution. What else needs decoupling like that ?

        Show
        Peter Linnell added a comment - +1 to Bruno's comments. Perfectly legit solution. What else needs decoupling like that ?
        Hide
        Roman Shaposhnik added a comment -

        Attached dot and png files are what I figured so far (rectangle boxes represent capabilities that will be provided by actual packages and dotted lines represent "optional/recommended" dependencies). Now, I still have a few concerns:

        1. I think it is pretty clear by now that mapreduce dependency has to be on a capability, not an actual package (and then we'll have hadoop-mapreduce "Provide: " that capability. The question is whether we are ready to do the same with hadoop-hdfs and what those capabilities should be called (my proposal is to call them "mapreduce" and "dfs" respectively and make the actual packages hadoop-mapreduce and hadoop-hdfs provide those capabilities for now).

        2. For pig, hive,sqoop and mahout the real hard dependency is mapreduce. The dependency on dfs is an optional one (they can run just fine in local mode without ever talking to HDFS). The question is – what's the best mechanism to "recommend" dfs? I know we can do that with debian packages (Recommends tag), but what about RPM? Finally, are we doing the right thing here by treating dfs as an optional dependency or should we enforce it to begin with?

        3. HBase is a weird case here – at the Maven level they package all of their dependencies (optional or not) into lib/* they end up with a whole bunch of jars there that we're currently replacing by symlinks. Not all of those dependencies are needed by HBase in all cases
        (in fact the only hard dependency there is Zookeeper) but having dangling symlinks doesn't seem appealing. The question is – what do we do?

        Show
        Roman Shaposhnik added a comment - Attached dot and png files are what I figured so far (rectangle boxes represent capabilities that will be provided by actual packages and dotted lines represent "optional/recommended" dependencies). Now, I still have a few concerns: 1. I think it is pretty clear by now that mapreduce dependency has to be on a capability, not an actual package (and then we'll have hadoop-mapreduce "Provide: " that capability. The question is whether we are ready to do the same with hadoop-hdfs and what those capabilities should be called (my proposal is to call them "mapreduce" and "dfs" respectively and make the actual packages hadoop-mapreduce and hadoop-hdfs provide those capabilities for now). 2. For pig, hive,sqoop and mahout the real hard dependency is mapreduce. The dependency on dfs is an optional one (they can run just fine in local mode without ever talking to HDFS). The question is – what's the best mechanism to "recommend" dfs? I know we can do that with debian packages (Recommends tag), but what about RPM? Finally, are we doing the right thing here by treating dfs as an optional dependency or should we enforce it to begin with? 3. HBase is a weird case here – at the Maven level they package all of their dependencies (optional or not) into lib/* they end up with a whole bunch of jars there that we're currently replacing by symlinks. Not all of those dependencies are needed by HBase in all cases (in fact the only hard dependency there is Zookeeper) but having dangling symlinks doesn't seem appealing. The question is – what do we do?

          People

          • Assignee:
            Roman Shaposhnik
            Reporter:
            Roman Shaposhnik
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development