Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.14.0
    • Fix Version/s: None
    • Component/s: build
    • Labels:
      None
    • Environment:

      All

      Description

      For the ease of deployment, one should not have to change the server jars, and restart clusters, when minor features on the client side are changed. This requireds separating client and server jars for hadoop. Version numbers appended to hadoop jars can reflect the compatibility. e.g. the server jar could be at 0.13.1, and the client jar could be at 0.13.2. In short, we can treat the part following 0. as the "major" version number for now.

      This allows major client frameworks such as streaming and Pig happy. To my knowledge, Pig uses hadoop's default jobclient. Whereas streaming uses its own jobclient. I would love to change streaming to use the default hadoop jobclient, if I can make modifications to it (e.g. to print more stats that are available from TaskReport, for example), if I do not have to deploy the new version of the whole jar to the backend and restart the mapreduce cluster.

      (I thought there was already a bug filed for separating the client and server jar, but I could not find it. Hence the new Jira. Sorry about duplication, if any.)

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          > Is it required that all tools be released at the same frequency as the core hadoop system?

          Pragmatically, yes. We don't want to call separate release votes for each tool. Bundling tools with the core simplifies compatibility: otherwise each tool would need to separately document which version of the core it is compatible with. Etc.

          I'm still on the fence about whether we need separate jars for each logical tool, or whether we should just move all non-kernel code into a single hadoop-tool.jar. In the case of Nutch, having all those plugins separate makes builds go slower, and makes browsing the sources more awkward.

          Show
          Doug Cutting added a comment - > Is it required that all tools be released at the same frequency as the core hadoop system? Pragmatically, yes. We don't want to call separate release votes for each tool. Bundling tools with the core simplifies compatibility: otherwise each tool would need to separately document which version of the core it is compatible with. Etc. I'm still on the fence about whether we need separate jars for each logical tool, or whether we should just move all non-kernel code into a single hadoop-tool.jar. In the case of Nutch, having all those plugins separate makes builds go slower, and makes browsing the sources more awkward.
          Hide
          Milind Bhandarkar added a comment -

          Indeed.

          src/tools is the right place to put such things. Is it required that all tools be released at the same frequency as the core hadoop system? I should probably take a look at nutch release processes, and see if they fit our requirements.

          Show
          Milind Bhandarkar added a comment - Indeed. src/tools is the right place to put such things. Is it required that all tools be released at the same frequency as the core hadoop system? I should probably take a look at nutch release processes, and see if they fit our requirements.
          Hide
          Doug Cutting added a comment -

          > Would it be a good idea for building a hadoop ecosystem, to separate all these client codes into separate projects?

          I think you mean something different than 'project'. At Apache, each project requires a diverse community, so code with a single contributor or with all contributors from the same employer aren't good candidates for Apache projects. But separating client tools into separate jar files might be good. The cleanest way to separate jar files is to use separate source trees. Splitting tools into 'src/tool/' subdirectories would be fine with me. For example, Nutch puts each of its plugins in a separate source tree:

          http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/plugin/

          And we already have contrib structured this way. Is that what you meant?

          Show
          Doug Cutting added a comment - > Would it be a good idea for building a hadoop ecosystem, to separate all these client codes into separate projects? I think you mean something different than 'project'. At Apache, each project requires a diverse community, so code with a single contributor or with all contributors from the same employer aren't good candidates for Apache projects. But separating client tools into separate jar files might be good. The cleanest way to separate jar files is to use separate source trees. Splitting tools into 'src/tool/' subdirectories would be fine with me. For example, Nutch puts each of its plugins in a separate source tree: http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/plugin/ And we already have contrib structured this way. Is that what you meant?
          Hide
          Milind Bhandarkar added a comment -

          Just want to know other users' opinions.

          Would it be a good idea for building a hadoop ecosystem, to separate all these client codes into separate projects?

          (one project - one artifact is a good thing (tm). Doug agrees on this, based on his previous comments. For growing hadoop ecosystem, we need several artifacts, that can evolve separately, and faster.)

          For example, how about separating distcp as a hadoop-dependent, but separate project to provide a high-bandwidth distributed copy ?

          How about separating streaming into a separate projects as a mechanism for using time-tested ways (aka stdin, and stdout) for passing data between plugins and hadoop ?

          Show
          Milind Bhandarkar added a comment - Just want to know other users' opinions. Would it be a good idea for building a hadoop ecosystem, to separate all these client codes into separate projects? (one project - one artifact is a good thing (tm). Doug agrees on this, based on his previous comments. For growing hadoop ecosystem, we need several artifacts, that can evolve separately, and faster.) For example, how about separating distcp as a hadoop-dependent, but separate project to provide a high-bandwidth distributed copy ? How about separating streaming into a separate projects as a mechanism for using time-tested ways (aka stdin, and stdout) for passing data between plugins and hadoop ?
          Hide
          Owen O'Malley added a comment -

          I'm not wild about splitting up the jar into a bunch of pieces. In particular, I think that making "tweaks" to some of the system tools like distcp and melding them into a release is a really bad idea for debugability and maintainability. I've seen far too many cases of someone making a "minor" fix to a tool that hoses a tool chain and with Hadoop, such a mess up can affect a lot of users very quickly.

          Show
          Owen O'Malley added a comment - I'm not wild about splitting up the jar into a bunch of pieces. In particular, I think that making "tweaks" to some of the system tools like distcp and melding them into a release is a really bad idea for debugability and maintainability. I've seen far too many cases of someone making a "minor" fix to a tool that hoses a tool chain and with Hadoop, such a mess up can affect a lot of users very quickly.
          Hide
          Arun C Murthy added a comment -

          Ah. I think one could make a case for a hadoop-user jar, that includes standard, supported stuff, like distcp, aggregate, etc.

          Essentially they are standard, supported tools/utilities for users... do we have a case for hadoop-tools.jar then?
          One other nice tool would be one which converts a bunch on input formats (text, compressed text etc.) to sequence files (say txt2seq for now). We could encourage users to contribute more of these utils.

          Along similar lines, it would nice to encourage people to contribute their actual mapper/reducer implementations and build up a comprehensive org.apache.hadoop.mapred.lib, clearly with riders about IP etc. I'm not clear whether it would help to have a hadoop-mapred-lib.jar though. What do others think, should I file a jira?

          Show
          Arun C Murthy added a comment - Ah. I think one could make a case for a hadoop-user jar, that includes standard, supported stuff, like distcp, aggregate, etc. Essentially they are standard, supported tools/utilities for users... do we have a case for hadoop-tools.jar then? One other nice tool would be one which converts a bunch on input formats (text, compressed text etc.) to sequence files (say txt2seq for now). We could encourage users to contribute more of these utils. Along similar lines, it would nice to encourage people to contribute their actual mapper/reducer implementations and build up a comprehensive org.apache.hadoop.mapred.lib , clearly with riders about IP etc. I'm not clear whether it would help to have a hadoop-mapred-lib.jar though. What do others think, should I file a jira?
          Hide
          Milind Bhandarkar added a comment -

          yes, that would be awesome ! Thanks.

          Show
          Milind Bhandarkar added a comment - yes, that would be awesome ! Thanks.
          Hide
          Doug Cutting added a comment -

          > we faced this problem while using distcp [ ...]

          Ah. I think one could make a case for a hadoop-user jar, that includes standard, supported stuff, like distcp, aggregate, etc., that's not part of the kernel, and should hence not be included in the kernel's jar that's used on servers. A clean way to do this would be to use a separate source tree. So we might split src/java into src/sys and src/user, and replace hadoop-core.jar with hadoop-sys.jar and hadoop-user.jar. Some client-side stuff, like DFSClient and JobClient, would still be in the kernel jar, so the split wouldn't be client/server but rather user/system. Could that work?

          Alternately, we could move such things into src/contrib, which is already built as separate jars and only contains user code. In any case, we should attempt to minimize what's in the system jar file to avoid problems like this.

          Show
          Doug Cutting added a comment - > we faced this problem while using distcp [ ...] Ah. I think one could make a case for a hadoop-user jar, that includes standard, supported stuff, like distcp, aggregate, etc., that's not part of the kernel, and should hence not be included in the kernel's jar that's used on servers. A clean way to do this would be to use a separate source tree. So we might split src/java into src/sys and src/user, and replace hadoop-core.jar with hadoop-sys.jar and hadoop-user.jar. Some client-side stuff, like DFSClient and JobClient, would still be in the kernel jar, so the split wouldn't be client/server but rather user/system. Could that work? Alternately, we could move such things into src/contrib, which is already built as separate jars and only contains user code. In any case, we should attempt to minimize what's in the system jar file to avoid problems like this.
          Hide
          Milind Bhandarkar added a comment -

          Actually, we faced this problem while using distcp also. (Sorry for not mentioning it in the original mail.) disctcp (i.e. o.a.h.util.CopyFiles) is a mapred and dfs client appplication. However, since it is part of THE hadoop jar, if we make modifications to it, and use the modified version as a client jar, it is not reflected on the server-side (i.e. the tasktrackers pick up the original version of CopyFiles, rather than from the client one.) This is also true for streaming, now that the contrib jars are on the class path and are searched before the user-supplied jar. So, in that case, there is a situation that the JobClient side changes in user-supplied jar work, but the Task-side changes are picked up from the deployed jars on tasktrackers.

          I do not claim to have a solution for this. But I am sure someone out there has come across this before and solved this problem.

          Show
          Milind Bhandarkar added a comment - Actually, we faced this problem while using distcp also. (Sorry for not mentioning it in the original mail.) disctcp (i.e. o.a.h.util.CopyFiles) is a mapred and dfs client appplication. However, since it is part of THE hadoop jar, if we make modifications to it, and use the modified version as a client jar, it is not reflected on the server-side (i.e. the tasktrackers pick up the original version of CopyFiles, rather than from the client one.) This is also true for streaming, now that the contrib jars are on the class path and are searched before the user-supplied jar. So, in that case, there is a situation that the JobClient side changes in user-supplied jar work, but the Task-side changes are picked up from the deployed jars on tasktrackers. I do not claim to have a solution for this. But I am sure someone out there has come across this before and solved this problem.
          Hide
          Doug Cutting added a comment -

          There's lots of code that's used on both the client and the server. Would that go in a third jar?

          Also, you state that your goal is to be able to upgrade the client jar without updating the server jar. Why isn't that possible today? Our current versioning system should already permit this: we should not be making incompatible protocol changes in point releases. So I'm not clear what this will facilitate that's not already possible. We shouldn't have to restart clusters to update the client's jar today, so long as there have been no incompatible protocol changes.

          Show
          Doug Cutting added a comment - There's lots of code that's used on both the client and the server. Would that go in a third jar? Also, you state that your goal is to be able to upgrade the client jar without updating the server jar. Why isn't that possible today? Our current versioning system should already permit this: we should not be making incompatible protocol changes in point releases. So I'm not clear what this will facilitate that's not already possible. We shouldn't have to restart clusters to update the client's jar today, so long as there have been no incompatible protocol changes.

            People

            • Assignee:
              Unassigned
              Reporter:
              Milind Bhandarkar
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development