Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1638

Remove MapReduce server depedencies on client code

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 0.24.0
    • Component/s: build, client
    • Labels:
      None

      Description

      I think it makes sense to separate the MapReduce source into client and server parts.

      1. MAPREDUCE-1638.sh
        6 kB
        Tom White
      2. MAPREDUCE-1638.patch
        3 kB
        Tom White
      3. MAPREDUCE-1638.patch
        2 kB
        Tom White
      4. MAPREDUCE-1638.patch
        5 kB
        Tom White

        Issue Links

          Activity

          Hide
          Arun C Murthy added a comment -

          Sorry to come in late, the patch has gone stale. Can you please rebase? Thanks.

          Show
          Arun C Murthy added a comment - Sorry to come in late, the patch has gone stale. Can you please rebase? Thanks.
          Hide
          Todd Lipcon added a comment -

          +1 to the patch from 7/22

          Show
          Todd Lipcon added a comment - +1 to the patch from 7/22
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12487568/MAPREDUCE-1638.patch
          against trunk revision 1149323.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestMRCLI
          org.apache.hadoop.fs.TestFileSystem

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/494//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/494//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/494//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12487568/MAPREDUCE-1638.patch against trunk revision 1149323. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestMRCLI org.apache.hadoop.fs.TestFileSystem -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/494//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/494//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/494//console This message is automatically generated.
          Hide
          Tom White added a comment -

          Updated title and description per Owen's comment.

          Show
          Tom White added a comment - Updated title and description per Owen's comment.
          Hide
          Tom White added a comment -

          > Could the toFullPropertyName method be moved to MRJobConfig and then used by both QueueManager and JobSubmitter? Duplicating it seems ugly.

          MRJobConfig is an interface containing constants, so I've created a private QueueConfig class for the toFullPropertyName method instead.

          Show
          Tom White added a comment - > Could the toFullPropertyName method be moved to MRJobConfig and then used by both QueueManager and JobSubmitter? Duplicating it seems ugly. MRJobConfig is an interface containing constants, so I've created a private QueueConfig class for the toFullPropertyName method instead.
          Hide
          Todd Lipcon added a comment -

          Could the toFullPropertyName method be moved to MRJobConfig and then used by both QueueManager and JobSubmitter? Duplicating it seems ugly.

          Show
          Todd Lipcon added a comment - Could the toFullPropertyName method be moved to MRJobConfig and then used by both QueueManager and JobSubmitter? Duplicating it seems ugly.
          Hide
          Tom White added a comment -

          Yes, these patches are about removing client dependencies on the server.

          Show
          Tom White added a comment - Yes, these patches are about removing client dependencies on the server.
          Hide
          Owen O'Malley added a comment -

          I can understand splitting up the client and server jars, but splitting up the API and implementation only makes sense if you have different implementations and a test suite to test them.

          Cleaning up the dependencies is a good thing, especially removing dependencies from the client on the server code.

          Show
          Owen O'Malley added a comment - I can understand splitting up the client and server jars, but splitting up the API and implementation only makes sense if you have different implementations and a test suite to test them. Cleaning up the dependencies is a good thing, especially removing dependencies from the client on the server code.
          Hide
          Tom White added a comment -

          This patch just does some more clean up to allow the separation to happen. This part can be committed now, while the actual separation of the trees would be done later, probably as a part of the process of getting MR-279 into trunk.

          Show
          Tom White added a comment - This patch just does some more clean up to allow the separation to happen. This part can be committed now, while the actual separation of the trees would be done later, probably as a part of the process of getting MR-279 into trunk.
          Hide
          Tom White added a comment -

          Here's a patch to get started. The script moves the MapReduce server classes into another source tree (leaving them in the same package). Compiling the remaining classes we can see which dependencies of the API on the implementation still need to be taken care of.

          Summarizing the compilation errors:

          • ClusterStatus depends on JobTracker.State. Covered by MAPREDUCE-2337
          • CLI depends on HistoryViewer. Fix: Declare a public API for JobHistory.
          • Cluster depends on JobTracker, LocalJobRunner, JobHistory. Fix: Introduce a factory using ServiceLoader.
          • Application (Pipes) depends on TaskLog.
          • DistributedCache depends on TaskController, MRAsyncDiskService.
          Show
          Tom White added a comment - Here's a patch to get started. The script moves the MapReduce server classes into another source tree (leaving them in the same package). Compiling the remaining classes we can see which dependencies of the API on the implementation still need to be taken care of. Summarizing the compilation errors: ClusterStatus depends on JobTracker.State. Covered by MAPREDUCE-2337 CLI depends on HistoryViewer. Fix: Declare a public API for JobHistory. Cluster depends on JobTracker, LocalJobRunner, JobHistory. Fix: Introduce a factory using ServiceLoader. Application (Pipes) depends on TaskLog. DistributedCache depends on TaskController, MRAsyncDiskService.
          Hide
          steve_l added a comment -

          I know of people who are looking at alternate execution engines to the JT, so having that bit of the API decoupled from the implementation would be good. I don't (yet) see the need for separate JARs though, it only complicates things and creates new problems.

          Show
          steve_l added a comment - I know of people who are looking at alternate execution engines to the JT, so having that bit of the API decoupled from the implementation would be good. I don't (yet) see the need for separate JARs though, it only complicates things and creates new problems.
          Hide
          Tom White added a comment -

          After considering it, I don't think that we should separate the api from the impls. In general, that makes more sense if you have multiple implementations of the api.

          In some sense we already have multiple implementations of the API, if you consider the full distributed implementation and LocalJobRunner as different implementations. (Also MRUnit is a partial implementation.)

          I'm also worried that there would be a circular dependence between the two jars.

          You're right that there are currently many circular dependencies, which are hard to break. Perhaps we can improve this in the future, but it's a bigger project (a project Jigsaw - http://openjdk.java.net/projects/jigsaw/ - for MapReduce?).

          I agree that we need to make the line stronger, but maybe it would be better to move the implementations into new packages?

          This would be a good step. For the moment, I think that a combination of MAPREDUCE-1623 and HADOOP-6658 is enough to define the public user-facing API.

          Show
          Tom White added a comment - After considering it, I don't think that we should separate the api from the impls. In general, that makes more sense if you have multiple implementations of the api. In some sense we already have multiple implementations of the API, if you consider the full distributed implementation and LocalJobRunner as different implementations. (Also MRUnit is a partial implementation.) I'm also worried that there would be a circular dependence between the two jars. You're right that there are currently many circular dependencies, which are hard to break. Perhaps we can improve this in the future, but it's a bigger project (a project Jigsaw - http://openjdk.java.net/projects/jigsaw/ - for MapReduce?). I agree that we need to make the line stronger, but maybe it would be better to move the implementations into new packages? This would be a good step. For the moment, I think that a combination of MAPREDUCE-1623 and HADOOP-6658 is enough to define the public user-facing API.
          Hide
          Owen O'Malley added a comment -

          I support splitting the libraries out to their own source tree. That both supports important use cases (MAPREDUCE-1478) and enforces good style (MAPREDUCE-1453).

          After considering it, I don't think that we should separate the api from the impls. In general, that makes more sense if you have multiple implementations of the api. I'm also worried that there would be a circular dependence between the two jars.

          I agree that we need to make the line stronger, but maybe it would be better to move the implementations into new packages?

          Leave all of the API classes alone and for the implementation classes move them as:

          org.apache.hadoop.mapred.* - > org.apache.hadoop.mapreduce.impl.*
          org.apache.hadoop.mapreduce.* -> org.apache.hadoop.mapreduce.impl.*

          Show
          Owen O'Malley added a comment - I support splitting the libraries out to their own source tree. That both supports important use cases ( MAPREDUCE-1478 ) and enforces good style ( MAPREDUCE-1453 ). After considering it, I don't think that we should separate the api from the impls. In general, that makes more sense if you have multiple implementations of the api. I'm also worried that there would be a circular dependence between the two jars. I agree that we need to make the line stronger, but maybe it would be better to move the implementations into new packages? Leave all of the API classes alone and for the implementation classes move them as: org.apache.hadoop.mapred.* - > org.apache.hadoop.mapreduce.impl.* org.apache.hadoop.mapreduce.* -> org.apache.hadoop.mapreduce.impl.*
          Hide
          Tom White added a comment -

          To flesh this out a little more: roughly speaking the API tree would contain the public part of o.a.h.mapred that contains the (deprecated) user API as well as o.a.h.mapreduce. The library tree would contain o.a.h.mapred.lib and o.a.h.mapreduce.lib (and subpackages). The implementation tree would contain everything else (although there may be exceptions - classes that are considered a part of the public API and should go in the API tree).

          This change would mark a very clear boundary between the public user-facing API and the internal implementation. Having separate source trees is a common approach in many projects. The use of annotations introduced in HADOOP-5073 doesn't provide such a clear demarcation (since you can't conditionally compile according to the presence of an annotation), but is still useful for more fine-grained distinctions.

          Note that this change would not introduce an incompatible change, since classes would be moved between trees and remain in the same packages.

          I see the following advantages:

          • If we created separate JARs for each tree, clients could compile against the API and library JARs without inadvertently introducing dependencies on implementation class that happen to be public. Even if the class is marked as InterfaceAudience.Private, it is easy to accidentally have the IDE pick it up.
          • It makes MAPREDUCE-1478 (shipping modified libraries) easy to implement.
          • We can enforce that the kernel (user API) and implementation don't depend on the libraries. (MAPREDUCE-1453)
          • It helps enforce compatibility. From a review standpoint it would be easier to see if a patch modifies a public API, since it is in its own tree. Similarly, we would publish javadoc for API and library, and generate JDiff against them. Currently JDiff output is very large due to a large number of false positives, so it is difficult to see real incompatibility problems. (HADOOP-6658 helps here, but the approach described in this issue solves the other problems listed above too.)

          Thoughts?

          Show
          Tom White added a comment - To flesh this out a little more: roughly speaking the API tree would contain the public part of o.a.h.mapred that contains the (deprecated) user API as well as o.a.h.mapreduce. The library tree would contain o.a.h.mapred.lib and o.a.h.mapreduce.lib (and subpackages). The implementation tree would contain everything else (although there may be exceptions - classes that are considered a part of the public API and should go in the API tree). This change would mark a very clear boundary between the public user-facing API and the internal implementation. Having separate source trees is a common approach in many projects. The use of annotations introduced in HADOOP-5073 doesn't provide such a clear demarcation (since you can't conditionally compile according to the presence of an annotation), but is still useful for more fine-grained distinctions. Note that this change would not introduce an incompatible change, since classes would be moved between trees and remain in the same packages. I see the following advantages: If we created separate JARs for each tree, clients could compile against the API and library JARs without inadvertently introducing dependencies on implementation class that happen to be public. Even if the class is marked as InterfaceAudience.Private, it is easy to accidentally have the IDE pick it up. It makes MAPREDUCE-1478 (shipping modified libraries) easy to implement. We can enforce that the kernel (user API) and implementation don't depend on the libraries. ( MAPREDUCE-1453 ) It helps enforce compatibility. From a review standpoint it would be easier to see if a patch modifies a public API, since it is in its own tree. Similarly, we would publish javadoc for API and library, and generate JDiff against them. Currently JDiff output is very large due to a large number of false positives, so it is difficult to see real incompatibility problems. ( HADOOP-6658 helps here, but the approach described in this issue solves the other problems listed above too.) Thoughts?

            People

            • Assignee:
              Tom White
              Reporter:
              Tom White
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:

                Development