Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1478

Separate the mapred.lib and mapreduce.lib classes to a different jar and include the user jar ahead of the lib jar.

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: task
    • Labels:
      None

      Description

      Currently the user can't include updated library jars as part of their job. By pulling out the lib classes we can include the classes (eg. TextInputFormat) in the user's jar and get their version and not the system installed one.

      1. MAPREDUCE-1478.patch
        63 kB
        Tom White
      2. move.sh
        1 kB
        Tom White

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          I assume this would be somewhat optional, right? That is, if the user doesn't put a mapreduce-lib.jar in their job.jar/lib/, it will use the system one on the TT's classpath? We'll just allow users to use a local one when they like?

          Show
          Todd Lipcon added a comment - I assume this would be somewhat optional, right? That is, if the user doesn't put a mapreduce-lib.jar in their job.jar/lib/, it will use the system one on the TT's classpath? We'll just allow users to use a local one when they like?
          Hide
          Arun C Murthy added a comment -

          Yes, we would need to use the installed ones by default - however, it gives the users an option to use a newer version of mapreduce.libraries etc.

          Show
          Arun C Murthy added a comment - Yes, we would need to use the installed ones by default - however, it gives the users an option to use a newer version of mapreduce.libraries etc.
          Hide
          Arun C Murthy added a comment -

          We also need to load the mapreduce.lib jar after loading the user-jars. Currently we load the system jars and then user jars, so the new order would be:

          system-jars, user.libs, mapreduce.lib.

          Show
          Arun C Murthy added a comment - We also need to load the mapreduce.lib jar after loading the user-jars. Currently we load the system jars and then user jars, so the new order would be: system-jars, user.libs, mapreduce.lib.
          Hide
          Tom White added a comment -

          This is related to MAPREDUCE-1453. Not all the old API library classes are in the o.a.h.mapred.lib package (e.g. TextInputFormat) - would these be included in the separate JAR too?

          Show
          Tom White added a comment - This is related to MAPREDUCE-1453 . Not all the old API library classes are in the o.a.h.mapred.lib package (e.g. TextInputFormat) - would these be included in the separate JAR too?
          Hide
          Tom White added a comment -

          Here's an initial first cut at implementing this. The part for including the user's classpath first is covered in MAPREDUCE-1938. This patch creates a new source tree under src/lib for the MapReduce libraries, and the build creates a separate library jar. It compiles, but I haven't done any testing yet. There are a number of changes that are needed to remove core's dependency on the libraries. Most are small, like removing dependencies on constants or configuration names, but here are some of the other ones I made:

          • Introduce InputSplitCallback so MapTask doesn't have a dependency on FileSplit. Make mapred.FileSplit implement this interface so it can modify JobConf before the mapper is run.
          • Task depends on FileOutputCommitter. Push this code down into FileOutputCommitter implementations of OutputCommitter#setupTask.
          • MapTask, Task depend on the public WrappedMapper, WrappedReducer classes. These need to be constructed reflectively or have private duplicates made (I did the latter in this patch).
          • org.apache.hadoop.mapreduce.util.ConfigUtil reflectively calls the new org.apache.hadoop.mapreduce.lib.ConfigUtil so that the deprecated keys for the libraries are added.

          You need to run the move.sh script before applying the patch.

          Show
          Tom White added a comment - Here's an initial first cut at implementing this. The part for including the user's classpath first is covered in MAPREDUCE-1938 . This patch creates a new source tree under src/lib for the MapReduce libraries, and the build creates a separate library jar. It compiles, but I haven't done any testing yet. There are a number of changes that are needed to remove core's dependency on the libraries. Most are small, like removing dependencies on constants or configuration names, but here are some of the other ones I made: Introduce InputSplitCallback so MapTask doesn't have a dependency on FileSplit. Make mapred.FileSplit implement this interface so it can modify JobConf before the mapper is run. Task depends on FileOutputCommitter. Push this code down into FileOutputCommitter implementations of OutputCommitter#setupTask. MapTask, Task depend on the public WrappedMapper, WrappedReducer classes. These need to be constructed reflectively or have private duplicates made (I did the latter in this patch). org.apache.hadoop.mapreduce.util.ConfigUtil reflectively calls the new org.apache.hadoop.mapreduce.lib.ConfigUtil so that the deprecated keys for the libraries are added. You need to run the move.sh script before applying the patch.
          Hide
          Scott Carey added a comment -

          This approach will work.

          I see another possibility that is simpler, but has some serious drawbacks.

          Instead of separating the source directories and classpaths at build time, file set includes/excludes could be used at jar creation time to create two jar files.
          Benefit: simpler, requires no code changes:
          Drawback: does not actually disentangle the code, which is more brittle and could lead to confusion depending on what changes are in the user lib jars.

          That is a big drawback.

          The above approach is more complicated but does enforce true separation.
          In the long run, tackling the disentangling of the lib from the system is a requirement but the extra complexity requires more careful review. It also is the bulk of the work required to make the lib a truly separate library – for maven or OSGi.

          I'm not an expert on the details of the API and code changes needed for clean separation as a consequence of this. The callback changes seem like an improvement. Duplication of wrappers and reflective access are usually signs that there is a better way, but that better way almost always requires an API break.

          Overall, I like this approach because it is moving in the right direction for separating the core execution from more volatile or user-extensible library code.

          Show
          Scott Carey added a comment - This approach will work. I see another possibility that is simpler, but has some serious drawbacks. Instead of separating the source directories and classpaths at build time, file set includes/excludes could be used at jar creation time to create two jar files. Benefit: simpler, requires no code changes: Drawback: does not actually disentangle the code, which is more brittle and could lead to confusion depending on what changes are in the user lib jars. That is a big drawback. The above approach is more complicated but does enforce true separation. In the long run, tackling the disentangling of the lib from the system is a requirement but the extra complexity requires more careful review. It also is the bulk of the work required to make the lib a truly separate library – for maven or OSGi. I'm not an expert on the details of the API and code changes needed for clean separation as a consequence of this. The callback changes seem like an improvement. Duplication of wrappers and reflective access are usually signs that there is a better way, but that better way almost always requires an API break. Overall, I like this approach because it is moving in the right direction for separating the core execution from more volatile or user-extensible library code.
          Hide
          Tom White added a comment -

          Thanks for taking a look Scott. Using includes/excludes was my first approach, but the compiler pulls in the transitive closure of the classes, which means that you get duplicates between the two jars.

          > Duplication of wrappers and reflective access are usually signs that there is a better way, but that better way almost always requires an API break.

          I'm open to suggestions for a better way. We could remove the reflective access by, e.g., defining a part of the library that is core and stays in the core jar. The wrapper duplication arose because we put them in lib, but use them from core. I'm not sure how much a problem the duplication is in practice as it's all boilerplate delegation code.

          Show
          Tom White added a comment - Thanks for taking a look Scott. Using includes/excludes was my first approach, but the compiler pulls in the transitive closure of the classes, which means that you get duplicates between the two jars. > Duplication of wrappers and reflective access are usually signs that there is a better way, but that better way almost always requires an API break. I'm open to suggestions for a better way. We could remove the reflective access by, e.g., defining a part of the library that is core and stays in the core jar. The wrapper duplication arose because we put them in lib, but use them from core. I'm not sure how much a problem the duplication is in practice as it's all boilerplate delegation code.

            People

            • Assignee:
              Unassigned
              Reporter:
              Owen O'Malley
            • Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:

                Development