Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2001

Enhancement to SequenceFileOutputFormat to allow user to set MetaData

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.20.2
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat class currently does not provide a way for the user to pass in a MetaData object to be written to the SequenceFile.

      Currently he only way for a developer to implement this functionality appears to be to create a subclass which overrides the SequenceFileOutputFormat's getRecordWriter() method, which is a bit of a kludge.

      This seems to be a common enough request to warrant a fix of some sort. (It's already been brought up twice in the past year: http://www.mail-archive.com/common-user@hadoop.apache.org/msg02198.html and http://www.mail-archive.com/mapreduce-user@hadoop.apache.org/msg00904.html)

      A couple of possible solutions:

      1) provide a static method SequenceFileOutputFormat.setMetaData(Job, MetaData)

      2) Provide a (non-static) setMetaData() method on the SequenceFileOutputFormat class. The user would create a subclass of SequenceFileOutputFormat which, say, implements Configurable. Then in the setConf() method, the user could create the MetaData object (using data from the Configuration), and then call setMetaData. The SequenceFileOutputFormat would then use this MetaData object when creating the SequenceFile. (Note that the user would have to create a subclass of SequenceFileOutputFormat to make this solution work.)

        Activity

        Hide
        anurag tangri added a comment -

        Hi,
        Any update on this ? We are facing the same issue as David and I agree it is a common enough request.

        -Anurag

        Show
        anurag tangri added a comment - Hi, Any update on this ? We are facing the same issue as David and I agree it is a common enough request. -Anurag
        Hide
        Jim Donofrio added a comment -

        added a static method setMetadata to both the mapred and mapreduce SequenceFileOutputFormats

        Show
        Jim Donofrio added a comment - added a static method setMetadata to both the mapred and mapreduce SequenceFileOutputFormats
        Hide
        Jim Donofrio added a comment -

        Any thoughts on the patch?

        I added a static method setMetadata on the mapred and mapreduce SequenceFileOutputFormat classes

        Show
        Jim Donofrio added a comment - Any thoughts on the patch? I added a static method setMetadata on the mapred and mapreduce SequenceFileOutputFormat classes
        Hide
        Harsh J added a comment -

        Jim,

        How does this work? I don't see you serializing the mapping so that it propagates to tasks via JobConfig, etc..

        I personally like David's second suggested approach, where instead of extending the whole getRecordWriter, one can simply extend a specific method and throw back any metadata they want (and have it executed at the task runtimes, instead of frontends).

        Show
        Harsh J added a comment - Jim, How does this work? I don't see you serializing the mapping so that it propagates to tasks via JobConfig, etc.. I personally like David's second suggested approach, where instead of extending the whole getRecordWriter, one can simply extend a specific method and throw back any metadata they want (and have it executed at the task runtimes, instead of frontends).
        Hide
        Jim Donofrio added a comment -

        I just added a simple static method setMetadata which then gets passed to the SequenceFile.Writer constructor. The users would then in their mapper or reducer configure or setup method call SequenceFileOutputFormat.setMetadata with the appropiate metadata object that they would create.

        I like your idea better though, we add some predefined constant to the JobConf and then store the key, value pairs as comma separated pairs. Then we make the SequenceFileOutputFormat JobConfigurable so that ReflectionUtils.newInstance will call configure on it and load the metadata. The user can then set metadata in their driver class or in the mapper or reducer and then dont have to subclass SequenceFileOutputFormat. I think we should avoid users having to subclass SequenceFileOutputFormat. Thoughts?

        Show
        Jim Donofrio added a comment - I just added a simple static method setMetadata which then gets passed to the SequenceFile.Writer constructor. The users would then in their mapper or reducer configure or setup method call SequenceFileOutputFormat.setMetadata with the appropiate metadata object that they would create. I like your idea better though, we add some predefined constant to the JobConf and then store the key, value pairs as comma separated pairs. Then we make the SequenceFileOutputFormat JobConfigurable so that ReflectionUtils.newInstance will call configure on it and load the metadata. The user can then set metadata in their driver class or in the mapper or reducer and then dont have to subclass SequenceFileOutputFormat. I think we should avoid users having to subclass SequenceFileOutputFormat. Thoughts?
        Hide
        Harsh J added a comment -

        The users would then in their mapper or reducer configure or setup method call SequenceFileOutputFormat.setMetadata with the appropiate metadata object that they would create.

        The problem exposed by this approach hits upon a possible inconsistency/bug in the framework:

        Record Writer Instantiation Old API New API
        Map Task Before Mapper After Mapper
        Reduce Task After Reducer After Reducer

        See MapTask.java/ReduceTask.java in 1.x for instance, methods run

        {Old/New} {Mapper/Reducer}

        . This has been so now for a very long time, and I do think changing this may break behavior of several users out there, including some of the code I've written at my former workplace. Though yeah, its highly strange no spec doc exists for this, we ought to have one via another JIRA.

        Hence the mapper.configure() approach with a static method would unfortunately fail on the old API runs, for map-only jobs.

        Then we make the SequenceFileOutputFormat JobConfigurable so that ReflectionUtils.newInstance will call configure on it and load the metadata.

        I imagine this working in a much better way. For new API users, they may still be able to sneak in changes per map/reduce task, and otherwise (on Old API) rely on driver to provide these up.

        I think we should avoid users having to subclass SequenceFileOutputFormat. Thoughts?

        Agreed, given your new approach via jobconf. Lets also make sure we serialize with base64 encoding or so, to allow for special chars in metadata if users so wish it (cause job.xml dislikes special chars).

        Show
        Harsh J added a comment - The users would then in their mapper or reducer configure or setup method call SequenceFileOutputFormat.setMetadata with the appropiate metadata object that they would create. The problem exposed by this approach hits upon a possible inconsistency/bug in the framework: Record Writer Instantiation Old API New API Map Task Before Mapper After Mapper Reduce Task After Reducer After Reducer See MapTask.java/ReduceTask.java in 1.x for instance, methods run {Old/New} {Mapper/Reducer} . This has been so now for a very long time, and I do think changing this may break behavior of several users out there, including some of the code I've written at my former workplace. Though yeah, its highly strange no spec doc exists for this, we ought to have one via another JIRA. Hence the mapper.configure() approach with a static method would unfortunately fail on the old API runs, for map-only jobs. Then we make the SequenceFileOutputFormat JobConfigurable so that ReflectionUtils.newInstance will call configure on it and load the metadata. I imagine this working in a much better way. For new API users, they may still be able to sneak in changes per map/reduce task, and otherwise (on Old API) rely on driver to provide these up. I think we should avoid users having to subclass SequenceFileOutputFormat. Thoughts? Agreed, given your new approach via jobconf. Lets also make sure we serialize with base64 encoding or so, to allow for special chars in metadata if users so wish it (cause job.xml dislikes special chars).
        Hide
        Jim Donofrio added a comment -
            if (numReduceTasks > 0) {
              collector = new MapOutputBuffer(umbilical, job, reporter);
            } else { 
              collector = new DirectMapOutputCollector(umbilical, job, reporter);
            }
            MapRunnable<INKEY,INVALUE,OUTKEY,OUTVALUE> runner =
              ReflectionUtils.newInstance(job.getMapRunnerClass(), job);
        

        Could you give an example of what user code would break if we moved ReflectionUtils.newInstance(job.getMapRunnerClass(), job); above if (numReduceTasks > 0) { ? Why would the configure method of the mapper care if the recordwriter/outputformat had been created yet? I would think we would want the recordwriter/outputformat to get configured after the configure method to allow tasks to make task level config changes to a recordwriter/outputformat

        > I imagine this working in a much better way. For new API users, they may still be able to sneak in changes per map/reduce task, and otherwise (on Old API) rely on driver to provide these up.

        I am confused by this comment, do you agree with my approach or are you just disappointed that the behavior will be inconsistent between the old and new api for map only jobs?

        Show
        Jim Donofrio added a comment - if (numReduceTasks > 0) { collector = new MapOutputBuffer(umbilical, job, reporter); } else { collector = new DirectMapOutputCollector(umbilical, job, reporter); } MapRunnable<INKEY,INVALUE,OUTKEY,OUTVALUE> runner = ReflectionUtils.newInstance(job.getMapRunnerClass(), job); Could you give an example of what user code would break if we moved ReflectionUtils.newInstance(job.getMapRunnerClass(), job); above if (numReduceTasks > 0) { ? Why would the configure method of the mapper care if the recordwriter/outputformat had been created yet? I would think we would want the recordwriter/outputformat to get configured after the configure method to allow tasks to make task level config changes to a recordwriter/outputformat > I imagine this working in a much better way. For new API users, they may still be able to sneak in changes per map/reduce task, and otherwise (on Old API) rely on driver to provide these up. I am confused by this comment, do you agree with my approach or are you just disappointed that the behavior will be inconsistent between the old and new api for map only jobs?
        Hide
        Harsh J added a comment -

        Why would the configure method of the mapper care if the recordwriter/outputformat had been created yet?

        It doesn't care on its own, but advanced users may be relying on this 'behavior' to do stuff. For instance, I've once relied on this behavior to have my output format inject a few strings into jobconf upon RW instantiation (some logic dependent on input format initialization that goes even before this), such that I then get the set strings in my mapper's configure. True that I was probably doing something wrong, and what I did can be done in a better/alternate way, but I ended up relying on that behavior and thats what I'm talking about (in terms of breakage).

        I would think we would want the recordwriter/outputformat to get configured after the configure method to allow tasks to make task level config changes to a recordwriter/outputformat

        True. I just don't know why its this way in the old API. Probably an oversight.

        I am confused by this comment, do you agree with my approach or are you just disappointed that the behavior will be inconsistent between the old and new api for map only jobs?

        Sorry for the confusion. Its just the latter. I agree with your approach.

        Show
        Harsh J added a comment - Why would the configure method of the mapper care if the recordwriter/outputformat had been created yet? It doesn't care on its own, but advanced users may be relying on this 'behavior' to do stuff. For instance, I've once relied on this behavior to have my output format inject a few strings into jobconf upon RW instantiation (some logic dependent on input format initialization that goes even before this), such that I then get the set strings in my mapper's configure. True that I was probably doing something wrong, and what I did can be done in a better/alternate way, but I ended up relying on that behavior and thats what I'm talking about (in terms of breakage). I would think we would want the recordwriter/outputformat to get configured after the configure method to allow tasks to make task level config changes to a recordwriter/outputformat True. I just don't know why its this way in the old API. Probably an oversight. I am confused by this comment, do you agree with my approach or are you just disappointed that the behavior will be inconsistent between the old and new api for map only jobs? Sorry for the confusion. Its just the latter. I agree with your approach.
        Hide
        Jim Donofrio added a comment -

        Ok I will go with this approach. I will document that the behavior for old api mappers is inconsistent.

        Show
        Jim Donofrio added a comment - Ok I will go with this approach. I will document that the behavior for old api mappers is inconsistent.
        Hide
        liu yu added a comment -

        When you watch the problem, you need to check your reduce.
        It should be 'Iterable<VALUEIN>',not 'Iterator<VALUEIN>'.
        The latter won't rewrite the mothed reduce(...) of Reducer.

        Show
        liu yu added a comment - When you watch the problem, you need to check your reduce. It should be 'Iterable<VALUEIN>',not 'Iterator<VALUEIN>'. The latter won't rewrite the mothed reduce(...) of Reducer.

          People

          • Assignee:
            Unassigned
            Reporter:
            David Rosenstrauch
          • Votes:
            2 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:

              Development