Hadoop Common
  1. Hadoop Common
  2. HADOOP-4927

Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      All output part files are created regardless of whether the corresponding task has output.

      Description

      When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.

      1. hadoop-4927-y20.patch
        34 kB
        Jothi Padmanabhan
      2. hadoop-4927-v6.patch
        34 kB
        Jothi Padmanabhan
      3. hadoop-4927-v5.patch
        34 kB
        Jothi Padmanabhan
      4. hadoop-4927-v4.patch
        33 kB
        Jothi Padmanabhan
      5. hadoop-4927-v3.patch
        24 kB
        Jothi Padmanabhan
      6. hadoop-4927-v2.patch
        18 kB
        Jothi Padmanabhan
      7. hadoop-4927-v1.patch
        17 kB
        Jothi Padmanabhan
      8. hadoop-4927.patch
        15 kB
        Jothi Padmanabhan

        Activity

        Hide
        Robert Chansler added a comment -

        Editorial pass over all release notes prior to publication of 0.21.

        Show
        Robert Chansler added a comment - Editorial pass over all release notes prior to publication of 0.21.
        Hide
        Jothi Padmanabhan added a comment -

        Patch for the Y! 20 distribution.

        Show
        Jothi Padmanabhan added a comment - Patch for the Y! 20 distribution.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-trunk #763 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/763/)
        . Adds a generic wrapper around outputformat to allow creation of output on demand. Contributed by Jothi Padmanabhan.

        Show
        Hudson added a comment - Integrated in Hadoop-trunk #763 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/763/ ) . Adds a generic wrapper around outputformat to allow creation of output on demand. Contributed by Jothi Padmanabhan.
        Hide
        Devaraj Das added a comment -

        I just committed this. Thanks, Jothi!

        Show
        Devaraj Das added a comment - I just committed this. Thanks, Jothi!
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12400482/hadoop-4927-v6.patch
        against trunk revision 745934.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 14 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3884/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3884/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3884/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3884/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12400482/hadoop-4927-v6.patch against trunk revision 745934. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 14 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3884/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3884/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3884/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3884/console This message is automatically generated.
        Hide
        Jothi Padmanabhan added a comment -

        Incorporated the review comment to check for nulls in a private method

        Show
        Jothi Padmanabhan added a comment - Incorporated the review comment to check for nulls in a private method
        Hide
        Doug Cutting added a comment -

        A minor nit: in FilterOutputFormat, the checks for a null baseOut and null rawWriter might be better factored into private getBaseOut() and getRawWriter methods, so that the body of most filter methods is just one line, e.g., 'return getBaseOut().getRecordWriter(...);'. Other than that, +1.

        Show
        Doug Cutting added a comment - A minor nit: in FilterOutputFormat, the checks for a null baseOut and null rawWriter might be better factored into private getBaseOut() and getRawWriter methods, so that the body of most filter methods is just one line, e.g., 'return getBaseOut().getRecordWriter(...);'. Other than that, +1.
        Hide
        Jothi Padmanabhan added a comment -

        The test that timed out org.apache.hadoop.mapred.TestTaskLimits.testTaskLimits passed on my local machine and this test is unrelated to the patch.

        Show
        Jothi Padmanabhan added a comment - The test that timed out org.apache.hadoop.mapred.TestTaskLimits.testTaskLimits passed on my local machine and this test is unrelated to the patch.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12400180/hadoop-4927-v5.patch
        against trunk revision 744406.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 14 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3859/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3859/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3859/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3859/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12400180/hadoop-4927-v5.patch against trunk revision 744406. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 14 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3859/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3859/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3859/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3859/console This message is automatically generated.
        Hide
        Jothi Padmanabhan added a comment -

        Testpatch and ant test passed on my local box with this patch

        Show
        Jothi Padmanabhan added a comment - Testpatch and ant test passed on my local box with this patch
        Hide
        Jothi Padmanabhan added a comment -

        Synched the patch with trunk

        Show
        Jothi Padmanabhan added a comment - Synched the patch with trunk
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12400110/hadoop-4927-v4.patch
        against trunk revision 744000.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 14 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3844/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12400110/hadoop-4927-v4.patch against trunk revision 744000. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 14 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3844/console This message is automatically generated.
        Hide
        Jothi Padmanabhan added a comment -
        • Patch that implements FilterOutputFormat and FilterRecordWriter that LazyOutputFormat and LazyRecordWriter extend.
        • Pipes and Streaming support the -lazyOutput flag
        • Added test cases for both mapred and mapreduce packages
        Show
        Jothi Padmanabhan added a comment - Patch that implements FilterOutputFormat and FilterRecordWriter that LazyOutputFormat and LazyRecordWriter extend. Pipes and Streaming support the -lazyOutput flag Added test cases for both mapred and mapreduce packages
        Hide
        Doug Cutting added a comment -
        • LazyOutputFormat should keep a field for the nested output format, not create it again for each call, no?
        • We might implement generic FilterOutputFormat and FilterRecordReader that LazyOutputFormat and LazyRecordReader extend. This is probably not the last time someone will need to wrap an OutputFormat or a RecordReader.
        • JobConf#setOutputFormatClass(Class, boolean) should instead be static LazyOutputFormat#setClass(Job, Class, boolean). This localizes the change, and it's still a one-line change for applications.
        • Similarly, JobContext#getLazyOutputFormatClass() should instead be static LazyOutputFormat#getClass(JobContext). This feature can be entirely contained in LazyOutputFormat and should not require changes to the kernel.
        Show
        Doug Cutting added a comment - LazyOutputFormat should keep a field for the nested output format, not create it again for each call, no? We might implement generic FilterOutputFormat and FilterRecordReader that LazyOutputFormat and LazyRecordReader extend. This is probably not the last time someone will need to wrap an OutputFormat or a RecordReader. JobConf#setOutputFormatClass(Class, boolean) should instead be static LazyOutputFormat#setClass(Job, Class, boolean). This localizes the change, and it's still a one-line change for applications. Similarly, JobContext#getLazyOutputFormatClass() should instead be static LazyOutputFormat#getClass(JobContext). This feature can be entirely contained in LazyOutputFormat and should not require changes to the kernel.
        Hide
        Jothi Padmanabhan added a comment -

        Attaching patch to elicit comments. This patch implements a wrapper for OutputFormat. It also handles the streaming case by supporting a flag "-lazyOutput". Pipes still needs to be handled.

        Show
        Jothi Padmanabhan added a comment - Attaching patch to elicit comments. This patch implements a wrapper for OutputFormat. It also handles the streaming case by supporting a flag "-lazyOutput". Pipes still needs to be handled.
        Hide
        Devaraj Das added a comment -

        It might make sense to add a new API job.setOutputFormat(OutputFormat, boolean lazy). That will be a shorthand to having the two lines explictly. Also, we need to handle the case of streaming/pipes (where probably the "lazyOutputCreation" could be taken as a command line argument).

        Show
        Devaraj Das added a comment - It might make sense to add a new API job.setOutputFormat(OutputFormat, boolean lazy). That will be a shorthand to having the two lines explictly. Also, we need to handle the case of streaming/pipes (where probably the "lazyOutputCreation" could be taken as a command line argument).
        Hide
        Doug Cutting added a comment -

        > LazyOutputFormat.set(actualoutputformat.class) and
        > job.setOutputFormat(LazyOutputFormat.class)

        Right. That's the two-line penalty of a wrapper. If we built it into FileInputFormat then it would only take one line:

        FileOutputFormat.setLazyOutput(true);

        but it would then also only work for subclasses of FileOutputFormat, rather than any OutputFormat implementation. This is a tough call, since most, but not all, OutputFormats do subclass FileOutputFormat. I'm leaning towards the wrapper, since, while a bit more complex for users, it is a cleaner layering, making FileOutputFormat less of a kitchen-sink of features.

        Show
        Doug Cutting added a comment - > LazyOutputFormat.set(actualoutputformat.class) and > job.setOutputFormat(LazyOutputFormat.class) Right. That's the two-line penalty of a wrapper. If we built it into FileInputFormat then it would only take one line: FileOutputFormat.setLazyOutput(true); but it would then also only work for subclasses of FileOutputFormat, rather than any OutputFormat implementation. This is a tough call, since most, but not all, OutputFormats do subclass FileOutputFormat. I'm leaning towards the wrapper, since, while a bit more complex for users, it is a cleaner layering, making FileOutputFormat less of a kitchen-sink of features.
        Hide
        Jothi Padmanabhan added a comment -

        sorry, we set the class objects in setOutputFormat, so, it would be

        LazyOutputFormat.set(actualoutputformat.class) and
        job.setOutputFormat(LazyOutputFormat.class)

        Show
        Jothi Padmanabhan added a comment - sorry, we set the class objects in setOutputFormat, so, it would be LazyOutputFormat.set(actualoutputformat.class) and job.setOutputFormat(LazyOutputFormat.class)
        Hide
        Jothi Padmanabhan added a comment -

        So, if I understand this correctly, we have new WrapperOutputFormat class

        
        LazyOutputFormat {
        
          LazyOutputFormat (OutputFormat rawOF) {
            this.rawOutputFormat = rawOF;
          }
        
          getRecordWriter(...) {
            return new LazyRecordWriter(...)
          }
        }
        
        

        Users will then do

        job.setOutputFormat(new LazyOutputFormat(actualOutputFormat));
        

        The other option would be

        FileOutputFormat.setLazy(true);
        
        JobConf::getOutputFormat() {
          OutputFormat o = getOutputFormat();
          if (o.instanceof FileOutputFormat && lazyFlag) {
            return new LazyOutputFormat(0);
          }
          else {
            return o;
          }
        }
        
        
        Show
        Jothi Padmanabhan added a comment - So, if I understand this correctly, we have new WrapperOutputFormat class LazyOutputFormat { LazyOutputFormat (OutputFormat rawOF) { this .rawOutputFormat = rawOF; } getRecordWriter(...) { return new LazyRecordWriter(...) } } Users will then do job.setOutputFormat( new LazyOutputFormat(actualOutputFormat)); The other option would be FileOutputFormat.setLazy( true ); JobConf::getOutputFormat() { OutputFormat o = getOutputFormat(); if (o. instanceof FileOutputFormat && lazyFlag) { return new LazyOutputFormat(0); } else { return o; } }
        Hide
        Doug Cutting added a comment -

        > Unless there's a non-FileOutputFormat use case [ ... ]

        I see Chris's point and agree. Unless there's a strong reason to put features in the kernel we should prefer to put them in library code, keeping the kernel minimal. Are there non-FileInputFormats that need this feature?

        A wrapper implementation is a bit harder to use, since folks would need to both set the job's outputformat to the wrapper, and set the wrapper's parameter to the real output format: two changes instead of just setting a single parameter, although it is more generic. We could perhaps implement both: a flag for FileOutputFormat and a wrapper OutputFormat for folks who've not subclassed FileOutputFormat?

        Show
        Doug Cutting added a comment - > Unless there's a non-FileOutputFormat use case [ ... ] I see Chris's point and agree. Unless there's a strong reason to put features in the kernel we should prefer to put them in library code, keeping the kernel minimal. Are there non-FileInputFormats that need this feature? A wrapper implementation is a bit harder to use, since folks would need to both set the job's outputformat to the wrapper, and set the wrapper's parameter to the real output format: two changes instead of just setting a single parameter, although it is more generic. We could perhaps implement both: a flag for FileOutputFormat and a wrapper OutputFormat for folks who've not subclassed FileOutputFormat?
        Hide
        Chris Douglas added a comment -

        this feature is a sort of generic across the different output formats and having the framework support this would be useful.

        True. Still, while it is generic functionality, it's neither difficult nor inefficient in user-space. Absent either of the latter criteria, putting it into the framework seems premature, at least. If this should be abstracted, wouldn't it make sense as an OutputFormat in lib? That seems no less brittle than using a framework configuration variable and I've difficulty seeing this setting scoped to the whole cluster...

        I am okay with the current patch in terms of the behavior and the framework support it adds for lazy creation of the destination output "something" (where "something" is easy to explain when interpreted as a file).

        Unless there's a non-FileOutputFormat use case that's also easy to explain, it remains unclear that this is the correct abstraction. Creating a setting on FileOutputFormat seems like a good idea. Whether that's implemented in FileOutputFormat only or via an OuputFormat wrapper class depends on how generally applicable the abstraction is. Since its motivation is expensive clutter in HDFS, it's not obvious to me that the latter is justified, let alone tight integration with the framework.

        Show
        Chris Douglas added a comment - this feature is a sort of generic across the different output formats and having the framework support this would be useful. True. Still, while it is generic functionality, it's neither difficult nor inefficient in user-space. Absent either of the latter criteria, putting it into the framework seems premature, at least. If this should be abstracted, wouldn't it make sense as an OutputFormat in lib? That seems no less brittle than using a framework configuration variable and I've difficulty seeing this setting scoped to the whole cluster... I am okay with the current patch in terms of the behavior and the framework support it adds for lazy creation of the destination output "something" (where "something" is easy to explain when interpreted as a file). Unless there's a non-FileOutputFormat use case that's also easy to explain, it remains unclear that this is the correct abstraction. Creating a setting on FileOutputFormat seems like a good idea. Whether that's implemented in FileOutputFormat only or via an OuputFormat wrapper class depends on how generally applicable the abstraction is. Since its motivation is expensive clutter in HDFS, it's not obvious to me that the latter is justified, let alone tight integration with the framework.
        Hide
        Devaraj Das added a comment -

        I am okay with the current patch in terms of the behavior and the framework support it adds for lazy creation of the destination output "something" (where "something" is easy to explain when interpreted as a file).

        Show
        Devaraj Das added a comment - I am okay with the current patch in terms of the behavior and the framework support it adds for lazy creation of the destination output "something" (where "something" is easy to explain when interpreted as a file).
        Hide
        Jothi Padmanabhan added a comment -

        True, this feature can be achieved by modifying RecordWriter implementations. But that would mean people who have their own implementations would need to add it themselves, see Devaraj's comment Also, this feature is a sort of generic across the different output formats and having the framework support this would be useful. No?

        Show
        Jothi Padmanabhan added a comment - True, this feature can be achieved by modifying RecordWriter implementations. But that would mean people who have their own implementations would need to add it themselves, see Devaraj's comment Also, this feature is a sort of generic across the different output formats and having the framework support this would be useful. No?
        Hide
        Chris Douglas added a comment -

        This feature doesn't require framework changes, does it? Configuring FileOutputFormat or writing an OutputFormat that lazily creates its files should be sufficient.

        Show
        Chris Douglas added a comment - This feature doesn't require framework changes, does it? Configuring FileOutputFormat or writing an OutputFormat that lazily creates its files should be sufficient.
        Hide
        Jothi Padmanabhan added a comment -

        Patch incorporating review comments

        Show
        Jothi Padmanabhan added a comment - Patch incorporating review comments
        Hide
        Doug Cutting added a comment -

        A few nits:

        • we need a public setter method for lazy file creation, setLazyOutput(boolean). This should probably go on mapred.JobConf and mapreduce.Job.
        • this does not need to be in hadoop-default.xml. That file is for options that are configured site-wide in hadoop-site.xml, not per-job parameters.
        Show
        Doug Cutting added a comment - A few nits: we need a public setter method for lazy file creation, setLazyOutput(boolean). This should probably go on mapred.JobConf and mapreduce.Job. this does not need to be in hadoop-default.xml. That file is for options that are configured site-wide in hadoop-site.xml, not per-job parameters.
        Hide
        Jothi Padmanabhan added a comment -

        Doug, could you please review this patch. Thank you.

        Show
        Jothi Padmanabhan added a comment - Doug, could you please review this patch. Thank you.
        Hide
        Jothi Padmanabhan added a comment -

        The test failure, org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.TestStartAtOffset.testStartAfterOffset, is not related to this patch

        Show
        Jothi Padmanabhan added a comment - The test failure, org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.TestStartAtOffset.testStartAfterOffset, is not related to this patch
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12398803/hadoop-4927-v1.patch
        against trunk revision 737944.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3761/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3761/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3761/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3761/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12398803/hadoop-4927-v1.patch against trunk revision 737944. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3761/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3761/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3761/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3761/console This message is automatically generated.
        Hide
        Jothi Padmanabhan added a comment -

        Patch incorporating review comments

        Show
        Jothi Padmanabhan added a comment - Patch incorporating review comments
        Hide
        Jothi Padmanabhan added a comment -

        Canceling patch to incorporate review comments

        Show
        Jothi Padmanabhan added a comment - Canceling patch to incorporate review comments
        Hide
        Doug Cutting added a comment -

        The class might better be called LazyRecordWriter, I'd prefer that this we used no wrapper when lazy output creation is disabled, and we should add a static method to access the config property. So, putting these together, we might have a method that looks like:

        static RecordWriter ReduceTask#createRecordWriter(JobConf job, ...) {
          if (ReduceTask.useLazyOutputCreation(job)) {
            return new LazyRecordWriter(...);
          } else {
            return ...;
        }
        
        Show
        Doug Cutting added a comment - The class might better be called LazyRecordWriter, I'd prefer that this we used no wrapper when lazy output creation is disabled, and we should add a static method to access the config property. So, putting these together, we might have a method that looks like: static RecordWriter ReduceTask#createRecordWriter(JobConf job, ...) { if (ReduceTask.useLazyOutputCreation(job)) { return new LazyRecordWriter(...); } else { return ...; }
        Hide
        Jothi Padmanabhan added a comment -

        This patch implements the lazy file creation by wrapping the RecordWriter funcitonality into a wrapper class and instantiating it only on a call to output.collect

        Show
        Jothi Padmanabhan added a comment - This patch implements the lazy file creation by wrapping the RecordWriter funcitonality into a wrapper class and instantiating it only on a call to output.collect
        Hide
        Devaraj Das added a comment -

        Resetting the Fix Version to 0.21

        Show
        Devaraj Das added a comment - Resetting the Fix Version to 0.21
        Hide
        Tsz Wo Nicholas Sze added a comment -

        I was setting the "component/s" to mapred but not intended to change "issue type". Changing this back to "New Feature".

        Show
        Tsz Wo Nicholas Sze added a comment - I was setting the "component/s" to mapred but not intended to change "issue type". Changing this back to "New Feature".
        Hide
        Tsz Wo Nicholas Sze added a comment -

        > I believe some users did mention that the feature of having exactly N output files is useful.

        I also believe it is useful in some cases, especially when all output files are empty (ah, you may argue that the entire job is not useful in this case ). However, it is costly to maintain empty files in HDFS and I believe it is USELESS in many cases. Could we have an option for not creating them or cleaning them?

        Show
        Tsz Wo Nicholas Sze added a comment - > I believe some users did mention that the feature of having exactly N output files is useful. I also believe it is useful in some cases, especially when all output files are empty (ah, you may argue that the entire job is not useful in this case ). However, it is costly to maintain empty files in HDFS and I believe it is USELESS in many cases. Could we have an option for not creating them or cleaning them?
        Hide
        Doug Cutting added a comment -

        Changing this from a bug to a feature request. It seems reasonable for FileOutputFormat to support a mode where files are created lazily when the first record is written.

        > Out of 30 million files/dirs, 4.5 million part- files were empty. 40 users having more than 10,000 empty files.

        It sounds like there's also perhaps another problem here. Are these folks perhaps specifying way too many reduces? For jobs with lots of empty files, how many non-empty files are there, and how big are they?

        Show
        Doug Cutting added a comment - Changing this from a bug to a feature request. It seems reasonable for FileOutputFormat to support a mode where files are created lazily when the first record is written. > Out of 30 million files/dirs, 4.5 million part- files were empty. 40 users having more than 10,000 empty files. It sounds like there's also perhaps another problem here. Are these folks perhaps specifying way too many reduces? For jobs with lots of empty files, how many non-empty files are there, and how big are they?
        Hide
        Koji Noguchi added a comment -

        On one of our clusters, counted number of empty "part-" files.

        Out of 30 million files/dirs, 4.5 million part- files were empty. 40 users having more than 10,000 empty files.

        If you specify N output partitions then you should generate N output files,

        I believe some users did mention that the feature of having exactly N output files is useful.

        If we could somehow make the no-empty-part-files feature configurable, it'll ease up our support work a lot.
        (Instead of asking our users to implement a custom outputformat, I can just ask them to set the jobconf.)

        Show
        Koji Noguchi added a comment - On one of our clusters, counted number of empty "part-" files. Out of 30 million files/dirs, 4.5 million part- files were empty. 40 users having more than 10,000 empty files. If you specify N output partitions then you should generate N output files, I believe some users did mention that the feature of having exactly N output files is useful. If we could somehow make the no-empty-part-files feature configurable, it'll ease up our support work a lot. (Instead of asking our users to implement a custom outputformat, I can just ask them to set the jobconf.)
        Hide
        Doug Cutting added a comment -

        I'm not convinced this is a bug. If you specify N output partitions then you should generate N output files, even if some of them are empty, no? One could write an OutputFormat that lazily creates its output files, but that's not the contract of FileOutputFormat.

        Show
        Doug Cutting added a comment - I'm not convinced this is a bug. If you specify N output partitions then you should generate N output files, even if some of them are empty, no? One could write an OutputFormat that lazily creates its output files, but that's not the contract of FileOutputFormat.
        Hide
        Devaraj Das added a comment -

        Okay, so i figured that I was referring to the old MapReduce API smile
        There seems to be two approaches anyways. For the old API:
        Today, the getRecordWriter calls relevant to the tasks are made in two places - in DirectMapOutputCollector (in the constructor) and in ReduceTask.java (just before starting to call the user's reduce method). We can probably move the calls to the respective OutputCollect.collect implementations:

        if (out == null) {
          out = job.getOutputFormat().getRecordWriter(fs, job, finalName, reporter);
        }
        

        For the new API, I am not yet sure what the good approach is. Maybe we could delay creating the recordwriter until TaskInputOutputContext.write is invoked.

        The other approach is to delay the creation of the files on the output filesystem, until it is necessary, in the respective RecordWriter implementations. But this requires users (who have implemented recordwriters or are implementing them in the future) to be aware of such a change and thus is vulnerable to problems..

        Thoughts?

        Show
        Devaraj Das added a comment - Okay, so i figured that I was referring to the old MapReduce API smile There seems to be two approaches anyways. For the old API: Today, the getRecordWriter calls relevant to the tasks are made in two places - in DirectMapOutputCollector (in the constructor) and in ReduceTask.java (just before starting to call the user's reduce method). We can probably move the calls to the respective OutputCollect.collect implementations: if (out == null ) { out = job.getOutputFormat().getRecordWriter(fs, job, finalName, reporter); } For the new API, I am not yet sure what the good approach is. Maybe we could delay creating the recordwriter until TaskInputOutputContext.write is invoked. The other approach is to delay the creation of the files on the output filesystem, until it is necessary, in the respective RecordWriter implementations. But this requires users (who have implemented recordwriters or are implementing them in the future) to be aware of such a change and thus is vulnerable to problems.. Thoughts?

          People

          • Assignee:
            Jothi Padmanabhan
            Reporter:
            Devaraj Das
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development