Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3130

Output files are not moved to job output directory after task completion

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.20.204.0
    • Fix Version/s: None
    • Component/s: task
    • Labels:
      None
    • Environment:

      NA

      Description

      When MultipleOutputs (OutputFormat set to TextOutputFormat) is used to write output files to different namedOutputFiles, turned off normal output file generation(i.e.part-r-<#reduce>) by setting OutputFormat as NullOutputFormat, task output is not moved to job output directory (${mapred.output.dir}).

      After the job completion, found output files in ${mapred.output.dir}/_temporary/_attempt-<jobid>-<#reduce> (task output directory) rather than in the ${mapred.output.dir} (job output dir)

        Issue Links

          Activity

          Hide
          sho.shimauchi Sho Shimauchi added a comment -

          +1 for 3. Most of users read only arguments list and return in API docs. 3 makes more sense for users because user doesn't expect NullOutputFormat creates a new file.

          Show
          sho.shimauchi Sho Shimauchi added a comment - +1 for 3. Most of users read only arguments list and return in API docs. 3 makes more sense for users because user doesn't expect NullOutputFormat creates a new file.
          Hide
          bvandehey Robert Vandehey added a comment -

          I'd vote for option #3. Option #2 could have regression issues where someone depends on a file always being created. Option #1 isn't really a solution just providing more guidance through documentation. Option #3 is the logical solution since the expectation is NullOutputFormat is not producing any output but there is still an expectation of cleaning up after committing.

          Show
          bvandehey Robert Vandehey added a comment - I'd vote for option #3. Option #2 could have regression issues where someone depends on a file always being created. Option #1 isn't really a solution just providing more guidance through documentation. Option #3 is the logical solution since the expectation is NullOutputFormat is not producing any output but there is still an expectation of cleaning up after committing.
          Hide
          qwertymaniac Harsh J added a comment -

          We could do one of these few things to address this:

          1. In new API MultipleOutputs class, we may document to not use NullOutputFormat but instead use plain FileOutputFormat.
          2. In new API FileOutputFormat, we can change the behavior of FileOutputFormat to never create blank empty files. That is, open a file if if and only if at least one KV is passed to the RecordReader, not upon construction of RecordReader.
          3. The NullOutputFormat should instead derive out of FileOutputFormat with a blank record reader as shown above. This wouldn't cause any issues (would be similar to a job that hasn't produced any output at all, but without empty files).

          I'd prefer (3) or (2) - in that order. Thoughts?

          Show
          qwertymaniac Harsh J added a comment - We could do one of these few things to address this: 1. In new API MultipleOutputs class, we may document to not use NullOutputFormat but instead use plain FileOutputFormat. 2. In new API FileOutputFormat, we can change the behavior of FileOutputFormat to never create blank empty files. That is, open a file if if and only if at least one KV is passed to the RecordReader, not upon construction of RecordReader. 3. The NullOutputFormat should instead derive out of FileOutputFormat with a blank record reader as shown above. This wouldn't cause any issues (would be similar to a job that hasn't produced any output at all, but without empty files). I'd prefer (3) or (2) - in that order. Thoughts?
          Hide
          qwertymaniac Harsh J added a comment -

          One workaround to use when wanting a NullOutputFormat without this kinda issues (i.e. one that works with MultipleOutputs and such) is to subclass FileOutputFormat and dumb it down:

          public static class DummyOutputFormat<K,V> extends FileOutputFormat<K,V> {
              @Override
              public void checkOutputSpecs(JobContext job)
                  throws FileAlreadyExistsException, IOException {
                  // Do not check for existing out-dir.
              }
          
              @Override
              public RecordWriter<K,V>
                  getRecordWriter(TaskAttemptContext context) {
                      return new RecordWriter<K,V>(){
                          public void write(K key, V value) {}
                          public void close(TaskAttemptContext context) {}
                      };
              }
          }
          
          Show
          qwertymaniac Harsh J added a comment - One workaround to use when wanting a NullOutputFormat without this kinda issues (i.e. one that works with MultipleOutputs and such) is to subclass FileOutputFormat and dumb it down: public static class DummyOutputFormat<K,V> extends FileOutputFormat<K,V> { @Override public void checkOutputSpecs(JobContext job) throws FileAlreadyExistsException, IOException { // Do not check for existing out-dir. } @Override public RecordWriter<K,V> getRecordWriter(TaskAttemptContext context) { return new RecordWriter<K,V>(){ public void write(K key, V value) {} public void close(TaskAttemptContext context) {} }; } }
          Hide
          qwertymaniac Harsh J added a comment -

          This is cause of the OutputCommitter implementation of NullOutputFormat. This is sort of a regression from the stable API, caused by tighter integration between the o/f and o/c as displayed on MAPREDUCE-2493.

          Show
          qwertymaniac Harsh J added a comment - This is cause of the OutputCommitter implementation of NullOutputFormat. This is sort of a regression from the stable API, caused by tighter integration between the o/f and o/c as displayed on MAPREDUCE-2493 .
          Hide
          kam_iitkgp Bhallamudi Venkata Siva Kamesh added a comment -

          When we set OutputFormat as NullOutputFormal, after the completion of task's execution, task checks, whether it needs to commit the task output. If commit is required, all the output files will be moved from task output dir to job output dir and after that _temporary directory will be deleted. But for NullOutputFormat, commit is not required, and hence output files remains in the _temporary folder itself.

          Show
          kam_iitkgp Bhallamudi Venkata Siva Kamesh added a comment - When we set OutputFormat as NullOutputFormal, after the completion of task's execution, task checks, whether it needs to commit the task output. If commit is required, all the output files will be moved from task output dir to job output dir and after that _temporary directory will be deleted. But for NullOutputFormat, commit is not required, and hence output files remains in the _temporary folder itself.

            People

            • Assignee:
              Unassigned
              Reporter:
              kam_iitkgp Bhallamudi Venkata Siva Kamesh
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development