Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1853

MultipleOutputs does not cache TaskAttemptContext

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.21.0, 0.22.0
    • 0.21.0, 0.22.0
    • task
    • None
    • OSX 10.6
      java6

    • Reviewed

    Description

      In MultipleOutputs there is

       private TaskAttemptContext getContext(String nameOutput) throws IOException {
          // The following trick leverages the instantiation of a record writer via
          // the job thus supporting arbitrary output formats.
          Job job = new Job(context.getConfiguration());
          job.setOutputFormatClass(getNamedOutputFormatClass(context, nameOutput));
          job.setOutputKeyClass(getNamedOutputKeyClass(context, nameOutput));
          job.setOutputValueClass(getNamedOutputValueClass(context, nameOutput));
          TaskAttemptContext taskContext = 
            new TaskAttemptContextImpl(job.getConfiguration(), 
                                       context.getTaskAttemptID());
          return taskContext;
        }
      

      so for every reduce call it creates a new Job instance ...which creates a new LocalJobRunner.
      That does not sound like a good idea.

      You end up with a flood of "jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized"

      This should probably also be added to 0.22.

      Attachments

        1. cache-task-attempts.diff
          2 kB
          Torsten Curdt

        Issue Links

          Activity

            People

              tcurdt Torsten Curdt
              tcurdt Torsten Curdt
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: