Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6351

Reducer hung in copy phase.

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.6.0
    • None
    • mrv2
    • None

    Description

      Problem
      Reducer gets stuck in copy phase and doesn't make progress for very long time. After killing this task for couple of times manually, it gets completed.

      Observations

      • Verfied gc logs. Found no memory related issues. Attached the logs.
      • Verified thread dumps. Found no thread related problems.
      • On verification of logs, fetcher threads are not copying the map outputs and they are just waiting for merge to happen.
      • Merge thread is alive and in wait state.

      Analysis
      On careful observation of logs, thread dumps and code, this looks to me like a classic case of multi-threading issue. Thread goes to wait state after it has been notified.

      Here is the suspect code flow.
      Thread #1
      Fetcher thread - notification comes first
      org.apache.hadoop.mapreduce.task.reduce.MergeThread.startMerge(Set<T>)

            synchronized(pendingToBeMerged) {
              pendingToBeMerged.addLast(toMergeInputs);
              pendingToBeMerged.notifyAll();
            }
      

      Thread #2
      Merge Thread - goes to wait state (Notification goes unconsumed)
      org.apache.hadoop.mapreduce.task.reduce.MergeThread.run()

              synchronized (pendingToBeMerged) {
                while(pendingToBeMerged.size() <= 0) {
                  pendingToBeMerged.wait();
                }
                // Pickup the inputs to merge.
                inputs = pendingToBeMerged.removeFirst();
              }
      

      Attachments

        1. jstat-gc.log
          0.7 kB
          Laxman
        2. reducer-container-partial.log.zip
          1.15 MB
          Laxman
        3. thread-dumps.out
          246 kB
          Laxman

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            lakshman Laxman
            lakshman Laxman
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment