XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.6.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Scale cluster has run for months with 2.6.0.
      2 of 200 reduces hang on shuffle

      instance 1 log seems like loop on 1 map output:

      2015-08-06 21:54:14,649 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 2 of 2 to node-132.bj:22408 to fetcher#1
      2015-08-06 21:54:14,651 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=22408/mapOutput?job=job_1438689528746_10193&reduce=20&map=attempt_1438689528746_10193_m_000013_0,attempt_1438689528746_10193_m_000020_0 sent hash and received reply
      2015-08-06 21:54:14,651 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 - MergeManager returned status WAIT ...
      2015-08-06 21:54:14,651 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: node-132.bj:22408 freed by fetcher#1 in 2ms
      2015-08-06 21:54:14,651 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning node-132.bj:22408 with 2 to fetcher#5
      2015-08-06 21:54:14,651 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 2 of 2 to node-132.bj:22408 to fetcher#5
      2015-08-06 21:54:14,656 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=22408/mapOutput?job=job_1438689528746_10193&reduce=20&map=attempt_1438689528746_10193_m_000013_0,attempt_1438689528746_10193_m_000020_0 sent hash and received reply
      2015-08-06 21:54:14,656 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#5 - MergeManager returned status WAIT ...
      2015-08-06 21:54:14,656 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: node-132.bj:22408 freed by fetcher#5 in 4ms
      2015-08-06 21:54:14,656 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning node-132.bj:22408 with 2 to fetcher#5
      2015-08-06 21:54:14,656 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 2 of 2 to node-132.bj:22408 to fetcher#5
      2015-08-06 21:54:14,660 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=22408/mapOutput?job=job_1438689528746_10193&reduce=20&map=attempt_1438689528746_10193_m_000013_0,attempt_1438689528746_10193_m_000020_0 sent hash and received reply
      2015-08-06 21:54:14,660 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#5 - MergeManager returned status WAIT ...
      2015-08-06 21:54:14,660 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: node-132.bj:22408 freed by fetcher#5 in 5ms
      2015-08-06 21:54:14,660 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning node-132.bj:22408 with 2 to fetcher#5
      2015-08-06 21:54:14,660 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 2 of 2 to node-132.bj:22408 to fetcher#5
      

      node 2 log seems like loop on 5 map output:

      2015-08-06 21:43:33,626 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning node-172.bj:22408 with 1 to fetcher#5
      2015-08-06 21:43:33,626 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of 1 to node-172.bj:22408 to fetcher#5
      2015-08-06 21:43:33,627 INFO [fetcher#3] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=22408/mapOutput?job=job_1438689528746_10193&reduce=85&map=attempt_1438689528746_10193_m_000013_0,attempt_1438689528746_10193_m_000020_0 sent hash and received reply
      2015-08-06 21:43:33,627 INFO [fetcher#3] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#3 - MergeManager returned status WAIT ...
      2015-08-06 21:43:33,627 INFO [fetcher#3] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: node-132.bj:22408 freed by fetcher#3 in 5ms
      2015-08-06 21:43:33,627 INFO [fetcher#3] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning node-179.bj:22408 with 1 to fetcher#3
      2015-08-06 21:43:33,627 INFO [fetcher#3] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of 1 to node-179.bj:22408 to fetcher#3
      2015-08-06 21:43:33,627 INFO [fetcher#4] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=22408/mapOutput?job=job_1438689528746_10193&reduce=85&map=attempt_1438689528746_10193_m_000084_0,attempt_1438689528746_10193_m_000046_0 sent hash and received reply
      2015-08-06 21:43:33,627 INFO [fetcher#4] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#4 - MergeManager returned status WAIT ...
      2015-08-06 21:43:33,627 INFO [fetcher#4] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: node-71.bj:22408 freed by fetcher#4 in 5ms
      2015-08-06 21:43:33,627 INFO [fetcher#4] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning node-71.bj:22408 with 2 to fetcher#4
      2015-08-06 21:43:33,627 INFO [fetcher#4] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 2 of 2 to node-71.bj:22408 to fetcher#4
      2015-08-06 21:43:33,628 INFO [fetcher#2] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=22408/mapOutput?job=job_1438689528746_10193&reduce=85&map=attempt_1438689528746_10193_m_000092_0 sent hash and received reply
      2015-08-06 21:43:33,628 INFO [fetcher#2] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#2 - MergeManager returned status WAIT ...
      2015-08-06 21:43:33,628 INFO [fetcher#2] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: node-167.bj:22408 freed by fetcher#2 in 3ms
      2015-08-06 21:43:33,628 INFO [fetcher#2] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning node-132.bj:22408 with 2 to fetcher#2
      2015-08-06 21:43:33,628 INFO [fetcher#2] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 2 of 2 to node-132.bj:22408 to fetcher#2
      2015-08-06 21:43:33,629 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=22408/mapOutput?job=job_1438689528746_10193&reduce=85&map=attempt_1438689528746_10193_m_000097_0 sent hash and received reply
      2015-08-06 21:43:33,629 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 - MergeManager returned status WAIT ...
      2015-08-06 21:43:33,629 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: node-174.bj:22408 freed by fetcher#1 in 3ms
      2015-08-06 21:43:33,629 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning node-174.bj:22408 with 1 to fetcher#1
      2015-08-06 21:43:33,629 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of 1 to node-174.bj:22408 to fetcher#1
      2015-08-06 21:43:33,629 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=22408/mapOutput?job=job_1438689528746_10193&reduce=85&map=attempt_1438689528746_10193_m_000093_0 sent hash and received reply
      2015-08-06 21:43:33,629 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#5 - MergeManager returned status WAIT ...
      2015-08-06 21:43:33,630 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: node-172.bj:22408 freed by fetcher#5 in 3ms
      2015-08-06 21:43:33,630 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning node-172.bj:22408 with 1 to fetcher#5
      2015-08-06 21:43:33,630 INFO [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of 1 to node-172.bj:22408 to fetcher#5
      2015-08-06 21:43:33,630 INFO [fetcher#3] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=22408/mapOutput?job=job_1438689528746_10193&reduce=85&map=attempt_1438689528746_10193_m_000089_0 sent hash and received reply
      2015-08-06 21:43:33,630 INFO [fetcher#3] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#3 - MergeManager returned status WAIT ...
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                peng.zhang Peng Zhang
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: