Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4208

The job is hanging up but never continuing until you kill the child process

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Hadoop 0.20.203.0
      Hbase 0.90.3
      Hive 0.80.1

      Description

      I use the hive MR query on hbase,but the job is never end.
      The job is hanging but never continuing util you kill the child process

      2012-04-28 18:22:33,661 Stage-1 map = 0%, reduce = 0%
      2012-04-28 18:22:59,760 Stage-1 map = 25%, reduce = 0%
      2012-04-28 18:23:04,782 Stage-1 map = 38%, reduce = 0%
      2012-04-28 18:23:07,796 Stage-1 map = 50%, reduce = 0%
      2012-04-28 18:23:08,801 Stage-1 map = 50%, reduce = 8%
      2012-04-28 18:23:17,839 Stage-1 map = 50%, reduce = 17%
      2012-04-28 18:23:19,848 Stage-1 map = 63%, reduce = 17%
      2012-04-28 18:23:32,909 Stage-1 map = 63%, reduce = 21%
      2012-04-28 18:23:57,017 Stage-1 map = 75%, reduce = 21%
      2012-04-28 18:24:09,075 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:25:09,397 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:26:09,688 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:27:09,980 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:28:10,262 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:29:10,522 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:30:10,742 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:31:10,985 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:32:11,238 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:33:11,467 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:34:11,731 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:35:11,968 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:36:12,213 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:37:12,508 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:38:12,747 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:39:12,970 Stage-1 map = 75%, reduce = 25%
      2012-04-28 18:40:13,205 Stage-1 map = 75%, reduce = 25%

      I checked the TT log,

      2012-04-28 18:31:53,879 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:31:56,883 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:31:59,887 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:02,892 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:05,897 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:08,902 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:11,906 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:14,910 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:17,915 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:20,920 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:23,924 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:26,929 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:29,934 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:32,938 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:35,943 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:38,948 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:41,953 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:44,957 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:47,961 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:50,966 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:53,970 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:56,974 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:32:59,979 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:02,983 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:05,987 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:08,992 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:11,997 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:15,001 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:18,006 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:21,011 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:24,015 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:27,020 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:30,025 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:33,029 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:36,034 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:39,038 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:42,043 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:45,047 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:48,051 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:51,057 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%
      2012-04-28 18:33:54,062 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204281725_0002_m_000002_0 0.0%

      [hadoop@mem1 logs]$ jps
      3282 Child
      31547 QuorumPeerMain
      1840 TaskTracker
      3469 Jps
      31070 HRegionServer
      30120 DataNode

      [hadoop@mem1 logs]$ kill 3282

      When I kill the child process , then the job continue and complete.

      2012-04-28 18:40:51,324 Stage-1 map = 88%, reduce = 25%
      2012-04-28 18:41:04,364 Stage-1 map = 88%, reduce = 29%
      2012-04-28 18:41:31,448 Stage-1 map = 100%, reduce = 29%
      2012-04-28 18:41:43,485 Stage-1 map = 100%, reduce = 100%

        Activity

        Hide
        Robert Joseph Evans added a comment -

        There are two types of timeouts that happen for a task. The first is a user configurable timeout with a default of 10 min. If a task does not make any progress for 10 min then the task will timeout. Progress is defined as reading a record of input, writing a record of output, or explicitly in your map/reduce code calling a method in the Context to indicate that you are still making progress.

        The other timeout is that if your task does not heartbeat into the AM for 5 min the task is assumed to be dead.

        In your case you are still making progress, it is just very slow, so neither of the timeouts would take place. You could possible set the progress interval to be something very small and that would probably detect the problem. We do not want to change the default value for the progress interval because there are lots of jobs that may be doing significant amounts of computation for each record, and we don't want to kill them off.

        mapreduce.task.timeout is the timeout you would want to configure.

        Show
        Robert Joseph Evans added a comment - There are two types of timeouts that happen for a task. The first is a user configurable timeout with a default of 10 min. If a task does not make any progress for 10 min then the task will timeout. Progress is defined as reading a record of input, writing a record of output, or explicitly in your map/reduce code calling a method in the Context to indicate that you are still making progress. The other timeout is that if your task does not heartbeat into the AM for 5 min the task is assumed to be dead. In your case you are still making progress, it is just very slow, so neither of the timeouts would take place. You could possible set the progress interval to be something very small and that would probably detect the problem. We do not want to change the default value for the progress interval because there are lots of jobs that may be doing significant amounts of computation for each record, and we don't want to kill them off. mapreduce.task.timeout is the timeout you would want to configure.
        Hide
        xieguiming added a comment -

        Hi ccw:
        whether the speculative is on? and the speculative is useful for your scenario?

        I think that the timeout mechanism is not good. because that the service will spend

        long time indeed.

        so, if kill the task by timeout mechanism, and the job is failed in fact.
        and is the same with killing the job.

        so, I suggest that it is suitable for user's application to kill the job when the job does not complete for long time.

        Show
        xieguiming added a comment - Hi ccw: whether the speculative is on? and the speculative is useful for your scenario? I think that the timeout mechanism is not good. because that the service will spend long time indeed. so, if kill the task by timeout mechanism, and the job is failed in fact. and is the same with killing the job. so, I suggest that it is suitable for user's application to kill the job when the job does not complete for long time.
        Hide
        ccw added a comment -

        Thank you! I found the cause of the problem . The fundamental reason is that data access of Hbase . Because the data(region server) of map task is not on local region server ,so each data access is very slow .
        But I think there must be a timeout mechanism to prevent the occurrence of such events. If a task is running too slow ,Hadoop should automatically kill the task rather than manually kill.
        Thank you!

        Show
        ccw added a comment - Thank you! I found the cause of the problem . The fundamental reason is that data access of Hbase . Because the data(region server) of map task is not on local region server ,so each data access is very slow . But I think there must be a timeout mechanism to prevent the occurrence of such events. If a task is running too slow ,Hadoop should automatically kill the task rather than manually kill. Thank you!
        Hide
        Robert Joseph Evans added a comment -

        So have you been able to reproduce it? I also think that it is a really bad bug that I want to fix, but I don't know where to start debugging it without more information about how it happened.

        Show
        Robert Joseph Evans added a comment - So have you been able to reproduce it? I also think that it is a really bad bug that I want to fix, but I don't know where to start debugging it without more information about how it happened.
        Hide
        ccw added a comment -

        I checkd all logs,but there is no error .I want to know why each job encounter this problem until I kill one or more child process. I set the property 'mapred.task.timeout' ,but it is unhelpful for this. I know the child task is not dead , but also it never work. It is terrible that every time I need to kill the hanging task.I think it need to be controlled by JT or TT.

        Show
        ccw added a comment - I checkd all logs,but there is no error .I want to know why each job encounter this problem until I kill one or more child process. I set the property 'mapred.task.timeout' ,but it is unhelpful for this. I know the child task is not dead , but also it never work. It is terrible that every time I need to kill the hanging task.I think it need to be controlled by JT or TT.
        Hide
        ccw added a comment -

        OK,I will try to reproduce this problem and tell you more message about it tomorrow.

        Show
        ccw added a comment - OK,I will try to reproduce this problem and tell you more message about it tomorrow.
        Hide
        Robert Joseph Evans added a comment -

        ccw,

        Without a heap dump, a stack trace, or even process logs for the hung process I am not really sure what more we can do but tell you that it looks like it was an intermittent issue, because it recovered after rerunning the task. The task should have timed out after a default of 10 mins with no progress being made, but I don’t know if you disabled that, or if there is something the mr code is doing to prevent it. If you can post some of that information to this JIRA we might be able to help, otherwise I think we will have to close this as Incomplete.

        Show
        Robert Joseph Evans added a comment - ccw, Without a heap dump, a stack trace, or even process logs for the hung process I am not really sure what more we can do but tell you that it looks like it was an intermittent issue, because it recovered after rerunning the task. The task should have timed out after a default of 10 mins with no progress being made, but I don’t know if you disabled that, or if there is something the mr code is doing to prevent it. If you can post some of that information to this JIRA we might be able to help, otherwise I think we will have to close this as Incomplete.

          People

          • Assignee:
            Unassigned
            Reporter:
            ccw
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development