Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-19169

llap: Timed out after 90 secs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • llap
    • None

    Description

      a more or less recent hive/master with tpcds1000; while running only 1 query at a time on the cluster; sometimes a "timeout" happens - this could be even a misconfiguration problem...I'm not sure I've set it up correctly.

      what I see from: it seems like the attempt have entered the queue and accepted - after almost 90 seconds there are messages that it will be pre-empted - and the 90 sec timeout happens.

      hive.log; example: attempt_1522319554594_0065_19_05_000119_14 failed

      2018-04-10T13:20:25,178 ERROR [HiveServer2-Background-Pool: Thread-4194]: SessionState (SessionState.java:printError(1214)) - Vertex failed, vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, TaskAttempt 12 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 secs], TaskAttempt 13 killed, TaskAttempt 14 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs], TaskAttempt 15 failed, info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be removed
      2018-04-10T13:20:25,260 ERROR [HiveServer2-Background-Pool: Thread-4194]: ql.Driver (SessionState.java:printError(1214)) - FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, TaskAttempt 12 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 secs], TaskAttempt 13 killed, TaskAttempt 14 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs], TaskAttempt 15 failed, info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be removed
      org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, TaskAttempt 12 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 secs], TaskAttempt 13 killed, TaskAttempt 14 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs], TaskAttempt 15 failed, info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be removed
      Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Vertex failed, vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, TaskAttempt 12 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 secs], TaskAttempt 13 killed, TaskAttempt 14 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs], TaskAttempt 15 failed, info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be removed
      
      

      tezam logs for the attempt

      2018-04-10 13:18:45,430 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. task=attempt_1522319554594_0065_19_05_000119_14, priority=194, capability=<memory:3072, vCores:1>, hosts=null
      2018-04-10 13:18:45,440 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Assigned #67763, task=TaskInfo{task=attempt_1522319554594_0065_19_05_000119_14, priority=194, startTime=0, containerId=null, uniqueId=69780, localityDelayTimeout=1127144039} on node={lemon-9.openstacklocal:50155, id=2a3562c1-1094-499f-9d83-5c3b5cf34663, canAcceptTask=true, st=10, ac=1, commF=false, disabled=false}, to container=container_222212222_0065_01_067763
      2018-04-10 13:18:45,440 [INFO] [LlapScheduler] |GuaranteedTasks|: Registering attempt_1522319554594_0065_19_05_000119_14; false
      2018-04-10 13:18:45,446 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|: TaskAttempt: [attempt_1522319554594_0065_19_05_000119_14] submitted. Is using containerId: [container_222212222_0065_01_067763] on NM: [lemon-9.openstacklocal:50155]
      2018-04-10 13:18:45,446 [INFO] [Dispatcher thread {Central}] |HistoryEventHandler.criticalEvents|: [HISTORY][DAG:dag_1522319554594_0065_19][Event:TASK_ATTEMPT_STARTED]: vertexName=Reducer 3, taskAttemptId=attempt_1522319554594_0065_19_05_000119_14, startTime=1523366325446, containerId=container_222212222_0065_01_067763, nodeId=lemon-9.openstacklocal:50155
      2018-04-10 13:18:45,464 [INFO] [TaskCommunicator # 9] |tezplugins.LlapTaskCommunicator|: Successfully launched task: attempt_1522319554594_0065_19_05_000119_14
      2018-04-10 13:18:45,464 [INFO] [TaskCommunicator # 9] |GuaranteedTasks|: Received response for attempt_1522319554594_0065_19_05_000119_14, true
      2018-04-10 13:19:08,743 [INFO] [IPC Server handler 15 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_05_000128_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, 
      2018-04-10 13:19:18,728 [INFO] [IPC Server handler 11 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, attempt_1522319554594_0065_19_08_000025_7, 
      2018-04-10 13:19:28,726 [INFO] [IPC Server handler 28 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, attempt_1522319554594_0065_19_08_000025_7, 
      2018-04-10 13:19:38,726 [INFO] [IPC Server handler 15 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, attempt_1522319554594_0065_19_08_000025_7, 
      2018-04-10 13:19:48,726 [INFO] [IPC Server handler 17 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, attempt_1522319554594_0065_19_08_000025_7, 
      2018-04-10 13:19:58,726 [INFO] [IPC Server handler 5 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, attempt_1522319554594_0065_19_08_000025_7, 
      2018-04-10 13:20:08,726 [INFO] [IPC Server handler 2 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, attempt_1522319554594_0065_19_08_000025_7, 
      2018-04-10 13:20:18,727 [INFO] [IPC Server handler 17 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, attempt_1522319554594_0065_19_08_000025_7, 
      2018-04-10 13:20:24,594 [INFO] [Dispatcher thread {Central}] |HistoryEventHandler.criticalEvents|: [HISTORY][DAG:dag_1522319554594_0065_19][Event:TASK_ATTEMPT_FINISHED]: vertexName=Reducer 3, taskAttemptId=attempt_1522319554594_0065_19_05_000119_14, creationTime=1523366325392, allocationTime=1523366325443, startTime=1523366325446, finishTime=1523366424594, timeTaken=99148, status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, diagnostics=AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs, nodeHttpAddress=http://lemon-9.openstacklocal:15002, counters=Counters: 0
      2018-04-10 13:20:24,600 [INFO] [TaskSchedulerEventHandlerThread] |GuaranteedTasks|: Unregistering attempt_1522319554594_0065_19_05_000119_14; false
      2018-04-10 13:20:24,600 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for task=attempt_1522319554594_0065_19_05_000119_14, state=ASSIGNED, endReason=OTHER
      2018-04-10 13:20:24,603 [INFO] [Dispatcher thread {Central}] |node.AMNodeImpl|: Attempt failedon node: lemon-9.openstacklocal:50155 TA: attempt_1522319554594_0065_19_05_000119_14, container: container_222212222_0065_01_067763, numFailedTAs: 57
      2018-04-10 13:20:24,605 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source task attempt completed for vertex: vertex_1522319554594_0065_19_07 [Reducer 10] attempt: attempt_1522319554594_0065_19_05_000119_14 with state: FAILED vertexState: RUNNING
      2018-04-10 13:20:24,605 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source task attempt completed for vertex: vertex_1522319554594_0065_19_11 [Reducer 4] attempt: attempt_1522319554594_0065_19_05_000119_14 with state: FAILED vertexState: RUNNING
      2018-04-10 13:20:24,605 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source task attempt completed for vertex: vertex_1522319554594_0065_19_06 [Reducer 9] attempt: attempt_1522319554594_0065_19_05_000119_14 with state: FAILED vertexState: RUNNING
      TaskAttempt 14 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs]
      2018-04-10 13:20:25,020 [INFO] [Dispatcher thread {Central}] |HistoryEventHandler.criticalEvents|: [HISTORY][DAG:dag_1522319554594_0065_19][Event:VERTEX_FINISHED]: vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, initRequestedTime=1523363710631, initedTime=1523363710645, startRequestedTime=1523363713645, startedTime=1523363713646, finishTime=1523366425017, timeTaken=2711371, status=FAILED, diagnostics=Task failed, taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, TaskAttempt 12 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 secs], TaskAttempt 13 killed, TaskAttempt 14 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs], TaskAttempt 15 failed, info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be removed
      2018-04-10 13:20:25,157 [INFO] [Dispatcher thread {Central}] |HistoryEventHandler.criticalEvents|: [HISTORY][DAG:dag_1522319554594_0065_19][Event:DAG_FINISHED]: dagId=dag_1522319554594_0065_19, startTime=1523363710305, finishTime=1523366425121, timeTaken=2714816, status=FAILED, diagnostics=Vertex failed, vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, TaskAttempt 12 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 secs], TaskAttempt 13 killed, TaskAttempt 14 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs], TaskAttempt 15 failed, info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be removed
      
      

      from llap0 logs; filtered for attempt_1522319554594_0065_19_05_000119_14 ; ordered by time:

      java.lang.RuntimeException: attempt_1522319554594_0065_19_05_000119_14 was not registered and couldn't be removed
      2018-04-10T13:18:45,455 INFO  [IPC Server handler 3 on 50155 ()] org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: Queueing container for execution: fragemendId=attempt_1522319554594_0065_19_05_000119_14, am_details=lemon-8.openstacklocal:35408, taskInfo= fragment 119 attempt 14, user=hive, queryId=hive_20180410113951_2fc85570-df7a-4cb1-8499-74d11f227e0d, appIdString=application_1522319554594_0065, appAttemptNum=1, containerIdString=container_222212222_0065_01_067763, dagName=with ws as
      2018-04-10T13:18:45,457  INFO [IPC Server handler 3 on 50155 (1522319554594_0065_19_05_000119_14)] impl.ContainerRunnerImpl: SubmissionState for attempt_1522319554594_0065_19_05_000119_14 : ACCEPTED 
      2018-04-10T13:19:13,533 INFO  [Wait-Queue-Scheduler-0 ()] org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Adding attempt_1522319554594_0065_19_05_000119_14 to pre-emption queue
      2018-04-10T13:19:13,533 INFO  [Wait-Queue-Scheduler-0 ()] org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Attempting to execute TaskWrapper{task=attempt_1522319554594_0065_19_05_000119_14, inWaitQueue=true, inPreemptionQueue=false, registeredForNotifications=true, canFinish=true, canFinish(in queue)=true, isGuaranteed=false, firstAttemptStartTime=1523364757720, dagStartTime=1523363710305, withinDagPriority=194, vertexParallelism= 297, selfAndUpstreamParallelism= 1286, selfAndUpstreamComplete= 1243}
      2018-04-10T13:19:13,534 WARN  [ExecutionCompletionThread #0 ()] org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Received onSuccess/onFailure for a fragment for which a completing message was not received: attempt_1522319554594_0065_19_05_000119_14
      2018-04-10T13:19:13,536 ERROR [ExecutionCompletionThread #0 ()] org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Failed notification received: Stacktrace: java.lang.RuntimeException: attempt_1522319554594_0065_19_05_000119_14 was not registered and couldn't be removed
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            kgyrtkirk Zoltan Haindrich
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: