Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
a more or less recent hive/master with tpcds1000; while running only 1 query at a time on the cluster; sometimes a "timeout" happens - this could be even a misconfiguration problem...I'm not sure I've set it up correctly.
what I see from: it seems like the attempt have entered the queue and accepted - after almost 90 seconds there are messages that it will be pre-empted - and the 90 sec timeout happens.
hive.log; example: attempt_1522319554594_0065_19_05_000119_14 failed
2018-04-10T13:20:25,178 ERROR [HiveServer2-Background-Pool: Thread-4194]: SessionState (SessionState.java:printError(1214)) - Vertex failed, vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, TaskAttempt 12 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 secs], TaskAttempt 13 killed, TaskAttempt 14 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs], TaskAttempt 15 failed, info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be removed 2018-04-10T13:20:25,260 ERROR [HiveServer2-Background-Pool: Thread-4194]: ql.Driver (SessionState.java:printError(1214)) - FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, TaskAttempt 12 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 secs], TaskAttempt 13 killed, TaskAttempt 14 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs], TaskAttempt 15 failed, info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be removed org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, TaskAttempt 12 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 secs], TaskAttempt 13 killed, TaskAttempt 14 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs], TaskAttempt 15 failed, info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be removed Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Vertex failed, vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, TaskAttempt 12 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 secs], TaskAttempt 13 killed, TaskAttempt 14 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs], TaskAttempt 15 failed, info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be removed
tezam logs for the attempt
2018-04-10 13:18:45,430 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|: Received allocateRequest. task=attempt_1522319554594_0065_19_05_000119_14, priority=194, capability=<memory:3072, vCores:1>, hosts=null 2018-04-10 13:18:45,440 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Assigned #67763, task=TaskInfo{task=attempt_1522319554594_0065_19_05_000119_14, priority=194, startTime=0, containerId=null, uniqueId=69780, localityDelayTimeout=1127144039} on node={lemon-9.openstacklocal:50155, id=2a3562c1-1094-499f-9d83-5c3b5cf34663, canAcceptTask=true, st=10, ac=1, commF=false, disabled=false}, to container=container_222212222_0065_01_067763 2018-04-10 13:18:45,440 [INFO] [LlapScheduler] |GuaranteedTasks|: Registering attempt_1522319554594_0065_19_05_000119_14; false 2018-04-10 13:18:45,446 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|: TaskAttempt: [attempt_1522319554594_0065_19_05_000119_14] submitted. Is using containerId: [container_222212222_0065_01_067763] on NM: [lemon-9.openstacklocal:50155] 2018-04-10 13:18:45,446 [INFO] [Dispatcher thread {Central}] |HistoryEventHandler.criticalEvents|: [HISTORY][DAG:dag_1522319554594_0065_19][Event:TASK_ATTEMPT_STARTED]: vertexName=Reducer 3, taskAttemptId=attempt_1522319554594_0065_19_05_000119_14, startTime=1523366325446, containerId=container_222212222_0065_01_067763, nodeId=lemon-9.openstacklocal:50155 2018-04-10 13:18:45,464 [INFO] [TaskCommunicator # 9] |tezplugins.LlapTaskCommunicator|: Successfully launched task: attempt_1522319554594_0065_19_05_000119_14 2018-04-10 13:18:45,464 [INFO] [TaskCommunicator # 9] |GuaranteedTasks|: Received response for attempt_1522319554594_0065_19_05_000119_14, true 2018-04-10 13:19:08,743 [INFO] [IPC Server handler 15 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_05_000128_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, 2018-04-10 13:19:18,728 [INFO] [IPC Server handler 11 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, attempt_1522319554594_0065_19_08_000025_7, 2018-04-10 13:19:28,726 [INFO] [IPC Server handler 28 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, attempt_1522319554594_0065_19_08_000025_7, 2018-04-10 13:19:38,726 [INFO] [IPC Server handler 15 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, attempt_1522319554594_0065_19_08_000025_7, 2018-04-10 13:19:48,726 [INFO] [IPC Server handler 17 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, attempt_1522319554594_0065_19_08_000025_7, 2018-04-10 13:19:58,726 [INFO] [IPC Server handler 5 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, attempt_1522319554594_0065_19_08_000025_7, 2018-04-10 13:20:08,726 [INFO] [IPC Server handler 2 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, attempt_1522319554594_0065_19_08_000025_7, 2018-04-10 13:20:18,727 [INFO] [IPC Server handler 17 on 35408] |tezplugins.LlapTaskCommunicator|: The tasks we expected to be on the node are not there: attempt_1522319554594_0065_19_05_000130_14, attempt_1522319554594_0065_19_05_000191_14, attempt_1522319554594_0065_19_08_000018_6, attempt_1522319554594_0065_19_08_000022_6, attempt_1522319554594_0065_19_05_000181_14, attempt_1522319554594_0065_19_05_000131_15, attempt_1522319554594_0065_19_05_000214_15, attempt_1522319554594_0065_19_05_000093_11, attempt_1522319554594_0065_19_05_000119_14, attempt_1522319554594_0065_19_08_000025_7, 2018-04-10 13:20:24,594 [INFO] [Dispatcher thread {Central}] |HistoryEventHandler.criticalEvents|: [HISTORY][DAG:dag_1522319554594_0065_19][Event:TASK_ATTEMPT_FINISHED]: vertexName=Reducer 3, taskAttemptId=attempt_1522319554594_0065_19_05_000119_14, creationTime=1523366325392, allocationTime=1523366325443, startTime=1523366325446, finishTime=1523366424594, timeTaken=99148, status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR, diagnostics=AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs, nodeHttpAddress=http://lemon-9.openstacklocal:15002, counters=Counters: 0 2018-04-10 13:20:24,600 [INFO] [TaskSchedulerEventHandlerThread] |GuaranteedTasks|: Unregistering attempt_1522319554594_0065_19_05_000119_14; false 2018-04-10 13:20:24,600 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|: Processing de-allocate request for task=attempt_1522319554594_0065_19_05_000119_14, state=ASSIGNED, endReason=OTHER 2018-04-10 13:20:24,603 [INFO] [Dispatcher thread {Central}] |node.AMNodeImpl|: Attempt failedon node: lemon-9.openstacklocal:50155 TA: attempt_1522319554594_0065_19_05_000119_14, container: container_222212222_0065_01_067763, numFailedTAs: 57 2018-04-10 13:20:24,605 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source task attempt completed for vertex: vertex_1522319554594_0065_19_07 [Reducer 10] attempt: attempt_1522319554594_0065_19_05_000119_14 with state: FAILED vertexState: RUNNING 2018-04-10 13:20:24,605 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source task attempt completed for vertex: vertex_1522319554594_0065_19_11 [Reducer 4] attempt: attempt_1522319554594_0065_19_05_000119_14 with state: FAILED vertexState: RUNNING 2018-04-10 13:20:24,605 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source task attempt completed for vertex: vertex_1522319554594_0065_19_06 [Reducer 9] attempt: attempt_1522319554594_0065_19_05_000119_14 with state: FAILED vertexState: RUNNING TaskAttempt 14 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs] 2018-04-10 13:20:25,020 [INFO] [Dispatcher thread {Central}] |HistoryEventHandler.criticalEvents|: [HISTORY][DAG:dag_1522319554594_0065_19][Event:VERTEX_FINISHED]: vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, initRequestedTime=1523363710631, initedTime=1523363710645, startRequestedTime=1523363713645, startedTime=1523363713646, finishTime=1523366425017, timeTaken=2711371, status=FAILED, diagnostics=Task failed, taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, TaskAttempt 12 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 secs], TaskAttempt 13 killed, TaskAttempt 14 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs], TaskAttempt 15 failed, info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be removed 2018-04-10 13:20:25,157 [INFO] [Dispatcher thread {Central}] |HistoryEventHandler.criticalEvents|: [HISTORY][DAG:dag_1522319554594_0065_19][Event:DAG_FINISHED]: dagId=dag_1522319554594_0065_19, startTime=1523363710305, finishTime=1523366425121, timeTaken=2714816, status=FAILED, diagnostics=Vertex failed, vertexName=Reducer 3, vertexId=vertex_1522319554594_0065_19_05, diagnostics=[Task failed, taskId=task_1522319554594_0065_19_05_000119, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_1 Timed out after 90 secs], TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, TaskAttempt 12 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_12 Timed out after 90 secs], TaskAttempt 13 killed, TaskAttempt 14 failed, info=[AttemptID:attempt_1522319554594_0065_19_05_000119_14 Timed out after 90 secs], TaskAttempt 15 failed, info=[org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): attempt_1522319554594_0065_19_08_000018_6 was not registered and couldn't be removed
from llap0 logs; filtered for attempt_1522319554594_0065_19_05_000119_14 ; ordered by time:
java.lang.RuntimeException: attempt_1522319554594_0065_19_05_000119_14 was not registered and couldn't be removed 2018-04-10T13:18:45,455 INFO [IPC Server handler 3 on 50155 ()] org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: Queueing container for execution: fragemendId=attempt_1522319554594_0065_19_05_000119_14, am_details=lemon-8.openstacklocal:35408, taskInfo= fragment 119 attempt 14, user=hive, queryId=hive_20180410113951_2fc85570-df7a-4cb1-8499-74d11f227e0d, appIdString=application_1522319554594_0065, appAttemptNum=1, containerIdString=container_222212222_0065_01_067763, dagName=with ws as 2018-04-10T13:18:45,457 INFO [IPC Server handler 3 on 50155 (1522319554594_0065_19_05_000119_14)] impl.ContainerRunnerImpl: SubmissionState for attempt_1522319554594_0065_19_05_000119_14 : ACCEPTED 2018-04-10T13:19:13,533 INFO [Wait-Queue-Scheduler-0 ()] org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Adding attempt_1522319554594_0065_19_05_000119_14 to pre-emption queue 2018-04-10T13:19:13,533 INFO [Wait-Queue-Scheduler-0 ()] org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Attempting to execute TaskWrapper{task=attempt_1522319554594_0065_19_05_000119_14, inWaitQueue=true, inPreemptionQueue=false, registeredForNotifications=true, canFinish=true, canFinish(in queue)=true, isGuaranteed=false, firstAttemptStartTime=1523364757720, dagStartTime=1523363710305, withinDagPriority=194, vertexParallelism= 297, selfAndUpstreamParallelism= 1286, selfAndUpstreamComplete= 1243} 2018-04-10T13:19:13,534 WARN [ExecutionCompletionThread #0 ()] org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Received onSuccess/onFailure for a fragment for which a completing message was not received: attempt_1522319554594_0065_19_05_000119_14 2018-04-10T13:19:13,536 ERROR [ExecutionCompletionThread #0 ()] org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService: Failed notification received: Stacktrace: java.lang.RuntimeException: attempt_1522319554594_0065_19_05_000119_14 was not registered and couldn't be removed