XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      See HIVE-10648.
      When AM cannot connect to a node, that appears to cause it to stall; example log, there are no other interleaving logs even though this is happening in the middle of Map 1 on TPCH q1, i.e. there are plenty of tasks scheduled.
      From "Assigning" messages I can also see tasks are scheduled to all the nodes before and after the pause, not just to the problematic node.
      LLAP daemons have corresponding gaps where between two fragments nothing is ran for a long time on any daemon.

      2015-05-07 12:13:46,679 INFO [Dispatcher thread: Central] impl.TaskImpl: task_1429683757595_0784_1_00_000276 Task Transitioned from SCHEDULED to RUNNING due to event T_ATTEMPT_LAUNCHED
      2015-05-07 12:13:46,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 10 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:13:46,955 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
      2015-05-07 12:13:47,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 11 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:13:48,812 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 12 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:13:49,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 13 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:13:50,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 14 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:13:51,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 15 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:13:52,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 16 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:13:53,815 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 17 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:13:54,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 18 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:13:55,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 19 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:13:56,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 20 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:13:56,971 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
      2015-05-07 12:13:57,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 21 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:13:58,818 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 22 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:13:59,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 23 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:00,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 24 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:01,820 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 25 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:02,821 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 26 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:03,821 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 27 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:04,822 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 28 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:05,823 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 29 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:06,823 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 30 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:06,984 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
      2015-05-07 12:14:07,824 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 31 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:08,824 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 32 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:09,825 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 33 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:10,825 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 34 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:11,826 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 35 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:12,826 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 36 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:13,827 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 37 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:14,827 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 38 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:15,828 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 39 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:16,828 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 40 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:16,996 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting to refresh ServiceInstanceSet 1611673583
      2015-05-07 12:14:17,829 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 41 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:18,830 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 42 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:19,830 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 43 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:20,831 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 44 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:21,832 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 45 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:22,832 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 46 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:23,833 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 47 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:24,833 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:25,834 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server: cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-05-07 12:14:25,836 INFO [TaskCommunicator # 3] tezplugins.LlapTaskCommunicator: Unable to run task: attempt_1429683757595_0784_1_00_000017_0 on containerId: container_222212222_0784_01_000018, Communication Error
      2015-05-07 12:14:25,841 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1429683757595_0784_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1429683757595_0784_1_00_000017_0, startTime=1431026014322, finishTime=1431026065838, timeTaken=51516, status=KILLED, errorEnum=COMMUNICATION_ERROR, diagnostics=Communication Error, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sershe Sergey Shelukhin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: