Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-1746

Coordinator stuck connecting to a decommissioned NM

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.12.0
    • None
    • None
    • None

    Description

      We hit an issue trying to decommission a large number of NMs.

      1. We've decommissioned large number of NMs
      2. Containers were allocated by RM to a new NM
      3. That NM was decommissioned immediately after RM allocation, before coordinator could start the containers   

      Relevant logs:

      2018-06-08 10:28:12 YarnClusterResourceManager [INFO] Container allocated from RM on x.x.x.x
      2018-06-08 10:28:12 ContainerProcessManager [INFO] Container allocated from RM on x.x.x.x
      2018-06-08 10:28:12 ResourceRequestState [INFO] Host affinity not enabled. Saving the samzaResource container_1525695824467_0091_06_000106 in the buffer for ANY_HOST
      2018-06-08 10:28:15 AbstractContainerAllocator [INFO] Found available resources on ANY_HOST. Assigning request for container_id 0 with timestamp 1528453690917 to resource container_1525695824467_0091_06_000106
      2018-06-08 10:28:15 YarnClusterResourceManager [INFO] Received launch request for 0 on hostname x.x.x.x
      2018-06-08 10:28:15 YarnContainerRunner [INFO] Got available container ID (0) for container: Container: [ContainerId: container_1525695824467_0091_06_000106, NodeId: x.x.x.x:45400, NodeHttpAddress: x.x.x.x:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: x.x.x.x:45400 }, ]
      2018-06-08 10:28:15 YarnContainerRunner [INFO] In runContainer in util: fwkPath= ;cmdPath=./__package/;jobLib=
      2018-06-08 10:28:15 YarnContainerRunner [INFO] Container ID 0 using command ./__package//bin/run-container.sh
      2018-06-08 10:28:15 YarnContainerRunner [INFO] Container ID 0 using environment variables: 
      SAMZA_CONTAINER_ID=0
      EXECUTION_ENV_CONTAINER_ID=container_1525695824467_0091_06_000106
      SAMZA_COORDINATOR_URL=http://ip-a-a-a-a:33907/
      JAVA_HOME=/opt/jvm/jdk1.8.0_65
      JAVA_OPTS=
      2018-06-08 10:28:15 YarnContainerRunner [INFO] Samza FWK path: ./__package//bin/run-container.sh; env={SAMZA_CONTAINER_ID=0, EXECUTION_ENV_CONTAINER_ID=container_1525695824467_0091_06_000106, SAMZA_COORDINATOR_URL=http://ip-a-a-a-a:33907/, JAVA_HOME=/opt/jvm/jdk1.8.0_65, JAVA_OPTS=}
      2018-06-08 10:28:15 YarnContainerRunner [INFO] Starting container ID 0 using package path http://package.tar.gz
      2018-06-08 10:28:15 YarnContainerRunner [INFO] starting container http://package.tar.gz Container: [ContainerId: container_1525695824467_0091_06_000106, NodeId: x.x.x.x:45400, NodeHttpAddress: x.x.x.x:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: x.x.x.x:45400 }, ] {SAMZA_CONTAINER_ID=0, EXECUTION_ENV_CONTAINER_ID=container_1525695824467_0091_06_000106, SAMZA_COORDINATOR_URL=http://ip-a-a-a-a:33907/, JAVA_HOME=/opt/jvm/jdk1.8.0_65, JAVA_OPTS=} export SAMZA_LOG_DIR=<LOG_DIR> && ln -sfn <LOG_DIR> logs && exec ./__package//bin/run-container.sh 1>logs/stdout 2>logs/stderr
      2018-06-08 10:28:15 YarnContainerRunner [INFO] set package Resource in YarnContainerRunner for scheme: "http" host: "s3-eu-west-1.amazonaws.com" port: -1 file: "/package.tar.gz"
      2018-06-08 10:28:15 ContainerManagementProtocolProxy [INFO] Opening proxy : x.x.x.x:45400
      2018-06-08 10:28:35 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 0 time(s); maxRetries=45
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000016, State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, ]
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had ID: 9
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 9 from completed containers
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000010, State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, ]
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had ID: 3
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 3 from completed containers
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000002, State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, ]
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had ID: 11
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 11 from completed containers
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000102, State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, ]
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had ID: 8
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 8 from completed containers
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000005, State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, ]
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had ID: 14
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 14 from completed containers
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000011, State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, ]
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had ID: 4
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 4 from completed containers
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000008, State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, ]
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had ID: 1
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 1 from completed containers
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1525695824467_0091_06_000013, State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, ]
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Completed container had ID: 6
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Removing container ID 6 from completed containers
      2018-06-08 10:28:51 ContainerProcessManager [INFO] Matching container ID found 9 org.apache.samza.clustermanager.SamzaResource@3c12182f
      2018-06-08 10:28:51 ContainerProcessManager [INFO] Got an exit code of -100. This means that container container_1525695824467_0091_06_000016 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. or due to preemption by the RM
      2018-06-08 10:28:51 ContainerProcessManager [INFO] Released container container_1525695824467_0091_06_000016 was assigned task group ID 9. Requesting a new container for the task group.
      2018-06-08 10:28:51 SamzaResourceRequest [INFO] Resource Request created for 9 on ANY_HOST at 1528453731124
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Requesting resources on ANY_HOST for container 9
      2018-06-08 10:28:51 YarnClusterResourceManager [INFO] Making a request for ANY_HOST ANY_HOST
      2018-06-08 10:28:55 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 1 time(s); maxRetries=45
      2018-06-08 10:29:06 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:29:10 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:29:14 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:29:17 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:29:20 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:29:23 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:29:26 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:29:29 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:29:32 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:29:35 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:29:51 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:29:54 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:29:57 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:00 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:03 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:06 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:09 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:12 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:15 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:18 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:34 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:38 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:41 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:44 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:47 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:50 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:53 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:56 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:30:59 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      2018-06-08 10:31:02 Client [INFO] Retrying connect to server: ip-x-x-x-x.eu-west-1.compute.internal/x.x.x.x:45400. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

      So the container gets allocated to x.x.x.x but then when coordinator tries to connect to the NM it is no longer reachable, and it gets stuck in retry loop indefinitely. The result of which new containers don't get scheduled for that job. 

      Attachments

        Activity

          People

            Unassigned Unassigned
            danil Danil Serdyuchenko
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: