Uploaded image for project: 'Apache Twill (Retired)'
  1. Apache Twill (Retired)
  2. TWILL-213

Increase of instances while starting up may lead to ignored retries and instance increases

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.9.0
    • None
    • yarn
    • None

    Description

      As seen in the test development for TWILL-181, if the number of instances for a container is increased before the ApplicationMasterService has observed the original request as being satisfied, the instance increase and any subsequent retries will be blocked. This is because in launchRunnable:

          TwillContainerLauncher launcher = new TwillContainerLauncher(
              twillSpec.getRunnables().get(runnableName), processLauncher.getContainerInfo(), launchContext,
              ZKClients.namespace(zkClient, getZKNamespace(runnableName)),
              containerCount, jvmOpts, reservedMemory, getSecureStoreLocation());
      
            runningContainers.start(runnableName, processLauncher.getContainerInfo(), launcher);
      
            // Need to call complete to workaround bug in YARN AMRMClient
            if (provisionRequest.containerAcquired()) {
              amClient.completeContainerRequest(provisionRequest.getRequestId());
            }
      
            /*
             * The provisionRequest will either contain a single container (ALLOCATE_ONE_INSTANCE_AT_A_TIME), or all the
             * containers to satisfy the expectedContainers count. In the later case, the provision request is complete once
             * all the containers have run at which point we poll() to remove the provisioning request.
             */
            if (expectedContainers.getExpected(runnableName) == runningContainers.count(runnableName) ||
              provisioning.peek().getType().equals(AllocationSpecification.Type.ALLOCATE_ONE_INSTANCE_AT_A_TIME)) {
              provisioning.poll();
            }
      

      There is a race condition. The sequence:

      • Thread A: runningContainers.start is called and 2 instances are started
      • Thread B: The runnable from createSetInstanceRunnable executes, sees the 2 instances are started and updates the expected count to 3.
      • Thread A: Gets to if check comparing expectedContainers (3) to runningContainers.count (2). Since this fails, poll is not called and this provision request is not satisfied.

      Subsequent calls will try to provision the 3rd container because it seems like the first provision request is not yet satisfied.

      The MaxRetriesTestRun.maxRetriesWithIncreasedInstances method can be used to reproduce this case intermittently by changing the allRunning.await check to something that does a countdown latch onRunning as EchoServerTestRun does.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mserrano Martin Serrano
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: