[JCLOUDS-1092] Azure: ComputeService.resumeNode spins in a timeout loop that doesn't have a chance to exit early - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.9.2
Fix Version/s: None
Component/s: jclouds-labs
Labels:
- azurecompute

Description

This is going to be a slightly longer text, so please bear with me.

Invoking ComputeService.resumeNode with the Azure provider goes through these layers:

BaseComputeService.resumeNode
AdaptingComputeServiceStrategies.resumeNode
AzureComputeServiceAdapter.resumeNode

The problem manifests when traversing the callstack back up, so let's assume we got down to AzureComputeServiceAdapter.resumeNode. Also, the problem only appears for us when calling suspendNode and then resumeNode in rapid succession, but that's out of JClouds's control.

When the trackRequest method returns (https://github.com/jclouds/jclouds-labs/blob/fe24698d81/azurecompute/src/main/java/org/jclouds/azurecompute/compute/AzureComputeServiceAdapter.java#L383), it means that the asynchronous operation "start node" succeeded – but that doesn't mean that the node is already running. In fact, it's only just starting – I was able to confirm that in the debugger by calling api.getDeploymentApiForService(id).get(id) and inspecting the roleInstanceList.

When we get one layer back up, the AdaptingComputeServiceStrategies.resumeNode method calls getNode (see https://github.com/jclouds/jclouds/blob/b9322c583d/compute/src/main/java/org/jclouds/compute/strategy/impl/AdaptingComputeServiceStrategies.java#L164), which delegates to AzureComputeServiceAdapter.getNode.

AzureComputeServiceAdapter.getNode only returns non-null value when all of the deployment's role instances are in a settled state (non-transient), see https://github.com/jclouds/jclouds-labs/blob/fe24698d81/azurecompute/src/main/java/org/jclouds/azurecompute/compute/AzureComputeServiceAdapter.java#L269 So when the node is only just starting, AzureComputeServiceAdapter.getNode will return null.

Again one layer back up: AdaptingComputeServiceStrategies.getNode returns null and hence AdaptingComputeServiceStrategies.resumeNode also returns null.

One more layer back up: BaseComputeService.resumeNode will call the nodeRunning predicate with an AtomicReference of null, see https://github.com/jclouds/jclouds/blob/b9322c583d/compute/src/main/java/org/jclouds/compute/internal/BaseComputeService.java#L470

The predicate is a ComputeServiceTimeoutsModule.RetryablePredicateGuardingNull which delegates to Predicates2.RetryablePredicate and through that to AtomicNodeRunning. That is a subclass of RefreshAndDoubleCheckOnFailUnlessStatusInvalid, which will always return false when the resource is null, see https://github.com/jclouds/jclouds/blob/b9322c583d/compute/src/main/java/org/jclouds/compute/predicates/internal/RefreshAndDoubleCheckOnFailUnlessStatusInvalid.java#L63 There's also some kind of status refreshing, but that will never happen if the resource (node, in this case) is null (there's nothing to refresh).

All in all, the Predicates2.RetryablePredicate will spin on and on, until it times out, because for null, there's no chance it will exit early.

After the timeout, BaseComputeService.resumeNode prints that resuming node was not successful and returns. The problems are:

the retrying predicate is spinning uselessly
we have actually no idea about the status of the node when resumeNode returns

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ladislav Thon

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 15/Mar/16 12:32

Updated:: 21/Apr/16 21:52