[SOLR-10904] Unnecessary waiting during failover in case of failed core creation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 7.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

Background failover thread checks for bad replicas. In case one is found it tries to create it on another node. Then it waits for the new replica to show up in the cluster state. It waits even if the core creation (initiated by itself) fails.

This situation does not occur on the happy path of the failover cases because the new node was marked as alive. But in case the cluster is in an instable state, or user is restarting the new node, or overseer is overloaded this extra wait will result in holding up this failover thread.

Proposed solution may be

wait for the result of the core creation
only if previous step is successful proceed to wait for cluster state change

In code:

try {
  Future<Boolean> future = updateExecutor.submit(() -> createSolrCore(collection, createUrl, dataDir, ulogDir, coreNodeName, coreName, shardId));
  future.get(30000L, TimeUnit.MILLISECONDS);
} catch (InterruptedException | ExecutionException | TimeoutException e) {
  log.error("Error creating core", e);
  return false;
} finally {
  MDC.remove("OverseerAutoReplicaFailoverThread.createUrl");
}

In such case we could consider moving core creation into the failover thread from the updateExecutor.

I can post a patch with these changes if the solution seems appropriate.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Mihaly Toth

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Jun/17 07:37

Updated:: 23/Oct/19 19:40