  Solr
  SOLR-14897

HttpSolrCall will forward a virtually unlimited number of times until ClusterState ZkWatcher is updated after collection delete



    Bug
    Status: Closed
    • Blocker
    Resolution: Fixed
    8.6.3
      While investigating the root cause of some SOLR-14896 related failures, I have seen evidence that if a collection is deleted, but a client makes a subequent request for that collection before the local ClusterState has been updated to remove that DocCollection, HttpSolrCall will forward/proxy that request a (virtually) unbounded number of times in a very short time period - stopping only once the the "cached" local DocCollection is updated to indicate there are no active replicas.**

      While HttpSolrCall does track & increment a _forwardedCount param on every request it forwards, it doesn't consult that request unless/until it finds a situation where the (local) DocCollection says there are no active replicas.

      So if you have a collection XX with 4 total replicas on 4 diff nodes (A,B,C,D), and and you delete XX (triggering sequential core deletions on A,B,C,D that fire successive ZkWatchers on various nodes to update the collection state) a request for XX can bounce back and forth between nodes C & D 20+ times until the ClusterState watcher fires on both of those nodes so they finally realize that the _forwardedCount=20 is more the the 0 active replicas...

      In the below code snippet from HttpSolrCall, the first call to getCoreUrl(...) is expected to return null if there are no active replicas - but it uses the local cached DocCollection, which may think there is an active replica on another node, so it forwards the request to that node - where the replica may have been deleted, so that node runs hte same code and may forward the request right back to the original node....

          String coreUrl = getCoreUrl(collectionName, origCorename, clusterState,
              activeSlices, byCoreName, true);
          // Avoid getting into a recursive loop of requests being forwarded by
          // stopping forwarding and erroring out after (totalReplicas) forwards
          if (coreUrl == null) {
            if (queryParams.getInt(INTERNAL_REQUEST_COUNT, 0) > totalReplicas){
              throw new SolrException(SolrException.ErrorCode.INVALID_STATE,
                  "No active replicas found for collection: " + collectionName);
            coreUrl = getCoreUrl(collectionName, origCorename, clusterState,
                activeSlices, byCoreName, false);

      ..the check that is suppose to prevent a "recursive loop" is only consulted once a situation arises where local ClusterState indicates there are no active replicas - which seems to defeat the point of the forward check?  (at which point if the total number of replicas hasn't been exceeded, the code is happy to forward the request to a coreUrl which the local ClusterState indicates is not active (which also sems to defeat the point?)



        1. SOLR-14897.patch
          4 kB
          Munendra S N

              munendrasn Munendra S N
              hossman Chris M. Hostetter
