Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11484

CloudSolrClient's cache of collection clusterstate can cause RouteExceptions when attempting directUpdates after collection modifications

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.2, 8.0
    • Component/s: None
    • Labels:
      None

      Description

      This was discovered while auditing jenkins failures from
      TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete (where a test explicitly deletes and then recreates a collection with the same name), but as noted in a comment below, SOLR-11392 is another example of non-obvious test failures that can pop up because of this bug.

      In practice, it can affect any CloudSolrClient user after changes have been made to a collection (to add/move replicas, etc...)


      Original jira notes...

      TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete
      seems to fail with non-trivial frequency, so I grabbed the logs from a recent failure and starting trying to follow along with the actions to figure out what exactly is happening....

      https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20662/

         [junit4] ERROR   20.3s J1 | TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete <<<
         [junit4]    > Throwable #1: org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error from server at https://127.0.0.1:42959/solr/testcollection_shard1_replica_n3: Expected mime type a
      pplication/octet-stream but got text/html. <html>
         [junit4]    > <head>
         [junit4]    > <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
         [junit4]    > <title>Error 404 </title>
      

      The crux of this failure appears to be a genuine bug in how CloudSolrClient uses it's cached ClusterState info when doing (direct) updates. The key bits seem to be:

      • CloudSolrClient does something (update,query,etc...) with a collection causing the current cluster state for the collection to be cached
      • The actual collection changes such that a Solr node/core no longer exists as part of the collection
      • CloudSolrClient is asked to process an UpdateRequest which triggers the code paths for the directUpdate() method – which attempts to route the updates directly to a replica of the appropriate shard using the (cache) collection state info
      • CloudSolrClient (may) attempt to send that UpdateRequest to a node/core that doesn't exist, getting a 404 – which does not (seem to) trigger a state refresh, or retry to find a correct URL to resend the update to.

      Details to follow in comment....

        Attachments

        1. jenkins.thetaphi.20662.txt
          3.83 MB
          Hoss Man
        2. SOLR-11484.patch
          4 kB
          Hoss Man
        3. SOLR-11484.patch
          6 kB
          Noble Paul

          Issue Links

            Activity

              People

              • Assignee:
                noble.paul Noble Paul
                Reporter:
                hossman Hoss Man
              • Votes:
                0 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: