[SOLR-11484] CloudSolrClient's cache of collection clusterstate can cause RouteExceptions when attempting directUpdates after collection modifications - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 7.2, 8.0
Component/s: None
Labels:
None

Description

This was discovered while auditing jenkins failures from
TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete (where a test explicitly deletes and then recreates a collection with the same name), but as noted in a comment below, SOLR-11392 is another example of non-obvious test failures that can pop up because of this bug.

In practice, it can affect any CloudSolrClient user after changes have been made to a collection (to add/move replicas, etc...)

Original jira notes...

TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete
seems to fail with non-trivial frequency, so I grabbed the logs from a recent failure and starting trying to follow along with the actions to figure out what exactly is happening....

https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20662/

   [junit4] ERROR   20.3s J1 | TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete <<<
   [junit4]    > Throwable #1: org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error from server at https://127.0.0.1:42959/solr/testcollection_shard1_replica_n3: Expected mime type a
pplication/octet-stream but got text/html. <html>
   [junit4]    > <head>
   [junit4]    > <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
   [junit4]    > <title>Error 404 </title>

The crux of this failure appears to be a genuine bug in how CloudSolrClient uses it's cached ClusterState info when doing (direct) updates. The key bits seem to be:

CloudSolrClient does something (update,query,etc...) with a collection causing the current cluster state for the collection to be cached
The actual collection changes such that a Solr node/core no longer exists as part of the collection
CloudSolrClient is asked to process an UpdateRequest which triggers the code paths for the directUpdate() method – which attempts to route the updates directly to a replica of the appropriate shard using the (cache) collection state info
CloudSolrClient (may) attempt to send that UpdateRequest to a node/core that doesn't exist, getting a 404 – which does not (seem to) trigger a state refresh, or retry to find a correct URL to resend the update to.

Details to follow in comment....

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

jenkins.thetaphi.20662.txt
13/Oct/17 00:20
3.83 MB
Chris M. Hostetter
SOLR-11484.patch
26/Oct/17 12:22
6 kB
Noble Paul
SOLR-11484.patch
17/Oct/17 17:09
4 kB
Chris M. Hostetter

Issue Links

breaks

SOLR-11392 StreamExpressionTest.testParallelExecutorStream fails too frequently

Open

links to

GitHub Pull Request #264

Activity

People

Assignee:: Noble Paul

Reporter:: Chris M. Hostetter

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 13/Oct/17 00:19

Updated:: 21/Nov/19 00:41

Resolved:: 28/Oct/17 01:38