Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
8.4
-
None
-
None
Description
Spin-off from SOLR-14128 and SOLR-13368.
In SolrCloud when a SolrCore is created and it uses managed schema then its ManagedIndexSchemaFactory performs an automatic upgrade of the initial schema.xml to managed-schema. This includes removing the original schema.xml file.
SOLR-13368 added some locking to make sure the changed resource name (i.e. managed-schema) becomes visible only when this process is complete, and that in-flight requests to /admin/schema block until this process is complete, to avoid returning inconsistent data. This locking mechanism uses simple Object monitors.
However, if there's more than 1 node in the cluster the subsequent request to retrieve schema may execute on a core that still hasn't reloaded its schema (ZkIndexSchemaReader uses a ZK watcher, which may take some time to trigger), and the resource name in that stale schema still points to schema.xml, which by this time no longer exists because it was removed by ManagedIndexSchemaFactory in the first core.
As I see it there are two bugs here:
- there's no distributed locking when this upgrade is performed, so it's natural that there are multiple cores racing against each other to perform this upgrade.
- the upgrade process removes schema.xml too early - it triggers all other cores by creating the managed-schema file, and then other cores reload from the new managed schema - but it should wait until this reload is complete on all cores because only then it's safe to delete the non-managed resource as it's no longer in use by any core.
Issue 1. can be solved by adding an ephemeral znode lock so that only one core can perform the upgrade. Issue 2. can be solved by using ManagedIndexSchema.waitForSchemaZkVersionAgreement after upgrade, and deleting schema.xml only after it's done.