[SOLR-4658] In preparation for dynamic schema modification via REST API, add a "managed" schema facility - ASF JIRA

Details

Type: Sub-task
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.3, 6.0
Component/s: Schema and Analysis
Labels:
None

Description

The idea is to have a set of configuration items in solrconfig.xml:

<schema managed="true" mutable="true" managedSchemaResourceName="managed-schema"/>

It will be a precondition for future dynamic schema modification APIs that mutable="true". solrconfig.xml parsing will fail if mutable="true" but managed="false".

When managed="true", and the resource named in managedSchemaResourceName doesn't exist, Solr will automatically upgrade the schema to "managed": the non-managed schema resource (typically schema.xml) is parsed and then persisted at managedSchemaResourceName under $solrHome/$collectionOrCore/conf/, or on ZooKeeper at /configs/$configName/, and the non-managed schema resource is renamed by appending .bak, e.g. schema.xml.bak.

Once the upgrade has taken place, users can get the full schema from the /schema?wt=schema.xml REST API, and can use this as the basis for modifications which can then be used to manually downgrade back to non-managed schema: put the schema.xml in place, then add <schema managed="false"/> to solrconfig.xml (or remove the whole <schema/> element, since managed="false" is the default).

If users take no action, then Solr behaves the same as always: the example solrconfig.xml will include <schema managed="false" ...>.

For a discussion of rationale for this feature, see hossman_lucene@fucit.org's post to the solr-user mailing list in the thread "Dynamic schema design: feedback requested" http://markmail.org/message/76zj24dru2gkop7b:

Ignoring for a moment what format is used to persist schema information, I
think it's important to have a conceptual distinction between "data" that
is managed by applications and manipulated by a REST API, and "config"
that is managed by the user and loaded by solr on init – or via an
explicit "reload config" REST API.

Past experience with how users percieve(d) solr.xml has heavily reinforced
this opinion: on one hand, it's a place users must specify some config
information – so people wnat to be able to keep it in version control
with other config files. On the other hand it's a "live" data file that
is rewritten by solr when cores are added. (God help you if you want do a
rolling deploy a new version of solr.xml where you've edited some of the
config values while simultenously clients are creating new SolrCores)

As we move forward towards having REST APIs that treat schema information
as "data" that can be manipulated, I anticipate the same types of
confusion, missunderstanding, and grumblings if we try to use the same
pattern of treating the existing schema.xml (or some new schema.json) as a
hybrid configs & data file. "Edit it by hand if you want, the /schema/*
REST API will too!" ... Even assuming we don't make any of the same
technical mistakes that have caused problems with solr.xml round tripping
in hte past (ie: losing comments, reading new config options that we
forget to write back out, etc...) i'm fairly certain there is still going
to be a lot of things that will loook weird and confusing to people.

(XML may bave been designed to be both "human readable & writable" and
"machine readable & writable", but practically speaking it's hard have a
single XML file be "machine and human readable & writable")

I think it would make a lot of sense – not just in terms of
implementation but also for end user clarity – to have some simple,
straightforward to understand caveats about maintaining schema
information...

1) If you want to keep schema information in an authoritative config file
that you can manually edit, then the /schema REST API will be read only.

2) If you wish to use the /schema REST API for read and write operations,
then schema information will be persisted under the covers in a data store
whose format is an implementation detail just like the index file format.

3) If you are using a schema config file and you wish to switch to using
the /schema REST API for managing schema information, there is a
tool/command/API you can run to so.

4) if you are using the /schema REST API for managing schema information,
and you wish to switch to using a schema config file, there is a
tool/command/API you can run to export the schema info if a config file
format.

...wether of not the "under the covers in a data store" used by the REST
API is JSON, or some binary data, or an XML file just schema.xml w/o
whitespace/comments should be an implementation detail. Likewise is the
question of wether some new config file formats are added – it shouldn't
matter.

If it's config it's config and the user owns it.
If it's data it's data and the system owns it.

: is the risk they take if they want to manually edit it - it's no
: different than today when you edit the file and do a Core reload or
: something. I think we can improve some validation stuff around that, but
: it doesn't seem like a show stopper to me.

The new risk is multiple "actors" (both the user, and Solr) editing the
file concurrently, and info that might be lost due to Solr reading the
file, manpulating internal state, and then writing the file back out.

Eg: User hand edits may be lost if they happen on disk during Solr's
internal manpulation of data. API edits may be reflected in the internal
state, but lost if the User writes the file directly and then does a core
reload, etc....

: At a minimum, I think the user should be able to start with a hand
: modified file. Many people heavily modify the example schema to fit
: their use case. If you have to start doing that by making 50 rest API
: calls, that's pretty rough. Once you get your schema nice and happy, you
: might script out those rest calls, but initially, it's much
: faster/easier to whack the schema into place in a text editor IMO.

I don't think there is any disagreement about that. The ability to say
"my schema is a config file and i own it" should always exist (remove
it over my dead body)

The question is what trade offs to expect/require for people who would
rather use an API to manipulate these things – i don't think it's
unreasable to say "if you would like to manipulate the schema using an
API, then you give up the ability to manipulate it as a config file on
disk"

("if you want the /schema API to drive your car, you have to take your
foot of hte pedals and let go of the steering wheel")

In preparation for dynamic schema modification via REST API, add a "managed" schema facility

Details

Description

Attachments

Attachments

Activity

People

Dates