Solr
  1. Solr
  2. SOLR-3251 dynamically add fields to schema
  3. SOLR-4658

In preparation for dynamic schema modification via REST API, add a "managed" schema facility

    Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3, Trunk
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      The idea is to have a set of configuration items in solrconfig.xml:

      <schema managed="true" mutable="true" managedSchemaResourceName="managed-schema"/>
      

      It will be a precondition for future dynamic schema modification APIs that mutable="true". solrconfig.xml parsing will fail if mutable="true" but managed="false".

      When managed="true", and the resource named in managedSchemaResourceName doesn't exist, Solr will automatically upgrade the schema to "managed": the non-managed schema resource (typically schema.xml) is parsed and then persisted at managedSchemaResourceName under $solrHome/$collectionOrCore/conf/, or on ZooKeeper at /configs/$configName/, and the non-managed schema resource is renamed by appending .bak, e.g. schema.xml.bak.

      Once the upgrade has taken place, users can get the full schema from the /schema?wt=schema.xml REST API, and can use this as the basis for modifications which can then be used to manually downgrade back to non-managed schema: put the schema.xml in place, then add <schema managed="false"/> to solrconfig.xml (or remove the whole <schema/> element, since managed="false" is the default).

      If users take no action, then Solr behaves the same as always: the example solrconfig.xml will include <schema managed="false" ...>.

      For a discussion of rationale for this feature, see Chris Hostetter's post to the solr-user mailing list in the thread "Dynamic schema design: feedback requested" http://markmail.org/message/76zj24dru2gkop7b:

      Ignoring for a moment what format is used to persist schema information, I
      think it's important to have a conceptual distinction between "data" that
      is managed by applications and manipulated by a REST API, and "config"
      that is managed by the user and loaded by solr on init – or via an
      explicit "reload config" REST API.

      Past experience with how users percieve(d) solr.xml has heavily reinforced
      this opinion: on one hand, it's a place users must specify some config
      information – so people wnat to be able to keep it in version control
      with other config files. On the other hand it's a "live" data file that
      is rewritten by solr when cores are added. (God help you if you want do a
      rolling deploy a new version of solr.xml where you've edited some of the
      config values while simultenously clients are creating new SolrCores)

      As we move forward towards having REST APIs that treat schema information
      as "data" that can be manipulated, I anticipate the same types of
      confusion, missunderstanding, and grumblings if we try to use the same
      pattern of treating the existing schema.xml (or some new schema.json) as a
      hybrid configs & data file. "Edit it by hand if you want, the /schema/*
      REST API will too!" ... Even assuming we don't make any of the same
      technical mistakes that have caused problems with solr.xml round tripping
      in hte past (ie: losing comments, reading new config options that we
      forget to write back out, etc...) i'm fairly certain there is still going
      to be a lot of things that will loook weird and confusing to people.

      (XML may bave been designed to be both "human readable & writable" and
      "machine readable & writable", but practically speaking it's hard have a
      single XML file be "machine and human readable & writable")

      I think it would make a lot of sense – not just in terms of
      implementation but also for end user clarity – to have some simple,
      straightforward to understand caveats about maintaining schema
      information...

      1) If you want to keep schema information in an authoritative config file
      that you can manually edit, then the /schema REST API will be read only.

      2) If you wish to use the /schema REST API for read and write operations,
      then schema information will be persisted under the covers in a data store
      whose format is an implementation detail just like the index file format.

      3) If you are using a schema config file and you wish to switch to using
      the /schema REST API for managing schema information, there is a
      tool/command/API you can run to so.

      4) if you are using the /schema REST API for managing schema information,
      and you wish to switch to using a schema config file, there is a
      tool/command/API you can run to export the schema info if a config file
      format.

      ...wether of not the "under the covers in a data store" used by the REST
      API is JSON, or some binary data, or an XML file just schema.xml w/o
      whitespace/comments should be an implementation detail. Likewise is the
      question of wether some new config file formats are added – it shouldn't
      matter.

      If it's config it's config and the user owns it.
      If it's data it's data and the system owns it.

      : is the risk they take if they want to manually edit it - it's no
      : different than today when you edit the file and do a Core reload or
      : something. I think we can improve some validation stuff around that, but
      : it doesn't seem like a show stopper to me.

      The new risk is multiple "actors" (both the user, and Solr) editing the
      file concurrently, and info that might be lost due to Solr reading the
      file, manpulating internal state, and then writing the file back out.

      Eg: User hand edits may be lost if they happen on disk during Solr's
      internal manpulation of data. API edits may be reflected in the internal
      state, but lost if the User writes the file directly and then does a core
      reload, etc....

      : At a minimum, I think the user should be able to start with a hand
      : modified file. Many people heavily modify the example schema to fit
      : their use case. If you have to start doing that by making 50 rest API
      : calls, that's pretty rough. Once you get your schema nice and happy, you
      : might script out those rest calls, but initially, it's much
      : faster/easier to whack the schema into place in a text editor IMO.

      I don't think there is any disagreement about that. The ability to say
      "my schema is a config file and i own it" should always exist (remove
      it over my dead body)

      The question is what trade offs to expect/require for people who would
      rather use an API to manipulate these things – i don't think it's
      unreasable to say "if you would like to manipulate the schema using an
      API, then you give up the ability to manipulate it as a config file on
      disk"

      ("if you want the /schema API to drive your car, you have to take your
      foot of hte pedals and let go of the steering wheel")

      1. SOLR-4658-fix-serialization.patch
        35 kB
        Steve Rowe
      2. SOLR-4658.patch
        79 kB
        Steve Rowe
      3. SOLR-4658.patch
        96 kB
        Steve Rowe

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Steve Rowe
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development