Solr
  1. Solr
  2. SOLR-3251

dynamically add fields to schema

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4, 5.0
    • Component/s: None
    • Labels:
      None

      Description

      One related piece of functionality needed for SOLR-3250 is the ability to dynamically add a field to the schema.

      1. SOLR-3251.patch
        328 kB
        Steve Rowe
      2. SOLR-3251.patch
        345 kB
        Steve Rowe
      3. SOLR-3251.patch
        230 kB
        Steve Rowe
      4. SOLR-3251.patch
        219 kB
        Steve Rowe
      5. SOLR-3251.patch
        106 kB
        Steve Rowe
      6. SOLR-3251.patch
        72 kB
        Steve Rowe
      7. SOLR-3251.patch
        3 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          Here's a quick start... no tests or external API yet.

          Show
          Yonik Seeley added a comment - Here's a quick start... no tests or external API yet.
          Hide
          Yonik Seeley added a comment -

          Any ideas for an external API?

          We could use a single entry point for all things schema related...
          http://localhost:8983/solr/schema
          {"addField":{"myfield":{"type":"int" ...}}

          Or more specific to fields...
          http://localhost:8983/solr/fields
          OR
          PUT/POST to http://localhost:8983/solr/schema/fields (nesting all schema related stuff under "schema" would help pollute the namespace less)
          {"myfield":{"type":"int" ...}}

          I'm leaning toward the last option. Thoughts?

          Show
          Yonik Seeley added a comment - Any ideas for an external API? We could use a single entry point for all things schema related... http://localhost:8983/solr/schema {"addField":{"myfield":{"type":"int" ...}} Or more specific to fields... http://localhost:8983/solr/fields OR PUT/POST to http://localhost:8983/solr/schema/fields (nesting all schema related stuff under "schema" would help pollute the namespace less) {"myfield":{"type":"int" ...}} I'm leaning toward the last option. Thoughts?
          Hide
          Grant Ingersoll added a comment -

          I like the last option, b/c you can easily see this evolving to support other things like field types, etc. If we go down this route, would be nice to make it RESTful.

          Show
          Grant Ingersoll added a comment - I like the last option, b/c you can easily see this evolving to support other things like field types, etc. If we go down this route, would be nice to make it RESTful.
          Hide
          Ryan McKinley added a comment -

          Does this imply that the schema would be writeable?

          The PUT/POST option is nicer

          Show
          Ryan McKinley added a comment - Does this imply that the schema would be writeable? The PUT/POST option is nicer
          Hide
          Ryan McKinley added a comment -

          What are the thoughts on error handling? are you only able to add fields that don't exist? If they exist in the schema but not in the index? What about if the index Analyzer is identical, but the query Analyzer has changed?

          Show
          Ryan McKinley added a comment - What are the thoughts on error handling? are you only able to add fields that don't exist? If they exist in the schema but not in the index? What about if the index Analyzer is identical, but the query Analyzer has changed?
          Hide
          Yonik Seeley added a comment -

          Does this imply that the schema would be writeable?

          The in-memory schema object yes.
          The question is how to persist changes. I was thinking it might be easiest to keep a separate file alongside schema.xml for dynamically added fields for now. The term "dynamicFields" has already been taken though and we probably shouldn't overload it. Maybe extra_fields.json? Or maybe even schema.json/schema.yaml that acts as an extension of schema.xml (and could acquire additional features over time such as the ability to define types too?)

          But a separate file that just lists fields will be much quicker (and easier) to update. Reloading a full schema.xml (along with type instantiation) would currently be somewhat prohibitive.

          Show
          Yonik Seeley added a comment - Does this imply that the schema would be writeable? The in-memory schema object yes. The question is how to persist changes. I was thinking it might be easiest to keep a separate file alongside schema.xml for dynamically added fields for now. The term "dynamicFields" has already been taken though and we probably shouldn't overload it. Maybe extra_fields.json? Or maybe even schema.json/schema.yaml that acts as an extension of schema.xml (and could acquire additional features over time such as the ability to define types too?) But a separate file that just lists fields will be much quicker (and easier) to update. Reloading a full schema.xml (along with type instantiation) would currently be somewhat prohibitive.
          Hide
          Sami Siren added a comment - - edited

          Any ideas for an external API?

          I like the latter option more.

          Show
          Sami Siren added a comment - - edited Any ideas for an external API? I like the latter option more.
          Hide
          Ryan McKinley added a comment -

          separate file alongside schema.xml

          This makes sense.

          As is, the ad-hoc naming conventions in schema make writing out the full schema pretty daunting.

          Show
          Ryan McKinley added a comment - separate file alongside schema.xml This makes sense. As is, the ad-hoc naming conventions in schema make writing out the full schema pretty daunting.
          Hide
          Hoss Man added a comment -

          Any ideas for an external API?

          I think the best way to support this externally is using the existing mechanism for plugins...

          • a RequestHandler people can register (if they want to support external clients programaticly modifying the schema) that accepts ContentStreams containing whatever payload structure makes sense given the functionality.
          • an UpdateProcessor people can register (if they want to support stuff like SOLR-3250 where clients adding documents can submit any field name and a type is added based on the type of hte value) which could be configured with mappings of java types to fieldTypes and rules about other field attributes – ie "if a client submits a new field=value with a java.lang.Integer value, create a new "tint" field with that name and set stored=true.
          Show
          Hoss Man added a comment - Any ideas for an external API? I think the best way to support this externally is using the existing mechanism for plugins... a RequestHandler people can register (if they want to support external clients programaticly modifying the schema) that accepts ContentStreams containing whatever payload structure makes sense given the functionality. an UpdateProcessor people can register (if they want to support stuff like SOLR-3250 where clients adding documents can submit any field name and a type is added based on the type of hte value) which could be configured with mappings of java types to fieldTypes and rules about other field attributes – ie "if a client submits a new field=value with a java.lang.Integer value, create a new "tint" field with that name and set stored=true.
          Hide
          Hoss Man added a comment -

          a low level implementation detail i would worry about is "snapshoting" the schema for the duration of a single request .. i suspect there are more then a few places in solr that would generate weird exceptions if multiple calls to "req.getSchema().getFields()" returned different things in the middle of processing a single request.

          Show
          Hoss Man added a comment - a low level implementation detail i would worry about is "snapshoting" the schema for the duration of a single request .. i suspect there are more then a few places in solr that would generate weird exceptions if multiple calls to "req.getSchema().getFields()" returned different things in the middle of processing a single request.
          Hide
          Yonik Seeley added a comment -

          a low level implementation detail i would worry about is "snapshoting" the schema for the duration of a single request .. i suspect there are more then a few places in solr that would generate weird exceptions if multiple calls to "req.getSchema().getFields()" returned different things in the middle of processing a single request.

          It's good to think about, but I'm not sure it will be a problem in practice. Adding a new field shouldn't be an issue for most code.
          Removing a field is a different matter... but if a query explicitly references a field (for example) and then it disappears, having that cause an exception is fine if it would also cause an exception if the field were missing.

          Instead of snapshotting, I think we should think about where fields changing could be a problem and then harden the code against that. If it does get too difficult, then we could revisit schema snapshots.

          Show
          Yonik Seeley added a comment - a low level implementation detail i would worry about is "snapshoting" the schema for the duration of a single request .. i suspect there are more then a few places in solr that would generate weird exceptions if multiple calls to "req.getSchema().getFields()" returned different things in the middle of processing a single request. It's good to think about, but I'm not sure it will be a problem in practice. Adding a new field shouldn't be an issue for most code. Removing a field is a different matter... but if a query explicitly references a field (for example) and then it disappears, having that cause an exception is fine if it would also cause an exception if the field were missing. Instead of snapshotting, I think we should think about where fields changing could be a problem and then harden the code against that. If it does get too difficult, then we could revisit schema snapshots.
          Hide
          Yonik Seeley added a comment -

          Regarding PUT, it doesn't seem to be allowed by our current implementation (I think it's a request parser implementation detail).
          Should we change that to allow us to be more REST-like?
          Or should we go further and integrate something like restlet?

          Show
          Yonik Seeley added a comment - Regarding PUT, it doesn't seem to be allowed by our current implementation (I think it's a request parser implementation detail). Should we change that to allow us to be more REST-like? Or should we go further and integrate something like restlet?
          Hide
          Grant Ingersoll added a comment -

          I'm a big fan of restlet and it could be a nice segue to supporting more things as pure REST. Downside is another moving part. I think Restlet even has some (old) Solr integration.

          Show
          Grant Ingersoll added a comment - I'm a big fan of restlet and it could be a nice segue to supporting more things as pure REST. Downside is another moving part. I think Restlet even has some (old) Solr integration.
          Hide
          Yonik Seeley added a comment -

          Thinking a little further about this, building a new schema when it changes (i.e. making schema effectively immutable), might be a good idea too.
          For performance reasons, we'd want to share/reuse objects across the different schema instances of course.

          Show
          Yonik Seeley added a comment - Thinking a little further about this, building a new schema when it changes (i.e. making schema effectively immutable), might be a good idea too. For performance reasons, we'd want to share/reuse objects across the different schema instances of course.
          Hide
          Steve Rowe added a comment -

          I'm interested in working on this issue. I'm new to the ways of Solr dev, though, so I'd appreciate assistance in getting things done right.

          Thinking a little further about this, building a new schema when it changes (i.e. making schema effectively immutable), might be a good idea too.
          For performance reasons, we'd want to share/reuse objects across the different schema instances of course.

          The DOM for the previous schema could be kept around and compared to the DOM for the new schema, and each object could keep a reference to the DOM node from which it came. When corresponding DOM nodes compare as equal, then reloading that object isn't necessary.

          Keeping the DOM around would also allow for round-tripping comments and whitespace, since those can be stored in the DOM. To make the new schema's DOM on the node handling the add field request, copy the old DOM, then insert a node for the new field. On second thought, I think it makes sense for the DOM to be mutable, and not require a full copy on minting a new schema, since otherwise unchanged objects would need to be modified to point to their new DOM node, and objects in the old schema will no longer refer to the old DOM.

          Show
          Steve Rowe added a comment - I'm interested in working on this issue. I'm new to the ways of Solr dev, though, so I'd appreciate assistance in getting things done right. Thinking a little further about this, building a new schema when it changes (i.e. making schema effectively immutable), might be a good idea too. For performance reasons, we'd want to share/reuse objects across the different schema instances of course. The DOM for the previous schema could be kept around and compared to the DOM for the new schema, and each object could keep a reference to the DOM node from which it came. When corresponding DOM nodes compare as equal, then reloading that object isn't necessary. Keeping the DOM around would also allow for round-tripping comments and whitespace, since those can be stored in the DOM. To make the new schema's DOM on the node handling the add field request, copy the old DOM, then insert a node for the new field. On second thought, I think it makes sense for the DOM to be mutable, and not require a full copy on minting a new schema, since otherwise unchanged objects would need to be modified to point to their new DOM node, and objects in the old schema will no longer refer to the old DOM.
          Hide
          Erik Hatcher added a comment -

          Keeping the DOM around would also allow for round-tripping comments and whitespace...

          IMO - The XMLness of the current Solr schema needs to be isolated to only one optional way of constructing an IndexSchema instance. We want less XML rather than more. (for example, it should be possible to have a relational database that contains a model of a schema and load it that way)

          Show
          Erik Hatcher added a comment - Keeping the DOM around would also allow for round-tripping comments and whitespace... IMO - The XMLness of the current Solr schema needs to be isolated to only one optional way of constructing an IndexSchema instance. We want less XML rather than more. (for example, it should be possible to have a relational database that contains a model of a schema and load it that way)
          Hide
          Steve Rowe added a comment -

          IMO - The XMLness of the current Solr schema needs to be isolated to only one optional way of constructing an IndexSchema instance. We want less XML rather than more. (for example, it should be possible to have a relational database that contains a model of a schema and load it that way)

          Well, I don't want to change the entire world all at once here . And the serialized representation in Zookeeper won't be a relational DB (dump), but I suppose it could be JSON or YAML instead of XML. AFAICT, YAML isn't used in Solr anywhere. And JSON doesn't support comments, but I think documentation could be included as a "documentation":"comment" pair at the appropriate level, similar to how W3C XML Schema syntax uses <documentation> within <annotation>.

          But I guess you're arguing against depending on an XML-specific intermediate representation (the DOM)?

          Show
          Steve Rowe added a comment - IMO - The XMLness of the current Solr schema needs to be isolated to only one optional way of constructing an IndexSchema instance. We want less XML rather than more. (for example, it should be possible to have a relational database that contains a model of a schema and load it that way) Well, I don't want to change the entire world all at once here . And the serialized representation in Zookeeper won't be a relational DB (dump), but I suppose it could be JSON or YAML instead of XML. AFAICT, YAML isn't used in Solr anywhere. And JSON doesn't support comments, but I think documentation could be included as a "documentation":"comment" pair at the appropriate level, similar to how W3C XML Schema syntax uses <documentation> within <annotation> . But I guess you're arguing against depending on an XML-specific intermediate representation (the DOM)?
          Hide
          Erik Hatcher added a comment - - edited

          And the serialized representation in Zookeeper won't be a relational DB (dump), but I suppose it could be JSON or YAML instead of XML

          True, and it could also be broken down into individual keys rather than one big "schema" blob... such that an individual field is defined under a schema tree rather than a config "file" being stored in XML format.

          But I guess you're arguing against depending on an XML-specific intermediate representation (the DOM)?

          Kinda, yeah, but I'm just thinking out loud here and just making sure we don't over XML things further.

          Regarding documentation: perhaps a field could be documented with a "comment" or "description" attribute.

          Show
          Erik Hatcher added a comment - - edited And the serialized representation in Zookeeper won't be a relational DB (dump), but I suppose it could be JSON or YAML instead of XML True, and it could also be broken down into individual keys rather than one big "schema" blob... such that an individual field is defined under a schema tree rather than a config "file" being stored in XML format. But I guess you're arguing against depending on an XML-specific intermediate representation (the DOM)? Kinda, yeah, but I'm just thinking out loud here and just making sure we don't over XML things further. Regarding documentation: perhaps a field could be documented with a "comment" or "description" attribute.
          Hide
          Erick Erickson added a comment -

          FWIW, I've seen situations in which the actual structure of schema.xml doesn't reflect what we "usually" think of as correct, i.e. I saw something like (going from memory)
          <fields>
          <field>...</field>
          <copyField />
          <field>...</field>
          </fields>

          But since the DOM traversal just asks for all leaf nodes for some situations, this worked just fine. Something to keep in mind when thinking about this in terms of breaking existing installations. That said I don't think we should strain to preserve this behavior.....

          FWIW,
          Erick

          Show
          Erick Erickson added a comment - FWIW, I've seen situations in which the actual structure of schema.xml doesn't reflect what we "usually" think of as correct, i.e. I saw something like (going from memory) <fields> <field>...</field> <copyField /> <field>...</field> </fields> But since the DOM traversal just asks for all leaf nodes for some situations, this worked just fine. Something to keep in mind when thinking about this in terms of breaking existing installations. That said I don't think we should strain to preserve this behavior..... FWIW, Erick
          Hide
          Steve Rowe added a comment -

          copyField ... DOM traversal

          Yeah, I saw that. I also noticed that the comment above the "//copyField" leaf-node-anywhere XPath says that the expression is "/schema/copyField", and that both lines date back to Yonik's 2006 "initial version" , so this flexibility is long-standing.

          Show
          Steve Rowe added a comment - copyField ... DOM traversal Yeah, I saw that. I also noticed that the comment above the "//copyField" leaf-node-anywhere XPath says that the expression is "/schema/copyField", and that both lines date back to Yonik's 2006 "initial version" , so this flexibility is long-standing.
          Hide
          Steve Rowe added a comment -

          And the serialized representation in Zookeeper won't be a relational DB (dump), but I suppose it could be JSON or YAML instead of XML

          True, and it could also be broken down into individual keys rather than one big "schema" blob... such that an individual field is defined under a schema tree rather than a config "file" being stored in XML format.

          I was thinking that Zookeeper watches would be the way to broadcast changes. Each znode (aka individual key) is distinguished from others by its fully qualified name, so a field's "key" would have to be its "name" attribute. ZK watches can be on znodes that don't exist yet, but it wouldn't be feasible to set watches on the entire space of possible field names. So the watch would have to be on the parent of the znodes for the individual fields.

          Seems like batching schema changes would be useful, both on the sending and receiving end. But having individual znodes for each field wouldn't allow for batching notifications via watches - each change would trigger a watch.

          Show
          Steve Rowe added a comment - And the serialized representation in Zookeeper won't be a relational DB (dump), but I suppose it could be JSON or YAML instead of XML True, and it could also be broken down into individual keys rather than one big "schema" blob... such that an individual field is defined under a schema tree rather than a config "file" being stored in XML format. I was thinking that Zookeeper watches would be the way to broadcast changes. Each znode (aka individual key) is distinguished from others by its fully qualified name, so a field's "key" would have to be its "name" attribute. ZK watches can be on znodes that don't exist yet, but it wouldn't be feasible to set watches on the entire space of possible field names. So the watch would have to be on the parent of the znodes for the individual fields. Seems like batching schema changes would be useful, both on the sending and receiving end. But having individual znodes for each field wouldn't allow for batching notifications via watches - each change would trigger a watch.
          Hide
          Steve Rowe added a comment -

          On schema serialization, I prefer not to keep a separate file for new fields, because updated fields will need special handling, and deleted fields will need either a separate file or an operation attribute (add or delete) in the separate file. Just seems like the design will be simpler if we keep it to just one file.

          I'm guessing the intent with the separate file is that it would be simpler to isolate changes and not have to reload the entire schema when a field is added. I think the reload costs can be reduced by comparing each component (field/fieldtype) to its previous serialized form, and only re-instantiating if not identical. (This strategy would require some form of canonicalization to work properly.)

          Show
          Steve Rowe added a comment - On schema serialization, I prefer not to keep a separate file for new fields, because updated fields will need special handling, and deleted fields will need either a separate file or an operation attribute (add or delete) in the separate file. Just seems like the design will be simpler if we keep it to just one file. I'm guessing the intent with the separate file is that it would be simpler to isolate changes and not have to reload the entire schema when a field is added. I think the reload costs can be reduced by comparing each component (field/fieldtype) to its previous serialized form, and only re-instantiating if not identical. (This strategy would require some form of canonicalization to work properly.)
          Hide
          Steve Rowe added a comment -

          Here's a rough outline of what I'd like to do:

          1. Add schema field & fieldtype details REST API methods getFields, getField, getFieldtypes, and getFieldtype. (A client should be able to find out the current set of fields/fieldtypes before adding a new one.)
          2. Change Solr schema serialization from XML to JSON, and provide an XML->JSON conversion tool.
          3. Add internal addField method:
            • Modify in-memory schema, based on Yonik's patch
            • Persist/reload schema:
              • In standalone mode (and on node handling addField request in SolrCloud mode), after creating new schema, persist (locally or to ZK), then switch to new schema.
              • In SolrCloud mode, set watches on the schema in ZK; when triggered, read/parse schema from ZK, and reload changed parts only. In-flight requests will not see schema changes.
          4. Implement REST API addField method; add Restlet dependency.

          I'm proposing to keep the schema in ZK as a znode, as it is now, rather than having one znode per section or per item.

          Show
          Steve Rowe added a comment - Here's a rough outline of what I'd like to do: Add schema field & fieldtype details REST API methods getFields, getField, getFieldtypes, and getFieldtype. (A client should be able to find out the current set of fields/fieldtypes before adding a new one.) Change Solr schema serialization from XML to JSON, and provide an XML->JSON conversion tool. Add internal addField method: Modify in-memory schema, based on Yonik's patch Persist/reload schema: In standalone mode (and on node handling addField request in SolrCloud mode), after creating new schema, persist (locally or to ZK), then switch to new schema. In SolrCloud mode, set watches on the schema in ZK; when triggered, read/parse schema from ZK, and reload changed parts only. In-flight requests will not see schema changes. Implement REST API addField method; add Restlet dependency. I'm proposing to keep the schema in ZK as a znode, as it is now, rather than having one znode per section or per item.
          Hide
          Steve Rowe added a comment -

          This patch builds on Yonik's patch:

          • Persistence is added, both locally and to ZooKeeper.
          • ZooKeeper persistence uses optimistic concurrency: when attempting to persist after adding a new field, if the remote version is different from the locally cached version, the schema is fetched from ZooKeeper, field definitions are reloaded (not the whole schema), and the field addition is redone and persistence is re-attempted, repeating until it succeeds.
          • JSON PUT is enabled on /schema/fields/(name), in idempotent fashion: repeated identical requests don't result in schema changes (though no special check is done to achieve this - after such a request, the schema is still persisted as though there were a real change.)

          So far there are only standalone tests, which pass. I'm working on REST API tests and SolrCloud tests.

          I'm not sure about whether anything needs to be done to enable replication of the managed schema, since it has a different name. I think master/slave mode will just work, since AFAIK replication of config files triggers a core reload on the slaves.

          I'd appreciate feedback on this, I think it's close.

          Show
          Steve Rowe added a comment - This patch builds on Yonik's patch: Persistence is added, both locally and to ZooKeeper. ZooKeeper persistence uses optimistic concurrency: when attempting to persist after adding a new field, if the remote version is different from the locally cached version, the schema is fetched from ZooKeeper, field definitions are reloaded (not the whole schema), and the field addition is redone and persistence is re-attempted, repeating until it succeeds. JSON PUT is enabled on /schema/fields/(name), in idempotent fashion: repeated identical requests don't result in schema changes (though no special check is done to achieve this - after such a request, the schema is still persisted as though there were a real change.) So far there are only standalone tests, which pass. I'm working on REST API tests and SolrCloud tests. I'm not sure about whether anything needs to be done to enable replication of the managed schema, since it has a different name. I think master/slave mode will just work, since AFAIK replication of config files triggers a core reload on the slaves. I'd appreciate feedback on this, I think it's close.
          Hide
          Steve Rowe added a comment -

          I did some timings loading the example schema, and found that 90-95% of the time is spent loading the fieldtypes' analyzers.

          On my Macbook Pro, reading in the whole schema takes over 900ms the first time, gradually reducing to about 400ms after 6 or 7 trials.

          Show
          Steve Rowe added a comment - I did some timings loading the example schema, and found that 90-95% of the time is spent loading the fieldtypes' analyzers. On my Macbook Pro, reading in the whole schema takes over 900ms the first time, gradually reducing to about 400ms after 6 or 7 trials.
          Hide
          Robert Muir added a comment -

          by any chance do you have any more fine-grained details on why its so slow to load the fieldtypes' analyzers?

          maybe there is a bad apple or two in the analysis factories?

          Show
          Robert Muir added a comment - by any chance do you have any more fine-grained details on why its so slow to load the fieldtypes' analyzers? maybe there is a bad apple or two in the analysis factories?
          Hide
          Steve Rowe added a comment -

          by any chance do you have any more fine-grained details on why its so slow to load the fieldtypes' analyzers?

          maybe there is a bad apple or two in the analysis factories?

          The exmaple schema has a bunch of field types.

          I didn't look at individual analyzer timings, but there weren't any obvious outliers - I was doing millisecond resolution with System.currentTimeMillis(), so relative differences for low single-digit millisecond timings (which they all were) are hard to assess.

          Show
          Steve Rowe added a comment - by any chance do you have any more fine-grained details on why its so slow to load the fieldtypes' analyzers? maybe there is a bad apple or two in the analysis factories? The exmaple schema has a bunch of field types. I didn't look at individual analyzer timings, but there weren't any obvious outliers - I was doing millisecond resolution with System.currentTimeMillis(), so relative differences for low single-digit millisecond timings (which they all were) are hard to assess.
          Hide
          Steve Rowe added a comment - - edited

          JSON PUT is enabled on /schema/fields/(name)

          Here's an example:

          edit: forgot to include the (required) application/json content-type - fixed now.

          curl -X PUT -H 'Content-type:application/json' -d '{"type":"text_general","stored":"false"}' http://localhost:8983/solr/schema/fields/newfield
          
          Show
          Steve Rowe added a comment - - edited JSON PUT is enabled on /schema/fields/(name) Here's an example: edit : forgot to include the (required) application/json content-type - fixed now. curl -X PUT -H 'Content-type:application/json' -d '{"type":"text_general","stored":"false"}' http://localhost:8983/solr/schema/fields/newfield
          Hide
          Yonik Seeley added a comment -

          Looking good Steve! I see you went the route of a mutable schema... any thoughts about areas that might be problematic with a changing schema in the middle of a request?
          I originally went with a mutable schema too, but I was starting to lean toward an immutable schema (i.e. a change would create a new schema).

          Show
          Yonik Seeley added a comment - Looking good Steve! I see you went the route of a mutable schema... any thoughts about areas that might be problematic with a changing schema in the middle of a request? I originally went with a mutable schema too, but I was starting to lean toward an immutable schema (i.e. a change would create a new schema).
          Hide
          Steve Rowe added a comment -

          Thanks for taking a look, Yonik.

          I went with mutable mostly because when I looked to see how to handle immutable, it looked way more complicated.

          I couldn't come up with any issues where added fields in the middle of a request would be a problem. This is simplified because the rest of the schema isn't reloaded.

          Show
          Steve Rowe added a comment - Thanks for taking a look, Yonik. I went with mutable mostly because when I looked to see how to handle immutable, it looked way more complicated. I couldn't come up with any issues where added fields in the middle of a request would be a problem. This is simplified because the rest of the schema isn't reloaded.
          Hide
          Erick Erickson added a comment -

          Hmmm, There's already the possibility of sharing schemas, they're cached by path and time as I remember. And I'm also working on config sets as we speak. Any interactions here that spring to mind? I suppose I'll have to be looking at invalidating any shared config set if any of the underlying files change. I admit I haven't looked into the code at all, maybe this'll all be transparent to the config set caching layer but it'll be a good thing for me to be aware of when I get back to that JIRA (I've got some work done on it, not testable yet though).

          Show
          Erick Erickson added a comment - Hmmm, There's already the possibility of sharing schemas, they're cached by path and time as I remember. And I'm also working on config sets as we speak. Any interactions here that spring to mind? I suppose I'll have to be looking at invalidating any shared config set if any of the underlying files change. I admit I haven't looked into the code at all, maybe this'll all be transparent to the config set caching layer but it'll be a good thing for me to be aware of when I get back to that JIRA (I've got some work done on it, not testable yet though).
          Hide
          Steve Rowe added a comment -

          Hmmm, There's already the possibility of sharing schemas, they're cached by path and time as I remember.

          Updating a shared schema will be live for every core that uses it. Persistence will cause new cores that are supposed to share to load a new schema object into the cache, but cores using the no-longer-cached version will continue to use it instead of getting refreshed. This will result in partitioning the cores into groups that really share schemas. Maybe cache keys should use a hash instead of a time stamp?

          I'm also working on config sets as we speak. Any interactions here that spring to mind? I suppose I'll have to be looking at invalidating any shared config set if any of the underlying files change. I admit I haven't looked into the code at all, maybe this'll all be transparent to the config set caching layer but it'll be a good thing for me to be aware of when I get back to that JIRA (I've got some work done on it, not testable yet though).

          Sorry, I'm not sure about the interactions - what I do know is that since updates are on the live schema, persistence happens as a side effect of changes - after startup, the persisted schema is never read again. Since modifications can only be made after turning on the "managed schema" facility, external modification can be ignored. Actually, that argues further for hashes instead of time stamps for cache keys.

          I'm wrapping up testing, and will post a patch soon. If there are no objections, I'll commit this in its current state, and we can make further changes, including the caching changes, on following issues.

          Show
          Steve Rowe added a comment - Hmmm, There's already the possibility of sharing schemas, they're cached by path and time as I remember. Updating a shared schema will be live for every core that uses it. Persistence will cause new cores that are supposed to share to load a new schema object into the cache, but cores using the no-longer-cached version will continue to use it instead of getting refreshed. This will result in partitioning the cores into groups that really share schemas. Maybe cache keys should use a hash instead of a time stamp? I'm also working on config sets as we speak. Any interactions here that spring to mind? I suppose I'll have to be looking at invalidating any shared config set if any of the underlying files change. I admit I haven't looked into the code at all, maybe this'll all be transparent to the config set caching layer but it'll be a good thing for me to be aware of when I get back to that JIRA (I've got some work done on it, not testable yet though). Sorry, I'm not sure about the interactions - what I do know is that since updates are on the live schema, persistence happens as a side effect of changes - after startup, the persisted schema is never read again. Since modifications can only be made after turning on the "managed schema" facility, external modification can be ignored. Actually, that argues further for hashes instead of time stamps for cache keys. I'm wrapping up testing, and will post a patch soon. If there are no objections, I'll commit this in its current state, and we can make further changes, including the caching changes, on following issues.
          Hide
          Robert Muir added a comment -

          I dont think you should rush it that quick steve. there are a lot of downsides to a mutable schema... like bringing back all the bugs that SOLR-4417 fixed.

          This needs more discussion.

          Show
          Robert Muir added a comment - I dont think you should rush it that quick steve. there are a lot of downsides to a mutable schema... like bringing back all the bugs that SOLR-4417 fixed. This needs more discussion.
          Hide
          Steve Rowe added a comment -

          I dont think you should rush it that quick steve. there are a lot of downsides to a mutable schema... like bringing back all the bugs that SOLR-4417 fixed.

          This needs more discussion.

          Okay. I was going to say, shouldn't the tests for SOLR-4417 catch new problems? But I see that Mark didn't include any tests there...

          Do you have any particular items for discussion, Robert, other than Erick's caching issue? (This issue's been open for over a year with little discussion.)

          Show
          Steve Rowe added a comment - I dont think you should rush it that quick steve. there are a lot of downsides to a mutable schema... like bringing back all the bugs that SOLR-4417 fixed. This needs more discussion. Okay. I was going to say, shouldn't the tests for SOLR-4417 catch new problems? But I see that Mark didn't include any tests there... Do you have any particular items for discussion, Robert, other than Erick's caching issue? (This issue's been open for over a year with little discussion.)
          Hide
          Robert Muir added a comment -

          Its bugs of a different form.

          If you make the schema mutable, then now you have a "mutable" Codec that indexwriter is using. This could cause a lot of issues: there is a reason why Codec is 'final' on indexwriter. Other similar strange bugs can suddenly pop out, i'm just mentioning one.

          I dont think the schema should be mutable. I think it would be less crazy if it was already mutable, but given that its already immutable, if we want to make this change then there needs to be a lot of review and discussion about what will break.

          as far as the issue being open for over a year with little discussion, i dont care about that. yesterday a patch when up that made the schema mutable, where the previous discussion before that patch indicated it might be copy-on-write or something else (immutable). I saw yonik's comment yesterday about immutability and started thinking about what all could go wrong here unless we go that route (it seems to me: a lot).

          I figured yesterday i wouldnt need to comment, that the issue would probably go that direction anyway. However today when you mentioned you wanted to commit it, it surprised me, so i spoke up.

          Show
          Robert Muir added a comment - Its bugs of a different form. If you make the schema mutable, then now you have a "mutable" Codec that indexwriter is using. This could cause a lot of issues: there is a reason why Codec is 'final' on indexwriter. Other similar strange bugs can suddenly pop out, i'm just mentioning one. I dont think the schema should be mutable. I think it would be less crazy if it was already mutable, but given that its already immutable, if we want to make this change then there needs to be a lot of review and discussion about what will break. as far as the issue being open for over a year with little discussion, i dont care about that. yesterday a patch when up that made the schema mutable, where the previous discussion before that patch indicated it might be copy-on-write or something else (immutable). I saw yonik's comment yesterday about immutability and started thinking about what all could go wrong here unless we go that route (it seems to me: a lot). I figured yesterday i wouldnt need to comment, that the issue would probably go that direction anyway. However today when you mentioned you wanted to commit it, it surprised me, so i spoke up.
          Hide
          Steve Rowe added a comment -

          I figured yesterday i wouldnt need to comment, that the issue would probably go that direction anyway. However today when you mentioned you wanted to commit it, it surprised me, so i spoke up.

          Fair enough. Thanks for mentioning before I committed .

          Can you think of testing that ought to happen before you'd be comfortable? I imagine that the same tests would apply to an immutable schema.

          Show
          Steve Rowe added a comment - I figured yesterday i wouldnt need to comment, that the issue would probably go that direction anyway. However today when you mentioned you wanted to commit it, it surprised me, so i spoke up. Fair enough. Thanks for mentioning before I committed . Can you think of testing that ought to happen before you'd be comfortable? I imagine that the same tests would apply to an immutable schema.
          Hide
          Steve Rowe added a comment -

          I should mention that since at this point only new fields are addable, the entire schema is not mutable: only fields, required fields, and fields with a default value are. After these changes, the analyzers have to be refreshed.

          Related: from IndexSchemaRuntimeFieldTest:

           public void testRuntimeFieldCreation() {
              // any field manipulation needs to happen when you know the core will not
              // be accepting any requests.  Typically this is done within the inform()
              // method.  Since this is a single threaded test, we can change the fields
              // willi-nilly
          
          Show
          Steve Rowe added a comment - I should mention that since at this point only new fields are addable, the entire schema is not mutable: only fields, required fields, and fields with a default value are. After these changes, the analyzers have to be refreshed. Related: from IndexSchemaRuntimeFieldTest: public void testRuntimeFieldCreation() { // any field manipulation needs to happen when you know the core will not // be accepting any requests. Typically this is done within the inform() // method. Since this is a single threaded test, we can change the fields // willi-nilly
          Hide
          Robert Muir added a comment - - edited

          Well here's how i see it honestly (given the situation the schema is currently immutable and if you gave me enough beers, i bet i could find piles of code in the current tree that have subtle reliance upon that fact):

          • option A: immutable
          • option B: mutable, but we drink those beers and look for those pieces of code and discuss and fix them.

          I'm just having trouble seeing how option B can really work.

          Typically in a case like this, you'd just add some "safeguards", e.g. shit like clone()/freeze()/factor out abstract schema+unmodifiable()/whatever: then these pieces of code can get immutable "snapshots" and go about their merry way.

          but if you do this, then it really doesn't fix the problem, simply brings back bugs like SOLR-4417 all over again: "you can add your new field to the schema, and its instant, but just dont index any documents with it without doing a core reload first, or all kinds of shit like similarity and codecs doesnt work"

          Show
          Robert Muir added a comment - - edited Well here's how i see it honestly (given the situation the schema is currently immutable and if you gave me enough beers, i bet i could find piles of code in the current tree that have subtle reliance upon that fact): option A: immutable option B: mutable, but we drink those beers and look for those pieces of code and discuss and fix them. I'm just having trouble seeing how option B can really work. Typically in a case like this, you'd just add some "safeguards", e.g. shit like clone()/freeze()/factor out abstract schema+unmodifiable()/whatever: then these pieces of code can get immutable "snapshots" and go about their merry way. but if you do this, then it really doesn't fix the problem, simply brings back bugs like SOLR-4417 all over again: "you can add your new field to the schema, and its instant, but just dont index any documents with it without doing a core reload first, or all kinds of shit like similarity and codecs doesnt work"
          Hide
          Steve Rowe added a comment -

          Patch with REST API and SolrCloud tests.

          The SolrCloud test reports the times it takes to fetch the schema from ZooKeeper after it's persisted there, and on my Macbook Pro, it averages about 5 ms or so for an eight-shard cluster. The test attempts to find a newly added field on each shard right after persisting, and all shards except the leader report its existence on the first try. The leader sometimes takes 3 or 4 tries (no retry delay) before it reports the new field as being present in the schema, with an additional latency of 15 ms or so.

          The SolrCloud test also reports schema reload times, and for the first change, it's roughly 50 ms, and goes down to about 10 ms after a few changes.

          I'm putting the patch up for reference - I won't commit right away.

          Show
          Steve Rowe added a comment - Patch with REST API and SolrCloud tests. The SolrCloud test reports the times it takes to fetch the schema from ZooKeeper after it's persisted there, and on my Macbook Pro, it averages about 5 ms or so for an eight-shard cluster. The test attempts to find a newly added field on each shard right after persisting, and all shards except the leader report its existence on the first try. The leader sometimes takes 3 or 4 tries (no retry delay) before it reports the new field as being present in the schema, with an additional latency of 15 ms or so. The SolrCloud test also reports schema reload times, and for the first change, it's roughly 50 ms, and goes down to about 10 ms after a few changes. I'm putting the patch up for reference - I won't commit right away.
          Hide
          Yonik Seeley added a comment -

          There are two different (but related) issues here:
          1. Should the schema object be immutable?
          2. When (or how often) are schema changes visible?

          A mutable schema answers both questions at once... all changes are seen immediately.
          An immutable schema moves you on to question #2

          Here's a hypothetical issue with a mutable schema:

            if (schema.getField(field).isSingleValued()) {
              // a different thread asynchronously changes the field to a multi-valued field
              assert(schema.getField(field).isSingleValued())
              // do something only valid on a single-valued field
            }
          
            if (schema.getField(field) != null) {
              // a different thread asynchronously remove the field from the schema
              schema.getField(field).foo()
            }
          

          Those types of issues are hard to enumerate and would be ongoing.
          Those issues could pretty much be eliminated for the query side by binding a schema to a request (just as a request gets the same SolrIndexSearcher for it's duration), or binding it to the SolrIndexSearcher itself (may be more "cache" friendly if any schema changes could change search results). That would be very simple to do ... SolrCore.schema would be volatile and point to the latest immutable schema object, and the request object would simply copy the reference in it's constructor.

          On the indexing side, changes need to be visible more quickly to handle the case of adding a new field and then indexing a document with that new field. The reference to the schema in the SchemaCodecFactory would need to be updated (if we went with immutable schema objects), or the SchemaCodecFactory would need a reference to the SolrCore so it could always use the latest SolrCore.schema to do lookups.

          So I think I'm saying that a schema should be effectively immutable for request scope and maybe SolrIndexSearcher scope, but pretty much "live" for indexing purposes, while I think Robert is saying that the schema should be immutable for the scope of a single IndexWriter. The latter would be a big impediment to where we're going with this (schema-less, type-guessing, etc).

          Show
          Yonik Seeley added a comment - There are two different (but related) issues here: 1. Should the schema object be immutable? 2. When (or how often) are schema changes visible? A mutable schema answers both questions at once... all changes are seen immediately. An immutable schema moves you on to question #2 Here's a hypothetical issue with a mutable schema: if (schema.getField(field).isSingleValued()) { // a different thread asynchronously changes the field to a multi-valued field assert (schema.getField(field).isSingleValued()) // do something only valid on a single-valued field } if (schema.getField(field) != null ) { // a different thread asynchronously remove the field from the schema schema.getField(field).foo() } Those types of issues are hard to enumerate and would be ongoing. Those issues could pretty much be eliminated for the query side by binding a schema to a request (just as a request gets the same SolrIndexSearcher for it's duration), or binding it to the SolrIndexSearcher itself (may be more "cache" friendly if any schema changes could change search results). That would be very simple to do ... SolrCore.schema would be volatile and point to the latest immutable schema object, and the request object would simply copy the reference in it's constructor. On the indexing side, changes need to be visible more quickly to handle the case of adding a new field and then indexing a document with that new field. The reference to the schema in the SchemaCodecFactory would need to be updated (if we went with immutable schema objects), or the SchemaCodecFactory would need a reference to the SolrCore so it could always use the latest SolrCore.schema to do lookups. So I think I'm saying that a schema should be effectively immutable for request scope and maybe SolrIndexSearcher scope, but pretty much "live" for indexing purposes, while I think Robert is saying that the schema should be immutable for the scope of a single IndexWriter. The latter would be a big impediment to where we're going with this (schema-less, type-guessing, etc).
          Hide
          Robert Muir added a comment -

          But your same hypothetical issue can happen inside e.g. codec code if suddenly the codec returns different things: its exactly the same problem and would be 'hard to enumerate and ongoing' just like it would be for solr search code!

          The fact that the current patch can only 'add' fields does not lessen my concerns: Lucene works with field names, so if you add 'foo_s' where previously this field matched some dynamic field '*_s' before, you've effectively changed 'foo_s'.

          Show
          Robert Muir added a comment - But your same hypothetical issue can happen inside e.g. codec code if suddenly the codec returns different things: its exactly the same problem and would be 'hard to enumerate and ongoing' just like it would be for solr search code! The fact that the current patch can only 'add' fields does not lessen my concerns: Lucene works with field names, so if you add 'foo_s' where previously this field matched some dynamic field '*_s' before, you've effectively changed 'foo_s'.
          Hide
          Yonik Seeley added a comment -

          But your same hypothetical issue can happen inside e.g. codec code if suddenly the codec returns different things: its exactly the same problem and would be 'hard to enumerate and ongoing' just like it would be for solr search code!

          Right, but I'm hoping that the scope of potential problems in lucene indexing code is much more limited than all of the solr search code.
          The Codec in SchemaCodecFactory only overrides getPostingsFormatForField and getDocValuesFormatForField. It depends on how those objects are used in Lucene (if those methods are called more than once for the same field or not). Then we need to understand what postings format changes and docvalue format changes are OK (if any).

          At the very least, there are a number of changes at the Solr level that will not change anything at the Lucene level (like changing multiValued from false to true for instance).

          Show
          Yonik Seeley added a comment - But your same hypothetical issue can happen inside e.g. codec code if suddenly the codec returns different things: its exactly the same problem and would be 'hard to enumerate and ongoing' just like it would be for solr search code! Right, but I'm hoping that the scope of potential problems in lucene indexing code is much more limited than all of the solr search code. The Codec in SchemaCodecFactory only overrides getPostingsFormatForField and getDocValuesFormatForField. It depends on how those objects are used in Lucene (if those methods are called more than once for the same field or not). Then we need to understand what postings format changes and docvalue format changes are OK (if any). At the very least, there are a number of changes at the Solr level that will not change anything at the Lucene level (like changing multiValued from false to true for instance).
          Hide
          Robert Muir added a comment -

          It may not be a problem given the current implementation, but in my opinion its unsupported (definitely untested).
          And it raises the point if it really should be: if you asked me a year or so ago, I think it would be a definitive "no".

          And I'm not saying the thing has to be immutable from this point of view necessarily either, I'm just saying we shouldnt rush into it. If we are careful and examine all the issues, discuss and add appropriate docs and tests then I would feel better about it. For example, we could add lucene-level tests for this per-field codec/sim stuff in IndexWriter and I wouldnt worry as much about that side. But today I worry about it since I have no idea what would happen.

          On the solr search side, binding to a indexsearcher sounds pretty good, but could still be problematic in some cases because there is nothing to ensure the same indexsearcher is used across multiple phases of a distributed request, right?

          Anyway, this is why I added my comments: I'm not trying to argue for any particular design as much as I'm just saying I don't think its a good idea to just commit right now and assume this all works today (from indexing or search side), and open a bunch of nasty bug reports later.

          Show
          Robert Muir added a comment - It may not be a problem given the current implementation, but in my opinion its unsupported (definitely untested). And it raises the point if it really should be: if you asked me a year or so ago, I think it would be a definitive "no". And I'm not saying the thing has to be immutable from this point of view necessarily either, I'm just saying we shouldnt rush into it. If we are careful and examine all the issues, discuss and add appropriate docs and tests then I would feel better about it. For example, we could add lucene-level tests for this per-field codec/sim stuff in IndexWriter and I wouldnt worry as much about that side. But today I worry about it since I have no idea what would happen. On the solr search side, binding to a indexsearcher sounds pretty good, but could still be problematic in some cases because there is nothing to ensure the same indexsearcher is used across multiple phases of a distributed request, right? Anyway, this is why I added my comments: I'm not trying to argue for any particular design as much as I'm just saying I don't think its a good idea to just commit right now and assume this all works today (from indexing or search side), and open a bunch of nasty bug reports later.
          Hide
          Yonik Seeley added a comment -

          We can approach this incrementally - the most conservative approach being to initially only allow new field additions (i.e. getFieldOrNull(field) == null)
          and expanding it to "compatible" changes as we feel comfortable.

          Show
          Yonik Seeley added a comment - We can approach this incrementally - the most conservative approach being to initially only allow new field additions (i.e. getFieldOrNull(field) == null) and expanding it to "compatible" changes as we feel comfortable.
          Hide
          Robert Muir added a comment -

          +1

          Show
          Robert Muir added a comment - +1
          Hide
          Steve Rowe added a comment -

          We can approach this incrementally - the most conservative approach being to initially only allow new field additions (i.e. getFieldOrNull(field) == null) and expanding it to "compatible" changes as we feel comfortable.

          Similarly, we could disallow new field additions for field types that specify per-field similarity, docvalues format, or postings format.

          Show
          Steve Rowe added a comment - We can approach this incrementally - the most conservative approach being to initially only allow new field additions (i.e. getFieldOrNull(field) == null) and expanding it to "compatible" changes as we feel comfortable. Similarly, we could disallow new field additions for field types that specify per-field similarity, docvalues format, or postings format.
          Hide
          Yonik Seeley added a comment -

          Similarly, we could disallow new field additions for field types that specify per-field similarity, docvalues format, or postings format.

          AFAIK, at the Lucene level, a field that hasn't been used yet is the same as a field that doesn't exist (since we don't pre-define fields).
          If so, adding a new field at the solr level and using a field for the first time should be indistinguishable to Lucene? Any issues would seem to be limited to the interface between Lucene and Solr (such as SchemaCodecFactory).

          Show
          Yonik Seeley added a comment - Similarly, we could disallow new field additions for field types that specify per-field similarity, docvalues format, or postings format. AFAIK, at the Lucene level, a field that hasn't been used yet is the same as a field that doesn't exist (since we don't pre-define fields). If so, adding a new field at the solr level and using a field for the first time should be indistinguishable to Lucene? Any issues would seem to be limited to the interface between Lucene and Solr (such as SchemaCodecFactory).
          Hide
          Yonik Seeley added a comment -

          Steve: one random implementation idea is that we might want to separate schema modification from schema publishing... say if we want to add more than one field or field type atomically, or add a whole bunch of fields in a batch just for performance reasons.

          One possible way:

            Schema newSchema = currSchema.shallowCopy();
            newSchema.add(...)
            newSchema.add(...)
            publishNewSchema(newSchema)
          

          We really only need the schema to be effectively immutable (i.e. you don't change it after you publish it). The devil is in the details of course...

          Show
          Yonik Seeley added a comment - Steve: one random implementation idea is that we might want to separate schema modification from schema publishing... say if we want to add more than one field or field type atomically, or add a whole bunch of fields in a batch just for performance reasons. One possible way: Schema newSchema = currSchema.shallowCopy(); newSchema.add(...) newSchema.add(...) publishNewSchema(newSchema) We really only need the schema to be effectively immutable (i.e. you don't change it after you publish it). The devil is in the details of course...
          Hide
          Steve Rowe added a comment -

          one random implementation idea is that we might want to separate schema modification from schema publishing

          Yeah, I thought about this: an addFields() method in addition to addField(). The REST API equivalent is POSTing to /schema/fields

          Show
          Steve Rowe added a comment - one random implementation idea is that we might want to separate schema modification from schema publishing Yeah, I thought about this: an addFields() method in addition to addField(). The REST API equivalent is POSTing to /schema/fields
          Hide
          Steve Rowe added a comment -

          Similarly, we could disallow new field additions for field types that specify per-field similarity, docvalues format, or postings format.

          AFAIK, at the Lucene level, a field that hasn't been used yet is the same as a field that doesn't exist (since we don't pre-define fields).
          If so, adding a new field at the solr level and using a field for the first time should be indistinguishable to Lucene? Any issues would seem to be limited to the interface between Lucene and Solr (such as SchemaCodecFactory).

          Yeah, similarity is not used at index time, so forget I mentioned that .

          But for the docvalues and postings formats, I was talking about SchemaCodecFactory, so it is exactly the interface issues I meant to say we could avoid by not allowing those features for newly added fields. Or maybe I don't get what you're saying?

          My thought was that for schema-less/type-guessing, field types will be fairly basic, and unlikely to need per-field codec settings.

          Show
          Steve Rowe added a comment - Similarly, we could disallow new field additions for field types that specify per-field similarity, docvalues format, or postings format. AFAIK, at the Lucene level, a field that hasn't been used yet is the same as a field that doesn't exist (since we don't pre-define fields). If so, adding a new field at the solr level and using a field for the first time should be indistinguishable to Lucene? Any issues would seem to be limited to the interface between Lucene and Solr (such as SchemaCodecFactory). Yeah, similarity is not used at index time, so forget I mentioned that . But for the docvalues and postings formats, I was talking about SchemaCodecFactory, so it is exactly the interface issues I meant to say we could avoid by not allowing those features for newly added fields. Or maybe I don't get what you're saying? My thought was that for schema-less/type-guessing, field types will be fairly basic, and unlikely to need per-field codec settings.
          Hide
          Robert Muir added a comment -

          Yeah, similarity is not used at index time, so forget I mentioned that

          But it is!

          Show
          Robert Muir added a comment - Yeah, similarity is not used at index time, so forget I mentioned that But it is!
          Hide
          Steve Rowe added a comment -

          Patch with the following changes:

          • Schema is now effectively immutable: requests see the same schema snapshot for their lifetimes.
          • ManagedIndexSchema.addFields() allows for multiple fields to be added at once, though only the single-field REST API is provided at this point.
          • Only new field additions are allowed: addFields() fails if getFieldOrNull() returns non-null for any of the given new fields.
          • SchemaCodecFactory and SchemaSimilarityFactory don't change codec and similarity when the schema is swapped out: instead they refer to the latest version they have been inform()'d about.
          • Multi-core shared schemas are now handled: the old schema is removed from the schema cache, the new schema is added to the schema cache, and the new schema replaces the old schema in all active cores.

          Robert Muir, I'd appreciate your review of these changes.

          Show
          Steve Rowe added a comment - Patch with the following changes: Schema is now effectively immutable: requests see the same schema snapshot for their lifetimes. ManagedIndexSchema.addFields() allows for multiple fields to be added at once, though only the single-field REST API is provided at this point. Only new field additions are allowed: addFields() fails if getFieldOrNull() returns non-null for any of the given new fields. SchemaCodecFactory and SchemaSimilarityFactory don't change codec and similarity when the schema is swapped out: instead they refer to the latest version they have been inform()'d about. Multi-core shared schemas are now handled: the old schema is removed from the schema cache, the new schema is added to the schema cache, and the new schema replaces the old schema in all active cores. Robert Muir , I'd appreciate your review of these changes.
          Hide
          Yonik Seeley added a comment -

          One thing I noticed quickly.... is there a reason this is synchronized?

                //Run the callbacks on SchemaAware now that everything else is done
                synchronized (schemaAware) {
                  for (SchemaAware aware : schemaAware) {
                    aware.inform(this);
                  }
                }
          
          Show
          Yonik Seeley added a comment - One thing I noticed quickly.... is there a reason this is synchronized? //Run the callbacks on SchemaAware now that everything else is done synchronized (schemaAware) { for (SchemaAware aware : schemaAware) { aware.inform( this ); } }
          Hide
          Steve Rowe added a comment -

          Thanks for taking a look, Yonik.

          One thing I noticed quickly.... is there a reason this is synchronized?

                //Run the callbacks on SchemaAware now that everything else is done
                synchronized (schemaAware) {
                  for (SchemaAware aware : schemaAware) {
                    aware.inform(this);
                  }
                }
          

          No, that's a vestige from when I had thought that access to the schema aware collection needed to be synchronized, I forgot to clean it up here. I'll remove the synchronization.

          Show
          Steve Rowe added a comment - Thanks for taking a look, Yonik. One thing I noticed quickly.... is there a reason this is synchronized? //Run the callbacks on SchemaAware now that everything else is done synchronized (schemaAware) { for (SchemaAware aware : schemaAware) { aware.inform( this ); } } No, that's a vestige from when I had thought that access to the schema aware collection needed to be synchronized, I forgot to clean it up here. I'll remove the synchronization.
          Hide
          Robert Muir added a comment -

          Hi Steve, I took a quick glance. One thing I don't quite understand:

          SchemaCodecFactory and SchemaSimilarityFactory don't change codec and similarity when the schema is swapped out: instead they refer to the latest version they have been inform()'d about.

          Can you elaborate on this (maybe just some code comments about which inform() gets called when)? I don't understand why there should be 2 inform methods or what its doing... ?

          Is the idea that these classes just need to be core-aware instead? And the SchemaAware interface is pretty much useless, except its being used now only as a marker to detect that the sim/codec understands properties on schema elements?

          Can we do something to eliminate the two inform methods?

          Schema is now effectively immutable: requests see the same schema snapshot for their lifetimes.

          well, except it seems for similarity (on indexsearcher)... which could be looking at the latest copy?

          Show
          Robert Muir added a comment - Hi Steve, I took a quick glance. One thing I don't quite understand: SchemaCodecFactory and SchemaSimilarityFactory don't change codec and similarity when the schema is swapped out: instead they refer to the latest version they have been inform()'d about. Can you elaborate on this (maybe just some code comments about which inform() gets called when)? I don't understand why there should be 2 inform methods or what its doing... ? Is the idea that these classes just need to be core-aware instead? And the SchemaAware interface is pretty much useless, except its being used now only as a marker to detect that the sim/codec understands properties on schema elements? Can we do something to eliminate the two inform methods? Schema is now effectively immutable: requests see the same schema snapshot for their lifetimes. well, except it seems for similarity (on indexsearcher)... which could be looking at the latest copy?
          Hide
          Steve Rowe added a comment -

          SchemaCodecFactory and SchemaSimilarityFactory don't change codec and similarity when the schema is swapped out: instead they refer to the latest version they have been inform()'d about.

          Can you elaborate on this (maybe just some code comments about which inform() gets called when)? I don't understand why there should be 2 inform methods or what its doing... ?

          When a new schema with added fields is produced, the inform(schema) SchemaAware variant is called - this is not just a marker interface.

          The inform(core) SolrCoreAware variant is called when a new core is instantiated, including on SolrCore.reload(). Looking now, though, I can see that in the SolrCore ctor, a new codec is pulled from the CodecFactory, so inform(core) isn't needed for it.

          For similarity, which is hosted on the IndexSchema, inform(core) won't have any effect.

          So it looks like the right thing to do is remove SolrCoreAware from both factories. I'll do that. SchemaAware needs to remain, though, so that the schema references can track the latest versions.

          Schema is now effectively immutable: requests see the same schema snapshot for their lifetimes.

          well, except it seems for similarity (on indexsearcher)... which could be looking at the latest copy?

          Yes, that's right: similarity and codec both will be looking at the latest copy.

          Show
          Steve Rowe added a comment - SchemaCodecFactory and SchemaSimilarityFactory don't change codec and similarity when the schema is swapped out: instead they refer to the latest version they have been inform()'d about. Can you elaborate on this (maybe just some code comments about which inform() gets called when)? I don't understand why there should be 2 inform methods or what its doing... ? When a new schema with added fields is produced, the inform(schema) SchemaAware variant is called - this is not just a marker interface. The inform(core) SolrCoreAware variant is called when a new core is instantiated, including on SolrCore.reload() . Looking now, though, I can see that in the SolrCore ctor, a new codec is pulled from the CodecFactory, so inform(core) isn't needed for it. For similarity, which is hosted on the IndexSchema, inform(core) won't have any effect. So it looks like the right thing to do is remove SolrCoreAware from both factories. I'll do that. SchemaAware needs to remain, though, so that the schema references can track the latest versions. Schema is now effectively immutable: requests see the same schema snapshot for their lifetimes. well, except it seems for similarity (on indexsearcher)... which could be looking at the latest copy? Yes, that's right: similarity and codec both will be looking at the latest copy.
          Hide
          Yonik Seeley added a comment -

          One super-minor concurrency issue: the check to see if a schema is mutable should be within the optimistic concurrency retry loop, else fields could be added to a schema that was just marked as immutable.

          Show
          Yonik Seeley added a comment - One super-minor concurrency issue: the check to see if a schema is mutable should be within the optimistic concurrency retry loop, else fields could be added to a schema that was just marked as immutable.
          Hide
          Steve Rowe added a comment -

          One super-minor concurrency issue: the check to see if a schema is mutable should be within the optimistic concurrency retry loop, else fields could be added to a schema that was just marked as immutable.

          Right now ManagedIndexSchemaFactory's mutability setting comes from SolrConfig, and is only changeable on SolrConfig change and reload, so neither mutable->immutable nor immutable->mutable should be possible. Or maybe I don't understand how SolrConfig works?

          Show
          Steve Rowe added a comment - One super-minor concurrency issue: the check to see if a schema is mutable should be within the optimistic concurrency retry loop, else fields could be added to a schema that was just marked as immutable. Right now ManagedIndexSchemaFactory's mutability setting comes from SolrConfig, and is only changeable on SolrConfig change and reload, so neither mutable->immutable nor immutable->mutable should be possible. Or maybe I don't understand how SolrConfig works?
          Hide
          Yonik Seeley added a comment -

          Right now ManagedIndexSchemaFactory's mutability setting comes from SolrConfig, and is only changeable on SolrConfig change and reload, so neither mutable->immutable nor immutable->mutable should be possible.

          Ah, ok - I had assumed it was on the schema itself.

          Show
          Yonik Seeley added a comment - Right now ManagedIndexSchemaFactory's mutability setting comes from SolrConfig, and is only changeable on SolrConfig change and reload, so neither mutable->immutable nor immutable->mutable should be possible. Ah, ok - I had assumed it was on the schema itself.
          Hide
          Steve Rowe added a comment -

          Ah, ok - I had assumed it was on the schema itself.

          Well, it is a boolean on ManagedIndexSchema, but there's no setter, only a getter, and the ManagedIndexSchema ctor, which is only called from the factory, is the only place it's set.

          Show
          Steve Rowe added a comment - Ah, ok - I had assumed it was on the schema itself. Well, it is a boolean on ManagedIndexSchema, but there's no setter, only a getter, and the ManagedIndexSchema ctor, which is only called from the factory, is the only place it's set.
          Hide
          Steve Rowe added a comment -

          Patch:

          SchemaSimilarityFactory and SchemaCodecFactory now implement only SolrCoreAware - SchemaAware alone was insufficient; you were right, Robert.

          Can we do something to eliminate the two inform methods?

          I think the SolrCoreAware one is necessary - I couldn't see how otherwise to pass in the SolrCore.

          I added a POST REST method allowing multiple fields to be added at once, at /schema/fields.

          I think it's ready.

          Show
          Steve Rowe added a comment - Patch: SchemaSimilarityFactory and SchemaCodecFactory now implement only SolrCoreAware - SchemaAware alone was insufficient; you were right, Robert. Can we do something to eliminate the two inform methods? I think the SolrCoreAware one is necessary - I couldn't see how otherwise to pass in the SolrCore. I added a POST REST method allowing multiple fields to be added at once, at /schema/fields . I think it's ready.
          Hide
          Robert Muir added a comment -

          Steve asked me for a review, so I took a quick look, just a few things i noticed (the codec/sim factory is much better without the 2 inform methods, thanks!):

          In CodecFactory:

             public Codec getCodec() {
          -    assert codec != null : "inform must be called first";
          

          Why remove this assert? I think this is pretty useful otherwise you can get a difficult-to-diagnose NPE. Same goes with the SimilarityFactory.

          In SolrCore:

              if (schema instanceof ManagedIndexSchema && schema.isMutable() && resourceLoader instanceof ZkSolrResourceLoader) {
                this.zkIndexSchemaReader = new ZkIndexSchemaReader(this);
              } else {
                this.zkIndexSchemaReader = null;
              }
          

          Why is this in SolrCore? Nothing in SolrCore uses this "zkIndexSchemaReader". I dont think this belongs here: i think it should be in ManagedIndexSchemaFactory... like it should be core-aware or whatever and do this itself.

          In SolrIndexSearcher.java:

             /** Direct access to the IndexSchema for use with this searcher */
          -  public IndexSchema getSchema() { return schema; }
          +  public IndexSchema getSchema() { return core.getSchema(); }
          

          I'm confused about this in conjunction with your previous comment:

          Schema is now effectively immutable: requests see the same schema snapshot for their lifetimes.

          Then isn't it dangerous for things to be pulling moving-target schemas off of SolrCores/SolrIndexSearchers? Shouldn't they be only getting this from the request? I made this package-private just to see the damage and its not clear to me that your statement really holds for all this query code

          In FieldCollectionResource.java:

              ManagedIndexSchema newSchema = ManagedIndexSchema.addFields(getSolrCore(), newFieldsArray);
              getSolrCore().setSchema(newSchema);
          

          It would be nice if we could at least add a TODO to refactor some of this. I think its a little confusing that IndexSchema itself has getMutable, but operations like this go directly to the implementation (abstraction violation). From a pluggability perspective it would be nice if e.g. addFields was factored down (e.g. IndexSchema becomes abstract and minimal), and the immutable default impl threw UOE for changes or whatever... But i know this is a lot of work, it would be a good followup issue and probably good to do before schema gets any more hair (there is already tons of backwards cruft thrown about it for compat etc too).

          In ExternalFileField.java:

            /**
             * Informs the {@link org.apache.solr.schema.IndexSchema} provided by the <code>schema</code>
             * parameter of an event (e.g., a new {@link org.apache.solr.schema.FieldType} was added, etc.
             *
             * @param schema The {@link org.apache.solr.schema.IndexSchema} instance that inform of the update to.
             * @since SOLR-1131
             */
            @Override
            public void inform(IndexSchema schema) {
          

          This should be unnecessary duplication... javadocs by default copies this from the overridden interface (SchemaAware). So I'd remove it completely, if there is anything ExternalFileField-specific that needs to be appended to this, then the base doc can be sucked in with inheritDoc.

          (the same goes for several other classes, e.g. i see this in ExternalFileFieldReloader too).

          Show
          Robert Muir added a comment - Steve asked me for a review, so I took a quick look, just a few things i noticed (the codec/sim factory is much better without the 2 inform methods, thanks!): In CodecFactory: public Codec getCodec() { - assert codec != null : "inform must be called first" ; Why remove this assert? I think this is pretty useful otherwise you can get a difficult-to-diagnose NPE. Same goes with the SimilarityFactory. In SolrCore: if (schema instanceof ManagedIndexSchema && schema.isMutable() && resourceLoader instanceof ZkSolrResourceLoader) { this .zkIndexSchemaReader = new ZkIndexSchemaReader( this ); } else { this .zkIndexSchemaReader = null ; } Why is this in SolrCore? Nothing in SolrCore uses this "zkIndexSchemaReader". I dont think this belongs here: i think it should be in ManagedIndexSchemaFactory... like it should be core-aware or whatever and do this itself. In SolrIndexSearcher.java: /** Direct access to the IndexSchema for use with this searcher */ - public IndexSchema getSchema() { return schema; } + public IndexSchema getSchema() { return core.getSchema(); } I'm confused about this in conjunction with your previous comment: Schema is now effectively immutable: requests see the same schema snapshot for their lifetimes. Then isn't it dangerous for things to be pulling moving-target schemas off of SolrCores/SolrIndexSearchers? Shouldn't they be only getting this from the request? I made this package-private just to see the damage and its not clear to me that your statement really holds for all this query code In FieldCollectionResource.java: ManagedIndexSchema newSchema = ManagedIndexSchema.addFields(getSolrCore(), newFieldsArray); getSolrCore().setSchema(newSchema); It would be nice if we could at least add a TODO to refactor some of this. I think its a little confusing that IndexSchema itself has getMutable, but operations like this go directly to the implementation (abstraction violation). From a pluggability perspective it would be nice if e.g. addFields was factored down (e.g. IndexSchema becomes abstract and minimal), and the immutable default impl threw UOE for changes or whatever... But i know this is a lot of work, it would be a good followup issue and probably good to do before schema gets any more hair (there is already tons of backwards cruft thrown about it for compat etc too). In ExternalFileField.java: /** * Informs the {@link org.apache.solr.schema.IndexSchema} provided by the <code>schema</code> * parameter of an event (e.g., a new {@link org.apache.solr.schema.FieldType} was added, etc. * * @param schema The {@link org.apache.solr.schema.IndexSchema} instance that inform of the update to. * @since SOLR-1131 */ @Override public void inform(IndexSchema schema) { This should be unnecessary duplication... javadocs by default copies this from the overridden interface (SchemaAware). So I'd remove it completely, if there is anything ExternalFileField-specific that needs to be appended to this, then the base doc can be sucked in with inheritDoc. (the same goes for several other classes, e.g. i see this in ExternalFileFieldReloader too).
          Hide
          Yonik Seeley added a comment -

          Then isn't it dangerous for things to be pulling moving-target schemas off of SolrCores/SolrIndexSearchers? Shouldn't they be only getting this from the request?

          On a normal search-side request, yes. Some things may need the latest schema though... like a codec provider, perhaps realtime-get, the future code to add new fields on demand (aka type guessing), etc.

          it would be nice if e.g. addFields was factored down

          +1, but not a big deal or a show stopper though.

          Show
          Yonik Seeley added a comment - Then isn't it dangerous for things to be pulling moving-target schemas off of SolrCores/SolrIndexSearchers? Shouldn't they be only getting this from the request? On a normal search-side request, yes. Some things may need the latest schema though... like a codec provider, perhaps realtime-get, the future code to add new fields on demand (aka type guessing), etc. it would be nice if e.g. addFields was factored down +1, but not a big deal or a show stopper though.
          Hide
          Robert Muir added a comment -

          On a normal search-side request, yes. Some things may need the latest schema though... like a codec provider, perhaps realtime-get, the future code to add new fields on demand (aka type guessing), etc.

          But my problem is with the API: an IndexSearcher is absolutely the worst place to have a getter for a moving target: because its all about search.

          If for example, realtime-get wants to get the 'latest', it should get it from request.getCore().getCurrentSchema() (please, name it in such a way that its not confusing).

          Otherwise in general things should use request.getSchema(). SolrIndexSearcher should not expose the schema: there need not be 3 different ways, 2 of which have "current" semantics and one of which is immutable across the request. And its own internal use of "moving target" should be carefully contained.

          Show
          Robert Muir added a comment - On a normal search-side request, yes. Some things may need the latest schema though... like a codec provider, perhaps realtime-get, the future code to add new fields on demand (aka type guessing), etc. But my problem is with the API: an IndexSearcher is absolutely the worst place to have a getter for a moving target: because its all about search. If for example, realtime-get wants to get the 'latest', it should get it from request.getCore().getCurrentSchema() (please, name it in such a way that its not confusing). Otherwise in general things should use request.getSchema(). SolrIndexSearcher should not expose the schema: there need not be 3 different ways, 2 of which have "current" semantics and one of which is immutable across the request. And its own internal use of "moving target" should be carefully contained.
          Hide
          Yonik Seeley added a comment -

          But my problem is with the API: an IndexSearcher is absolutely the worst place to have a getter for a moving target: because its all about search.

          Yeah, I can agree with that.
          My comments were about the ability to grab a moving-target schema off of SolrCore, which is desirable/needed.
          Speaking of which, I should go check out how realtime-get is handled in this patch...

          Show
          Yonik Seeley added a comment - But my problem is with the API: an IndexSearcher is absolutely the worst place to have a getter for a moving target: because its all about search. Yeah, I can agree with that. My comments were about the ability to grab a moving-target schema off of SolrCore, which is desirable/needed. Speaking of which, I should go check out how realtime-get is handled in this patch...
          Hide
          Yonik Seeley added a comment -

          in RealTimeGetComponent, it seems like we should use
          req.getCore().getSchema() instead of req.getSchema()
          since one can be concurrently reading docs out of the tlog at the same time they are being added (and hence the schema bound to the request may be too old)

          Show
          Yonik Seeley added a comment - in RealTimeGetComponent, it seems like we should use req.getCore().getSchema() instead of req.getSchema() since one can be concurrently reading docs out of the tlog at the same time they are being added (and hence the schema bound to the request may be too old)
          Hide
          Steve Rowe added a comment -

          Thanks Robert and Yonik.

          Replying to Robert first:

          In CodecFactory:

             public Codec getCodec() {
          -    assert codec != null : "inform must be called first";
          

          Why remove this assert? I think this is pretty useful otherwise you can get a difficult-to-diagnose NPE. Same goes with the SimilarityFactory.

          I removed it from SchemaSimilarityFactory because when the schema is constructed, it doesn't have a SolrCore reference, but tries to obtain similarity from the factory, so this assert was always tripped. And if you recall, you asked me to remove the SchemaAware inform(). Catch-22. (I think I removed it from SchemaCodecFactory for consistency with SchemaSimilarityFactory, so I'll try put the assert back there.)

          In SolrCore:

              if (schema instanceof ManagedIndexSchema && schema.isMutable() && resourceLoader instanceof ZkSolrResourceLoader) {
                this.zkIndexSchemaReader = new ZkIndexSchemaReader(this);
              } else {
                this.zkIndexSchemaReader = null;
              }
          

          Why is this in SolrCore? Nothing in SolrCore uses this "zkIndexSchemaReader". I dont think this belongs here: i think it should be in ManagedIndexSchemaFactory... like it should be core-aware or whatever and do this itself.

          SolrCore is where the schema lives and is updated, and zkIndexSchemaReader keeps its schema up-to-date in SolrCloud mode, so it made sense for the update function to live where the thing being updates lives. But I don't have a strong feeling about this - I'll move it to the factory and make the factory SolrCoreAware.

          In SolrIndexSearcher.java:

             /** Direct access to the IndexSchema for use with this searcher */
          -  public IndexSchema getSchema() { return schema; }
          +  public IndexSchema getSchema() { return core.getSchema(); }
          

          I'm confused about this in conjunction with your previous comment:

          Schema is now effectively immutable: requests see the same schema snapshot for their lifetimes.

          Then isn't it dangerous for things to be pulling moving-target schemas off of SolrCores/SolrIndexSearchers? Shouldn't they be only getting this from the request? I made this package-private just to see the damage and its not clear to me that your statement really holds for all this query code

          I'll investigate.

          In FieldCollectionResource.java:

              ManagedIndexSchema newSchema = ManagedIndexSchema.addFields(getSolrCore(), newFieldsArray);
              getSolrCore().setSchema(newSchema);
          

          It would be nice if we could at least add a TODO to refactor some of this. I think its a little confusing that IndexSchema itself has getMutable, but operations like this go directly to the implementation (abstraction violation). From a pluggability perspective it would be nice if e.g. addFields was factored down (e.g. IndexSchema becomes abstract and minimal), and the immutable default impl threw UOE for changes or whatever... But i know this is a lot of work, it would be a good followup issue and probably good to do before schema gets any more hair (there is already tons of backwards cruft thrown about it for compat etc too).

          I actually originally had addField() in the base class and overrode it in the subclass, but in the shift to immutable schema, it seemed weird to me for it to not affect the instance on which it was being called, so I made it static, but static methods aren't overrideable... If it gets moved back, maybe it should be named addFieldsAfterCloning() or something?

          In ExternalFileField.java:

            /**
             * Informs the {@link org.apache.solr.schema.IndexSchema} provided by the <code>schema</code>
             * parameter of an event (e.g., a new {@link org.apache.solr.schema.FieldType} was added, etc.
             *
             * @param schema The {@link org.apache.solr.schema.IndexSchema} instance that inform of the update to.
             * @since SOLR-1131
             */
            @Override
            public void inform(IndexSchema schema) {
          

          This should be unnecessary duplication... javadocs by default copies this from the overridden interface (SchemaAware). So I'd remove it completely, if there is anything ExternalFileField-specific that needs to be appended to this, then the base doc can be sucked in with inheritDoc.
          (the same goes for several other classes, e.g. i see this in ExternalFileFieldReloader too).

          Thanks, I'll fix. IntelliJ auto-copies javadoc when you tell it to fix unimplemented methods... I'll see if there's a setting to not do that by default.

          If for example, realtime-get wants to get the 'latest', it should get it from request.getCore().getCurrentSchema() (please, name it in such a way that its not confusing).

          I'll rename SolrCore.getSchema() to get getLatestSchema().


          Replying to Yonik:

          in RealTimeGetComponent, it seems like we should use
          req.getCore().getSchema() instead of req.getSchema()
          since one can be concurrently reading docs out of the tlog at the same time they are being added (and hence the schema bound to the request may be too old)

          Thanks, I'll fix.

          Show
          Steve Rowe added a comment - Thanks Robert and Yonik. Replying to Robert first: In CodecFactory: public Codec getCodec() { - assert codec != null : "inform must be called first" ; Why remove this assert? I think this is pretty useful otherwise you can get a difficult-to-diagnose NPE. Same goes with the SimilarityFactory. I removed it from SchemaSimilarityFactory because when the schema is constructed, it doesn't have a SolrCore reference, but tries to obtain similarity from the factory, so this assert was always tripped. And if you recall, you asked me to remove the SchemaAware inform() . Catch-22. (I think I removed it from SchemaCodecFactory for consistency with SchemaSimilarityFactory, so I'll try put the assert back there.) In SolrCore: if (schema instanceof ManagedIndexSchema && schema.isMutable() && resourceLoader instanceof ZkSolrResourceLoader) { this .zkIndexSchemaReader = new ZkIndexSchemaReader( this ); } else { this .zkIndexSchemaReader = null ; } Why is this in SolrCore? Nothing in SolrCore uses this "zkIndexSchemaReader". I dont think this belongs here: i think it should be in ManagedIndexSchemaFactory... like it should be core-aware or whatever and do this itself. SolrCore is where the schema lives and is updated, and zkIndexSchemaReader keeps its schema up-to-date in SolrCloud mode, so it made sense for the update function to live where the thing being updates lives. But I don't have a strong feeling about this - I'll move it to the factory and make the factory SolrCoreAware. In SolrIndexSearcher.java: /** Direct access to the IndexSchema for use with this searcher */ - public IndexSchema getSchema() { return schema; } + public IndexSchema getSchema() { return core.getSchema(); } I'm confused about this in conjunction with your previous comment: Schema is now effectively immutable: requests see the same schema snapshot for their lifetimes. Then isn't it dangerous for things to be pulling moving-target schemas off of SolrCores/SolrIndexSearchers? Shouldn't they be only getting this from the request? I made this package-private just to see the damage and its not clear to me that your statement really holds for all this query code I'll investigate. In FieldCollectionResource.java: ManagedIndexSchema newSchema = ManagedIndexSchema.addFields(getSolrCore(), newFieldsArray); getSolrCore().setSchema(newSchema); It would be nice if we could at least add a TODO to refactor some of this. I think its a little confusing that IndexSchema itself has getMutable, but operations like this go directly to the implementation (abstraction violation). From a pluggability perspective it would be nice if e.g. addFields was factored down (e.g. IndexSchema becomes abstract and minimal), and the immutable default impl threw UOE for changes or whatever... But i know this is a lot of work, it would be a good followup issue and probably good to do before schema gets any more hair (there is already tons of backwards cruft thrown about it for compat etc too). I actually originally had addField() in the base class and overrode it in the subclass, but in the shift to immutable schema, it seemed weird to me for it to not affect the instance on which it was being called, so I made it static, but static methods aren't overrideable... If it gets moved back, maybe it should be named addFieldsAfterCloning() or something? In ExternalFileField.java: /** * Informs the {@link org.apache.solr.schema.IndexSchema} provided by the <code>schema</code> * parameter of an event (e.g., a new {@link org.apache.solr.schema.FieldType} was added, etc. * * @param schema The {@link org.apache.solr.schema.IndexSchema} instance that inform of the update to. * @since SOLR-1131 */ @Override public void inform(IndexSchema schema) { This should be unnecessary duplication... javadocs by default copies this from the overridden interface (SchemaAware). So I'd remove it completely, if there is anything ExternalFileField-specific that needs to be appended to this, then the base doc can be sucked in with inheritDoc. (the same goes for several other classes, e.g. i see this in ExternalFileFieldReloader too). Thanks, I'll fix. IntelliJ auto-copies javadoc when you tell it to fix unimplemented methods... I'll see if there's a setting to not do that by default. If for example, realtime-get wants to get the 'latest', it should get it from request.getCore().getCurrentSchema() (please, name it in such a way that its not confusing). I'll rename SolrCore.getSchema() to get getLatestSchema() . Replying to Yonik: in RealTimeGetComponent, it seems like we should use req.getCore().getSchema() instead of req.getSchema() since one can be concurrently reading docs out of the tlog at the same time they are being added (and hence the schema bound to the request may be too old) Thanks, I'll fix.
          Hide
          Robert Muir added a comment -

          I removed it from SchemaSimilarityFactory because when the schema is constructed, it doesn't have a SolrCore reference, but tries to obtain similarity from the factory, so this assert was always tripped. And if you recall, you asked me to remove the SchemaAware inform(). Catch-22. (I think I removed it from SchemaCodecFactory for consistency with SchemaSimilarityFactory, so I'll try put the assert back there.)

          I admit I don't understand the catch-22, but for these factories to return null I think is a very serious problem.

          Show
          Robert Muir added a comment - I removed it from SchemaSimilarityFactory because when the schema is constructed, it doesn't have a SolrCore reference, but tries to obtain similarity from the factory, so this assert was always tripped. And if you recall, you asked me to remove the SchemaAware inform(). Catch-22. (I think I removed it from SchemaCodecFactory for consistency with SchemaSimilarityFactory, so I'll try put the assert back there.) I admit I don't understand the catch-22, but for these factories to return null I think is a very serious problem.
          Hide
          Steve Rowe added a comment -

          I admit I don't understand the catch-22, but for these factories to return null I think is a very serious problem.

          I agree - I'll fix it. In the SolrCore constructor, the Sim factory's getSimilarity() is called before inform() is called on registered SolrCoreAware objects, so I'll call the Sim and Codec factories' inform() methods as soon as they're bound to the core, if they're SolrCoreAware.

          Show
          Steve Rowe added a comment - I admit I don't understand the catch-22, but for these factories to return null I think is a very serious problem. I agree - I'll fix it. In the SolrCore constructor, the Sim factory's getSimilarity() is called before inform() is called on registered SolrCoreAware objects, so I'll call the Sim and Codec factories' inform() methods as soon as they're bound to the core, if they're SolrCoreAware.
          Hide
          Steve Rowe added a comment -

          Patch, fixing issues Robert and Yonik raised.

          1. Restored SchemaCodecFactory's and SchemaSimilarityFactory's not-null assertions in their getters.
          2. Moved zkIndexSchemaReader to ManagedIndexSchemaFactory.
          3. Removed SolrIndexSearcher.getSchema() entirely, switching previous calls to either pull the schema from the request, if available or failing that, from the searcher's SolrCore.
          4. Put newField() and addFields() back as member functions of IndexSchema, rather than static methods on ManagedIndexSchema. This is not the full refactoring with an abstract IndexSchema, but at least these methods won't get in the way of that. I'll make a separate JIRA for the schema refactoring so the idea doesn't get lost.
          5. Fixed javadoc duplication on SchemaAware inform() methods.
          6. In RealTimeGetComponent, switched from req.getCore().getSchema() to req.getSchem(). Added a basic test in new class TestAddFieldRealTimeGet to make sure this works.
          Show
          Steve Rowe added a comment - Patch, fixing issues Robert and Yonik raised. Restored SchemaCodecFactory's and SchemaSimilarityFactory's not-null assertions in their getters. Moved zkIndexSchemaReader to ManagedIndexSchemaFactory. Removed SolrIndexSearcher.getSchema() entirely, switching previous calls to either pull the schema from the request, if available or failing that, from the searcher's SolrCore. Put newField() and addFields() back as member functions of IndexSchema, rather than static methods on ManagedIndexSchema. This is not the full refactoring with an abstract IndexSchema, but at least these methods won't get in the way of that. I'll make a separate JIRA for the schema refactoring so the idea doesn't get lost. Fixed javadoc duplication on SchemaAware inform() methods. In RealTimeGetComponent, switched from req.getCore().getSchema() to req.getSchem() . Added a basic test in new class TestAddFieldRealTimeGet to make sure this works.
          Hide
          Steve Rowe added a comment -

          This is not the full refactoring with an abstract IndexSchema, but at least these methods won't get in the way of that. I'll make a separate JIRA for the schema refactoring so the idea doesn't get lost.

          See SOLR-4726

          Show
          Steve Rowe added a comment - This is not the full refactoring with an abstract IndexSchema, but at least these methods won't get in the way of that. I'll make a separate JIRA for the schema refactoring so the idea doesn't get lost. See SOLR-4726
          Hide
          Robert Muir added a comment -

          +++ solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java (working copy)
          @@ -120,7 +120,6 @@

           
             private static Logger log = LoggerFactory.getLogger(SolrIndexSearcher.class);
             private final SolrCore core;
          -  private final IndexSchema schema;
          

          Are we sure SolrIndexSearcher cannot just take a snapshot of the schema and just use that? After all, it represents an immutable point-in-time of the index.
          Requests already have indexsearchers assigned to them, so req.getSchema() could just forward to getSearcher().getSchema(). Then the patch would get a lot
          smaller and so much search code would be simpler and probably require no code or API changes at all to work.

          The special cases like real-time get could continue to just call getLatestSchema() from the core.

          This seems to cause a lot of general problems throughout the patch (and require lots of search code to become more complex, taking additional IndexSchema parameters). After making it halfway thru the patch, I think doing a small change here might fix 90% of my other comments (but im including them anyway here, sorry if thats confusing, just want
          to reduce the number of iterations hopefully)

          solr/core/src/java/org/apache/solr/handler/component/FieldFacetStats.java

          +        facetStats.accumulate(searcher.getCore().getLatestSchema(), value, count);
          

          Shouldn't this be using the one from the request instead of the moving target?

          Index: solr/core/src/java/org/apache/solr/handler/component/QueryElevationComponent.java?

             @Override
             public void inform(SolrCore core) {
          -    String a = initArgs.get(FIELD_TYPE);
          -    if (a != null) {
          -      FieldType ft = core.getSchema().getFieldTypes().get(a);
          +    IndexSchema schema = core.getLatestSchema();
          

          Index: solr/core/src/java/org/apache/solr/handler/component/SpellCheckComponent.java

          +        IndexSchema schema = core.getLatestSchema();
          

          These look scary, but seem ok since it only uses the fieldtype (which cannot yet change).
          Maybe add a comment just so its clear (especially in case the fieldtype becomes changeable later)?

          +++ solr/core/src/java/org/apache/solr/handler/component/StatsComponent.java (working copy)

               boolean isShard = params.getBool(ShardParams.IS_SHARD, false);
               if (null != statsFs) {
          +      IndexSchema schema = searcher.getCore().getLatestSchema();
          
             public NamedList<?> getFieldCacheStats(String fieldName, String[] facet) throws IOException {
          -    final SchemaField sf = searcher.getSchema().getField(fieldName);
          +    IndexSchema schema = searcher.getCore().getLatestSchema();
          +    final SchemaField sf = schema.getField(fieldName);
          

          This should be using the one from the request instead of the moving target.

          Index: solr/core/src/java/org/apache/solr/handler/loader/CSVLoaderBase.java

          -    schema = req.getSchema();
          +    core = req.getCore();
               this.literals = new HashMap<SchemaField, String>();
           
               templateAdd = new AddUpdateCommand(req);
          @@ -244,6 +245,7 @@
               CSVLoaderBase.FieldAdder adder = new CSVLoaderBase.FieldAdder();
               CSVLoaderBase.FieldAdder adderKeepEmpty = new CSVLoaderBase.FieldAdderEmpty();
           
          +    IndexSchema schema = core.getLatestSchema();
          

          Is this right? Why not use the one from the request? If there is a reason (which i dont see/understand), can we add a comment?

          Index: solr/core/src/java/org/apache/solr/schema/SchemaField.java

             /** Declared field property overrides */
          -  Map<String,String> args = Collections.emptyMap();
          +  Map<String,?> args = Collections.emptyMap();
          
          -  static SchemaField create(String name, FieldType ft, Map<String,String> props) {
          +  static SchemaField create(String name, FieldType ft, Map<String,?> props) {
          

          Why are we losing type safety here? I'm now unclear on what can be in this properties map...

          Index: solr/core/src/java/org/apache/solr/search/Grouping.java

          @@ -775,6 +776,7 @@
                 // handle case of rows=0
                 if (numGroups == 0) return;
           
          +      IndexSchema schema = searcher.getCore().getLatestSchema();
          

          I don't think this should be using a moving target

          +++ solr/core/src/java/org/apache/solr/search/JoinQParserPlugin.java (working copy)

          -      String prefixStr = TrieField.getMainValuePrefix(fromSearcher.getSchema().getFieldType(fromField));
          +      String prefixStr = TrieField.getMainValuePrefix(fromSearcher.getCore().getLatestSchema().getFieldType(fromField));
          

          Nor this

          Show
          Robert Muir added a comment - +++ solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java (working copy) @@ -120,7 +120,6 @@ private static Logger log = LoggerFactory.getLogger(SolrIndexSearcher.class); private final SolrCore core; - private final IndexSchema schema; Are we sure SolrIndexSearcher cannot just take a snapshot of the schema and just use that? After all, it represents an immutable point-in-time of the index. Requests already have indexsearchers assigned to them, so req.getSchema() could just forward to getSearcher().getSchema(). Then the patch would get a lot smaller and so much search code would be simpler and probably require no code or API changes at all to work. The special cases like real-time get could continue to just call getLatestSchema() from the core. This seems to cause a lot of general problems throughout the patch (and require lots of search code to become more complex, taking additional IndexSchema parameters). After making it halfway thru the patch, I think doing a small change here might fix 90% of my other comments (but im including them anyway here, sorry if thats confusing, just want to reduce the number of iterations hopefully) solr/core/src/java/org/apache/solr/handler/component/FieldFacetStats.java + facetStats.accumulate(searcher.getCore().getLatestSchema(), value, count); Shouldn't this be using the one from the request instead of the moving target? Index: solr/core/src/java/org/apache/solr/handler/component/QueryElevationComponent.java? @Override public void inform(SolrCore core) { - String a = initArgs.get(FIELD_TYPE); - if (a != null ) { - FieldType ft = core.getSchema().getFieldTypes().get(a); + IndexSchema schema = core.getLatestSchema(); Index: solr/core/src/java/org/apache/solr/handler/component/SpellCheckComponent.java + IndexSchema schema = core.getLatestSchema(); These look scary, but seem ok since it only uses the fieldtype (which cannot yet change). Maybe add a comment just so its clear (especially in case the fieldtype becomes changeable later)? +++ solr/core/src/java/org/apache/solr/handler/component/StatsComponent.java (working copy) boolean isShard = params.getBool(ShardParams.IS_SHARD, false ); if ( null != statsFs) { + IndexSchema schema = searcher.getCore().getLatestSchema(); public NamedList<?> getFieldCacheStats( String fieldName, String [] facet) throws IOException { - final SchemaField sf = searcher.getSchema().getField(fieldName); + IndexSchema schema = searcher.getCore().getLatestSchema(); + final SchemaField sf = schema.getField(fieldName); This should be using the one from the request instead of the moving target. Index: solr/core/src/java/org/apache/solr/handler/loader/CSVLoaderBase.java - schema = req.getSchema(); + core = req.getCore(); this .literals = new HashMap<SchemaField, String >(); templateAdd = new AddUpdateCommand(req); @@ -244,6 +245,7 @@ CSVLoaderBase.FieldAdder adder = new CSVLoaderBase.FieldAdder(); CSVLoaderBase.FieldAdder adderKeepEmpty = new CSVLoaderBase.FieldAdderEmpty(); + IndexSchema schema = core.getLatestSchema(); Is this right? Why not use the one from the request? If there is a reason (which i dont see/understand), can we add a comment? Index: solr/core/src/java/org/apache/solr/schema/SchemaField.java /** Declared field property overrides */ - Map< String , String > args = Collections.emptyMap(); + Map< String ,?> args = Collections.emptyMap(); - static SchemaField create( String name, FieldType ft, Map< String , String > props) { + static SchemaField create( String name, FieldType ft, Map< String ,?> props) { Why are we losing type safety here? I'm now unclear on what can be in this properties map... Index: solr/core/src/java/org/apache/solr/search/Grouping.java @@ -775,6 +776,7 @@ // handle case of rows=0 if (numGroups == 0) return ; + IndexSchema schema = searcher.getCore().getLatestSchema(); I don't think this should be using a moving target +++ solr/core/src/java/org/apache/solr/search/JoinQParserPlugin.java (working copy) - String prefixStr = TrieField.getMainValuePrefix(fromSearcher.getSchema().getFieldType(fromField)); + String prefixStr = TrieField.getMainValuePrefix(fromSearcher.getCore().getLatestSchema().getFieldType(fromField)); Nor this
          Hide
          Robert Muir added a comment -

          And by the way, i'm sorry if this suggestion seems to conflict with my previous one. My problem before was SolrIndexSearcher publicly exposing a moving-target-schema, which I still insist is bad. So its good the patch fixes that, but I'm not sure if removing the getter is the only solution.

          My question is: why does it need a moving target at all (even internally). If it can have an immutable snapshot, then its getter is ok and search code works as before. We could keep it (and maybe deprecate if we dont want to have 87 different ways to get the schema, doesnt matter so much at that point).

          Show
          Robert Muir added a comment - And by the way, i'm sorry if this suggestion seems to conflict with my previous one. My problem before was SolrIndexSearcher publicly exposing a moving-target-schema, which I still insist is bad. So its good the patch fixes that, but I'm not sure if removing the getter is the only solution. My question is: why does it need a moving target at all (even internally). If it can have an immutable snapshot, then its getter is ok and search code works as before. We could keep it (and maybe deprecate if we dont want to have 87 different ways to get the schema, doesnt matter so much at that point).
          Hide
          Steve Rowe added a comment -

          And by the way, i'm sorry if this suggestion seems to conflict with my previous one. My problem before was SolrIndexSearcher publicly exposing a moving-target-schema, which I still insist is bad. So its good the patch fixes that, but I'm not sure if removing the getter is the only solution.
          My question is: why does it need a moving target at all (even internally). If it can have an immutable snapshot, then its getter is ok and search code works as before. We could keep it (and maybe deprecate if we dont want to have 87 different ways to get the schema, doesnt matter so much at that point).

          No problem, I understand. I'll recast as bind-schema-to-searcher and see how that goes.

          Show
          Steve Rowe added a comment - And by the way, i'm sorry if this suggestion seems to conflict with my previous one. My problem before was SolrIndexSearcher publicly exposing a moving-target-schema, which I still insist is bad. So its good the patch fixes that, but I'm not sure if removing the getter is the only solution. My question is: why does it need a moving target at all (even internally). If it can have an immutable snapshot, then its getter is ok and search code works as before. We could keep it (and maybe deprecate if we dont want to have 87 different ways to get the schema, doesnt matter so much at that point). No problem, I understand. I'll recast as bind-schema-to-searcher and see how that goes.
          Hide
          Steve Rowe added a comment -
          Index: solr/core/src/java/org/apache/solr/schema/SchemaField.java
             /** Declared field property overrides */
          -  Map<String,String> args = Collections.emptyMap();
          +  Map<String,?> args = Collections.emptyMap();
          
          -  static SchemaField create(String name, FieldType ft, Map<String,String> props) {
          +  static SchemaField create(String name, FieldType ft, Map<String,?> props) {
          

          Why are we losing type safety here? I'm now unclear on what can be in this properties map...

          This change was in Yonik's original patch on this issue and I kept it in. If you look at his patch, he uses this to carry Booleans in the props map:

          Index: core/src/java/org/apache/solr/schema/FieldProperties.java
          ===================================================================
          --- core/src/java/org/apache/solr/schema/FieldProperties.java	(revision 1300409)
          +++ core/src/java/org/apache/solr/schema/FieldProperties.java	(working copy)
          @@ -101,12 +101,13 @@
               return (bitfield & props) == 0;
             }
           
          -  static int parseProperties(Map<String,String> properties, boolean which) {
          +  static int parseProperties(Map<String,?> properties, boolean which) {
               int props = 0;
          -    for (Map.Entry<String, String> entry : properties.entrySet()) {
          -      String val = entry.getValue();
          +    for (Map.Entry<String, ?> entry : properties.entrySet()) {
          +      Object val = entry.getValue();
                 if(val == null) continue;
          -      if (Boolean.parseBoolean(val) == which) {
          +      boolean boolVal = val instanceof Boolean ? (Boolean)val : Boolean.parseBoolean(val.toString());
          +      if (boolVal == which) {
                   props |= propertyNameToInt(entry.getKey());
                 }
               }
          
          Show
          Steve Rowe added a comment - Index: solr/core/src/java/org/apache/solr/schema/SchemaField.java /** Declared field property overrides */ - Map< String , String > args = Collections.emptyMap(); + Map< String ,?> args = Collections.emptyMap(); - static SchemaField create( String name, FieldType ft, Map< String , String > props) { + static SchemaField create( String name, FieldType ft, Map< String ,?> props) { Why are we losing type safety here? I'm now unclear on what can be in this properties map... This change was in Yonik's original patch on this issue and I kept it in. If you look at his patch, he uses this to carry Booleans in the props map: Index: core/src/java/org/apache/solr/schema/FieldProperties.java =================================================================== --- core/src/java/org/apache/solr/schema/FieldProperties.java (revision 1300409) +++ core/src/java/org/apache/solr/schema/FieldProperties.java (working copy) @@ -101,12 +101,13 @@ return (bitfield & props) == 0; } - static int parseProperties(Map< String , String > properties, boolean which) { + static int parseProperties(Map< String ,?> properties, boolean which) { int props = 0; - for (Map.Entry< String , String > entry : properties.entrySet()) { - String val = entry.getValue(); + for (Map.Entry< String , ?> entry : properties.entrySet()) { + Object val = entry.getValue(); if (val == null ) continue ; - if ( Boolean .parseBoolean(val) == which) { + boolean boolVal = val instanceof Boolean ? ( Boolean )val : Boolean .parseBoolean(val.toString()); + if (boolVal == which) { props |= propertyNameToInt(entry.getKey()); } }
          Hide
          Steve Rowe added a comment -

          Patch:

          • Restores SolrIndexSearcher.getSchema(), but makes it return the schema snapshot passed into the SolrIndexSearcher ctor, rather than the latest available schema from SolrCore.
          • Converts query request code to pull the schema from a searcher if one is already available.
          • Removes all fields and non-static methods from o.a.s.update.DocumentBuilder - this is dead code.
          • Removes DIH's o.a.s.handler.dataimport.config.Document entirely - this is dead code.
          • Reworks DIH schema handling, so that DIHConfiguration, which is created per-request, hosts a schema snapshot and derived schema info (lowercase field mappings).
          • ExternalFileFieldReloader's newSearcher() callback now checks if the schema has changed, and if so, reloads its FileFloatSource cache.

          I tried converting query code to always pull a searcher from the request and then pull the schema from there, rather than from the request, but this caused lots of imbalanced searcher refcounts, because searchers weren't already bound to the request in some cases, and request.close() apparently wasn't always invoked in some tests. So I backtracked and only pulled the schema from already-available searchers.

          So we'll now have three schema sources:

          1. SolrCore.getLatestSchema()
          2. SolrQueryRequest.getSchema() - schema snapshot at request construction
          3. SolrIndexSearcher.getSchema() - schema snapshot at searcher construction

          Update code will use the schema snapshot from the request, when available, and the latest schema from SolrCore when it's not.

          I believe that since the only permitted schema change now is new fields, it's okay for query code to also pull the schema from the request, and for update code to also pull the latest schema from SolrCore.

          Show
          Steve Rowe added a comment - Patch: Restores SolrIndexSearcher.getSchema(), but makes it return the schema snapshot passed into the SolrIndexSearcher ctor, rather than the latest available schema from SolrCore. Converts query request code to pull the schema from a searcher if one is already available. Removes all fields and non-static methods from o.a.s.update.DocumentBuilder - this is dead code. Removes DIH's o.a.s.handler.dataimport.config.Document entirely - this is dead code. Reworks DIH schema handling, so that DIHConfiguration, which is created per-request, hosts a schema snapshot and derived schema info (lowercase field mappings). ExternalFileFieldReloader's newSearcher() callback now checks if the schema has changed, and if so, reloads its FileFloatSource cache. I tried converting query code to always pull a searcher from the request and then pull the schema from there, rather than from the request, but this caused lots of imbalanced searcher refcounts, because searchers weren't already bound to the request in some cases, and request.close() apparently wasn't always invoked in some tests. So I backtracked and only pulled the schema from already-available searchers. So we'll now have three schema sources: SolrCore.getLatestSchema() SolrQueryRequest.getSchema() - schema snapshot at request construction SolrIndexSearcher.getSchema() - schema snapshot at searcher construction Update code will use the schema snapshot from the request, when available, and the latest schema from SolrCore when it's not. I believe that since the only permitted schema change now is new fields, it's okay for query code to also pull the schema from the request, and for update code to also pull the latest schema from SolrCore.
          Hide
          Robert Muir added a comment -

          I tried converting query code to always pull a searcher from the request and then pull the schema from there, rather than from the request, but this caused lots of imbalanced searcher refcounts, because searchers weren't already bound to the request in some cases, and request.close() apparently wasn't always invoked in some tests. So I backtracked and only pulled the schema from already-available searchers.

          So we'll now have three schema sources:

          I don't think we should make bad design decisions because of a few bad tests? They should be closing this thing, and its just random chance that the current implementation doesnt leak anything if nobody called certain methods yet.

          There is a real value i think in having request.getSchema() == request.getSearcher().getSchema().

          I took the patch locally and tried this in SolrQueryRequestBase.java and it didnt seem like such a disaster to me:

            // The index schema associated with this request
            @Override
            public IndexSchema getSchema() {
              SolrIndexSearcher s = getSearcher();
              if (s == null) {
                return null;
              } else {
                return s.getSchema();
              }
            }
          
          Show
          Robert Muir added a comment - I tried converting query code to always pull a searcher from the request and then pull the schema from there, rather than from the request, but this caused lots of imbalanced searcher refcounts, because searchers weren't already bound to the request in some cases, and request.close() apparently wasn't always invoked in some tests. So I backtracked and only pulled the schema from already-available searchers. So we'll now have three schema sources: I don't think we should make bad design decisions because of a few bad tests? They should be closing this thing, and its just random chance that the current implementation doesnt leak anything if nobody called certain methods yet. There is a real value i think in having request.getSchema() == request.getSearcher().getSchema(). I took the patch locally and tried this in SolrQueryRequestBase.java and it didnt seem like such a disaster to me: // The index schema associated with this request @Override public IndexSchema getSchema() { SolrIndexSearcher s = getSearcher(); if (s == null ) { return null ; } else { return s.getSchema(); } }
          Hide
          Yonik Seeley added a comment -

          There is a real value i think in having request.getSchema() == request.getSearcher().getSchema().

          This introduces a new dependency that did not exist in the past, and I don't think we should do that. There should be no need to get an open searcher to get schema information. As the failing tests show, it can have unintended consequences. getSearcher() is also a blocking operation and if called in the wrong context can lead to deadlock (certain callbacks are forbidden to call getSearcher).

          Show
          Yonik Seeley added a comment - There is a real value i think in having request.getSchema() == request.getSearcher().getSchema(). This introduces a new dependency that did not exist in the past, and I don't think we should do that. There should be no need to get an open searcher to get schema information. As the failing tests show, it can have unintended consequences. getSearcher() is also a blocking operation and if called in the wrong context can lead to deadlock (certain callbacks are forbidden to call getSearcher).
          Hide
          Steve Rowe added a comment -

          There is a real value i think in having request.getSchema() == request.getSearcher().getSchema().

          This won't work at all for update requests that depend on new fields in a schema newer than that on the request's searcher.

          Show
          Steve Rowe added a comment - There is a real value i think in having request.getSchema() == request.getSearcher().getSchema(). This won't work at all for update requests that depend on new fields in a schema newer than that on the request's searcher.
          Hide
          Robert Muir added a comment -

          As the failing tests show, it can have unintended consequences.

          I don't think this shows anything, these tests arent closing things.

          This won't work at all for update requests that depend on new fields in a schema newer than that on the request's searcher.

          There are other ways to deal with that, for example the update requests can have a getSchema() thats implemented differently. Or it can be required that reopen/commit happens to make the new fields visible.

          I think ideally it would actually be core reload: I'm really worried about the possibility of bugs the way this is heading. I think its dangerous.

          Show
          Robert Muir added a comment - As the failing tests show, it can have unintended consequences. I don't think this shows anything, these tests arent closing things. This won't work at all for update requests that depend on new fields in a schema newer than that on the request's searcher. There are other ways to deal with that, for example the update requests can have a getSchema() thats implemented differently. Or it can be required that reopen/commit happens to make the new fields visible. I think ideally it would actually be core reload: I'm really worried about the possibility of bugs the way this is heading. I think its dangerous.
          Hide
          Yonik Seeley added a comment -

          I think ideally it would actually be core reload

          That would be much more complex... reloading a core is very heavy-weight, and doing it in the middle of an update request that adds a new field? Whew.

          I think the current patch is safe for what it does now which is only add new fields. It's also safer than trying to introduce a new dependency on schema (i.e. opening a searcher).
          We know there are going to be additional issues to watch for if/when we get around to changing (or deleting) fields, and I think the current patch leaves enough flexibility to deal with those issues in the future.

          Show
          Yonik Seeley added a comment - I think ideally it would actually be core reload That would be much more complex... reloading a core is very heavy-weight, and doing it in the middle of an update request that adds a new field? Whew. I think the current patch is safe for what it does now which is only add new fields. It's also safer than trying to introduce a new dependency on schema (i.e. opening a searcher). We know there are going to be additional issues to watch for if/when we get around to changing (or deleting) fields, and I think the current patch leaves enough flexibility to deal with those issues in the future.
          Hide
          Steve Rowe added a comment -

          I'm inclined to invoke "progress not perfection" here and commit. Robert Muir, if you don't object, I'll do that tomorrow.

          Show
          Steve Rowe added a comment - I'm inclined to invoke "progress not perfection" here and commit. Robert Muir , if you don't object, I'll do that tomorrow.
          Hide
          Robert Muir added a comment -

          That would be much more complex... reloading a core is very heavy-weight, and doing it in the middle of an update request that adds a new field? Whew.

          Changing the schema is a big deal. I think its perfectly acceptable to require a core reload. This isn't something that happens every 5 seconds.

          Otherwise, if you arent going to do this, then it needs to be designed not to have scary race conditions.

          Show
          Robert Muir added a comment - That would be much more complex... reloading a core is very heavy-weight, and doing it in the middle of an update request that adds a new field? Whew. Changing the schema is a big deal. I think its perfectly acceptable to require a core reload. This isn't something that happens every 5 seconds. Otherwise, if you arent going to do this, then it needs to be designed not to have scary race conditions.
          Hide
          Robert Muir added a comment -

          I'm inclined to invoke "progress not perfection" here and commit. Robert Muir, if you don't object, I'll do that tomorrow.

          I think its ok for trunk.

          I think the semantics of when things happen, the APIs, and the consistency should all be fleshed out before anything goes to the stable branch (if at all).

          Show
          Robert Muir added a comment - I'm inclined to invoke "progress not perfection" here and commit. Robert Muir, if you don't object, I'll do that tomorrow. I think its ok for trunk. I think the semantics of when things happen, the APIs, and the consistency should all be fleshed out before anything goes to the stable branch (if at all).
          Hide
          Yonik Seeley added a comment -

          it needs to be designed not to have scary race conditions.

          Of course. If you point out where you see a race condition in the current patch, we can fix it!

          Show
          Yonik Seeley added a comment - it needs to be designed not to have scary race conditions. Of course. If you point out where you see a race condition in the current patch, we can fix it!
          Hide
          Yonik Seeley added a comment -

          Changing the schema is a big deal. I think its perfectly acceptable to require a core reload.

          Core reloads in the presence of updates have always been more scary... I think the current approach is much safer, and I feel comfortable with it being committed to both trunk and 4x.

          Show
          Yonik Seeley added a comment - Changing the schema is a big deal. I think its perfectly acceptable to require a core reload. Core reloads in the presence of updates have always been more scary... I think the current approach is much safer, and I feel comfortable with it being committed to both trunk and 4x.
          Hide
          Steve Rowe added a comment - - edited

          I'll commit to trunk in a little bit.

          Robert Muir, to be clear, are you veto'ing a commit to branch_4x? If not, I'll commit there too.

          Show
          Steve Rowe added a comment - - edited I'll commit to trunk in a little bit. Robert Muir , to be clear, are you veto'ing a commit to branch_4x? If not, I'll commit there too.
          Hide
          Robert Muir added a comment -

          Steve, no. I'm just giving my opinion.

          I would just ask myself (and maybe some other developers, other than me), if the change is ready to go into a release tomorrow.

          If the intent is to go about this iteratively (you said progress not perfection), then I think trunk is the place for that.

          Show
          Robert Muir added a comment - Steve, no. I'm just giving my opinion. I would just ask myself (and maybe some other developers, other than me), if the change is ready to go into a release tomorrow. If the intent is to go about this iteratively (you said progress not perfection), then I think trunk is the place for that.
          Hide
          Steve Rowe added a comment -

          Steve, no. I'm just giving my opinion.

          Thanks for the clarification.

          I would just ask myself (and maybe some other developers, other than me), if the change is ready to go into a release tomorrow.

          I'd be comfortable releasing it tomorrow, in part because of the future bugs you worry about: I want them known sooner rather than later, and releases are where the rubber meets the road in terms of widespread usage. Worst case: the new feature fails for some people - the fix is to discontinue use of the new feature. I'm fine with that, especially since those (theoretical) failures will factor into future improvements.

          If the intent is to go about this iteratively (you said progress not perfection), then I think trunk is the place for that.

          I disagree with your implication that iteration may not take place on the stable branch. If that were true, new features would be really hard to introduce, and that seems to me like a step backward in the great progress Lucene and Solr have made in regularly releasing new features.

          Does anybody else have an opinion about the releasability of the current patch?

          Show
          Steve Rowe added a comment - Steve, no. I'm just giving my opinion. Thanks for the clarification. I would just ask myself (and maybe some other developers, other than me), if the change is ready to go into a release tomorrow. I'd be comfortable releasing it tomorrow, in part because of the future bugs you worry about: I want them known sooner rather than later, and releases are where the rubber meets the road in terms of widespread usage. Worst case: the new feature fails for some people - the fix is to discontinue use of the new feature. I'm fine with that, especially since those (theoretical) failures will factor into future improvements. If the intent is to go about this iteratively (you said progress not perfection), then I think trunk is the place for that. I disagree with your implication that iteration may not take place on the stable branch. If that were true, new features would be really hard to introduce, and that seems to me like a step backward in the great progress Lucene and Solr have made in regularly releasing new features. Does anybody else have an opinion about the releasability of the current patch?
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] sarowe
          http://svn.apache.org/viewvc?view=revision&revision=1470539

          SOLR-3251: Dynamically add fields to schema.

          Show
          Commit Tag Bot added a comment - [trunk commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1470539 SOLR-3251 : Dynamically add fields to schema.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] sarowe
          http://svn.apache.org/viewvc?view=revision&revision=1470686

          SOLR-3251: Wait longer before failing when modified schema doesn't show up right away

          Show
          Commit Tag Bot added a comment - [trunk commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1470686 SOLR-3251 : Wait longer before failing when modified schema doesn't show up right away
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] sarowe
          http://svn.apache.org/viewvc?view=revision&revision=1470723

          SOLR-3251: TestManagedSchema: Explicitly close managed-schema FileInputStream so that Windows can delete the file in the @After-annotated method that cleans up the temp directory.

          Show
          Commit Tag Bot added a comment - [trunk commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1470723 SOLR-3251 : TestManagedSchema: Explicitly close managed-schema FileInputStream so that Windows can delete the file in the @After-annotated method that cleans up the temp directory.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] sarowe
          http://svn.apache.org/viewvc?view=revision&revision=1470979

          SOLR-3251: Dynamically add fields to schema. (merged trunk r1470539, r1470686, and r1470723)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1470979 SOLR-3251 : Dynamically add fields to schema. (merged trunk r1470539, r1470686, and r1470723)
          Hide
          Steve Rowe added a comment -

          Committed to trunk and branch_4x.

          Show
          Steve Rowe added a comment - Committed to trunk and branch_4x.
          Hide
          Steve Rowe added a comment -

          Bulk close resolved 4.4 issues

          Show
          Steve Rowe added a comment - Bulk close resolved 4.4 issues

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Yonik Seeley
            • Votes:
              4 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development