Now that we have the ability to dynamically add schema fields (
SOLR-3251), I want to push forward on this issue.
Value-based dynamic field capabilities for document updates - which I'll sometimes refer to as schemaless mode - will a) determine the type for field names that don’t match explicit or dynamic fields in the schema; b) add these field names to the schema with their determined types; and c) complete the document update request as normal. This process should apply equally to new doc additions, atomic updates, and regular updates.
In a conversation with Chris Hostetter (Unused) about this feature, he suggested that configuration for parsing/converting String-typed field values into the appropriate Java objects could be separated from configuration of mappings from Java object types to schema field types. In this way, components built for schemaless mode could be reused for other purposes.
JSON and Javabin content streams already carry some type information for their field values. The ContentStreamLoader-s corresponding to these, JsonLoader and JavabinLoader, should set field value object types in the SolrInputDocument according to the content stream's data types. (Currently JavabinLoader does this correctly, but JsonLoader stores everything as String-s; this will need to be fixed.) As a result, for the Java object types supported by these content streams and their loaders (as well as other update processors, etc. that set field values' Java object types), String parsing/conversion won't be required, and only the Java object type -> schema field type mappings will be necessary to determine the schema field type for new fields.
On SOLR-2802, Hoss wrote that FieldMutatingUpdateProcessor-s that parsed dates, numbers and booleans would be generally useful. I plan on going that route to implement String-typed field value parsing. These field value parsing update processors should operate on String-valued fields that either a) are not in the schema, or b) have a schema field type with an appropriate typeClass.
After the new parsing update processors detect and convert field values to the appropriate Java object types, an update processor that adds fields to the schema as needed can be configured with a mapping from Java object type to schema field type.
Here is the list of things I think need to happen - I plan on making JIRA issues for each of these:
- Fix JsonLoader to create field values using the JSON-supplied type, rather than making everything a String.
- Add a new field update processor selector that will configure the processor to select fields that match any schema field, or that match no schema field, depending on its boolean parameter: <bool name="fieldNameMatchesSchemaField">
- Add new FieldMutatingUpdateProcessorFactory subclasses ParseFooUpdateProcessorFactory, where Foo includes Date, Double, Long, and Boolean. If they see a field value that is not String-valued, or can't parse the value, they will ignore it and leave it as is. For multi-valued fields, they should be all-or-nothing.
- Add a new AddSchemaFieldsUpdateProcessorFactory, with configurable mappings from Java object type to schema field type, that will dynamically add fields to the schema, as needed.
- Add a new example config set for schemaless mode.