Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-3250

Dynamic Field capabilities based on value not name

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      In some situations, one already knows the schema of their content, so having to declare a schema in Solr becomes cumbersome in some situations. For instance, if you have all your content in JSON (or can easily generate it) or other typed serializations, then you already have a schema defined. It would be nice if we could have support for dynamic fields that used whatever name was passed in, but then picked the appropriate FieldType for that field based on the value of the content. So, for instance, if the input is a number, it would select the appropriate numeric type. If it is a plain text string, it would pick the appropriate text field (you could even add in language detection here). If it is comma separated, it would treat them as keywords, etc. Also, we could likely send in a hint as to the type too.

      With this approach, you of course have a "first in wins" situation, but assuming you have this schema defined elsewhere, it is likely fine.

      Supporting such cases would allow us to be schemaless when appropriate, while offering the benefits of schemas when appropriate. Naturally, one could mix and match these too.

        Issue Links

          Activity

          Hide
          gsingers Grant Ingersoll added a comment -

          Note, a core reload is not something I would want to do.

          Show
          gsingers Grant Ingersoll added a comment - Note, a core reload is not something I would want to do.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Of course hopefully everyone knows "schemaless" is mostly marketing b.s. - when people do this, there is still a schema, but it's guessed on first use (and hence generally a horrible idea for production systems).

          It would be easy enough on a single node... but how does one handle a cluster?
          Say you index price=0 on nodeA, and price=100.0 on nodeB?

          A quick thought on how it might work:

          • have a separate file auto_fields.json that keeps track of the mappings that would be the same for all cores using that schema
          • when we run across a field we haven't seen before, we must guess a type for it, then grab a lock - update the auto_fields.json
          • we can update our in-memory schema with any new fields we find in auto_fields.json
          • works the same in ZK mode... it's just the auto_fields.json is in ZK, and we would use something like optimistic locking to update it
          Show
          yseeley@gmail.com Yonik Seeley added a comment - Of course hopefully everyone knows "schemaless" is mostly marketing b.s. - when people do this, there is still a schema, but it's guessed on first use (and hence generally a horrible idea for production systems). It would be easy enough on a single node... but how does one handle a cluster? Say you index price=0 on nodeA, and price=100.0 on nodeB? A quick thought on how it might work: have a separate file auto_fields.json that keeps track of the mappings that would be the same for all cores using that schema when we run across a field we haven't seen before, we must guess a type for it, then grab a lock - update the auto_fields.json we can update our in-memory schema with any new fields we find in auto_fields.json works the same in ZK mode... it's just the auto_fields.json is in ZK, and we would use something like optimistic locking to update it
          Hide
          steve_rowe Steve Rowe added a comment -

          Now that we have the ability to dynamically add schema fields (SOLR-3251), I want to push forward on this issue.

          Value-based dynamic field capabilities for document updates - which I'll sometimes refer to as schemaless mode - will a) determine the type for field names that don’t match explicit or dynamic fields in the schema; b) add these field names to the schema with their determined types; and c) complete the document update request as normal. This process should apply equally to new doc additions, atomic updates, and regular updates.

          In a conversation with Chris Hostetter (Unused) about this feature, he suggested that configuration for parsing/converting String-typed field values into the appropriate Java objects could be separated from configuration of mappings from Java object types to schema field types. In this way, components built for schemaless mode could be reused for other purposes.

          JSON and Javabin content streams already carry some type information for their field values. The ContentStreamLoader-s corresponding to these, JsonLoader and JavabinLoader, should set field value object types in the SolrInputDocument according to the content stream's data types. (Currently JavabinLoader does this correctly, but JsonLoader stores everything as String-s; this will need to be fixed.) As a result, for the Java object types supported by these content streams and their loaders (as well as other update processors, etc. that set field values' Java object types), String parsing/conversion won't be required, and only the Java object type -> schema field type mappings will be necessary to determine the schema field type for new fields.

          On SOLR-2802, Hoss wrote that FieldMutatingUpdateProcessor-s that parsed dates, numbers and booleans would be generally useful. I plan on going that route to implement String-typed field value parsing. These field value parsing update processors should operate on String-valued fields that either a) are not in the schema, or b) have a schema field type with an appropriate typeClass.

          After the new parsing update processors detect and convert field values to the appropriate Java object types, an update processor that adds fields to the schema as needed can be configured with a mapping from Java object type to schema field type.

          Here is the list of things I think need to happen - I plan on making JIRA issues for each of these:

          1. Fix JsonLoader to create field values using the JSON-supplied type, rather than making everything a String.
          2. Add a new field update processor selector that will configure the processor to select fields that match any schema field, or that match no schema field, depending on its boolean parameter: <bool name="fieldNameMatchesSchemaField">
          3. Add new FieldMutatingUpdateProcessorFactory subclasses ParseFooUpdateProcessorFactory, where Foo includes Date, Double, Long, and Boolean. If they see a field value that is not String-valued, or can't parse the value, they will ignore it and leave it as is. For multi-valued fields, they should be all-or-nothing.
          4. Add a new AddSchemaFieldsUpdateProcessorFactory, with configurable mappings from Java object type to schema field type, that will dynamically add fields to the schema, as needed.
          5. Add a new example config set for schemaless mode.
          Show
          steve_rowe Steve Rowe added a comment - Now that we have the ability to dynamically add schema fields ( SOLR-3251 ), I want to push forward on this issue. Value-based dynamic field capabilities for document updates - which I'll sometimes refer to as schemaless mode - will a) determine the type for field names that don’t match explicit or dynamic fields in the schema; b) add these field names to the schema with their determined types; and c) complete the document update request as normal. This process should apply equally to new doc additions, atomic updates, and regular updates. In a conversation with Chris Hostetter (Unused) about this feature, he suggested that configuration for parsing/converting String -typed field values into the appropriate Java objects could be separated from configuration of mappings from Java object types to schema field types. In this way, components built for schemaless mode could be reused for other purposes. JSON and Javabin content streams already carry some type information for their field values. The ContentStreamLoader -s corresponding to these, JsonLoader and JavabinLoader , should set field value object types in the SolrInputDocument according to the content stream's data types. (Currently JavabinLoader does this correctly, but JsonLoader stores everything as String -s; this will need to be fixed.) As a result, for the Java object types supported by these content streams and their loaders (as well as other update processors, etc. that set field values' Java object types), String parsing/conversion won't be required, and only the Java object type -> schema field type mappings will be necessary to determine the schema field type for new fields. On SOLR-2802 , Hoss wrote that FieldMutatingUpdateProcessor -s that parsed dates, numbers and booleans would be generally useful. I plan on going that route to implement String -typed field value parsing. These field value parsing update processors should operate on String -valued fields that either a) are not in the schema, or b) have a schema field type with an appropriate typeClass . After the new parsing update processors detect and convert field values to the appropriate Java object types, an update processor that adds fields to the schema as needed can be configured with a mapping from Java object type to schema field type. Here is the list of things I think need to happen - I plan on making JIRA issues for each of these: Fix JsonLoader to create field values using the JSON-supplied type, rather than making everything a String . Add a new field update processor selector that will configure the processor to select fields that match any schema field, or that match no schema field, depending on its boolean parameter: <bool name="fieldNameMatchesSchemaField"> Add new FieldMutatingUpdateProcessorFactory subclasses ParseFooUpdateProcessorFactory , where Foo includes Date , Double , Long , and Boolean . If they see a field value that is not String -valued, or can't parse the value, they will ignore it and leave it as is. For multi-valued fields, they should be all-or-nothing. Add a new AddSchemaFieldsUpdateProcessorFactory , with configurable mappings from Java object type to schema field type, that will dynamically add fields to the schema, as needed. Add a new example config set for schemaless mode.
          Hide
          steve_rowe Steve Rowe added a comment -

          The constituent issues have all been committed.

          Further improvements can be made in new issues.

          Show
          steve_rowe Steve Rowe added a comment - The constituent issues have all been committed. Further improvements can be made in new issues.
          Hide
          steve_rowe Steve Rowe added a comment -

          Bulk close resolved 4.4 issues

          Show
          steve_rowe Steve Rowe added a comment - Bulk close resolved 4.4 issues

            People

            • Assignee:
              steve_rowe Steve Rowe
              Reporter:
              gsingers Grant Ingersoll
            • Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development