Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
spun off of an idea in SOLR-6016...
We could add a SchemaGeneratorHandler which would generate the "best" schema.
You wouldn't need/want a handler for this – you'd just need an UpdateProcessorFactory to use in place of RunUpdateProcessorFactory that would look at the datatypes of the fields in each document w/o doing any indexing and pick the least common denominator.
So then you'd have a chain with all of your normal update processors including the TypeMapping processors configured with the preccedence orders and locales and format strings you want – and at the end you'd have your BestFitScheamGeneratorUpdateProcessorFactory that would look at all those docs, study their values, and throw them away – until a commit comes along, at which point it does all the under the hood schema field addition calls.
So to learn, you'd send docs using whatever handler/format you wnat (json, xml, extraction, etc...) with an update.chain=my.datatype.learning.processor.chain request param ... and once you've sent a bunch and giving it a lot of variety to see, then you send a commit so it creates the schema and then you re-index your docs for real w/o that special chain.
...not mentioned originally: this factory could also default to assuming fields should be single valued, unless/until it sees multiple values in a doc that it samples.
Attachments
Issue Links
- is related to
-
SOLR-6016 Failure indexing exampledocs with example-schemaless mode
- Resolved
- relates to
-
SOLR-11741 Offline training mode for schema guessing
- Open