[SOLR-6939] UpdateProcessor to buffer & sample documents and then batch create neccessary fields - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

spun off of an idea in ~~SOLR-6016~~...

We could add a SchemaGeneratorHandler which would generate the "best" schema.

You wouldn't need/want a handler for this – you'd just need an UpdateProcessorFactory to use in place of RunUpdateProcessorFactory that would look at the datatypes of the fields in each document w/o doing any indexing and pick the least common denominator.

So then you'd have a chain with all of your normal update processors including the TypeMapping processors configured with the preccedence orders and locales and format strings you want – and at the end you'd have your BestFitScheamGeneratorUpdateProcessorFactory that would look at all those docs, study their values, and throw them away – until a commit comes along, at which point it does all the under the hood schema field addition calls.

So to learn, you'd send docs using whatever handler/format you wnat (json, xml, extraction, etc...) with an update.chain=my.datatype.learning.processor.chain request param ... and once you've sent a bunch and giving it a lot of variety to see, then you send a commit so it creates the schema and then you re-index your docs for real w/o that special chain.

...not mentioned originally: this factory could also default to assuming fields should be single valued, unless/until it sees multiple values in a doc that it samples.

Attachments

Issue Links

is related to

SOLR-6016 Failure indexing exampledocs with example-schemaless mode

Resolved

relates to

SOLR-11741 Offline training mode for schema guessing

Open

Activity

People

Assignee:: Unassigned

Reporter:: Chris M. Hostetter

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Jan/15 06:49

Updated:: 08/Jan/18 20:00