[SOLR-13077] PreAnalyzedField TokenStreamComponents should be reusable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Schema and Analysis
Labels:
None

Description

TokenStreamComponents for PreAnalyzedField is currently recreated from scratch for every field value.

This is necessary at the moment because the current implementation has no a priori knowledge about the schema/TokenStream that it's deserializing – Attributes are implicit in the serialized token stream, and token Attributes are lazily instantiated in incrementToken().

Reuse of TokenStreamComponents with the current implementation would at a minimum cause problems at index time, when Attributes are cached in indexing components (e.g., FieldInvertState), keyed per AttributeSource. For instance, if the first field encountered has no value specified for PayloadAttribute, a null value would be cached for that PayloadAttribute for the corresponding AttributeSource. If that AttributeSource were to be reused for a field that does specify a PayloadAttribute, indexing components would "consult" the cached null value, and the payload (and all subsequent payloads) would be silently ignored (not indexed).

This is not exactly broken currently, but I gather it's an unorthodox implementation of TokenStream, and the current workaround of disabling TokenStreamComponents reuse necessarily adds to object creation and GC overhead.

For reference (and see ~~LUCENE-8610~~), the TokenStream API says:

To make sure that filters and consumers know which attributes are available, the attributes must be added during instantiation.

Attachments

Issue Links

supercedes

LUCENE-8610 NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Michael Gibney

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Dec/18 16:58

Updated:: 08/Jun/19 14:58