Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
Description
TokenStreamComponents for PreAnalyzedField is currently recreated from scratch for every field value.
This is necessary at the moment because the current implementation has no a priori knowledge about the schema/TokenStream that it's deserializing – Attributes are implicit in the serialized token stream, and token Attributes are lazily instantiated in incrementToken().
Reuse of TokenStreamComponents with the current implementation would at a minimum cause problems at index time, when Attributes are cached in indexing components (e.g., FieldInvertState), keyed per AttributeSource. For instance, if the first field encountered has no value specified for PayloadAttribute, a null value would be cached for that PayloadAttribute for the corresponding AttributeSource. If that AttributeSource were to be reused for a field that does specify a PayloadAttribute, indexing components would "consult" the cached null value, and the payload (and all subsequent payloads) would be silently ignored (not indexed).
This is not exactly broken currently, but I gather it's an unorthodox implementation of TokenStream, and the current workaround of disabling TokenStreamComponents reuse necessarily adds to object creation and GC overhead.
For reference (and see LUCENE-8610), the TokenStream API says:
To make sure that filters and consumers know which attributes are available, the attributes must be added during instantiation.
Attachments
Issue Links
- supercedes
-
LUCENE-8610 NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes
-
- Resolved
-