Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-13077

PreAnalyzedField TokenStreamComponents should be reusable

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • Schema and Analysis
    • None

    Description

      TokenStreamComponents for PreAnalyzedField is currently recreated from scratch for every field value.

      This is necessary at the moment because the current implementation has no a priori knowledge about the schema/TokenStream that it's deserializing – Attributes are implicit in the serialized token stream, and token Attributes are lazily instantiated in incrementToken().

      Reuse of TokenStreamComponents with the current implementation would at a minimum cause problems at index time, when Attributes are cached in indexing components (e.g., FieldInvertState), keyed per AttributeSource. For instance, if the first field encountered has no value specified for PayloadAttribute, a null value would be cached for that PayloadAttribute for the corresponding AttributeSource. If that AttributeSource were to be reused for a field that does specify a PayloadAttribute, indexing components would "consult" the cached null value, and the payload (and all subsequent payloads) would be silently ignored (not indexed).

      This is not exactly broken currently, but I gather it's an unorthodox implementation of TokenStream, and the current workaround of disabling TokenStreamComponents reuse necessarily adds to object creation and GC overhead.

      For reference (and see LUCENE-8610), the TokenStream API says:

      To make sure that filters and consumers know which attributes are available, the attributes must be added during instantiation.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              magibney Michael Gibney
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: