Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6987

Clarify TokenStream workflow documentation


    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • New


      On SOLR-4619, rcmuir noted:

      According to TokenStream's class javadocs:

      The workflow of the new TokenStream API is as follows:

      1. Instantiation of TokenStream/TokenFilters which add/get attributes to/from the AttributeSource.
      2. The consumer calls reset().
      3. The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access.

      So we have consumers (such as QueryBuilder) doing stuff out of order: if they do step 3 before they do step 2.

      My question is, can we detect this in tests? If MockAnalyzer can enforce it, it is easier to fix it consistently everywhere. One idea is if MockTokenizer deferred initializing its attributes until reset()? Its not going to be the best (we need to tie it into its state machine logic somehow for that), but it might be an easy step.

      Also, majority of TokenFilters (which basically also serve as consumers too), are doing step 3 before step 2 today. Most of them are just assigning to final variables in their constructor.

      So something is off: we gotta go one of two ways. Either fix the documentation to swap step 3 before step 2 [...], or we make a massive change to tons of tokenizers (making them more complex and less efficient).
      But I think we have to do something, at least we should fix the docs to be clear, they need to reflect reality.




            Unassigned Unassigned
            sarowe Steven Rowe
            0 Vote for this issue
            1 Start watching this issue