Hello, I am working on WordDelimiterFilter and I have a question: how do we want custom attributes to work here?
This affects performance of the filter under the new tokenstream API, as it will determine when/if we have to save/restore state.
Here are two alternatives:
Alternative #1 (most performant): custom attributes from the original term will only apply to words with no delimiters, or in the case of words with delimiters, only the 'original' token output with the 'preserveOriginal' option. This is easiest to understand in my opinion, and would perform the best. Its arguable that if you split a term into 10 subwords, applying these attributes to all 10 subwords may no longer make sense
Alternative #2: (least performant): custom attributes from the original term will only apply to non-injected terms: this means if a word is split into 10 tokens, all 10 subword tokens, but not their concatenations, also have the attributes derived from the original term. If preserveOriginal is on, then it has the attributes also.
Alternative #3: ??? your ideas?
In my opinion, we should shoot for maximum performance, as I view this as somewhat like a tokenizer, and custom attributes in general would be applied after WDF, because trying to apply them before WDF and expecting them to make sense afterwards will be confusing. but it does not matter really.