Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7267

Field with an explicit TokenStream must be tokenized and then uses the default Analyzer offset gaps

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • New

    Description

      This took me somewhat by surprise. We have a pretty complex code that uses fields with explicit token streams (which provide their own offset data) and multivalues.

      It was surprising to see that offsets for subsequent values were shifted by 1 compared to what was explicitly provided in the OffsetAttribute. A bit of debugging showed this code inside PerField.invert:

            if (analyzed) {
              invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
              invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);
            }
      

      A field with an explicit token stream must still be declared as tokenized and PerField then thinks that this field must have come from an analyzer (where in fact it didn't):

            final boolean analyzed = fieldType.tokenized() && docState.analyzer != null;
      

      While the default position increment is 0, the default offset gap isn't – it's 1, causing the shift.

      Thoughts?

      Attachments

        Activity

          People

            Unassigned Unassigned
            dweiss Dawid Weiss
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: