Lucene - Core
  1. Lucene - Core
  2. LUCENE-579

TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 1.9
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      If you add multiple values for a field with term vector positions and offsets enabled and one of the values ends with a non-term then the offsets for the terms from subsequent values are wrong. For example (note the '.' in the first value):

      IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), true);

      Document doc = new Document();

      doc.add(new Field("", "one.", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));

      doc.add(new Field("", "two", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));

      writer.addDocument(doc);

      writer.optimize();

      writer.close();

      IndexSearcher searcher = new IndexSearcher(directory);

      Hits hits = searcher.search(new MatchAllDocsQuery());

      Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(),
      new QueryScorer(new TermQuery(new Term("", "camera")), searcher.getIndexReader(), ""));

      for (int i = 0; i < hits.length(); ++i) {

      TermPositionVector v = (TermPositionVector) searcher.getIndexReader().getTermFreqVector(
      hits.id, "");

      StringBuilder str = new StringBuilder();

      for (String s : hits.doc.getValues(""))

      { str.append(s); str.append(" "); }

      System.out.println(str);

      TokenStream tokenStream = TokenSources.getTokenStream(v, false);

      String[] terms = v.getTerms();
      int[] freq = v.getTermFrequencies();

      for (int j = 0; j < terms.length; ++j) {

      System.out.print(terms[j] + ":" + freq[j] + ":");

      int[] pos = v.getTermPositions(j);

      System.out.print(Arrays.toString(pos));

      TermVectorOffsetInfo[] offset = v.getOffsets(j);

      for (int k = 0; k < offset.length; ++k)

      { System.out.print(":"); System.out.print(str.substring(offset[k].getStartOffset(), offset[k].getEndOffset())); }

      System.out.println();
      }
      }

      searcher.close();

      If I run the above I get:
      one:1:[0]:one
      two:1:[1]: tw

      Note that the offsets for the second term are off by 1.

      It seems to be that the length of the value that is stored is not taken into account when calculating the offset for the fields of the next value.

      I noticed ths problem when using the highlight contrib package which can make use of term vectors for highlighting. I also noticed that the offset for the second string is +1 the end of the previous value, so when concatenating the fields values to pass to the hgighlighter I add to append a ' ' character after each string...which is quite useful, but not documented anywhere.

      1. offsets.patch
        1 kB
        Andrew Duffy

        Issue Links

          Activity

          Keiron McCammon created issue -
          Hide
          Shahan Khatchadourian added a comment -

          DocumentWriter seems to be the culprit in adding 1 to the previous token's endOffset. It may not be possible to provide token offsets that "undo" this operation since it is not possible to determine the order in which tokens are handled as they are grouped by field which doesn't necessarily correspond to document-order. This would also interfere with custom synonym tokens since custom token offsets are no longer guaranteed.

          I suggest that there be a flag in Fieldable or IndexWriter which allows exact provided offsets to be stored rather than increased by one. There does not seem to be any sanity checks on offset values during reading/writing the term vector.

          A current workaround to this issue is to store the correct startOffset, but leave the endOffset as -1. This has the effect of undoing the +1 of the previous token's endOffset but prevents endOffset information from being available without retokenizing/reparsing the original document.

          Show
          Shahan Khatchadourian added a comment - DocumentWriter seems to be the culprit in adding 1 to the previous token's endOffset. It may not be possible to provide token offsets that "undo" this operation since it is not possible to determine the order in which tokens are handled as they are grouped by field which doesn't necessarily correspond to document-order. This would also interfere with custom synonym tokens since custom token offsets are no longer guaranteed. I suggest that there be a flag in Fieldable or IndexWriter which allows exact provided offsets to be stored rather than increased by one. There does not seem to be any sanity checks on offset values during reading/writing the term vector. A current workaround to this issue is to store the correct startOffset, but leave the endOffset as -1. This has the effect of undoing the +1 of the previous token's endOffset but prevents endOffset information from being available without retokenizing/reparsing the original document.
          Shahan Khatchadourian made changes -
          Field Original Value New Value
          Link This issue relates to LUCENE-713 [ LUCENE-713 ]
          Hide
          Shahan Khatchadourian added a comment -

          The documentation is also not correct with respect to the current offset implementation.

          Show
          Shahan Khatchadourian added a comment - The documentation is also not correct with respect to the current offset implementation.
          Hide
          Grant Ingersoll added a comment -

          Can you provide a unit test for this?

          Show
          Grant Ingersoll added a comment - Can you provide a unit test for this?
          Hide
          Andrew Duffy added a comment - - edited

          I've attached a patch to 2.4's DocInverterPerField.java that fixes this. The problem is in line 160, which stores the starting offset for the next value of the same field:

          • if a field value has delimiter text after its last token this is ignored.
          • If there is no extra delimiter text after the last token, the offsets are off by +1 for the tokens in the second value, +2 for the third value and so on.
          • The problem is hidden when there is exactly one delimiter character after each value.

          The patch removes the +1 completely and uses the length of the string to adjust offsets for fields with a string value. Fields with reader or token stream values can't easily be fixed but can't be stored either so are much less likely to affect anyone.

          Show
          Andrew Duffy added a comment - - edited I've attached a patch to 2.4's DocInverterPerField.java that fixes this. The problem is in line 160, which stores the starting offset for the next value of the same field: if a field value has delimiter text after its last token this is ignored. If there is no extra delimiter text after the last token, the offsets are off by +1 for the tokens in the second value, +2 for the third value and so on. The problem is hidden when there is exactly one delimiter character after each value. The patch removes the +1 completely and uses the length of the string to adjust offsets for fields with a string value. Fields with reader or token stream values can't easily be fixed but can't be stored either so are much less likely to affect anyone.
          Andrew Duffy made changes -
          Attachment offsets.patch [ 12396950 ]
          Hide
          Michael McCandless added a comment -

          Thank you for the patch Andrew.

          I think this issue is a dup of LUCENE-1448, where the plan is to effectively add a getFinalOffset() method to TokenStream, which for the core/contrib analyzers would in fact default to the total number of characters read from their Reader inputs.

          Show
          Michael McCandless added a comment - Thank you for the patch Andrew. I think this issue is a dup of LUCENE-1448 , where the plan is to effectively add a getFinalOffset() method to TokenStream, which for the core/contrib analyzers would in fact default to the total number of characters read from their Reader inputs.
          Hide
          Andrew Duffy added a comment -

          It is a duplicate of LUCENE-1448; the fix proposed in that issue will fix the problem in a very comprehensive way.

          Show
          Andrew Duffy added a comment - It is a duplicate of LUCENE-1448 ; the fix proposed in that issue will fix the problem in a very comprehensive way.
          Michael McCandless made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Duplicate [ 3 ]
          Mark Thomas made changes -
          Workflow jira [ 12372435 ] Default workflow, editable Closed status [ 12563587 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12563587 ] jira [ 12584316 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Keiron McCammon
            • Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development