[LUCENE-579] TermPositionVector offsets incorrect if indexed field has multiple values and one ends with non-term chars - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.9
Fix Version/s: None
Component/s: modules/analysis
Labels:
None

Description

If you add multiple values for a field with term vector positions and offsets enabled and one of the values ends with a non-term then the offsets for the terms from subsequent values are wrong. For example (note the '.' in the first value):

IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), true);

Document doc = new Document();

doc.add(new Field("", "one.", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));

doc.add(new Field("", "two", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));

writer.addDocument(doc);

writer.optimize();

writer.close();

IndexSearcher searcher = new IndexSearcher(directory);

Hits hits = searcher.search(new MatchAllDocsQuery());

Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(),
new QueryScorer(new TermQuery(new Term("", "camera")), searcher.getIndexReader(), ""));

for (int i = 0; i < hits.length(); ++i) {

TermPositionVector v = (TermPositionVector) searcher.getIndexReader().getTermFreqVector(
hits.id, "");

StringBuilder str = new StringBuilder();

for (String s : hits.doc.getValues(""))

{ str.append(s); str.append(" "); }

System.out.println(str);

TokenStream tokenStream = TokenSources.getTokenStream(v, false);

String[] terms = v.getTerms();
int[] freq = v.getTermFrequencies();

for (int j = 0; j < terms.length; ++j) {

System.out.print(terms[j] + ":" + freq[j] + ":");

int[] pos = v.getTermPositions(j);

System.out.print(Arrays.toString(pos));

TermVectorOffsetInfo[] offset = v.getOffsets(j);

for (int k = 0; k < offset.length; ++k)

{ System.out.print(":"); System.out.print(str.substring(offset[k].getStartOffset(), offset[k].getEndOffset())); }

System.out.println();
}
}

searcher.close();

If I run the above I get:
one:1:[0]:one
two:1:[1]: tw

Note that the offsets for the second term are off by 1.

It seems to be that the length of the value that is stored is not taken into account when calculating the offset for the fields of the next value.

I noticed ths problem when using the highlight contrib package which can make use of term vectors for highlighting. I also noticed that the offset for the second string is +1 the end of the previous value, so when concatenating the fields values to pass to the hgighlighter I add to append a ' ' character after each string...which is quite useful, but not documented anywhere.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

offsets.patch
31/Dec/08 03:27
1 kB
Andrew Duffy

Issue Links

relates to

LUCENE-713 File Formats Documentation is not correct for Term Vectors

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Keiron McCammon

Votes:: 2 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/May/06 04:18

Updated:: 28/Aug/22 11:27

Resolved:: 31/Dec/08 16:09