[LUCENE-6034] MemoryIndex should be able to wrap TermVector Terms - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 5.0, 6.0
Component/s: modules/highlighter
Labels:
None

Lucene Fields:

New, Patch Available

Description

The default highlighter has a "WeightedSpanTermExtractor" that uses MemoryIndex for certain queries – basically phrases, SpanQueries, and the like. For lots of text, this aspect of highlighting is time consuming and consumes a fair amount of memory. What also consumes memory is that it wraps the tokenStream in CachingTokenFilter in this case. But if the underlying TokenStream is actually from TokenSources (wrapping TermVector Terms), this is all needless! Furthermore, MemoryIndex doesn't support payloads.

The patch here has 3 aspects to it:

Internal refactoring to MemoryIndex to simplify it by maintaining the fields in a sorted state using a TreeMap. The ramifications of this led to reduced LOC for this file, even with the other features I added. It also puts the FieldInfo on the Info, and thus there's one less data structure to keep around. I suppose if there are a huge variety of fields in MemoryIndex, the aggregated N*Log(N) field lookup could add up, but that seems very unlikely. I also brought in the MemoryIndexNormDocValues as a simple anonymous inner class - it's super-simple after all, not worth having in a separate file.
New MemoryIndex.addField(String fieldName, Terms) method. In this case, MemoryIndex is providing the supporting wrappers around the underlying Terms so that it appears as an Index. In so doing, MemoryIndex supports payloads for such fields.
WeightedSpanTermExtractor now detects TokenSources' wrapping of Terms and it supplies this to MemoryIndex.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-6034.patch
30/Oct/14 16:13
27 kB
David Smiley
LUCENE-6034.patch
01/Dec/14 04:57
32 kB
David Smiley
LUCENE-6034.patch
01/Dec/14 13:04
32 kB
David Smiley
LUCENE-6034.patch
03/Dec/14 13:27
13 kB
David Smiley
LUCENE-6034.patch
05/Dec/14 03:52
39 kB
David Smiley
LUCENE-6034_Simplify_MemoryIndex.patch
03/Dec/14 13:17
21 kB
David Smiley

Issue Links

requires

LUCENE-6031 TokenSources optimization, avoid sort

Closed

Activity

People

Assignee:: David Smiley

Reporter:: David Smiley

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 30/Oct/14 16:11

Updated:: 28/Aug/22 14:18

Resolved:: 05/Dec/14 15:27