[LUCENE-1812] Static index pruning by in-document term frequency (Carmel pruning) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.6
Component/s: modules/other
Labels:
None

Lucene Fields:

New, Patch Available

Description

This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance.

Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).

As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values.

Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching.

NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id.

Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.

A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

pruning.patch
30/Jan/12 18:30
92 kB
Doron Cohen
pruning.patch
23/Jan/12 15:31
89 kB
Doron Cohen
pruning.patch
12/Aug/10 12:12
80 kB
Doron Cohen
pruning.patch
18/May/10 15:18
59 kB
Andrzej Bialecki
pruning.patch
02/Nov/09 14:36
54 kB
Andrzej Bialecki
pruning.patch
15/Aug/09 14:58
30 kB
Andrzej Bialecki

Issue Links

is blocked by

LEGAL-78 Contribution of code that uses a contributor's patent

Closed

Activity

People

Assignee:: Doron Cohen

Reporter:: Andrzej Bialecki

Votes:: 4 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 15/Aug/09 14:57

Updated:: 28/Aug/22 12:06

Resolved:: 25/Mar/12 16:51