[LUCENE-1458] Further steps towards flexible indexing - ASF JIRA

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 4.0-ALPHA
Fix Version/s: 4.0-ALPHA
Component/s: core/index
Labels:
None

Lucene Fields:

New

Description

I attached a very rough checkpoint of my current patch, to get early
feedback. All tests pass, though back compat tests don't pass due to
changes to package-private APIs plus certain bugs in tests that
happened to work (eg call TermPostions.nextPosition() too many times,
which the new API asserts against).

[Aside: I think, when we commit changes to package-private APIs such
that back-compat tests don't pass, we could go back, make a branch on
the back-compat tag, commit changes to the tests to use the new
package private APIs on that branch, then fix nightly build to use the
tip of that branch?o]

There's still plenty to do before this is committable! This is a
rather large change:

Switches to a new more efficient terms dict format. This still
uses tii/tis files, but the tii only stores term & long offset
(not a TermInfo). At seek points, tis encodes term & freq/prox
offsets absolutely instead of with deltas delta. Also, tis/tii
are structured by field, so we don't have to record field number
in every term.
.
On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
-> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
.
RAM usage when loading terms dict index is significantly less
since we only load an array of offsets and an array of String (no
more TermInfo array). It should be faster to init too.
.
This part is basically done.

Introduces modular reader codec that strongly decouples terms dict
from docs/positions readers. EG there is no more TermInfo used
when reading the new format.
.
There's nice symmetry now between reading & writing in the codec
chain – the current docs/prox format is captured in:
```
FormatPostingsTermsDictWriter/Reader
FormatPostingsDocsWriter/Reader (.frq file) and
FormatPostingsPositionsWriter/Reader (.prx file).
```
This part is basically done.

Introduces a new "flex" API for iterating through the fields,
terms, docs and positions:
```
FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
```
This replaces TermEnum/Docs/Positions. SegmentReader emulates the
old API on top of the new API to keep back-compat.

Next steps:

Plug in new codecs (pulsing, pfor) to exercise the modularity /
fix any hidden assumptions.

Expose new API out of IndexReader, deprecate old API but emulate
old API on top of new one, switch all core/contrib users to the
new API.

Maybe switch to AttributeSources as the base class for TermsEnum,
DocsEnum, PostingsEnum – this would give readers API flexibility
(not just index-file-format flexibility). EG if someone wanted
to store payload at the term-doc level instead of
term-doc-position level, you could just add a new attribute.

Test performance & iterate.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-1458_rotate.patch
30/Nov/09 00:28
4 kB
Robert Muir
LUCENE-1458_sortorder_bwcompat.patch
24/Nov/09 00:44
3 kB
Robert Muir
LUCENE-1458_termenum_bwcompat.patch
23/Nov/09 19:03
1 kB
Robert Muir
LUCENE-1458.patch
13/Oct/09 16:11
883 kB
Mark Miller
LUCENE-1458.patch
13/Oct/09 05:54
878 kB
Mark Miller
LUCENE-1458.patch
09/Oct/09 22:46
909 kB
Michael McCandless
LUCENE-1458.patch
07/Oct/09 16:01
895 kB
Michael McCandless
LUCENE-1458.patch
06/Oct/09 15:06
886 kB
Michael McCandless
LUCENE-1458.patch
06/Oct/09 04:06
1024 kB
Mark Miller
LUCENE-1458.patch
05/Oct/09 23:58
1015 kB
Mark Miller
LUCENE-1458.patch
12/Aug/09 10:13
360 kB
Michael Busch
LUCENE-1458.patch
24/Feb/09 14:22
370 kB
Michael McCandless
LUCENE-1458.patch
21/Nov/08 11:40
263 kB
Michael McCandless
LUCENE-1458.patch
18/Nov/08 22:10
188 kB
Michael McCandless
LUCENE-1458.patch
18/Nov/08 15:41
167 kB
Michael McCandless
LUCENE-1458.patch
18/Nov/08 10:32
116 kB
Michael McCandless
LUCENE-1458.tar.bz2
05/Oct/09 12:19
1.93 MB
Michael McCandless
LUCENE-1458.tar.bz2
02/Oct/09 00:32
1.94 MB
Michael McCandless
LUCENE-1458.tar.bz2
25/Sep/09 17:58
1.84 MB
Michael McCandless
LUCENE-1458.tar.bz2
23/Sep/09 17:10
1.83 MB
Michael McCandless
LUCENE-1458.tar.bz2
12/Sep/09 16:41
1.82 MB
Michael McCandless
LUCENE-1458.tar.bz2
11/Sep/09 13:49
1.83 MB
Michael McCandless
LUCENE-1458.tar.bz2
04/Sep/09 00:10
1.80 MB
Michael McCandless
LUCENE-1458-back-compat.patch
05/Oct/09 12:19
22 kB
Michael McCandless
LUCENE-1458-back-compat.patch
02/Oct/09 00:32
22 kB
Michael McCandless
LUCENE-1458-back-compat.patch
25/Sep/09 17:58
16 kB
Michael McCandless
LUCENE-1458-back-compat.patch
23/Sep/09 17:10
16 kB
Michael McCandless
LUCENE-1458-back-compat.patch
12/Sep/09 16:41
15 kB
Michael McCandless
LUCENE-1458-back-compat.patch
11/Sep/09 13:49
15 kB
Michael McCandless
LUCENE-1458-DocIdSetIterator.patch
03/Dec/09 14:36
22 kB
Uwe Schindler
LUCENE-1458-DocIdSetIterator.patch
03/Dec/09 14:27
21 kB
Uwe Schindler
LUCENE-1458-MTQ-BW.patch
02/Dec/09 00:08
2 kB
Uwe Schindler
LUCENE-1458-NRQ.patch
01/Dec/09 07:54
12 kB
Uwe Schindler
UnicodeTestCase.patch
23/Nov/09 11:28
2 kB
Robert Muir
UnicodeTestCase.patch
23/Nov/09 04:08
2 kB
Robert Muir

Issue Links

relates to

LUCENE-2025 Ability to turn off the store for an index

Patch Available

Further steps towards flexible indexing

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates