Issue Details (XML | Word | Printable)

Key: LUCENE-1458
Type: New Feature New Feature
Status: Open Open
Priority: Minor Minor
Assignee: Michael McCandless
Reporter: Michael McCandless
Votes: 1
Watchers: 6
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Further steps towards flexible indexing

Created: 18/Nov/08 10:31 AM   Updated: Today 03:08 PM
Component/s: Index
Affects Version/s: 2.9
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-1458-back-compat.patch 2009-10-05 12:19 PM Michael McCandless 22 kB
Text File Licensed for inclusion in ASF works LUCENE-1458-back-compat.patch 2009-10-02 12:32 AM Michael McCandless 22 kB
Text File Licensed for inclusion in ASF works LUCENE-1458-back-compat.patch 2009-09-25 05:58 PM Michael McCandless 16 kB
Text File Licensed for inclusion in ASF works LUCENE-1458-back-compat.patch 2009-09-23 05:10 PM Michael McCandless 16 kB
Text File Licensed for inclusion in ASF works LUCENE-1458-back-compat.patch 2009-09-12 04:41 PM Michael McCandless 15 kB
Text File Licensed for inclusion in ASF works LUCENE-1458-back-compat.patch 2009-09-11 01:49 PM Michael McCandless 15 kB
Text File Licensed for inclusion in ASF works LUCENE-1458.patch 2009-10-13 04:11 PM Mark Miller 883 kB
Text File Licensed for inclusion in ASF works LUCENE-1458.patch 2009-10-13 05:54 AM Mark Miller 878 kB
Text File Licensed for inclusion in ASF works LUCENE-1458.patch 2009-10-09 10:46 PM Michael McCandless 909 kB
Text File Licensed for inclusion in ASF works LUCENE-1458.patch 2009-10-07 04:01 PM Michael McCandless 895 kB
Text File Licensed for inclusion in ASF works LUCENE-1458.patch 2009-10-06 03:06 PM Michael McCandless 886 kB
Text File Licensed for inclusion in ASF works LUCENE-1458.patch 2009-10-06 04:06 AM Mark Miller 1024 kB
Text File Licensed for inclusion in ASF works LUCENE-1458.patch 2009-10-05 11:58 PM Mark Miller 1015 kB
Text File Licensed for inclusion in ASF works LUCENE-1458.patch 2009-08-12 10:13 AM Michael Busch 360 kB
Text File Licensed for inclusion in ASF works LUCENE-1458.patch 2009-02-24 02:22 PM Michael McCandless 370 kB
Text File Licensed for inclusion in ASF works LUCENE-1458.patch 2008-11-21 11:40 AM Michael McCandless 263 kB
Text File Licensed for inclusion in ASF works LUCENE-1458.patch 2008-11-18 10:10 PM Michael McCandless 188 kB
Text File Licensed for inclusion in ASF works LUCENE-1458.patch 2008-11-18 03:41 PM Michael McCandless 167 kB
Text File Licensed for inclusion in ASF works LUCENE-1458.patch 2008-11-18 10:32 AM Michael McCandless 116 kB
File Licensed for inclusion in ASF works LUCENE-1458.tar.bz2 2009-10-05 12:19 PM Michael McCandless 1.93 MB
File Licensed for inclusion in ASF works LUCENE-1458.tar.bz2 2009-10-02 12:32 AM Michael McCandless 1.94 MB
File Licensed for inclusion in ASF works LUCENE-1458.tar.bz2 2009-09-25 05:58 PM Michael McCandless 1.84 MB
File Licensed for inclusion in ASF works LUCENE-1458.tar.bz2 2009-09-23 05:10 PM Michael McCandless 1.83 MB
File Licensed for inclusion in ASF works LUCENE-1458.tar.bz2 2009-09-12 04:41 PM Michael McCandless 1.82 MB
File Licensed for inclusion in ASF works LUCENE-1458.tar.bz2 2009-09-11 01:49 PM Michael McCandless 1.83 MB
File Licensed for inclusion in ASF works LUCENE-1458.tar.bz2 2009-09-04 12:10 AM Michael McCandless 1.80 MB
Text File Licensed for inclusion in ASF works UnicodeTestCase.patch 2009-11-23 11:28 AM Robert Muir 2 kB
Text File Licensed for inclusion in ASF works UnicodeTestCase.patch 2009-11-23 04:08 AM Robert Muir 2 kB
Issue Links:
Reference
 

Lucene Fields: New


 Description  « Hide
I attached a very rough checkpoint of my current patch, to get early
feedback. All tests pass, though back compat tests don't pass due to
changes to package-private APIs plus certain bugs in tests that
happened to work (eg call TermPostions.nextPosition() too many times,
which the new API asserts against).

[Aside: I think, when we commit changes to package-private APIs such
that back-compat tests don't pass, we could go back, make a branch on
the back-compat tag, commit changes to the tests to use the new
package private APIs on that branch, then fix nightly build to use the
tip of that branch?o]

There's still plenty to do before this is committable! This is a
rather large change:

  • Switches to a new more efficient terms dict format. This still
    uses tii/tis files, but the tii only stores term & long offset
    (not a TermInfo). At seek points, tis encodes term & freq/prox
    offsets absolutely instead of with deltas delta. Also, tis/tii
    are structured by field, so we don't have to record field number
    in every term.
    .
    On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
    -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
    .
    RAM usage when loading terms dict index is significantly less
    since we only load an array of offsets and an array of String (no
    more TermInfo array). It should be faster to init too.
    .
    This part is basically done.
  • Introduces modular reader codec that strongly decouples terms dict
    from docs/positions readers. EG there is no more TermInfo used
    when reading the new format.
    .
    There's nice symmetry now between reading & writing in the codec
    chain – the current docs/prox format is captured in:
    FormatPostingsTermsDictWriter/Reader
    FormatPostingsDocsWriter/Reader (.frq file) and
    FormatPostingsPositionsWriter/Reader (.prx file).

    This part is basically done.

  • Introduces a new "flex" API for iterating through the fields,
    terms, docs and positions:
    FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum

    This replaces TermEnum/Docs/Positions. SegmentReader emulates the
    old API on top of the new API to keep back-compat.

Next steps:

  • Plug in new codecs (pulsing, pfor) to exercise the modularity /
    fix any hidden assumptions.
  • Expose new API out of IndexReader, deprecate old API but emulate
    old API on top of new one, switch all core/contrib users to the
    new API.
  • Maybe switch to AttributeSources as the base class for TermsEnum,
    DocsEnum, PostingsEnum – this would give readers API flexibility
    (not just index-file-format flexibility). EG if someone wanted
    to store payload at the term-doc level instead of
    term-doc-position level, you could just add a new attribute.
  • Test performance & iterate.


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Mark Miller added a comment - 18/Nov/08 02:40 PM
Hmmm...I think something is missing - FormatPostingsPositionsReader?

Michael McCandless added a comment - 18/Nov/08 03:41 PM
Woops, sorry... I was missing a bunch of files. Try this one?

Marvin Humphrey added a comment - 18/Nov/08 07:52 PM
The work on streamlining the term dictionary is excellent, but perhaps we can do better still. Can we design a format that allows us rely upon the operating system's virtual memory and avoid caching in process memory altogether?

Say that we break up the index file into fixed-width blocks of 1024 bytes. Most blocks would start with a complete term/pointer pairing, though at the top of each block, we'd need a status byte indicating whether the block contains a continuation from the previous block in order to handle cases where term length exceeds the block size.

For Lucy/KinoSearch our plan would be to mmap() on the file, but accessing it as a stream would work, too. Seeking around the index term dictionary would involve seeking the stream to multiples of the block size and performing binary search, rather than performing binary search on an array of cached terms. There would be increased processor overhead; my guess is that since the second stage of a term dictionary seek – scanning through the primary term dictionary – involves comparatively more processor power than this, the increased costs would be acceptable.

Advantages:

  • Multiple forks can all share the same system buffer, reducing per-process memory footprint.
  • The cost to read in the index term dictionary during IndexReader startup drops to zero.
  • The OS caches for the index term dictionaries can either be allowed to warm naturally, or can be nudged into virtual memory via e.g. "cat /path/to/index/*.tis > /dev/null".

Michael McCandless added a comment - 18/Nov/08 08:25 PM

Can we design a format that allows us rely upon the operating system's virtual memory and avoid caching in process memory altogether?

Interesting! I've been wondering what you're up to over on KS, Marvin

I'm not sure it'll be a win in practice: I'm not sure I'd trust the
OS's IO cache to "make the right decisions" about what to cache. Plus
during that binary search the IO system is loading whole pages into
the IO cache, even though you'll only peak at the first few bytes of
each.

We could also explore something in-between, eg it'd be nice to
genericize MultiLevelSkipListWriter so that it could index arbitrary
files, then we could use that to index the terms dict. You could
choose to spend dedicated process RAM on the higher levels of the skip
tree, and then tentatively trust IO cache for the lower levels.

I'd like to eventually make the TermsDict index pluggable so one could
swap in different indexers like this (it's not now).


Michael McCandless added a comment - 18/Nov/08 10:10 PM

[Attached patch]

To test whether the new pluggable codec approach is flexible enough, I
coded up "pulsing" (described in detail in
http://citeseer.ist.psu.edu/cutting90optimizations.html), where
freq/prox info is inlined into the terms dict if the term freq is < N.

It was wonderfully simple I just had to create a reader & a writer,
and then switch the places that read (SegmentReader) and write
(SegmentMerger, FreqProxTermsWriter) to use the new pulsing codec
instead of the default one.

The pulsing codec can "wrap" any other codec, ie, when a term is
written, if the term's freq is < N, then it's inlined into the terms
dict with the pulsing writer, else it's fed to the other codec for it
to do whatever it normally would. The two codecs are strongly
decoupled, so we can mix & match pulsing with other codecs like pfor.

All tests pass with this pulsing codec.

As a quick test I indexed first 1M docs from Wikipedia, with N=2 (ie
terms that occur only in one document are inlined into the terms
dict). 5.4M terms get inlined (only 1 doc) and 2.2M terms are not (>
1 doc). The final size of the index (after optimizing) was a bit
smaller with pulsing (1120 MB vs 1131 MB).


Michael Busch added a comment - 18/Nov/08 10:18 PM
I'll look into this patch soon.

Just wanted to say: I'm really excited about the progress here, this is cool stuff!
Great job...


Marvin Humphrey added a comment - 19/Nov/08 12:19 AM
> I'm not sure I'd trust the OS's IO cache to "make the right decisions" about what to cache.

In KS and Lucy, at least, we're focused on optimizing for the use case of dedicated search clusters where each box has enough RAM to fit the entire index/shard – in which case we won't have to worry about the OS swapping out those pages.

I suspect that in many circumstances the term dictionary would be a hot file even if RAM were running short, but I don't think it's important to worry about maxing out performance on such systems – if the term dictionary isn't hot the posting list files are definitely not hot and search-time responsiveness is already compromised.

In other words...

  • I trust the OS to do a decent enough job on underpowered systems.
  • High-powered systems should strive to avoid swapping entirely. To aid in that endeavor, we minimize per-process RAM consumption by maximizing our use of mmap and treating the system IO cache backing buffers as interprocess shared memory.

More on designing with modern virtual memory in mind at <http://varnish.projects.linpro.no/wiki/ArchitectNotes>.

> Plus during that binary search the IO system is loading whole pages into
> the IO cache, even though you'll only peak at the first few bytes of each.

I'd originally been thinking of mapping only the term dictionary index files. Those are pretty small, and the file itself occupies fewer bytes than the decompressed array of term/pointer pairs. Even better if you have several search app forks and they're all sharing the same memory mapped system IO buffer.

But hey, we can simplify even further! How about dispensing with the index file? We can just divide the main dictionary file into blocks and binary search on that.

Killing off the term dictionary index yields a nice improvement in code and file specification simplicity, and there's no performance penalty for our primary optimization target use case.

> We could also explore something in-between, eg it'd be nice to
> genericize MultiLevelSkipListWriter so that it could index arbitrary
> files, then we could use that to index the terms dict. You could
> choose to spend dedicated process RAM on the higher levels of the skip
> tree, and then tentatively trust IO cache for the lower levels.

That doesn't meet the design goals of bringing the cost of opening/warming an IndexReader down to near-zero and sharing backing buffers among multiple forks. It's also very complicated, which of course bothers me more than it bothers you. So I imagine we'll choose different paths.

> I'd like to eventually make the TermsDict index pluggable so one could
> swap in different indexers like this (it's not now).

If we treat the term dictionary as a black box, it has to accept a term and return... a blob, I guess. Whatever calls the lookup needs to know how to handle that blob.


Michael Busch added a comment - 19/Nov/08 12:37 AM

We could also explore something in-between, eg it'd be nice to
genericize MultiLevelSkipListWriter so that it could index arbitrary
files, then we could use that to index the terms dict.

Hmm, +1 for generalizing the MultiLevelSkipListWriter/Reader so that we can re-use it for different (custom) posting-list formats easily.
However, I'm not so sure if it's the right approach for a dictionary. A skip list is optimized for skipping forward (as the name says), so excellent for positing lists, which are always read from "left to right".
However, in the term dictionary you do a binary search for the lookup term. So something like a B+Tree would probably work better. Then you can decide similar to the MultiLevelSkipListWriter how many of the upper levels you want to keep in memory and control memory consumption.


Michael McCandless added a comment - 19/Nov/08 10:28 AM

So something like a B+Tree would probably work better.

I agree, btree is a better fit, though we don't need insertion & deletion operations since each segment is write once.


Michael McCandless added a comment - 19/Nov/08 01:12 PM

In KS and Lucy, at least, we're focused on optimizing for the use case of dedicated search clusters where each box has enough RAM to fit the entire index/shard - in which case we won't have to worry about the OS swapping out those pages.

I suspect that in many circumstances the term dictionary would be a hot file even if RAM were running short, but I don't think it's important to worry about maxing out performance on such systems - if the term dictionary isn't hot the posting list files are definitely not hot and search-time responsiveness is already compromised.

In other words...

  • I trust the OS to do a decent enough job on underpowered systems.
  • High-powered systems should strive to avoid swapping entirely. To aid in that endeavor, we minimize per-process RAM consumption by maximizing our use of mmap and treating the system IO cache backing buffers as interprocess shared memory.

These are the two extremes, but, I think most common are all the apps
in between. Take a large Jira instance, where the app itself is also
consuming alot of RAM, doing alot of its own IO, etc., where perhaps
searching is done infrequently enough relative to other operations
that the OS may no longer think the pages you hit for the terms index
are hot enough to keep around.

More on designing with modern virtual memory in mind at <http://varnish.projects.linpro.no/wiki/ArchitectNotes>.

This is a good read, but I find it overly trusting of VM.

How can the VM system possibly make good decisions about what to swap
out? It can't know if a page is being used for terms dict index,
terms dict, norms, stored fields, postings. LRU is not a good policy,
because some pages (terms index) are far far more costly to miss than
others.

From Java we have even more ridiculous problems: sometimes the OS
swaps out garbage... and then massive swapping takes place when GC
runs, swapping back in the garbage only to then throw it away. Ugh!

I think we need to aim for consistency: a given search should not
suddenly take 10 seconds because the OS decided to swap out a few
critical structures (like the term index). Unfortunately we can't
really achieve that today, especially from Java.

I've seen my desktop OS (Mac OS X 10.5.5, based on FreeBSD) make
stupid VM decisions: if I run something that does a single-pass
through many GB of on-disk data (eg re-encoding a video), it then
swaps out the vast majority of my apps even though I have 6 GB RAM. I
hit tons (many seconds) of swapping just switching back to my mail
client. It's infuriating. I've seen Linux do the same thing, but at
least Linux let's you tune this behavior ("swappiness"); I had to
disable swapping entirely on my desktop.

Similarly, when a BG merge is burning through data, or say backup
kicks off and moves many GB, or the simple act of iterating through a
big postings list, the OS will gleefully evict my terms index or norms
in order to populate its IO cache with data it will need again for a
very long time.

I bet the VM system fails to show graceful degradation: if I don't
have enough RAM to hold my entire index, then walking through postings
lists will evict my terms index and norms, making all searches slower.

In the ideal world, an IndexReader would be told how much RAM to use.
It would spend that RAM wisely, eg first on the terms index, second on
norms, third maybe on select column-stride fields, etc. It would pin
these pages so the OS couldn't swap them out (can't do this from
java... though as a workaround we could use a silly thread). Or, if
the OS found itself tight on RAM, it would ask the app to free things
up instead of blindly picking pages to swap out, which does not happen
today.

From Java we could try using WeakReference but I fear the
communication from the OS -> JRE is too weak. IE I'd want my
WeakReference cleared only when the OS is threatening to swap out my
data structure.

> Plus during that binary search the IO system is loading whole pages into
> the IO cache, even though you'll only peak at the first few bytes of each.

I'd originally been thinking of mapping only the term dictionary index files. Those are pretty small, and the file itself occupies fewer bytes than the decompressed array of term/pointer pairs. Even better if you have several search app forks and they're all sharing the same memory mapped system IO buffer.

But hey, we can simplify even further! How about dispensing with the index file? We can just divide the main dictionary file into blocks and binary search on that.

I'm not convinced this'll be a win in practice. You are now paying an
even higher overhead cost for each "check" of your binary search,
especially with something like pulsing which inlines more stuff into
the terms dict. I agree it's simpler, but I think that's trumped by
the performance hit.

In Lucene java, the concurrency model we are aiming for is a single
JVM sharing a single instance of IndexReader. I do agree, if fork()
is the basis of your concurrency model then sharing pages becomes
critical. However, modern OSs implement copy-on-write sharing of VM
pages after a fork, so that's another good path to sharing?

Killing off the term dictionary index yields a nice improvement in code and file specification simplicity, and there's no performance penalty for our primary optimization target use case.

Have you tried any actual tests swapping these approaches in as your
terms index impl? Tests of fully hot and fully cold ends of the
spectrum would be interesting, but also tests where a big segment
merge or a backup is running in the background...

That doesn't meet the design goals of bringing the cost of opening/warming an IndexReader down to near-zero and sharing backing buffers among multiple forks.

That's a nice goal. Our biggest cost in Lucene is warming the
FieldCache, used for sorting, function queries, etc. Column-stride
fields should go a ways towards improving this.

It's also very complicated, which of course bothers me more than it bothers you. So I imagine we'll choose different paths.

I think if we make the pluggable API simple, and capture the
complexity inside each impl, such that it can be well tested in
isolation, it's acceptable.

If we treat the term dictionary as a black box, it has to accept a term and return... a blob, I guess. Whatever calls the lookup needs to know how to handle that blob.

In my approach here, the blob is opaque to the terms dict reader: it
simply seeks to the right spot in the tis file, and then asks the
codec to decode the entry. TermsDictReader is entirely unaware of
what/how is stored there.


Marvin Humphrey added a comment - 21/Nov/08 01:48 AM
> Take a large Jira instance, where the app itself is also
> consuming alot of RAM, doing alot of its own IO, etc., where perhaps
> searching is done infrequently enough relative to other operations
> that the OS may no longer think the pages you hit for the terms index
> are hot enough to keep around.

Search responsiveness is already compromised in such a situation, because we
can all but guarantee that the posting list files have already been evicted
from cache. If the box has enough RAM for the large JIRA instance including
the Lucene index, search responsiveness won't be a problem. As soon as you
start running a little short on RAM, though, there's no way to stop infrequent
searches from being sluggish.

Nevertheless, the terms index isn't that big in comparison to, say, the size
of a posting list for a common term, so the cost of re-heating it isn't
astronomical in the grand scheme of things.

> Similarly, when a BG merge is burning through data, or say backup kicks off
> and moves many GB, or the simple act of iterating through a big postings
> list, the OS will gleefully evict my terms index or norms in order to
> populate its IO cache with data it will need again for a very long time.

When that background merge finishes, the new files will be hot. So, if we
open a new IndexReader right away and that IndexReader uses mmap() to get at
the file data, new segments be responsive right away.

Even better, any IO caches for old segments used by the previous IndexReader
may still be warm. All of this without having to decompress a bunch of stream
data into per-process data structures at IndexReader startup.

The terms index could indeed get evicted some of the time on busy systems, but
the point is that the system IO cache usually works in our favor, even under
load.

As far as backup daemons blowing up everybody's cache, that's stupid,
pathological behavior: <http://kerneltrap.org/node/3000#comment-8573>. Such
apps ought to be calling madvise(ptr, len, MADV_SEQUENTIAL) so that the kernel
knows it can recycle the cache pages as soon as they're cleared.

>> But hey, we can simplify even further! How about dispensing with the index
>> file? We can just divide the main dictionary file into blocks and binary
>> search on that.
>
> I'm not convinced this'll be a win in practice. You are now paying an
> even higher overhead cost for each "check" of your binary search,
> especially with something like pulsing which inlines more stuff into
> the terms dict. I agree it's simpler, but I think that's trumped by
> the performance hit.

I'm persuaded that we shouldn't do away with the terms index. Even if we're
operating on a dedicated search box with gobs of RAM, loading entire cache
pages when we only care about the first few bytes of each is poor use of
memory bandwidth. And, just in case the cache does get blown, we'd like to
keep the cost of rewarming down.

Nathan Kurz and I brainstormed this subject in a phone call this morning, and
we came up with a three-file lexicon index design:

  • A file which is a solid stack of 64-bit file pointers into the lexicon
    index term data. Term data UTF-8 byte length can be determined by
    subtracting the current pointer from the next one (or the file length at
    the end).
  • A file which is contains solid UTF-8 term content. (No string lengths, no
    file pointers, just character data.)
  • A file which is a solid stack of 64-bit file pointers into the primary
    lexicon.

Since the integers are already expanded and the raw UTF-8 data can be compared
as-is, those files can be memory-mapped and used as-is for binary search.

> In Lucene java, the concurrency model we are aiming for is a single JVM
> sharing a single instance of IndexReader.

When I mentioned this to Nate, he remarked that we're using the OS kernel like
you're using the JVM.

We don't keep a single IndexReader around, but we do keep the bulk of its data
cached so that we can just slap a cheap wrapper around it.

> I do agree, if fork() is the basis of your concurrency model then sharing
> pages becomes critical. However, modern OSs implement copy-on-write sharing
> of VM pages after a fork, so that's another good path to sharing?

Lucy/KS can't enforce that, and we wouldn't want to. It's very convenient to
be able to launch a cheap search process.

> Have you tried any actual tests swapping these approaches in as your
> terms index impl?

No – changing something like this requires a lot of coding, so it's better to
do thought experiments first to winnow down the options.

> Tests of fully hot and fully cold ends of the
> spectrum would be interesting, but also tests where a big segment
> merge or a backup is running in the background...

>> That doesn't meet the design goals of bringing the cost of opening/warming
>> an IndexReader down to near-zero and sharing backing buffers among
>> multiple forks.
>
> That's a nice goal. Our biggest cost in Lucene is warming the FieldCache, used
> for sorting, function queries, etc.

Exactly. It would be nice to add a plug-in indexing component that writes sort
caches to files that can be memory mapped at IndexReader startup. There would
be multiple files: both a solid array of 32-bit integers mapping document
number to sort order, and the field cache values. Such a component would
allow us to move the time it takes to read in a sort cache from
IndexReader-startup-time to index-time.

Hmm, maybe we can conflate this with a column-stride field writer and require
that sort fields have a fixed width?

> In my approach here, the blob is opaque to the terms dict reader: it
> simply seeks to the right spot in the tis file, and then asks the
> codec to decode the entry. TermsDictReader is entirely unaware of
> what/how is stored there.

Sounds good. Basically, a hash lookup.

In KS, the relevant IndexReader methods no longer take a Term object. (In
fact, there IS no Term object any more – KinoSearch::Index::Term has been
removed.) Instead, they take a string field and a generic "Obj".

Lexicon*
SegReader_lexicon(SegReader *self, const CharBuf *field, Obj *term)

{ return (Lexicon*)LexReader_Lexicon(self->lex_reader, field, term); }

I suppose we genericize this by adding a TermsDictReader/LexReader argument to
the IndexReader constructor? That way, someone can supply a custom subclass
that knows how to decode custom dictionary files.


Michael McCandless added a comment - 21/Nov/08 11:40 AM

OK I created another codec, SepCodec (for lack of a better name) that
stores doc & frq & skip in 3 separate files (vs 1 for Lucene today),
as well as positions & payloads in 2 separate files (vs 1 for Lucene
today).

The code is still messy – lots of nocommits all over the place. I'm
still iterating.

Finally, this gets us one step closer to using PFOR! With this codec,
the .frq, .doc and .prx are now "pure" streams of ints.

This codec was more interesting because it adds new files to the file
format, which required fixing the various interesting places where we
assume which file extensions belong to a segment.

In this patch I also created a PostingCodec class, with the 3
subclasses (so far):

  • DefaultCodec: new terms dict format, but same back-compatible
    prx/frq format
  • PulsingCodec: new terms dict format, but inlines rare terms into
    terms dict
  • SepCodec: new terms dict format, and splits doc/frq/skip into
    3 separate files, and prox/payload into 2 separate files

By editing the PostingCodec.getCodec method you can switch all tests
to use each codec; all tests pass using each codec.

I built the 1M Wikipedia index, using SepCodec. Here's the ls -l:

-rw-rw-rw-  1 mike  admin    4000004 Nov 20 17:16 _0.fdt
-rw-rw-rw-  1 mike  admin    8000004 Nov 20 17:16 _0.fdx
-rw-rw-rw-  1 mike  admin  303526787 Nov 20 17:34 _n.doc
-rw-rw-rw-  1 mike  admin         33 Nov 20 17:30 _n.fnm
-rw-rw-rw-  1 mike  admin  220470670 Nov 20 17:34 _n.frq
-rw-rw-rw-  1 mike  admin    3000004 Nov 20 17:34 _n.nrm
-rw-rw-rw-  1 mike  admin  651670377 Nov 20 17:34 _n.prx
-rw-rw-rw-  1 mike  admin          0 Nov 20 17:30 _n.pyl
-rw-rw-rw-  1 mike  admin   84963104 Nov 20 17:34 _n.skp
-rw-rw-rw-  1 mike  admin     666999 Nov 20 17:34 _n.tii
-rw-rw-rw-  1 mike  admin   87551274 Nov 20 17:34 _n.tis
-rw-rw-rw-  1 mike  admin         20 Nov 20 17:34 segments.gen
-rw-rw-rw-  1 mike  admin         64 Nov 20 17:34 segments_2

Some initial observations for SepCodec:

  • Merging/optimizing was noticeably slower... I think there's some
    pending inefficiency in my changes, but it could also simply be
    that having to step through 3 (.frq, .doc, .prx) files instead of
    2 (.frq, .prx) for each segment is that much more costly. (With
    payloads it'd be 4 files instead of 2).
  • Net index size is quite a bit larger (1300 MB vs 1139 MB), I think
    because we are not efficiently encoding the frq=1 case anymore.
    PFOR should fix that.
  • Skip data is just about as large as the terms dict, which
    surprises me (I had intuitively expected it to be smaller I
    guess).

Michael McCandless added a comment - 22/Nov/08 07:11 PM

Nevertheless, the terms index isn't that big in comparison to, say, the size
of a posting list for a common term, so the cost of re-heating it isn't
astronomical in the grand scheme of things.

Be careful: it's the seeking that kills you (until we switch to SSDs
at which point perhaps most of this discussion is moot!). Even though
the terms index net size is low, if re-heating the spots you touch
incurs 20 separate page misses, you lose.

Potentially worse than the terms index are norms, if the search hits
alot of docs.

> Take a large Jira instance...

Search responsiveness is already compromised in such a situation, because we
can all but guarantee that the posting list files have already been evicted
from cache. If the box has enough RAM for the large JIRA instance including
the Lucene index, search responsiveness won't be a problem. As soon as you
start running a little short on RAM, though, there's no way to stop infrequent
searches from being sluggish.

If the term index and norms are pinned (or happen to still be hot), I
would expect most searches to be OK with this "in the middle" use case
because the number of seeks you'll hit should be well contained
(assuming your posting list isn't unduly fragmented by the
filesystem). Burning through the posting list is a linear scan.
Queries that simply hit too many docs will always be slow anyways.

I think at both extremes (way too litle RAM and tons of RAM) both
approaches (pinned in RAM vs mmap'd) should perfom the same. It's the
cases in between where I think letting VM decide whether critical
things (terms index, norms) get to stay hot is dangerous.

The terms index could indeed get evicted some of the time on busy systems, but
the point is that the system IO cache usually works in our favor, even under
load.

I think you're just more trusting of the IO/VM system. I think LRU is
a poor metric.

As far as backup daemons blowing up everybody's cache, that's stupid,
pathological behavior: <http://kerneltrap.org/node/3000#comment-8573>. Such
apps ought to be calling madvise(ptr, len, MADV_SEQUENTIAL) so that the kernel
knows it can recycle the cache pages as soon as they're cleared.

Excellent! If only more people knew about this. And, if only we
could do this from javaland. EG SegmentMerger should do this for all
segment data it's reading & writing.

Nathan Kurz and I brainstormed this subject in a phone call this morning, and
we came up with a three-file lexicon index design:

I don't fully understand this approach. Would the index file pointers
point into the full lexicon's packed utf8 file, or a separate "only
terms in the index" packed utf8 file?

We currently materialize individual Strings when we load our index,
which is bad because of the GC cost, added RAM overhead (& swapping)
and because for iso8859-1 only terms we are using 2X the space over
utf8. So I'd love to eventually do something similar (in RAM) for
Lucene.

> Have you tried any actual tests swapping these approaches in as your
> terms index impl?

No - changing something like this requires a lot of coding, so it's better to
do thought experiments first to winnow down the options.

Agreed. But once you've got the mmap-based solution up and running
it'd be nice to meaure net time doing terms lookup / norms reading,
for a variety of search use cases, and plot that on a histogram.

When I mentioned this to Nate, he remarked that we're using the OS kernel like
you're using the JVM.

True!

Lucy/KS can't enforce that, and we wouldn't want to. It's very convenient to
be able to launch a cheap search process.

It seems like the ability to very quickly launch brand new searchers
is/has become a strong design goal of Lucy/KS. What's the driver
here? Is it for near-realtime search? (Which I think may be better
achieved by having IndexWriter export a reader, rather than using IO
system as the intermediary).

If we fix terms index to bulk load arrays (it's not now) then the cost
of loading norms & terms index on instantiating a reader should be
fairly well contained, though not as near zero as Lucy/KS will be.

> That's a nice goal. Our biggest cost in Lucene is warming the
> FieldCache, used for sorting, function queries, etc.

Exactly. It would be nice to add a plug-in indexing component that
writes sort caches to files that can be memory mapped at IndexReader
startup. There would be multiple files: both a solid array of 32-bit
integers mapping document number to sort order, and the field cache
values. Such a component would allow us to move the time it takes to
read in a sort cache from IndexReader-startup-time to index-time.

Except I would have IndexReader use its RAM budget to pick & choose
which of these will be hot, and which would be mmap'd.

Hmm, maybe we can conflate this with a column-stride field writer
and require that sort fields have a fixed width?

Yes I think column-stride fields writer should write the docID -> ord
part of StringIndex to disk, and MultiRangeQuery in LUCENE-1461 would
then use it. With enumerated type of fields (far fewer unique terms
than docs), bit packing will make them compact.

In KS, the relevant IndexReader methods no longer take a Term
object. (In fact, there IS no Term object any more -
KinoSearch::Index::Term has been removed.) Instead, they take a
string field and a generic "Obj".

But you must at least require these Obj's to know how to compareTo one
another? Does this mean using per-field custom sort ordering
(collator) is straightforward for KS?

I suppose we genericize this by adding a TermsDictReader/LexReader
argument to the IndexReader constructor? That way, someone can
supply a custom subclass that knows how to decode custom dictionary
files.

Right; that's what let me create the PulsingCodec here.

The biggest problem with the "load important stuff into RAM" approach,
of course, is we can't actually pin VM pages from java, which means
the OS will happily swap out my RAM anyway, at which point of course
we should have used mmap. Though apparently at least Windows has an
option to "optimize for services" (= "don't swap out my RAM" I think)
vs "optimize for applications", and Linux lets you tune swappiness.
But both are global.


Marvin Humphrey added a comment - 24/Nov/08 08:41 PM
> Nathan Kurz and I brainstormed this subject in a phone call this morning, and
> we came up with a three-file lexicon index design:
>
> I don't fully understand this approach. Would the index file pointers
> point into the full lexicon's packed utf8 file, or a separate "only
> terms in the index" packed utf8 file?

Just the index terms (i.e. every 128th term). We're trying to fake up an
array of strings without having to load anything into process memory. The
comparison would go something like this:

/* self->text_lengths, self->char_data, and self->lex_file_ptrs are all
   * memory mapped buffers.
   */
  while (hi >= lo) {
    const i32_t mid        = lo + ((hi - lo) / 2);
    const i64_t offset     = self->text_lengths[mid];
    const i64_t mid_len    = self->text_lengths[mid + 1] - offset;
    char *const mid_text   = self->char_data + offset;
    const i32_t comparison = StrHelp_string_diff(target_text, target_len, 
                                                 mid_text, mid_len);
    if      (comparison < 0) { hi = mid - 1; }
    else if (comparison > 0) { lo = mid + 1; }
    else { 
      result = mid; 
      break;
    }
  }
  offset_into_main_lexicon = self->lex_file_ptrs[result]
  ...

However, perhaps some sort of a B-tree with string prefix compression would be
better, as per recent suggestions.


Marvin Humphrey added a comment - 24/Nov/08 11:18 PM
>> Hmm, maybe we can conflate this with a column-stride field writer
>> and require that sort fields have a fixed width?
>
> Yes I think column-stride fields writer should write the docID -> ord
> part of StringIndex to disk, and MultiRangeQuery in LUCENE-1461 would
> then use it. With enumerated type of fields (far fewer unique terms
> than docs), bit packing will make them compact.

How do you plan on dealing with the ord values changing as segments get
added? The addition of a single document triggers the rewriting of the
entire mapping.

I was planning on having SortCacheWriter write the out the docID -> ord
mapping, but with the understanding that there was a relatively high cost so
the module couldn't be core. The idea was to take the cost of iterating over
the field caches during IndexReader startup, move that to index time, and write
out a file that could be memory mapped and shared among multiple search apps.

In theory, if we were to have only per-segment docID -> ord maps, we could
perform inter-segment collation the same way that it's handled at the
MultiSearcher level – by comparing the original strings. It wouldn't be that
expensive in the grand scheme of things, because most of the work would be
done by comparing ord values within large segments.

Unfortunately, that won't work because segment boundaries are hidden from
Scorers.

>> In KS, the relevant IndexReader methods no longer take a Term
>> object. (In fact, there IS no Term object any more -
>> KinoSearch::Index::Term has been removed.) Instead, they take a
>> string field and a generic "Obj".
>
> But you must at least require these Obj's to know how to compareTo one
> another?

Yes.

> Does this mean using per-field custom sort ordering (collator) is
> straightforward for KS?

That's one objective. The implementation is incomplete.

Another objective is to allow non-string term types, e.g. TimeStamp,
Float... Hmm... how about FixedWidthText?


Marvin Humphrey added a comment - 24/Nov/08 11:27 PM
>> I suppose we genericize this by adding a TermsDictReader/LexReader
>> argument to the IndexReader constructor? That way, someone can
>> supply a custom subclass that knows how to decode custom dictionary
>> files.
>
> Right; that's what let me create the PulsingCodec here.

I'm running into an OO design problem because of the SegmentReader/MultiReader
bifurcation. If IndexReader were an ordinary class, and we expected all of
its component parts to perform their own collation of data from multiple
segments, then the API for overriding individual components would be
straightforward:

reader = new IndexReader(termsDictReader, postingsReader, fieldsReader);

We can't do that, though, because there's logic in IndexReader.open() which
guards against race conditions with regards to file deletion and index
modification, and the initialization of the auxiliary reader components would
happen outside those guards – possibly resulting in sub-components within an
IndexReader object reading from different versions of the index.

Using setters a la reader.setTermsDictReader(termsDictReader) is problematic
for the same reason.

Are factory methods the only way to handle adding or replacing components
within IndexReader?

KS forces people to subclass Schema to define their index, but up till now
there hasn't been anything that would affect the complement of major
sub-components within IndexReader or InvIndexer (=IndexWriter). I suppose
Schema is the right place to put stuff like this, but it seems a lot more
elaborate than the factory method which returns the index's default Analyzer.


Marvin Humphrey added a comment - 25/Nov/08 12:34 AM

> Be careful: it's the seeking that kills you (until we switch to SSDs
> at which point perhaps most of this discussion is moot!). Even though
> the terms index net size is low, if re-heating the spots you touch
> incurs 20 separate page misses, you lose.

Perhaps for such situations, we can make it possible to create custom
HotLexiconReader or HotIndexReader subclasses that slurp term index files and
what-have-you into process memory. Implementation would be easy, since we can
just back the InStreams with malloc'd RAM buffers rather than memory mapped
system buffers.

Consider the tradeoffs. On the one hand, if we rely on memory mapped buffers,
busy systems may experience sluggish search after long lapses in a worst case
scenario. On the other hand, reading a bunch of stuff into process memory
makes IndexReader a lot heavier, with large indexes imposing consistently
sluggish startup and a large RAM footprint on each object.

> It seems like the ability to very quickly launch brand new searchers
> is/has become a strong design goal of Lucy/KS. What's the driver
> here? Is it for near-realtime search?

Near-realtime search is one of the motivations. But lightweight IndexReaders
are more convenient in all sorts of ways.

Elaborate pre-warming rituals are necessary with heavy IndexReaders whenever
indexes get modified underneath a persistent search service. This is
certainly a problem when you are trying to keep up with real-time insertions,
but it is also a problem with batch updates or optimization passes.

With lightweight IndexReaders, you can check whether the index has been
modified as requests come in, launch a new Searcher if it has, then deal with
the request after a negligible delay. You have to warm the system io caches
when the service starts up ("cat /path/to/index/* > /dev/null"), but after
that, there's no more need for background warming.

Lightweight IndexReaders can also be sprinkled liberally around source code in
a way that heavy IndexReaders cannot. For instance, each thread in a
multi-threaded server can have its own Searcher.

Launching cheap search processes is also important when writing tools akin to
the Unix command line 'locate' app. The first time you invoke locate it's
slow, but subsequent invocations are nice and quick. You can only mimic that
with a lightweight IndexReader.

And so on... The fact that segment data files are never modified once written
makes the Lucene/Lucy/KS file format design particularly well suited for
memory mapping and sharing via the system buffers. In addition to the reasons
cited above, intuition tells me that this is the right design decision and
that there will be other opportunities not yet anticipated. I don't see how Lucy
can deny such advantages to most users for the sake of those few for whom
term dictionary cache eviction proves to be a problem, especially when we can
offer those users a remedy.

> The biggest problem with the "load important stuff into RAM" approach,
> of course, is we can't actually pin VM pages from java, which means
> the OS will happily swap out my RAM anyway, at which point of course
> we should have used mmap.

We can't realistically pin pages from C, either, at least on Unixen. Modern
Unixen offer the mlock() command, but it has a crucial limitation – you have to
run it as root.

Also, there aren't any madvise() flags that hint to the OS that the mapped
region should stay hot. The closest thing is MADV_WILLNEED, which
communicates "this will be needed soon" – not "keep this around".


Michael McCandless added a comment - 25/Nov/08 11:23 AM

Just the index terms (i.e. every 128th term). We're trying to fake up an
array of strings without having to load anything into process memory. The
comparison would go something like this:

OK this makes sense. We could do something similar in Lucene. Not
creating String objects is nice. I wonder in practice how much time
we are "typically" spending loading the terms index...

However, perhaps some sort of a B-tree with string prefix compression would be
better, as per recent suggestions.

B-tree or FST/trie or ... something.

Actually: I just realized the terms index need not store all suffixes
of the terms it stores. Only unique prefixes (ie a simple letter
trie, not FST). Because, its goal is to simply find the spot in the
main lexicon file to seek to and then scan from. This makes it even
smaller!

Though, if we want to do neat things like respelling, wildcard/prefix
searching, etc., which reduce to graph-intersection problems, we would
need the suffix and we would need the entire lexicon (not just every
128th index term) compiled into the FST.


Michael McCandless added a comment - 25/Nov/08 11:43 AM

How do you plan on dealing with the ord values changing as segments get
added? The addition of a single document triggers the rewriting of the
entire mapping.

...

Unfortunately, that won't work because segment boundaries are hidden from
Scorers.

This is a big challenge – presenting a merged docID->ord map for a
MultiSegmentReader is very costly.

I think, just like we are pushing for column-stride / FieldCache to be
"per segment" instead of one big merged array, we should move in the
same direction for searching?

Ie, if one did all searching with MultiSearcher, it should work well.
Each segment uses its pre-computed (during indexing) docID->ord
mapping. Merge-sorting the results from each searcher ought to be low
cost since you only need to lookup the string values for the top N
docs (though care must be taken to not incur N seeks for this... eg
perhaps each reader, on hitting a doc that makes it into the pqueue,
should then seek&load the String value from column-stride store?). An
optimized index wouldn't need to read any of the actual string values
since no results merging is needed.

For the RangeFilter impl in LUCENE-1461 (which'd use the docID->order
per segment, using MultiSearcher), string values are never needed.

> Does this mean using per-field custom sort ordering (collator) is
> straightforward for KS?

That's one objective. The implementation is incomplete.

Another objective is to allow non-string term types, e.g. TimeStamp,
Float... Hmm... how about FixedWidthText?

Neat!


Michael McCandless added a comment - 25/Nov/08 11:52 AM

If IndexReader were an ordinary class, and we expected all of
its component parts to perform their own collation of data from multiple
segments, then the API for overriding individual components would be
straightforward:

reader = new IndexReader(termsDictReader, postingsReader, fieldsReader);

We can't do that, though, because there's logic in IndexReader.open() which
guards against race conditions with regards to file deletion and index
modification, and the initialization of the auxiliary reader components would
happen outside those guards - possibly resulting in sub-components within an
IndexReader object reading from different versions of the index.

I think you "just" have to have "index version data" that's
collectively read/written, atomically, and is then used to init all
the components. This is what segments_N is in Lucene (and I think
"Schema" is in KS/Lucy?): it contains all details that all
sub-components need.

If init'ing each sub-component is then costly (opening files,
slurping things in, etc.) its OK because they are all still loading a
consistent commit point.


Michael McCandless added a comment - 25/Nov/08 01:20 PM

> Be careful: it's the seeking that kills you (until we switch to SSDs
> at which point perhaps most of this discussion is moot!). Even though
> the terms index net size is low, if re-heating the spots you touch
> incurs 20 separate page misses, you lose.

Perhaps for such situations, we can make it possible to create custom
HotLexiconReader or HotIndexReader subclasses that slurp term index files and
what-have-you into process memory. Implementation would be easy, since we can
just back the InStreams with malloc'd RAM buffers rather than memory mapped
system buffers.

Consider the tradeoffs. On the one hand, if we rely on memory mapped buffers,
busy systems may experience sluggish search after long lapses in a worst case
scenario. On the other hand, reading a bunch of stuff into process memory
makes IndexReader a lot heavier, with large indexes imposing consistently
sluggish startup and a large RAM footprint on each object.

I think this is a fabulous solution. If you make things so pluggable
that you can choose to swap in "mmap this thing" vs "slurp in this
thing" and it's the same interface presented to the consumer, then, we
don't need to resolve this debate now Put both out in the field and
gather data...

Elaborate pre-warming rituals are necessary with heavy IndexReaders whenever
indexes get modified underneath a persistent search service. This is
certainly a problem when you are trying to keep up with real-time insertions,
but it is also a problem with batch updates or optimization passes.

With lightweight IndexReaders, you can check whether the index has been
modified as requests come in, launch a new Searcher if it has, then deal with
the request after a negligible delay. You have to warm the system io caches
when the service starts up ("cat /path/to/index/* > /dev/null"), but after
that, there's no more need for background warming.

Well ...that cat command can be deadly for a large index, too? You've
replaced the elaborate pre-warming ritual (= run certain queries that
you know will populate variou caches) with a cat command that doesn't
distinguish what's important (norms, terms index, certain docID->ord
maps, certain column-stride-fields, etc.) from what's less important.

Lightweight IndexReaders can also be sprinkled liberally around source code in
a way that heavy IndexReaders cannot. For instance, each thread in a
multi-threaded server can have its own Searcher.

Launching cheap search processes is also important when writing tools akin to
the Unix command line 'locate' app. The first time you invoke locate it's
slow, but subsequent invocations are nice and quick. You can only mimic that
with a lightweight IndexReader.

This is indeed nice. I think the two approaches boil down to "pay up
front & reuse" (Lucene, slurping) vs "pay as you go & discard"
(KS/Lucy, mmap'ing).

And so on... The fact that segment data files are never modified once written
makes the Lucene/Lucy/KS file format design particularly well suited for
memory mapping and sharing via the system buffers. In addition to the reasons
cited above, intuition tells me that this is the right design decision and
that there will be other opportunities not yet anticipated. I don't see how Lucy
can deny such advantages to most users for the sake of those few for whom
term dictionary cache eviction proves to be a problem, especially when we can
offer those users a remedy.

[BTW the ZFS filesystem gets many of its nice properties for the same
reason – "write once", at the file block level.]

Lucene java takes advantage of that 'write once' nature during
IndexReader.reopen(). If we can finally push FieldCache, norms,
docID->ord to be per-reader then the reopen of a MultiSearcher should
be alot better than it is today.

> The biggest problem with the "load important stuff into RAM" approach,
> of course, is we can't actually pin VM pages from java, which means
> the OS will happily swap out my RAM anyway, at which point of course
> we should have used mmap.

We can't realistically pin pages from C, either, at least on Unixen. Modern
Unixen offer the mlock() command, but it has a crucial limitation - you have to
run it as root.

Also, there aren't any madvise() flags that hint to the OS that the mapped
region should stay hot. The closest thing is MADV_WILLNEED, which
communicates "this will be needed soon" - not "keep this around".

Alas.

The only fallback is gross system-level tunings ("swappiness" on Linux
and "Adjust for best performance of: Programs/System Cache" on Windows
Server 2003, at least).

Or also a silly "keep warm" thread...


Marvin Humphrey added a comment - 25/Nov/08 02:47 PM
> I think you "just" have to have "index version data" that's
> collectively read/written, atomically, and is then used to init all
> the components. This is what segments_N is in Lucene (and I think
> "Schema" is in KS/Lucy?): it contains all details that all
> sub-components need.

The equivalent to segments_N in KinoSearch is snapshot_N.meta, which is
encoded as JSON. There's a KinoSearch::Index::Snapshot class that's
responsible for reading/writing it.

KinoSearch::Schema is for defining your index: global field properties,
default Analyzer, etc. It's similar to Solr's schema.xml, but implemented as
an abstract class that users are required to subclass. Translated to Java,
the subclassing might look something like this:

class MySchema extends Schema {
    class URLField extends TextField {
        boolean analyzed() { return false; }
        boolean indexed() { return false; }
    }

    void initFields() {
      addField("title", "text");
      addField("content", "text");
      addField("url", new URLField());
    }

    Analyzer analyzer() {
      return new PolyAnalyzer("en");
    }
  }

I anticipate that Lucy will adopt both Schema and Snapshot in some form, but
after discussion.

> If init'ing each sub-component is then costly (opening files,
> slurping things in, etc.) its OK because they are all still loading a
> consistent commit point.

So, something like this prospective Lucy code? (Lucy with Java bindings, that is.)

MySchema schema = new MySchema();
  Snapshot snapshot = new Snapshot((Schema)schema);
  snapShot.readSnapShot("/path/to/index");
  MyTermsDictReader termsDictReader = new MyTermsDictReader(schema, snapshot);
  IndexReader reader = new IndexReader(schema, snapshot, null, null,
                                       (TermsDictReader)termsDictReader);

What if index files get deleted out from under that code block? The user will
have to implement retry logic.


Michael McCandless added a comment - 25/Nov/08 03:51 PM

The equivalent to segments_N in KinoSearch is snapshot_N.meta, which is
encoded as JSON. There's a KinoSearch::Index::Snapshot class that's
responsible for reading/writing it.

KinoSearch::Schema is for defining your index: global field properties,
default Analyzer, etc. It's similar to Solr's schema.xml, but implemented as
an abstract class that users are required to subclass. Translated to Java,
the subclassing might look something like this:

OK got it.

What if index files get deleted out from under that code block? The
user will have to implement retry logic.

I would think this "openReader" method would live inside Lucy/KS, and
would in fact implement its own retry logic (to load the next snapshot
and try again). I must be missing some part of the question here...


Marvin Humphrey added a comment - 25/Nov/08 04:39 PM
> I would think this "openReader" method would live inside Lucy/KS, and
> would in fact implement its own retry logic (to load the next snapshot
> and try again). I must be missing some part of the question here...

If the retry code lives inside of IndexReader, then the only way to get the
IndexReader to use e.g. a subclassed TermsDictReader is to subclass
IndexReader and override a factory method:

class MyIndexReader extends IndexReader {
    TermsDictReader makeTermsDictReader() {
      return (TermsDictReader) new MyTermsDictReader(invindex, snapshot);
    }
  }

  InvIndex invindex = MySchema.open("/path/to/index");
  IndexReader reader = (IndexReader) new MyIndexReader(invindex);

I was hoping to avoid forcing the user to subclass IndexReader, but I think
the need for retry logic during open() precludes that possibility.


Michael McCandless added a comment - 25/Nov/08 05:45 PM

I was hoping to avoid forcing the user to subclass IndexReader, but I
think the need for retry logic during open() precludes that
possibility.

How about the caller provides a codec instance which when asked will
return a TermsDictReader "matching" the codec that had been used to
write the index?

Then open() implements the retry logic, asking the codec to load each
part of the index?

That's roughly the approach I'm taking here (on next iteriaton of the
patch, hopefully soon), though I'm only tackling the postings now (not
yet norms, stored fields, term vectors, fields infos).


Marvin Humphrey added a comment - 25/Nov/08 05:46 PM
>> We're trying to fake up an array of strings without having to load anything
>> into process memory.

> We could do something similar in Lucene. Not creating String objects is
> nice.

OK, assume that you slurp all three files. Here's the code from above, ported
from C to Java.

while (hi >= lo) {
  int  mid           = lo + ((hi - lo) / 2);
  long midTextOffset = textLengths[mid];
  long midTextLength = textLengths[mid + 1] - midTextOffset;
  int comparison     = StringHelper.compareUTF8Bytes(
                          targetUTF8Bytes, 0, targetLength, 
                          termUTF8bytes, midTextOffset, midTextLength);
  if      (comparison < 0) { hi = mid - 1; }
  else if (comparison > 0) { lo = mid + 1; }
  else { 
    result = mid; 
    break;
  }
}
long offsetIntoMainTermDict = mainTermDictFilePointers[result];
...

Other than the slurping, the only significant difference is the need for the
comparison routine to take a byte[] array and an offset, rather than a char*
pointer.

You can also use FileChannels to memory map this stuff, right? (Have to be
careful on 32-bit systems, though.)

> B-tree or FST/trie or ... something.

Much to my regret, my tree algorithm vocabulary is limited – I haven't spent
enough time coding such projects that I can intuit sophisticated solutions.
So I'll be counting on you, Jason Rutherglen, and Eks Dev to suggest
appropriate algorithms based on your experience.

Our segment-based inverted index term dictionary has a few defining
characteristics.

First, a lot of tree algorithms are optimized to a greater or lesser extent
for insertion speed, but we hardly care about that at all. We can spend all
the cycles we need at index-time balancing nodes within a segment, and once
the tree is written out, it will never be updated.

Second, when we are writing out the term dictionary at index-time, the raw
data will be fed into the writer in sorted order as iterated values, one
term/term-info pair at a time. Ideally, the writer would be able to serialize
the tree structure during this single pass, but it could also write a
temporary file during the terms iteration then write a final file afterwards.
The main limitation is that the writer will never be able to "see" all
terms at once as an array.

Third, at read-time we're going to have one of these trees per segment. We'd
really like to be able to conflate them somehow. KinoSearch actually
implements a MultiLexicon class which keeps SegLexicons in a PriorityQueue;
MultiLexicon_Next() advances the queue to the next unique term. However,
that's slow, unwieldy, and inflexible. Can we do better?

> Actually: I just realized the terms index need not store all suffixes
> of the terms it stores. Only unique prefixes (ie a simple letter
> trie, not FST). Because, its goal is to simply find the spot in the
> main lexicon file to seek to and then scan from. This makes it even
> smaller!

It would be ideal if we could separate the keys from the values and put all
the keys in a single file.

> Though, if we want to do neat things like respelling, wildcard/prefix
> searching, etc., which reduce to graph-intersection problems, we would
> need the suffix and we would need the entire lexicon (not just every
> 128th index term) compiled into the FST.

The main purpose of breaking out a separate index structure is to avoid binary
searching over the large primary file. There's nothing special about the
extra file – in fact, it's a drawback that it doesn't include all terms. If
we can jam all the data we need to binary search against into the front of the
file, but include the data for all terms in an infrequently-accessed tail, we
win.


Marvin Humphrey added a comment - 25/Nov/08 08:27 PM
> How about the caller provides a codec instance which when asked will
> return a TermsDictReader "matching" the codec that had been used to
> write the index?

OK, it makes sense to have the user access these capabilities via a single
handle at both index-time and search-time. However, for Lucy/KS, the handle
should definitely be specified via the Schema subclass rather than via
constructor argument.

"Codec" isn't really the right name for this, though. "IndexComponent",
maybe? Lucy would have three main index components by default:
LexiconComponent, PostingsComponent, StorageComponent.

// If Lucy's Schema class were implemented in Java instead of C...
abstract class Schema extends Obj {
  LexiconComponent lexiconComponent() { return new LexiconComponent(); }
  PostingsComponent postingsComponent() { return new PostingsComponent(); }
  StorageComponent storageComponent() { return new StorageComponent(); }
  ...
}

Auxiliary IndexComponents might include TermVectorsComponent,
SortCacheComponent, ColumnStrideComponent, RTreeComponent, etc.

Here's example code for overriding the default LexiconComponent:

// Implements term dictionary as a hash table with term texts as keys.
class HashLexiconComponent extends LexiconComponent {
  LexiconReader makeReader(InvIndex invindex, Snapshot snapshot) {
    SegInfos segInfos = Snapshot.getSegInfos();
    if (segInfos.size == 1) { 
      return (LexiconReader) new SegHashLexiconReader(invindex, snapshot);
    }
    else {
      return (LexiconReader) new MultiHashLexiconReader(invindex, snapshot);
    }
  }

  LexiconWriter makeWriter(InvIndex invindex, SegInfo segInfo) {
    return (LexiconWriter) new HashLexiconWriter(invindex, segInfo);
  }
}

// [User code]
class MySchema extends Schema {
  LexiconComponent lexiconComponent() {
    return (LexiconComponent) new HashLexiconComponent();
  }
  ...
}

Marvin Humphrey added a comment - 26/Nov/08 03:37 AM
> I think, just like we are pushing for column-stride / FieldCache to be
> "per segment" instead of one big merged array, we should move in the
> same direction for searching?

Algorithmically speaking, it would definitely help this specific task, and
that's a BIG FAT PLUS. This, plus memory mapping and writing the DocID -> ord
map at index-time, allows us to totally eliminate the current cost of loading
sort caches at IndexReader startup. The question is, how easy is it to
refactor our search OO hierarchy to support it?

If our goal is minimal impact to the current model, we worry only about the
TopFieldDocs search() method. We can hack in per-segment bookending via doc
number to the hit collection routine, initializing the TopFieldDocCollector
each segment (either creating a new one or popping all the collected docs).

But does it make sense to be more aggressive? Should Searchers run hit
collection against individual segments? Should Scorers only be compiled
against single segments?

Maybe so. I implemented pruning (early termination) in KS, and it had to be
done per segment. This is because you have to sort the documents within a
segment according to the primary criteria you want to prune on (typically doc
boost). I've since ripped out that code because it was adding too much
complexity, but maybe there would have been less complexity if segments were
closer to the foreground.


Marvin Humphrey added a comment - 26/Nov/08 03:40 AM
> Well ...that cat command can be deadly for a large index, too?

It will be costly for a large index, and it wouldn't be appropriate in all
cases. The use case I was thinking of was: dedicated server with gobs of RAM.
The index could either be updated often or not updated at all. Pre-existing
segments stay warm on such a box, and the writer would leave the latest
segment hot, so the cat command would only be needed once, at the startup of
the persistent service.


Michael McCandless added a comment - 26/Nov/08 11:22 AM

OK, assume that you slurp all three files. Here's the code from above, ported
from C to Java.

Looks good!

You can also use FileChannels to memory map this stuff, right? (Have to be
careful on 32-bit systems, though.)

Yes.

First, a lot of tree algorithms are optimized to a greater or lesser extent
for insertion speed, but we hardly care about that at all. We can spend all
the cycles we need at index-time balancing nodes within a segment, and once
the tree is written out, it will never be updated.

Right, neither inserts nor deletes matter to us.

Second, when we are writing out the term dictionary at index-time, the raw
data will be fed into the writer in sorted order as iterated values, one
term/term-info pair at a time. Ideally, the writer would be able to serialize
the tree structure during this single pass, but it could also write a
temporary file during the terms iteration then write a final file afterwards.
The main limitation is that the writer will never be able to "see" all
terms at once as an array.

Lucene differs from Lucy/KS in this. For Lucene, when flushing a
new segment, we can assume you can see all Terms in RAM at once. We
don't make use of this today (it's a simple iteration that's given to
the consumer), but we could. In Lucene, when RAM is full, we flush a
real segment (but KS flushes a "run" which I think is more of a raw
dump, ie, you don't build lexicon trees during that?).

However, for both Lucene and Lucy/KS, during merging one cannot assume
the entire lexicon can be in RAM at once. But then, during merging
you could in theory merge trees not expanded terms.

I think for starters at least we should stick with the simple
shared-prefix-compression we have today.

Third, at read-time we're going to have one of these trees per segment. We'd
really like to be able to conflate them somehow. KinoSearch actually
implements a MultiLexicon class which keeps SegLexicons in a PriorityQueue;
MultiLexicon_Next() advances the queue to the next unique term. However,
that's slow, unwieldy, and inflexible. Can we do better?

Continuing the move towards pushing searching closer to the segments
(ie, using MultiSearcher instead of MultiReader), I think we should
not try to conflate the terms dict?

It would be ideal if we could separate the keys from the values and put all
the keys in a single file.

Why not inline the value with the key? The pointer to the value just
consumes extra space. I think "value" in this context is the long
offset into the main terms dict file, which then stores the "real
[opaque] value" for each term.

> Though, if we want to do neat things like respelling, wildcard/prefix
> searching, etc., which reduce to graph-intersection problems, we would
> need the suffix and we would need the entire lexicon (not just every
> 128th index term) compiled into the FST.

The main purpose of breaking out a separate index structure is to avoid binary
searching over the large primary file. There's nothing special about the
extra file - in fact, it's a drawback that it doesn't include all terms. If
we can jam all the data we need to binary search against into the front of the
file, but include the data for all terms in an infrequently-accessed tail, we
win.

And... if your terms index is in RAM, to minimize its net size and
decode cost on loading.


Michael McCandless added a comment - 26/Nov/08 11:31 AM

OK, it makes sense to have the user access these capabilities via a single
handle at both index-time and search-time. However, for Lucy/KS, the handle
should definitely be specified via the Schema subclass rather than via
constructor argument.

"Codec" isn't really the right name for this, though. "IndexComponent",
maybe? Lucy would have three main index components by default:
LexiconComponent, PostingsComponent, StorageComponent.

Well, maybe both? Ie, each of these IndexComponents could have many
different codecs to write/read the data to/from the index. So when I
implement PostingsComponent, when writing a segment I could choose my
own codec; when reading it, I retrieve the matching codec to decode
it.

Subclassing Schema seems like the right approach.


Michael McCandless added a comment - 26/Nov/08 11:38 AM

> I think, just like we are pushing for column-stride / FieldCache to be
> "per segment" instead of one big merged array, we should move in the
> same direction for searching?

Algorithmically speaking, it would definitely help this specific task, and
that's a BIG FAT PLUS. This, plus memory mapping and writing the DocID -> ord
map at index-time, allows us to totally eliminate the current cost of loading
sort caches at IndexReader startup. The question is, how easy is it to
refactor our search OO hierarchy to support it?

If our goal is minimal impact to the current model, we worry only about the
TopFieldDocs search() method. We can hack in per-segment bookending via doc
number to the hit collection routine, initializing the TopFieldDocCollector
each segment (either creating a new one or popping all the collected docs).

But does it make sense to be more aggressive? Should Searchers run hit
collection against individual segments? Should Scorers only be compiled
against single segments?

Maybe so. I implemented pruning (early termination) in KS, and it had to be
done per segment. This is because you have to sort the documents within a
segment according to the primary criteria you want to prune on (typically doc
boost). I've since ripped out that code because it was adding too much
complexity, but maybe there would have been less complexity if segments were
closer to the foreground.

I think the plus's are substantial here. Not having to materialize
one massive array of norms, of FieldCache/column-stride values, of
docID->ord values, is very important because these are at least linear
cost (more for the docID->ord) in # docs in the index. Reopening a
searcher on a large index is very costly in Lucene now because of
these materializations.

We need to think more about the tradeoffs here...


Michael McCandless added a comment - 26/Nov/08 11:45 AM

> Well ...that cat command can be deadly for a large index, too?

It will be costly for a large index, and it wouldn't be appropriate in all
cases. The use case I was thinking of was: dedicated server with gobs of RAM.
The index could either be updated often or not updated at all. Pre-existing
segments stay warm on such a box, and the writer would leave the latest
segment hot, so the cat command would only be needed once, at the startup of
the persistent service.

Ahh OK. But that cat command is basically just a different, more
global, implemenation of "warming".

So eg you'd still need to coordinate so that the new searcher isn't
used until warming finishes, right? In Lucene, since warming is explicit
and under direct programmatic control, we know when warming is done.
I guess you could also do a system call to do the cat command,
blocking cutover to the new searcher until it completes.


Marvin Humphrey added a comment - 26/Nov/08 02:04 PM
> Not having to materialize one massive array of norms, of
> FieldCache/column-stride values, of docID->ord values, is very important
> because these are at least linear cost (more for the docID->ord) in # docs
> in the index. Reopening a searcher on a large index is very costly in Lucene
> now because of these materializations.
>
> We need to think more about the tradeoffs here...

Let's continue the discussion of segment-centric searching on java-dev, since it it's
only tangentially related to flexible indexing.


Marvin Humphrey added a comment - 26/Nov/08 02:38 PM
> So eg you'd still need to coordinate so that the new searcher isn't
> used until warming finishes, right?

...

> I guess you could also do a system call to do the cat command,
> blocking cutover to the new searcher until it completes.

Warming is only needed once, at search service startup. The idea is to get
the whole index into the system IO cache.

Once all segment data is in the IO cache, we assume that it stays there,
because this is a beefy dedicated search box with more than enough RAM to fit
the entire index/shard.

Say that we add a new segment to the index, either by running an
index-writing process locally, or via rsync. (Assume for the purposes of
argument that the local indexing process doesn't require much RAM – which
is true with KS – and so won't have the side effect of nudging existing
segments out of IO cache.)

Now, say that our search service checks at the beginning of each request to
see if the index has been modified. If it has, it opens a new searcher from
scratch – which takes almost no time, because we're memory mapping rather
than slurping.

while (newRequest()) {
  if (indexHasBeenUpdated()) {
    searcher = new IndexSearcher("/path/to/index");
  }
  ...
}

After an abrupt cutover to the new searcher, we process the search request.
Is the new search sluggish in any way? No, because all the segments used
by the new searcher are "hot". Older segments are hot because they were
in use by the prior searcher, and the new segment is hot because it was
just written.

Therefore, we don't need to worry about managing cutover to a new searcher.
We can just discard the old one and replace it with the new one.


Marvin Humphrey added a comment - 26/Nov/08 05:18 PM
> Well, maybe both? Ie, each of these IndexComponents could have many
> different codecs to write/read the data to/from the index. So when I
> implement PostingsComponent, when writing a segment I could choose my
> own codec; when reading it, I retrieve the matching codec to decode
> it.

Yes, both – that sounds good. However, I'm not sure whether you're proposing
the creation of a class named "Codec", which I think we should avoid unless
all of our "codecs" can descend from it. So: PostingsCodec, TermsDictCodec
(or LexiconCodec, for Lucy/KS), and so on would be base classes.

> Subclassing Schema seems like the right approach.

Groovy. How are you going to handle it in Lucene? I think you just have to
require the end user to be consistent about supplying the necessary arguments
to the IndexReader and IndexWriter constructors.

How do we handle auxiliary IndexComponents? I've long wanted to implement an
RTreeComponent for geographic searching, so I'll use that as an example.

At index-time, I think we just create an array of SegDataWriter objects and
feed each document to each writer in turn. The SegDataWriter abstract base
class will define all the necessary abstract methods: addDoc(),
addSegment(SegReader) (for Lucy/KS), various commands related to merging (for
Lucene), finish()/close(), and so on. RTreeWriter would simply subclass
SegDataWriter.

At search-time, things get a little trickier. Say we hand our Searcher object
an RTreeRadiusQuery. At some point, the RTreeRadiusQuery will need to be
compiled to an RTreeRadiusScorer, which will involve accessing an RTreeReader
which presumably resides within an IndexReader. However, right now,
IndexReader hides all of its inner readers and provides access through
specific methods, e.g. IndexReader.document(int docNum), which ultimately
hands off to FieldsReader internally. This model doesn't scale with the
addition of arbitrary IndexComponents.

The only thing I can thing of is an IndexReader.getReader(String name) method.


Michael McCandless added a comment - 26/Nov/08 05:57 PM

Warming is only needed once, at search service startup.

Ahh, got it. Lucene must warm for each reopened searcher (though that warming cost will eventually be in proportion to what's changed in the index), but KS/Lucy should be fine doing zero warming except for the very first searcher startup (eg after rebooting the machine).


Michael McCandless added a comment - 26/Nov/08 06:10 PM

> So: PostingsCodec, TermsDictCodec (or LexiconCodec, for Lucy/KS), and
> so on would be base classes.

Right: separate codec base classes for each component. Back to the
video analogy: a typical video has a "audio" component and a "video"
component. AudioCodec would be the base class for all the various
audio codecs, and likewise for VideoCodec.

> I think you just have to require the end user to be consistent about
> supplying the necessary arguments to the IndexReader and IndexWriter
> constructors.

Right.

> How do we handle auxiliary IndexComponents? I've long wanted to implement an
> RTreeComponent for geographic searching, so I'll use that as an example.

> At index-time, I think we just create an array of SegDataWriter objects and
> feed each document to each writer in turn.

I think that's right. In Lucene we now have an indexing chain
(package private), so that you can "tap in" at whatever point is
appropriate – you could handle the whole doc yourself (like
SegDataWriter); you could be fed one field at a time; you could tap in
after inversion so you get one token at a time, etc.

> At search-time, things get a little trickier.
> ...
> The only thing I can thing of is an IndexReader.getReader(String
> name) method.

I haven't thought enough about how to handle this at search time.
IR.getReader seems fine, though, you'd need to open each
IndexComponent up front inside the retry loop, right?


Marvin Humphrey added a comment - 26/Nov/08 08:50 PM
> In Lucene we now have an indexing chain
> (package private), so that you can "tap in" at whatever point is
> appropriate - you could handle the whole doc yourself (like
> SegDataWriter); you could be fed one field at a time; you could tap in
> after inversion so you get one token at a time, etc.

That's pretty nice. It occurred to me to try something like that, but I got a
little lost.

The fact that the Doc object in KS uses the host language's native hashtable
and string implementations for field data complicates an already complicated
matter. It's hard to abstract out access to field data so that the KS/Lucy
core, which knows nothing about the host language, can see it, yet still
maintain peak performance in the addDoc() loop.

In any case, I don't anticipate intractable implementation troubles with
adding IndexComponents at index-time.

> IR.getReader seems fine, though, you'd need to open each
> IndexComponent up front inside the retry loop, right?

Sure, startup's easy. I think we just add Schema.auxiliaryComponents(),
which returns an array of IndexComponents. The default would be to return
null or an empty array, but subclasses could override it.

Where we have problems, though, is with remote searching or multi-searching.
You can't ask a Searchable for its inner IndexReader, because it might not
have one. That means that you can't "see" information pertaining to a custom
IndexComponent until you're at the level of the individual machine –
aggregate information, like docFreq across an entire collection spanning
multiple indexes, wouldn't be available to searches which use custom
components.

The only remedy would be to subclass all your Searchables – the local
IndexSearcher, the RemoteSearchable that wraps it, and the MultiSearcher that
aggregates results – to drill down into the correct IndexReader and pass data
back up the chain. Basically, you'd have to duplicate e.g. the call chain
that fetches documents.


Michael McCandless added a comment - 24/Feb/09 02:22 PM

New patch attached (still plenty more to do...):

  • Updated to current trunk (747391).
  • All tests pass, but back-compat tests don't compile...
  • Switched the new "4d iteration API" (Fields -> Terms -> Docs ->
    Positions) to subclass AttributeSource; this way codecs can add in
    their own attrs.
  • Added PostingsCodecs class, that holds all PostingCodec instances
    your index may make use of, and changed segments_N format to
    record which codec was used per segment. So, an index can have
    mixed codecs (though for a single IndexWriter session, the same
    codec is used when writing new segments).
  • I cutover TermScorer to use the new API; I still need to cutover
    other queries, segment merging, etc.

Michael McCandless added a comment - 12/Mar/09 12:51 PM
Clearing fix version.

Michael Busch added a comment - 12/Aug/09 10:13 AM
I took Mike's latest patch and updated it to current trunk.
It applies cleanly and compiles fine.

Some test cases fail. The problem is in SegmentReader in termsIndexIsLoaded() and loadTermsIndex(). I'll take a look tomorrow, I need to understand the latest changes we made in the different IndexReaders better (and now it's getting quite late here...)


Michael McCandless added a comment - 12/Aug/09 04:47 PM
Thanks for modernizing the patch Michael! I'll get back to this one soon... I'd really love to get PForDelta working as a codec. It's a great test case since it's block-based, ie, very different from the other codecs.

Michael Busch added a comment - 27/Aug/09 07:51 PM

Switches to a new more efficient terms dict format.

This is nice! Maybe we should break this whole issue into smaller pieces? We could start with the dictionary. The changes you made here are really cool already. We could further separate the actual TermsDictReader from the terms index with a clean API (I think you put actually a TODO comment into your patch). Then we can have different terms index implementations in the future, e.g. one that uses a tree. We could also make SegmentReader a bit cleaner: if opened just for merging it would not create a terms index reader at all; only if cloned for an external reader we would instantiate the terms index lazily. Currently this is done by setting the divisor to -1.


Michael Busch added a comment - 27/Aug/09 11:29 PM
In the current patch the choice of the Codec is index-wide, right? So I can't specify different codecs for different fields. Please correct me if I'm wrong.

Michael Busch added a comment - 28/Aug/09 04:06 AM
Oups, didn't want to steal this from you, Mike. Wanted to hit the "Watch" button instead...

Michael McCandless added a comment - 28/Aug/09 09:36 AM

Maybe we should break this whole issue into smaller pieces? We could start with the dictionary. The changes you made here are really cool already.

Yeah the issue is very large now. I'll think about how to break it
up.

I agree: the new default terms dict codec is a good step forward.
Rather than load a separate TermInfo instance for every indexed term
(costly in object overhead, and, because we store Term[] as well we
are wasting space storing many duplicate String field pointers in a
row), we only store the String and the long offset into the index file
as two arrays. It's a sizable memory savings for indexes with many
terms.

This was a nice side-effect of genericizing things, because the
TermInfo class had to be made private to the codec since it's storing
things like proxOffset, freqOffset, etc., which is particular to how
the Lucene's default codec stores postings.

But, it's somewhat tricky to break out only this change... eg it's
also coupled with the change to strongly separate field from term
text, and, to remove TermInfo reliance. Ie, the new terms dict has a
separate per-field class, and within that per-field class it has the
String[] termText and long[] index offsets. I guess we could make a
drop-in class that tries to emulate TermInfosReader/SegmentTermEnum
even though it separates into per-field, internally.

We could further separate the actual TermsDictReader from the terms index with a clean API (I think you put actually a TODO comment into your patch).

Actually the whole terms dict writing/reading is itself pluggable, so
your codec could provide its own. Ie, Lucene "just" needs a
FieldsConsumer (for writing) and a FieldsProducer (for reading).

But it sounds like you're proposing making a strong decoupling of
terms index from terms dict?

Then we can have different terms index implementations in the future, e.g. one that uses a tree.

+1

Or, an FST. FST is more compelling than tree since it also compresses
suffixes. FST is simply a tree in the front plus a tree in the back
(in reverse), where the "output" of a given term's details appears in
the middle, on an edge that is "unique" to each term, as you traverse
the graph.

We could also make SegmentReader a bit cleaner: if opened just for merging it would not create a terms index reader at all; only if cloned for an external reader we would instantiate the terms index lazily. Currently this is done by setting the divisor to -1.

Right. Somehow we should genericize the "I don't need the terms
index at all" when opening a SegmentReader. Passing -1 is sort of
hackish. Though I do prefer passing up front your intentions, rather
than loading lazily (LUCENE-1609).

We could eg pass "requirements" when asking the codec for the terms
dict reader. EG if I don't state that RANDOM_ACCESS is required (and
only say LINEAR_SCAN) then the codec internally can make itself more
efficient based on that.

In the current patch the choice of the Codec is index-wide, right? So I can't specify different codecs for different fields. Please correct me if I'm wrong.

The Codec is indeed index-wide, however, because the field vs term
text are strongly separated, it's completely within a Codec's control
to return a different reader/writer for different fields. So it ought
to work fine... eg one in theory could make a "PerFieldCodecWrapper".
But, I haven't yet tried this with any codecs. It would make a good
test case though... I'll write down to make a test case for this.

Also, it's fine if an index has used different codecs over time when
writing, as long as when reading you provide a PostingsCodecs
instance that's able to [correctly] retrieve those codecs to read those
segments.


Michael Busch added a comment - 31/Aug/09 08:45 AM

But it sounds like you're proposing making a strong decoupling of
terms index from terms dict?

Right.

Right. Somehow we should genericize the "I don't need the terms
index at all" when opening a SegmentReader. Passing -1 is sort of
hackish. Though I do prefer passing up front your intentions, rather
than loading lazily (LUCENE-1609).

I'm a bit confused. Doesn't the IndexWriter open SegmentReaders
usually with termsIndexDivisor=-1 for merge, and maybe later with
a termsIndexDivisor>0 when IndexWriter#getReader() is called?
That's what I meant with loading lazily.

I thought that's why it'd be good to separate the terms index from
the terms dict. For merge we'd open the dict reader only, and then
if getReader() is called we'd open the terms index reader and give
its reference to the dict reader.

I admit that I didn't follow the NRT changes as closely as I should
have, so I might be missing things here.


Michael Busch added a comment - 31/Aug/09 08:58 AM

The Codec is indeed index-wide, however, because the field vs term
text are strongly separated, it's completely within a Codec's control
to return a different reader/writer for different fields. So it ought
to work fine... eg one in theory could make a "PerFieldCodecWrapper".
But, I haven't yet tried this with any codecs. It would make a good
test case though... I'll write down to make a test case for this.

OK I see now. Did you think about possibly extending the field API
to specify the codec? And then to store the Codec name in the
fieldinfos (which we'd want to make extensible too, as briefly
discussed in LUCENE-1597) instead of the dictionary?


Michael McCandless added a comment - 31/Aug/09 09:18 AM

I'm a bit confused. Doesn't the IndexWriter open SegmentReaders
usually with termsIndexDivisor=-1 for merge, and maybe later with
a termsIndexDivisor>0 when IndexWriter#getReader() is called?
That's what I meant with loading lazily.

Right, it does. This is the one case (internal to Lucene, only) where
loading lazily is still necessary.

I thought that's why it'd be good to separate the terms index from
the terms dict. For merge we'd open the dict reader only, and then
if getReader() is called we'd open the terms index reader and give
its reference to the dict reader.

OK got it. I think this makes sense.

The separation in the current approach is already quite strong, in
that the terms dict writer/reader maintains its own String[] indexText
and long[] indexOffset and then "defers" to its child component just
what is stored in each terms dict entry. So each child can store
whatever it wants in the terms dict entry (eg the pulsing codec
inlines low-freq postings).

If we make pluggable how the indexText/indexOffset is stored/loaded in
memory/used, then we have a stronger separation/pluggability on the
index. EG even before FST for the index we should switch to blocks of
char[] instead of separate Strings, for indexText.


Michael Busch added a comment - 31/Aug/09 09:22 AM

EG even before FST for the index we should switch to blocks of
char[] instead of separate Strings, for indexText.

I totally agree. I made a similar change (from String objects to
char[] blocks) on some other code (not Lucene) and the savings
in memory and garbage collection were tremendous!


Michael McCandless added a comment - 04/Sep/09 12:10 AM
I attached a .tar.bz2 of src/* with my current state – too hard to
keep svn in sync / patchable right now. Changes:
  • Factored out the terms dict index, so it's now "pluggable" (though
    I've only created one impl, so far)
  • Cutover SegmentMerger to flex API
  • Changed terms to be stored in RAM as byte[] (not char[]), when
    reading. These are UTF8 bytes, but in theory eventually we could
    allow generic bytes here (there are not that many places that try
    to decode them as UTF8). I think this is a good step towards
    allowing generic terms. It also saves 50% RAM for simple ascii
    terms w/ the terms index.
  • Changed terms index to use shared byte[] blocks
  • Broke sources out into "codecs" subdir of oal.index. Right now I
    have "preflex" (only provides reader, to read old index format),
    "standard" (new terms dict & index, but otherwise same
    freq/prox/skip/payloads encoding), "pulsing" (inlines low-freq
    terms directly into terms dict) and "sep" (seperately stores docs,
    frq, prox, skip, payloads, as a pre-cursor to using pfor to encode
    doc/frq/prox).

The patch is very rough... core & core-test compile, but most tests
fail. It's very much still a work in progress...


Michael McCandless added a comment - 11/Sep/09 01:49 PM
New patch & src.tar.bz2 attached. All tests, including back-compat, pass.

There are still zillions of nocommits to resolve.

Some of the changes:

  • Got all tests to pass.
  • Separated out a non-enum Fields/Terms API.
  • Improved byte[] block allocation in the new terms index so that
    the blocks are shared across fields (important when there are
    zillions of fields each of which has few index terms)
  • Changed TermsEnum.docs() API to accept a new bit set interface
    (currently called Bits) skipDocs. This is towards eventual
    support for random access filters. I also added Bits
    IndexReader.getDeletedDocs().

Next step is to get the other codecs (sep, pulsing) to pass all tests,
then to make a pfor codec! I also need to perf test all of these
changes...


Yonik Seeley added a comment - 11/Sep/09 04:27 PM

Changed terms to be stored in RAM as byte[] (not char[]),

Yay! This will be important for NumericField too since it uses 7 bits per char and will probably account for the majority of terms in the index in many applications.

I attached a .tar.bz2 of src/* with my current state - too hard to keep svn in sync / patchable right now.

Could a git branch make things easier for mega-features like this?


Michael McCandless added a comment - 11/Sep/09 05:45 PM

Changed terms to be stored in RAM as byte[] (not char[]),

Yay! This will be important for NumericField too since it uses 7 bits per char and will probably account for the majority of terms in the index in many applications.

It's actually byte[] both in how the terms dict index stores the terms
in RAM (using shared byte[] blocks) and also in how terms are
represented throughout the flex API. EG TermsEnum API returns
a TermRef from its next() method. TermRef holds byte[]/offset/length.

Could a git branch make things easier for mega-features like this?

Maybe – though I don't have much experience w/ git. If people are
interested in working together on this then I think it'd be worth
exploring?


Michael McCandless added a comment - 12/Sep/09 04:41 PM
Attached patch.

All tests pass with all 3 codecs (standard = just like today's index format; pulsing = terms that occur in only 1 doc are inlined into terms dict; sep = separate files for doc, freq, prx, payload, skip data).


Jason Rutherglen added a comment - 14/Sep/09 12:17 AM
Mike,

Maybe a directed acyclic word graph would work well as an alternative dictionary implementation?


Michael McCandless added a comment - 14/Sep/09 09:24 AM

Maybe a directed acyclic word graph would work well as an alternative dictionary implementation?

I think that'd be great. In particular, an FST (DAG that shares prefix & suffix and "outputs" the per-term data in the middle of the graph) should be a good savings in most normal term distributions.

Flexible indexing makes the terms dict & terms dict index pluggable, so we are free to experiment with alternative impls. I've only taken some baby steps to improve on the current terms dict index (by switching to shared byte[] blocks, instead of a separate TermInfo / String instance per indexed term).


Michael McCandless added a comment - 23/Sep/09 05:10 PM
New patch attached. All tests pass.

I haven't quite made it to PForDelta yet, but it's very close!

The sep codec was the first step (uses separate files for doc, frq,
pos, payload, skip).

Then, in this patch, the big change was to create new
IntIndexInput/Output abstract classes, that only expose reading &
writing ints. I then fixed the sep codec to use this class for doc,
frq and pos files.

The trickiest part was abstracting away just what a "file pointer"
is. In Lucene we assume in many places this is the long file offset,
but I needed to change this to file-offset plus within-block-offset,
for int-block based files.

Once I did that, I created a FixedIntBlockIndexInput/Output, which
reads & writes the ints in blocks of a specified size. They are
abstract classes and require a subclass to do the actual encode/decode
of a given block. To test it I created a simple class that just
writes multiple vInts. All tests also pass with this newly added
("intblock") codec.

So the next step is to hook up PforDelta...


John Wang added a comment - 24/Sep/09 01:57 AM
This is awesome!
Feel free to take code from Kamikaze for the p4delta stuff.
The impl in Kamikaze assumes no decompression at load time, e.g. the Docset can be traversed in compressed form.

Michael McCandless added a comment - 24/Sep/09 10:11 AM

Feel free to take code from Kamikaze for the p4delta stuff.
The impl in Kamikaze assumes no decompression at load time, e.g. the Docset can be traversed in compressed form.

Thanks John. I've been starting with LUCENE-1410 for now, but we can easily swap in any p4 impl, or any other int compression method. All that's needed in the subclass is to implement encodeBlock (in the writer) and decodeBlock (in the reader). The intblock codec takes care of the rest.

Kamikaze looks like great stuff!

What variation on p4 is Kamikaze using?

Keeping the p4 data compressed is interesting... when you implement AND/OR/NOT on p4, do you have a shortcut that traverses the compressed form while applying the operator? Or do you do the full decode and then 2nd pass to apply the operator?


John Wang added a comment - 24/Sep/09 01:09 PM
Hi Mike:

We have been using Kamikaze in our social graph engine in addition to our search system. A person's network can be rather large, decompressing it in memory some network operation is not feasible for us, hence we made the requirement for the DocIdSetIterator to be able to walk to DocIdSet's P4Delta implementation in compressed form.

We do not decode the p4delta set and make a second pass for boolean set operations, we cannot afford it in both memory cost and latency. The P4Delta set adheres to the DocIdSet/Iterator api, and the And/Or/Not is performed on that level of abstraction using next() and skipTo methods.

-John


John Wang added a comment - 24/Sep/09 01:13 PM
Just a FYI: Kamikaze was originally started as our sandbox for Lucene contributions until 2.4 is ready. (we needed the DocIdSet/Iterator abstraction that was migrated from Solr)

It has three components:

1) P4Delta
2) Logical boolean operations on DocIdSet/Iterators (I have created a jira ticket and a patch for Lucene awhile ago with performance numbers. It is significantly faster than DisjunctionScorer)
3) algorithm to determine which DocIdSet implementations to use given some parameters, e.g. miniD,maxid,id count etc. It learns and adjust from the application behavior if not all parameters are given.

So please feel free to incorporate anything you see if or move it to contrib.


John Wang added a comment - 24/Sep/09 01:28 PM
Hi Uwe:

Thanks for the pointer to the isCacheable method. We will defn incorporate it.

-John


Michael McCandless added a comment - 25/Sep/09 05:58 PM
Attached patch. This includes the pfor impl from LUCENE-1410.

PforDelta is working! I added another codec (pfordelta). It uses the
sep codec to separately store freq, doc, pos, and then uses PforDelta
to encode the ints (as fixed-size blocks).

However, there are a couple test failures
(TestIndexWriter.testNegativePositions,
TestPositionIncrement.testPayloadsPos0) due to PforDelta not properly
encoding -1 (it's returned as 255). Lucene normally doesn't write
negative ints, except for the special case of a 0 position increment
in the initial token(s), in which case due to the bug in LUCENE-1542
we write a -1 if you've called IndexWriter.setAllowMinus1Position.
However, that's deprecated and will be removed shortly at which point
the pfordelta codec will pass all tests.


Michael McCandless added a comment - 26/Sep/09 12:52 PM
I wrote up first cut of the toplevel design of this patch, in the wiki: http://wiki.apache.org/lucene-java/FlexibleIndexing.

John Wang added a comment - 26/Sep/09 03:32 PM
Hi Mike:

Truly awesome work!

Quick question, are codecs per index or per field? From the wiki, it seems to be per index, if so, is it possible to make it per field?

Thanks

-John


Michael McCandless added a comment - 27/Sep/09 02:19 PM
The codec is per segment. However, we ask the codec for
Terms/TermsEnum by fields, so it should be simple to make a Codec that
dispatches to field-specific Codecs.

Michael McCandless added a comment - 02/Oct/09 12:32 AM
Attached current patch. All tests pass:
  • Cutover merging to flex API.
  • Cutover FieldCache to flex API. This got tricky, because terms
    are now UTF8 byte[]. First, we have a back-compat issue (I
    changed FieldCache's parsers to take TermRef not String). Second,
    parsing float/double from byte[] is tricky. I just punted and
    made a new String(), and then called parseDouble/parseFloat, which
    is slow (but, NumericFields don't do this – they are easy to
    parse straight from byte[], I think). Net/net this should be
    faster loading the FieldCache now. Also, later we can make a
    String/StringIndex FieldCache variant that keeps things as byte[].
  • Cutover CheckIndex to flex API.
  • Removed the codec-owned extensions from IndexFileNames; added
    methods to quey a Codec for all file extensions it may write. As
    part of this there is a minor (I think) runtime change whereby
    Directory.copy or new RamDirectory(Directory) will now copy all
    files not just index-related files.

I'm now working towards getting this committable. While PforDelta
works, I think we should move its codec over to LUCENE-1410 and get it
working well, separately, after this is committed.

Still need to cutover more stuff (queries, AllTermDocs, etc.) to flex
API, get the ThreadLocal cache carried over, fix a bunch of nocommits,
remove debugging, do perf testing & fix issues, add some more tests,
etc.


Michael McCandless added a comment - 05/Oct/09 12:19 PM
New patch attached. All tests pass. The changes are mostly cutting
many things over to the flex API. Still many nocommits to address,
but I'm getting closer!

I haven't "svn up"d to all the recent the deprecations removals /
generics additions. Kinda dreading doing so I think I'll wait
until all deprecations are gone and then bite the bullet...

Cutting over all the MultiTermQuery subclasses was nice because all
the places where we get a TermEnum & iterate, checking if .field() is
still our field, are now cleaner because with the flex API the
TermsEnum you get is already only for your requested field.


Yonik Seeley added a comment - 05/Oct/09 02:59 PM
Sounding cool! I haven't had time to look at the code too much... but I j ust wanted to mention two features I've had in the back of my mind for a while that seem to have multiple use cases.

1) How many terms in a field?

  • If the tii/TermInfos were exposed, this could be estimated.
  • Perhaps this could just be stored in FieldInfos... should be easy to track during indexing?
  • MultiTermQuery could also use this to switch impls

2) Convert back and forth between a term number and a term.
Solr has code to do this... stores every 128th term in memory as an index, and uses that to convert back and forth. This is very much like the internals of TermInfos... would be nice to expose some of that.


John Wang added a comment - 05/Oct/09 03:48 PM
Hi Yonik:

These are indeed useful features. LUCENE-1922 addresses 1), perhaps, we can add 2) to the same issue to track?

Thanks

-John


Michael McCandless added a comment - 05/Oct/09 06:05 PM

1) How many terms in a field?

Actually I've already added this one (Terms.getUniqueTermCount), but I
didn't punch it through to IndexReader. I'll do that. The standard
codec (new "default" codec when writing segments) already records this
per field, so it's trivial to expose.

However, some impls may throw UOE (eg a composite IndexReader).

2) Convert back and forth between a term number and a term.

I agree this would be useful. I did have ord() in early iterations of
the TermsEnum API, but it wasn't fully implemented and I stripped it
when I switched to "just finish it already" mode We could think
about adding it back, though you'd also presumably need seek(int ord)
as well? (And docFreq(String field, int ord) sugar exposed in
IndexReader?).


Yonik Seeley added a comment - 05/Oct/09 06:27 PM

I agree this would be useful. I did have ord() in early iterations of the TermsEnum API, but it wasn't fully implemented and I stripped it when I switched to "just finish it already" mode

A "complete" implementation seems hard (i.e. across multiple segments also)... but it still seems useful even if it's only at the segment level. So perhaps just on SegmentTermEnum, and uses would have to cast to access?

Exposing the term index array (i.e. every 128th term) as an expert-subject-to-change warning would let people implement variants themselves at least.

you'd also presumably need seek(int ord)

Yep.


Grant Ingersoll added a comment - 05/Oct/09 09:59 PM
I haven't followed too closely (even though it is one of my favorite issues) but I figured while Yonik was throwing out ideas, I'd add that one of the obvious use cases for flexible indexing is altering scoring. One of the common statistics one needs to implement some more advanced scoring approaches is the average document length. Is this patch far enough along that I could take a look at it and think about how one might do this?

Mark Miller added a comment - 05/Oct/09 11:58 PM

I haven't "svn up"d to all the recent the deprecations removals / generics additions. Kinda dreading doing so '

Come on old man, stop clinging to emacs I've got a meditation technique for that

Sounds like some annoyance, and I think I made a comment there - and I'm a man of my word... or child of my word - take your pick.

To trunk. Since you likely have moved on, don't worry - this was good practice - I'll do it again sometime if you'd like. I may have mis merged something little or something. I went fairly quick (I think it took like 30 or 40 min - was hoping to do it faster, but eh - sometimes I like to grind).

I didn't really look at the code, but some stuff I noticed:

java 6 in pfor Arrays.copy

skiplist stuff in codecs still have package of index - not sure what is going on there - changed them

in IndexWriter:
+ // Mark: read twice?
segmentInfos.read(directory);
+ segmentInfos.read(directory, codecs);

Core tests pass, but I didn't wait for contrib or back compat.


Mark Miller added a comment - 06/Oct/09 04:06 AM
eh - even if you have moved on, if I'm going to put up a patch, might as well do it right - here is another:
  • removed a boatload of unused imports
  • removed DefaultSkipListWriter/Reader - I accidently put them back in
  • removed an unused field or two (not all)
  • paramaterized LegacySegmentMergeQueue.java
  • Fixed the double read I mentioned in previous comment in IndexWriter
  • TermRef defines an equals (that throws UOE) and not hashCode - early stuff I guess but odd since no class extends it. Added a hashCode that throws UOE anyway.
  • fixed bug in TermRangeTermsEnum: lowerTermRef = new TermRef(lowerTermText); to lowerTermRef = new TermRef(this.lowerTermText);
  • Fixed Remote contrib test to work with TermRef for fieldcache parser (since you don't include contrib in the tar)
  • Missed a StringBuffer to StringBuilder in MultiTermQuery.toString
  • had missed removing deprecated IndexReader.open(final Directory directory) and deprecated IndexReader.open(final IndexCommit commit)
  • Paramertized some stuff in ParrallelReader that made sense - what the heck
  • added a nocommit or two on unread fields with a comment that made it look like they were/will be used
  • Looks like SegmentTermPositions.java may have been screwy in last patch - ensure its now a deleted file - same with TermInfosWriter.java
  • You left getEnum(IndexReader reader) in the MultiTerm queries, but no in PrefixQuery - just checkin'.
  • Missed removing listAll from FileSwitchDirectory - gone
  • cleaned up some white space nothings in the patch
  • I guess TestBackwardsCompatibility.java has been removed from trunk or something? kept it here for now.
  • looks like i missed merging in a change to TestIndexWriter.java#assertNoUnreferencedFiles - done
  • doubled checked my merge work

core and contrib tests pass


Michael McCandless added a comment - 06/Oct/09 09:53 AM
Whoa thanks for the sudden sprint Mark!

Come on old man, stop clinging to emacs

Hey! I'm not so old But yeah I still cling to emacs. Hey, I know
people who still cling to vi!

I didn't really look at the code, but some stuff I noticed:

java 6 in pfor Arrays.copy

skiplist stuff in codecs still have package of index - not sure what is going on there - changed them

in IndexWriter:
+ // Mark: read twice?
segmentInfos.read(directory);
+ segmentInfos.read(directory, codecs);

Excellent catches! All of these are not right.

(since you don't include contrib in the tar)

Gak, sorry. I have a bunch of mods there, cutting over to flex API.

You left getEnum(IndexReader reader) in the MultiTerm queries, but no in PrefixQuery - just checkin'.

Woops, for back compat I think we need to leave it in (it's a
protected method), deprecated. I'll put it back if you haven't.

I guess TestBackwardsCompatibility.java has been removed from trunk or something? kept it here for now.

Eek, it shouldn't be – indeed it is. When did that happen? We
should fix this (separately from this issue!).

Do you have more fixes coming? If so, I'll let you sprint some more; else, I'll merge in, add contrib & back-compat branch, and post new patch! Thanks


Michael McCandless added a comment - 06/Oct/09 10:08 AM

One of the common statistics one needs to implement some more advanced scoring approaches is the average document length. Is this patch far enough along that I could take a look at it and think about how one might do this?

Well, thinking through how you'd do this... likely you'd want to store
the avg length (in tokens), eg as a single float per field per
segment, right? The natural place to store this would be in the
FieldInfos, I think?. Unfortunately, this patch doesn't yet add
extensibility to FieldInfos.

And you'd need a small customization to the indexing chain to
compute this when indexing new docs, which is already doable today
(though, package private).

But then on merging segments, you'd need an extensions point, which we
don't have today, to recompute the avg. Hmm: how would you handle
deleted docs? Would you want to go back to the field length for every
doc & recompute the average? (Which'd mean you need to per doc per
field length, not just the averages).

Unfortunately, this patch doesn't yet address things like customizing
what's stored in FieldInfo or SegmentInfo, nor customizing what
happens during merging (though it takes us a big step closer to this).
I think we need both of these to "finish" flexible indexing, but I'm
thinking at this point that these should really be tackled in followon
issue(s). This issue is already ridiculously massive.


Uwe Schindler added a comment - 06/Oct/09 11:43 AM

I guess TestBackwardsCompatibility.java has been removed from trunk or something? kept it here for now.

Eek, it shouldn't be - indeed it is. When did that happen? We
should fix this (separately from this issue!).

My fault, I removed it during the remove backwards tests on Saturday. If we do not remove DateTools/DateField for 3.0 (we may need to leave it in for index compatibility), I will restore, these tests, too. It's easy with TortoiseSVN and you can also preserve the history (using svn:mergeinfo prop).

I have this on my list when going forward with removing the old TokenStream API.


Michael McCandless added a comment - 06/Oct/09 12:01 PM

It's easy with TortoiseSVN and you can also preserve the history (using svn:mergeinfo prop).

Ahh – can you do this for TestBackwardsCompatibility? I restored it, but, lost all history. Thanks.


Uwe Schindler added a comment - 06/Oct/09 12:31 PM
Done. I also did it for the BW branch, but didn't create a tag yet. The next tag creation for the next bigger patch is enough (no need to do it now).

What I have done: svn copy from the older revision to the same path


Michael McCandless added a comment - 06/Oct/09 02:14 PM

What I have done: svn copy from the older revision to the same path

Excellent, thanks! It had a few problems (was still trying to deprecated APIs, some of which were gone) – I just committed fixes.


Yonik Seeley added a comment - 06/Oct/09 02:26 PM

likely you'd want to store the avg length (in tokens), eg as a single float per field per segment, right?

I think we might want to store fundamentals instead:

  • total number of tokens indexed for that field in the entire segment
  • total number of documents that contain the field in the entire segment

Both of these seem really easy to keep track of?
I also think we'd just ignore deleted docs (i.e. don't change the stats) just as idf does today.

The natural place to store this would be in the FieldInfos, I think?

yep.


Michael McCandless added a comment - 06/Oct/09 03:06 PM
Uber-patch attached: started from Mark's patch (thanks!), added my contrib & back-compat branch changes. All tests pass.

Also, I removed pfor from this issue. I'll attach the pfor codec to LUCENE-1410.

Note that I didn't use "svn move" in generating the patch, so that the patch can be applied cleanly. When it [finally] comes time to commit for real, I'll svn move so we preserve history.


Mark Miller added a comment - 07/Oct/09 01:01 AM

Hey! I'm not so old But yeah I still cling to emacs.

can you say both of those things in the same breath? Just how long did it take to get that phd...

I'd look it up and guestimate your age, but I think MIT still has my ip blocked from back when I was applying to colleges. So I'm going with the "uses emacs" guestimate.

Hey, I know people who still cling to vi!

vi is the only one I can half way use - I know 3 commands - edit mode, leave edit mode, and save. And every now and then I accidently delete a whole line. When I make a change that I don't want to save, I have to kill the power.

The patch is in a bit of an unpatchable state I think I know what editor to blame...Pico!

Our old friend, the $id is messing up WildcardTermEnum - no problem, I can fix that...

But also, NumericUtils is unpatched, Codec is missing, along with most of the classes from the codecs packages! This looks like my work

My only conclusion is that your one of those guys that can write the whole program once without even running it - and then it works perfectly on the first go. Thats the only way I can explain those classes in the wrong package previously as well No bug hunting tonight


Mark Miller added a comment - 07/Oct/09 01:15 AM - edited
nope - something else - looking through the patch I see the files I want - a second attempt at patching has gone over better.

A couple errors still, but stuff I think I can fix so that I can at least look over. False alarm. My patcher wonked out or something. I can resolve the few errors that popped up this time. Sweet.

edit

Just for reference - not sure what happened the first time - my patch preview looked the same both times (was only complaining about the $id), but completely failed on attempt one and worked on attempt two - the only issue now appears to be you have half switch deletedDocs to Bits from BitVector - but only have way, so its broken in a dozen places. Not sure what you are doing about size() and what not, so I'm just gonna read around.

edit

Yes - I found it - BitVector was supposed to implement Bits - which was in the patch ... this patch just did not want to apply. I guess it was right, but Eclipse just did not want it to take ...


Mark Miller added a comment - 07/Oct/09 02:02 AM
Bah - all this huffing an puffing over the patch and I'm too sick to stay up late anyway.

Have you started benching at all? I'm seeing like a 40-50% drop in same reader search benches with standard, sep, and pulsing. Like 80% with intblock.


Michael McCandless added a comment - 07/Oct/09 09:28 AM
Mark is there anything wrong w/ the patch? Did you get it working?

Have you started benching at all? I'm seeing like a 40-50% drop in same reader search benches with standard, sep, and pulsing. Like 80% with intblock.

I haven't but it sounds like you have! I'll get to it soon... but one thing I know is missing is the equivalent of the "terminfo cache" so that when a query 1) looks up docFreq of the term (to compute its weight), and 2) looks up the freq/prox offsets, that 2nd lookup is cached.

IntBlock is expected to be slow – it naively encodes one int at a time using vInt. Ie, it's just a "test" codec, meant to be the base for real block-based codecs like pfor.


Mark Miller added a comment - 07/Oct/09 12:01 PM

Mark is there anything wrong w/ the patch? Did you get it working?

I got it working - it didn't apply cleanly, but perhaps that was just me. It was a weird situation - I get a preview of whats going to happen with complaints, and it only complained about the $id issue in wildcardtermenum - the half the patch failed. A second attempt and it only complained about that again - but then it missed making BitVector implement Bits - could just be ghosts in my machine. I wouldn't worry about it till someone else complains. In any case, I got it working in my case by just fixing the $id issue and adding implements Bits to BitVector.


Mark Miller added a comment - 07/Oct/09 03:30 PM

I haven't but it sounds like you have!

Nothing serious Just began trying to understand the code a bit more, so started with playing around with the different Codecs. Which lead to just quickly trying out the micro bench with each of em.


Michael McCandless added a comment - 07/Oct/09 04:01 PM
New patch attached. All tests pass.

I simplified the TermsEnum.seek API, and added ord to the API. The
ord is a long, but the standard codec (and, I think, Lucene today)
internally use an int...


Yonik Seeley added a comment - 08/Oct/09 09:28 PM - edited
Another for theTermsEnum wishlist: the ability to seek to the term before the given term... useful for finding the largest value in a field, etc.

I imagine "at or before" semantics would also work (like the current semantics of TermEnum in reverse)


Mark Miller added a comment - 08/Oct/09 09:53 PM - edited
Okay, I just tried a toy cache with standard - its not perfect because the tests have a bunch that end up finding one doc short, and I don't turn off the cache for any reason (the old one terns it off when returning the segmenttermenum, but I didn't even try to understand that with the new stuff). But that appears to get the majority of the perf back. Went from about 3500 r/s to 7500 - the old is 8400.

This stuff is so cool by the way.

edit

whew - emphasis on toy - its hard to do this right with docsreader


Michael McCandless added a comment - 09/Oct/09 12:06 AM

its hard to do this right with docsreader

I was thinking something along the lines of adding a "captureState" to DocsProducer.Reader, that returns an opaque object, and then adding a corresponding seek that accepts that object. It would chain to the positions reader.

Then StandardTermsDictReader would hold the thread private cache, using this API.


Mark Miller added a comment - 09/Oct/09 12:47 AM
Well thats reassuring - I think I was on the right path then. I've got the thread private cache, and I was initially just capturing in's position so I could set it before calling readTerm after pulling from the cache - so I knew I had an issue with the positions reader in there too (the position of it in readTerm) - but didn't see the cleanest path to set and capture that without modifying the reader like you said - but I wasn't even sure I was on the right path, so thats about where I gave up

Your comment makes me feel a little less dumb about it all though.


Michael McCandless added a comment - 09/Oct/09 09:54 AM
No problem Please post the patch once you have it working! We'll need to implement captureState/seek for the other codes too. The pulsing case will be interesting since it's state will hold the actual postings for the low freq case.

BTW I think an interesting codec would be one that pre-loads postings into RAM, storing them uncompressed (eg docs/positions as simple int[]) or slightly compressed (stored as packed bits). This should be a massive performance win at the expense of sizable RAM consumption, ie it makes the same tradeoff as contrib/memory and contrib/instantiated.


Michael McCandless added a comment - 09/Oct/09 10:26 AM

Another for theTermsEnum wishlist: the ability to seek to the term before the given term... useful for finding the largest value in a field, etc.
I imagine "at or before" semantics would also work (like the current semantics of TermEnum in reverse)

Right now seek(TermRef seekTerm) stops at the earliest term that's >=
seekTerm.

It sounds like you're asking for a variant of seek that'd stop at the
latest term that's <= seekTerm?

How would you use this to seek to the last term in a field? With the
flex API, the TermsEnum only works with a single field's terms. So I
guess we'd need TermRef constants, eg TermRef.FIRST and TermRef.LAST,
that "act like" -infinity / +infinity.


Michael McCandless added a comment - 09/Oct/09 12:07 PM
Actually, FIRST/LAST could be achieved with seek-by-ord (plus getUniqueTermCount()). Though that'd only work for TermsEnum impls that support ords.

Yonik Seeley added a comment - 09/Oct/09 01:39 PM

How would you use this to seek to the last term in a field?

It's not just last in a field, since one may be looking for last out of any given term range (the highest value of a trie int is not the last value encoded in that field).
So if you had a trie based field, one would find the highest value via seekAtOrBefore(triecoded(MAXINT))

Actually, FIRST/LAST could be achieved with seek-by-ord (plus getUniqueTermCount()).

Ahhh... right, prev could be implemented like so:

int ord = seek(triecoded(MAXINT))).ord
seek(ord-1)

Though that'd only work for TermsEnum impls that support ords.

As long as ord is supported at the segment level, it's doable.


Mark Miller added a comment - 09/Oct/09 03:24 PM - edited
hmm - I think I'm close. Everything passes except for omitTermsTest, LazyProxTest, and for some odd reason the multi term tests. Getting close though.

My main concern at the moment is the state capturing. It seems I have to capture the state before readTerm in next() - but I might not use that state if there are multiple next calls before the hit. So thats a lot of wasted capturing. Have to deal with that somehow.

Doing things more correctly like this, the gain is much less significant. What really worries me is that my hack test was still slower than the old - and that skipped a bunch of necessary work, so its almost a better than best case here - I think you might need more gains elsewhere to get back up to speed.

edit

Hmm - still no equivalent of the cached enum for one I guess.
And at the least, since you only cache when the scan is great than one, you can at least skip one capture there...


Michael McCandless added a comment - 09/Oct/09 03:54 PM

It seems I have to capture the state before readTerm in next()

Wait, how come? It seems like we should only cache if we find exactly the requested term (ie, where we return SeekStatus.FOUND)? So you should only have to capture the state once, there?

Hmm I wonder whether we should also cache the seek(ord) calls?