|
[
Permlink
| « Hide
]
Mark Miller added a comment - 18/Nov/08 02:40 PM
Hmmm...I think something is missing - FormatPostingsPositionsReader?
Woops, sorry... I was missing a bunch of files. Try this one?
The work on streamlining the term dictionary is excellent, but perhaps we can do better still. Can we design a format that allows us rely upon the operating system's virtual memory and avoid caching in process memory altogether?
Say that we break up the index file into fixed-width blocks of 1024 bytes. Most blocks would start with a complete term/pointer pairing, though at the top of each block, we'd need a status byte indicating whether the block contains a continuation from the previous block in order to handle cases where term length exceeds the block size. For Lucy/KinoSearch our plan would be to mmap() on the file, but accessing it as a stream would work, too. Seeking around the index term dictionary would involve seeking the stream to multiples of the block size and performing binary search, rather than performing binary search on an array of cached terms. There would be increased processor overhead; my guess is that since the second stage of a term dictionary seek – scanning through the primary term dictionary – involves comparatively more processor power than this, the increased costs would be acceptable. Advantages:
Interesting! I've been wondering what you're up to over on KS, Marvin I'm not sure it'll be a win in practice: I'm not sure I'd trust the We could also explore something in-between, eg it'd be nice to I'd like to eventually make the TermsDict index pluggable so one could [Attached patch] To test whether the new pluggable codec approach is flexible enough, I It was wonderfully simple The pulsing codec can "wrap" any other codec, ie, when a term is All tests pass with this pulsing codec. As a quick test I indexed first 1M docs from Wikipedia, with N=2 (ie I'll look into this patch soon.
Just wanted to say: I'm really excited about the progress here, this is cool stuff! > I'm not sure I'd trust the OS's IO cache to "make the right decisions" about what to cache.
In KS and Lucy, at least, we're focused on optimizing for the use case of dedicated search clusters where each box has enough RAM to fit the entire index/shard – in which case we won't have to worry about the OS swapping out those pages. I suspect that in many circumstances the term dictionary would be a hot file even if RAM were running short, but I don't think it's important to worry about maxing out performance on such systems – if the term dictionary isn't hot the posting list files are definitely not hot and search-time responsiveness is already compromised. In other words...
More on designing with modern virtual memory in mind at <http://varnish.projects.linpro.no/wiki/ArchitectNotes > Plus during that binary search the IO system is loading whole pages into I'd originally been thinking of mapping only the term dictionary index files. Those are pretty small, and the file itself occupies fewer bytes than the decompressed array of term/pointer pairs. Even better if you have several search app forks and they're all sharing the same memory mapped system IO buffer. But hey, we can simplify even further! How about dispensing with the index file? We can just divide the main dictionary file into blocks and binary search on that. Killing off the term dictionary index yields a nice improvement in code and file specification simplicity, and there's no performance penalty for our primary optimization target use case. > We could also explore something in-between, eg it'd be nice to That doesn't meet the design goals of bringing the cost of opening/warming an IndexReader down to near-zero and sharing backing buffers among multiple forks. It's also very complicated, which of course bothers me more than it bothers you. > I'd like to eventually make the TermsDict index pluggable so one could If we treat the term dictionary as a black box, it has to accept a term and return... a blob, I guess. Whatever calls the lookup needs to know how to handle that blob.
Hmm, +1 for generalizing the MultiLevelSkipListWriter/Reader so that we can re-use it for different (custom) posting-list formats easily.
I agree, btree is a better fit, though we don't need insertion & deletion operations since each segment is write once.
These are the two extremes, but, I think most common are all the apps
This is a good read, but I find it overly trusting of VM. How can the VM system possibly make good decisions about what to swap From Java we have even more ridiculous problems: sometimes the OS I think we need to aim for consistency: a given search should not I've seen my desktop OS (Mac OS X 10.5.5, based on FreeBSD) make Similarly, when a BG merge is burning through data, or say backup I bet the VM system fails to show graceful degradation: if I don't In the ideal world, an IndexReader would be told how much RAM to use. From Java we could try using WeakReference but I fear the
I'm not convinced this'll be a win in practice. You are now paying an In Lucene java, the concurrency model we are aiming for is a single
Have you tried any actual tests swapping these approaches in as your
That's a nice goal. Our biggest cost in Lucene is warming the
I think if we make the pluggable API simple, and capture the
In my approach here, the blob is opaque to the terms dict reader: it > Take a large Jira instance, where the app itself is also
> consuming alot of RAM, doing alot of its own IO, etc., where perhaps > searching is done infrequently enough relative to other operations > that the OS may no longer think the pages you hit for the terms index > are hot enough to keep around. Search responsiveness is already compromised in such a situation, because we Nevertheless, the terms index isn't that big in comparison to, say, the size > Similarly, when a BG merge is burning through data, or say backup kicks off When that background merge finishes, the new files will be hot. So, if we Even better, any IO caches for old segments used by the previous IndexReader The terms index could indeed get evicted some of the time on busy systems, but As far as backup daemons blowing up everybody's cache, that's stupid, >> But hey, we can simplify even further! How about dispensing with the index I'm persuaded that we shouldn't do away with the terms index. Even if we're Nathan Kurz and I brainstormed this subject in a phone call this morning, and
Since the integers are already expanded and the raw UTF-8 data can be compared > In Lucene java, the concurrency model we are aiming for is a single JVM When I mentioned this to Nate, he remarked that we're using the OS kernel like We don't keep a single IndexReader around, but we do keep the bulk of its data > I do agree, if fork() is the basis of your concurrency model then sharing Lucy/KS can't enforce that, and we wouldn't want to. It's very convenient to > Have you tried any actual tests swapping these approaches in as your No – changing something like this requires a lot of coding, so it's better to > Tests of fully hot and fully cold ends of the >> That doesn't meet the design goals of bringing the cost of opening/warming Exactly. It would be nice to add a plug-in indexing component that writes sort Hmm, maybe we can conflate this with a column-stride field writer and require > In my approach here, the blob is opaque to the terms dict reader: it Sounds good. Basically, a hash lookup. In KS, the relevant IndexReader methods no longer take a Term object. (In Lexicon* I suppose we genericize this by adding a TermsDictReader/LexReader argument to OK I created another codec, SepCodec (for lack of a better name) that The code is still messy – lots of nocommits all over the place. I'm Finally, this gets us one step closer to using PFOR! With this codec, This codec was more interesting because it adds new files to the file In this patch I also created a PostingCodec class, with the 3
By editing the PostingCodec.getCodec method you can switch all tests I built the 1M Wikipedia index, using SepCodec. Here's the ls -l: -rw-rw-rw- 1 mike admin 4000004 Nov 20 17:16 _0.fdt -rw-rw-rw- 1 mike admin 8000004 Nov 20 17:16 _0.fdx -rw-rw-rw- 1 mike admin 303526787 Nov 20 17:34 _n.doc -rw-rw-rw- 1 mike admin 33 Nov 20 17:30 _n.fnm -rw-rw-rw- 1 mike admin 220470670 Nov 20 17:34 _n.frq -rw-rw-rw- 1 mike admin 3000004 Nov 20 17:34 _n.nrm -rw-rw-rw- 1 mike admin 651670377 Nov 20 17:34 _n.prx -rw-rw-rw- 1 mike admin 0 Nov 20 17:30 _n.pyl -rw-rw-rw- 1 mike admin 84963104 Nov 20 17:34 _n.skp -rw-rw-rw- 1 mike admin 666999 Nov 20 17:34 _n.tii -rw-rw-rw- 1 mike admin 87551274 Nov 20 17:34 _n.tis -rw-rw-rw- 1 mike admin 20 Nov 20 17:34 segments.gen -rw-rw-rw- 1 mike admin 64 Nov 20 17:34 segments_2 Some initial observations for SepCodec:
Be careful: it's the seeking that kills you (until we switch to SSDs Potentially worse than the terms index are norms, if the search hits
If the term index and norms are pinned (or happen to still be hot), I I think at both extremes (way too litle RAM and tons of RAM) both
I think you're just more trusting of the IO/VM system. I think LRU is
Excellent! If only more people knew about this. And, if only we
I don't fully understand this approach. Would the index file pointers We currently materialize individual Strings when we load our index,
Agreed. But once you've got the mmap-based solution up and running
True!
It seems like the ability to very quickly launch brand new searchers If we fix terms index to bulk load arrays (it's not now) then the cost
Except I would have IndexReader use its RAM budget to pick & choose
Yes I think column-stride fields writer should write the docID -> ord
But you must at least require these Obj's to know how to compareTo one
Right; that's what let me create the PulsingCodec here. The biggest problem with the "load important stuff into RAM" approach, > Nathan Kurz and I brainstormed this subject in a phone call this morning, and
> we came up with a three-file lexicon index design: > > I don't fully understand this approach. Would the index file pointers > point into the full lexicon's packed utf8 file, or a separate "only > terms in the index" packed utf8 file? Just the index terms (i.e. every 128th term). We're trying to fake up an /* self->text_lengths, self->char_data, and self->lex_file_ptrs are all * memory mapped buffers. */ while (hi >= lo) { const i32_t mid = lo + ((hi - lo) / 2); const i64_t offset = self->text_lengths[mid]; const i64_t mid_len = self->text_lengths[mid + 1] - offset; char *const mid_text = self->char_data + offset; const i32_t comparison = StrHelp_string_diff(target_text, target_len, mid_text, mid_len); if (comparison < 0) { hi = mid - 1; } else if (comparison > 0) { lo = mid + 1; } else { result = mid; break; } } offset_into_main_lexicon = self->lex_file_ptrs[result] ... However, perhaps some sort of a B-tree with string prefix compression would be >> Hmm, maybe we can conflate this with a column-stride field writer
>> and require that sort fields have a fixed width? > > Yes I think column-stride fields writer should write the docID -> ord > part of StringIndex to disk, and MultiRangeQuery in > then use it. With enumerated type of fields (far fewer unique terms > than docs), bit packing will make them compact. How do you plan on dealing with the ord values changing as segments get I was planning on having SortCacheWriter write the out the docID -> ord In theory, if we were to have only per-segment docID -> ord maps, we could Unfortunately, that won't work because segment boundaries are hidden from >> In KS, the relevant IndexReader methods no longer take a Term Yes. > Does this mean using per-field custom sort ordering (collator) is That's one objective. The implementation is incomplete. Another objective is to allow non-string term types, e.g. TimeStamp, >> I suppose we genericize this by adding a TermsDictReader/LexReader
>> argument to the IndexReader constructor? That way, someone can >> supply a custom subclass that knows how to decode custom dictionary >> files. > > Right; that's what let me create the PulsingCodec here. I'm running into an OO design problem because of the SegmentReader/MultiReader reader = new IndexReader(termsDictReader, postingsReader, fieldsReader);
We can't do that, though, because there's logic in IndexReader.open() which Using setters a la reader.setTermsDictReader(termsDictReader) is problematic Are factory methods the only way to handle adding or replacing components KS forces people to subclass Schema to define their index, but up till now > Be careful: it's the seeking that kills you (until we switch to SSDs Perhaps for such situations, we can make it possible to create custom Consider the tradeoffs. On the one hand, if we rely on memory mapped buffers, > It seems like the ability to very quickly launch brand new searchers Near-realtime search is one of the motivations. But lightweight IndexReaders Elaborate pre-warming rituals are necessary with heavy IndexReaders whenever With lightweight IndexReaders, you can check whether the index has been Lightweight IndexReaders can also be sprinkled liberally around source code in Launching cheap search processes is also important when writing tools akin to And so on... The fact that segment data files are never modified once written > The biggest problem with the "load important stuff into RAM" approach, We can't realistically pin pages from C, either, at least on Unixen. Modern Also, there aren't any madvise() flags that hint to the OS that the mapped
OK this makes sense. We could do something similar in Lucene. Not
B-tree or FST/trie or ... something. Actually: I just realized the terms index need not store all suffixes Though, if we want to do neat things like respelling, wildcard/prefix
This is a big challenge – presenting a merged docID->ord map for a I think, just like we are pushing for column-stride / FieldCache to be Ie, if one did all searching with MultiSearcher, it should work well. For the RangeFilter impl in
Neat!
I think you "just" have to have "index version data" that's If init'ing each sub-component is then costly (opening files,
I think this is a fabulous solution. If you make things so pluggable
Well ...that cat command can be deadly for a large index, too? You've
This is indeed nice. I think the two approaches boil down to "pay up
[BTW the ZFS filesystem gets many of its nice properties for the same Lucene java takes advantage of that 'write once' nature during
Alas. The only fallback is gross system-level tunings ("swappiness" on Linux Or also a silly "keep warm" thread... > I think you "just" have to have "index version data" that's
> collectively read/written, atomically, and is then used to init all > the components. This is what segments_N is in Lucene (and I think > "Schema" is in KS/Lucy?): it contains all details that all > sub-components need. The equivalent to segments_N in KinoSearch is snapshot_N.meta, which is KinoSearch::Schema is for defining your index: global field properties, class MySchema extends Schema { class URLField extends TextField { boolean analyzed() { return false; } boolean indexed() { return false; } } void initFields() { addField("title", "text"); addField("content", "text"); addField("url", new URLField()); } Analyzer analyzer() { return new PolyAnalyzer("en"); } } I anticipate that Lucy will adopt both Schema and Snapshot in some form, but > If init'ing each sub-component is then costly (opening files, So, something like this prospective Lucy code? (Lucy with Java bindings, that is.) MySchema schema = new MySchema(); Snapshot snapshot = new Snapshot((Schema)schema); snapShot.readSnapShot("/path/to/index"); MyTermsDictReader termsDictReader = new MyTermsDictReader(schema, snapshot); IndexReader reader = new IndexReader(schema, snapshot, null, null, (TermsDictReader)termsDictReader); What if index files get deleted out from under that code block? The user will
OK got it.
I would think this "openReader" method would live inside Lucy/KS, and > I would think this "openReader" method would live inside Lucy/KS, and
> would in fact implement its own retry logic (to load the next snapshot > and try again). I must be missing some part of the question here... If the retry code lives inside of IndexReader, then the only way to get the class MyIndexReader extends IndexReader { TermsDictReader makeTermsDictReader() { return (TermsDictReader) new MyTermsDictReader(invindex, snapshot); } } InvIndex invindex = MySchema.open("/path/to/index"); IndexReader reader = (IndexReader) new MyIndexReader(invindex); I was hoping to avoid forcing the user to subclass IndexReader, but I think
How about the caller provides a codec instance which when asked will Then open() implements the retry logic, asking the codec to load each That's roughly the approach I'm taking here (on next iteriaton of the >> We're trying to fake up an array of strings without having to load anything
>> into process memory. > We could do something similar in Lucene. Not creating String objects is OK, assume that you slurp all three files. Here's the code from above, ported while (hi >= lo) { int mid = lo + ((hi - lo) / 2); long midTextOffset = textLengths[mid]; long midTextLength = textLengths[mid + 1] - midTextOffset; int comparison = StringHelper.compareUTF8Bytes( targetUTF8Bytes, 0, targetLength, termUTF8bytes, midTextOffset, midTextLength); if (comparison < 0) { hi = mid - 1; } else if (comparison > 0) { lo = mid + 1; } else { result = mid; break; } } long offsetIntoMainTermDict = mainTermDictFilePointers[result]; ... Other than the slurping, the only significant difference is the need for the You can also use FileChannels to memory map this stuff, right? (Have to be > B-tree or FST/trie or ... something. Much to my regret, my tree algorithm vocabulary is limited – I haven't spent Our segment-based inverted index term dictionary has a few defining First, a lot of tree algorithms are optimized to a greater or lesser extent Second, when we are writing out the term dictionary at index-time, the raw Third, at read-time we're going to have one of these trees per segment. We'd > Actually: I just realized the terms index need not store all suffixes It would be ideal if we could separate the keys from the values and put all > Though, if we want to do neat things like respelling, wildcard/prefix The main purpose of breaking out a separate index structure is to avoid binary > How about the caller provides a codec instance which when asked will
> return a TermsDictReader "matching" the codec that had been used to > write the index? OK, it makes sense to have the user access these capabilities via a single "Codec" isn't really the right name for this, though. "IndexComponent", // If Lucy's Schema class were implemented in Java instead of C... abstract class Schema extends Obj { LexiconComponent lexiconComponent() { return new LexiconComponent(); } PostingsComponent postingsComponent() { return new PostingsComponent(); } StorageComponent storageComponent() { return new StorageComponent(); } ... } Auxiliary IndexComponents might include TermVectorsComponent, Here's example code for overriding the default LexiconComponent: // Implements term dictionary as a hash table with term texts as keys. class HashLexiconComponent extends LexiconComponent { LexiconReader makeReader(InvIndex invindex, Snapshot snapshot) { SegInfos segInfos = Snapshot.getSegInfos(); if (segInfos.size == 1) { return (LexiconReader) new SegHashLexiconReader(invindex, snapshot); } else { return (LexiconReader) new MultiHashLexiconReader(invindex, snapshot); } } LexiconWriter makeWriter(InvIndex invindex, SegInfo segInfo) { return (LexiconWriter) new HashLexiconWriter(invindex, segInfo); } } // [User code] class MySchema extends Schema { LexiconComponent lexiconComponent() { return (LexiconComponent) new HashLexiconComponent(); } ... } > I think, just like we are pushing for column-stride / FieldCache to be
> "per segment" instead of one big merged array, we should move in the > same direction for searching? Algorithmically speaking, it would definitely help this specific task, and If our goal is minimal impact to the current model, we worry only about the But does it make sense to be more aggressive? Should Searchers run hit Maybe so. I implemented pruning (early termination) in KS, and it had to be > Well ...that cat command can be deadly for a large index, too?
It will be costly for a large index, and it wouldn't be appropriate in all
Looks good!
Yes.
Right, neither inserts nor deletes matter to us.
Lucene differs from Lucy/KS in this. For Lucene, when flushing a However, for both Lucene and Lucy/KS, during merging one cannot assume I think for starters at least we should stick with the simple
Continuing the move towards pushing searching closer to the segments
Why not inline the value with the key? The pointer to the value just
And... if your terms index is in RAM, to minimize its net size and
Well, maybe both? Ie, each of these IndexComponents could have many Subclassing Schema seems like the right approach.
I think the plus's are substantial here. Not having to materialize We need to think more about the tradeoffs here...
Ahh OK. But that cat command is basically just a different, more So eg you'd still need to coordinate so that the new searcher isn't > Not having to materialize one massive array of norms, of
> FieldCache/column-stride values, of docID->ord values, is very important > because these are at least linear cost (more for the docID->ord) in # docs > in the index. Reopening a searcher on a large index is very costly in Lucene > now because of these materializations. > > We need to think more about the tradeoffs here... Let's continue the discussion of segment-centric searching on java-dev, since it it's > So eg you'd still need to coordinate so that the new searcher isn't
> used until warming finishes, right? ... > I guess you could also do a system call to do the cat command, Warming is only needed once, at search service startup. The idea is to get Once all segment data is in the IO cache, we assume that it stays there, Say that we add a new segment to the index, either by running an Now, say that our search service checks at the beginning of each request to while (newRequest()) { if (indexHasBeenUpdated()) { searcher = new IndexSearcher("/path/to/index"); } ... } After an abrupt cutover to the new searcher, we process the search request. Therefore, we don't need to worry about managing cutover to a new searcher. > Well, maybe both? Ie, each of these IndexComponents could have many
> different codecs to write/read the data to/from the index. So when I > implement PostingsComponent, when writing a segment I could choose my > own codec; when reading it, I retrieve the matching codec to decode > it. Yes, both – that sounds good. However, I'm not sure whether you're proposing > Subclassing Schema seems like the right approach. Groovy. How are you going to handle it in Lucene? I think you just have to How do we handle auxiliary IndexComponents? I've long wanted to implement an At index-time, I think we just create an array of SegDataWriter objects and At search-time, things get a little trickier. Say we hand our Searcher object The only thing I can thing of is an IndexReader.getReader(String name) method.
Ahh, got it. Lucene must warm for each reopened searcher (though that warming cost will eventually be in proportion to what's changed in the index), but KS/Lucy should be fine doing zero warming except for the very first searcher startup (eg after rebooting the machine).
Right: separate codec base classes for each component. Back to the
Right.
I think that's right. In Lucene we now have an indexing chain
I haven't thought enough about how to handle this at search time. > In Lucene we now have an indexing chain
> (package private), so that you can "tap in" at whatever point is > appropriate - you could handle the whole doc yourself (like > SegDataWriter); you could be fed one field at a time; you could tap in > after inversion so you get one token at a time, etc. That's pretty nice. It occurred to me to try something like that, but I got a The fact that the Doc object in KS uses the host language's native hashtable In any case, I don't anticipate intractable implementation troubles with > IR.getReader seems fine, though, you'd need to open each Sure, startup's easy. I think we just add Schema.auxiliaryComponents(), Where we have problems, though, is with remote searching or multi-searching. The only remedy would be to subclass all your Searchables – the local New patch attached (still plenty more to do...):
I took Mike's latest patch and updated it to current trunk.
It applies cleanly and compiles fine. Some test cases fail. The problem is in SegmentReader in termsIndexIsLoaded() and loadTermsIndex(). I'll take a look tomorrow, I need to understand the latest changes we made in the different IndexReaders better (and now it's getting quite late here...) Thanks for modernizing the patch Michael! I'll get back to this one soon... I'd really love to get PForDelta working as a codec. It's a great test case since it's block-based, ie, very different from the other codecs.
This is nice! Maybe we should break this whole issue into smaller pieces? We could start with the dictionary. The changes you made here are really cool already. We could further separate the actual TermsDictReader from the terms index with a clean API (I think you put actually a TODO comment into your patch). Then we can have different terms index implementations in the future, e.g. one that uses a tree. We could also make SegmentReader a bit cleaner: if opened just for merging it would not create a terms index reader at all; only if cloned for an external reader we would instantiate the terms index lazily. Currently this is done by setting the divisor to -1. In the current patch the choice of the Codec is index-wide, right? So I can't specify different codecs for different fields. Please correct me if I'm wrong.
Oups, didn't want to steal this from you, Mike. Wanted to hit the "Watch" button instead...
Yeah the issue is very large now. I'll think about how to break it I agree: the new default terms dict codec is a good step forward. This was a nice side-effect of genericizing things, because the But, it's somewhat tricky to break out only this change... eg it's
Actually the whole terms dict writing/reading is itself pluggable, so But it sounds like you're proposing making a strong decoupling of
+1 Or, an FST. FST is more compelling than tree since it also compresses
Right. Somehow we should genericize the "I don't need the terms We could eg pass "requirements" when asking the codec for the terms
The Codec is indeed index-wide, however, because the field vs term Also, it's fine if an index has used different codecs over time when
Right.
I'm a bit confused. Doesn't the IndexWriter open SegmentReaders I thought that's why it'd be good to separate the terms index from I admit that I didn't follow the NRT changes as closely as I should
OK I see now. Did you think about possibly extending the field API
Right, it does. This is the one case (internal to Lucene, only) where
OK got it. I think this makes sense. The separation in the current approach is already quite strong, in If we make pluggable how the indexText/indexOffset is stored/loaded in
I totally agree. I made a similar change (from String objects to I attached a .tar.bz2 of src/* with my current state – too hard to
keep svn in sync / patchable right now. Changes:
The patch is very rough... core & core-test compile, but most tests New patch & src.tar.bz2 attached. All tests, including back-compat, pass.
There are still zillions of nocommits to resolve. Some of the changes:
Next step is to get the other codecs (sep, pulsing) to pass all tests,
Yay! This will be important for NumericField too since it uses 7 bits per char and will probably account for the majority of terms in the index in many applications.
Could a git branch make things easier for mega-features like this?
It's actually byte[] both in how the terms dict index stores the terms
Maybe – though I don't have much experience w/ git. If people are Attached patch.
All tests pass with all 3 codecs (standard = just like today's index format; pulsing = terms that occur in only 1 doc are inlined into terms dict; sep = separate files for doc, freq, prx, payload, skip data). Mike,
Maybe a directed acyclic word graph would work well as an alternative dictionary implementation?
I think that'd be great. In particular, an FST (DAG that shares prefix & suffix and "outputs" the per-term data in the middle of the graph) should be a good savings in most normal term distributions. Flexible indexing makes the terms dict & terms dict index pluggable, so we are free to experiment with alternative impls. I've only taken some baby steps to improve on the current terms dict index (by switching to shared byte[] blocks, instead of a separate TermInfo / String instance per indexed term). New patch attached. All tests pass.
I haven't quite made it to PForDelta yet, but it's very close! The sep codec was the first step (uses separate files for doc, frq, Then, in this patch, the big change was to create new The trickiest part was abstracting away just what a "file pointer" Once I did that, I created a FixedIntBlockIndexInput/Output, which So the next step is to hook up PforDelta...
Thanks John. I've been starting with LUCENE-1410 for now, but we can easily swap in any p4 impl, or any other int compression method. All that's needed in the subclass is to implement encodeBlock (in the writer) and decodeBlock (in the reader). The intblock codec takes care of the rest. Kamikaze looks like great stuff! What variation on p4 is Kamikaze using? Keeping the p4 data compressed is interesting... when you implement AND/OR/NOT on p4, do you have a shortcut that traverses the compressed form while applying the operator? Or do you do the full decode and then 2nd pass to apply the operator? Hi Mike:
We have been using Kamikaze in our social graph engine in addition to our search system. A person's network can be rather large, decompressing it in memory some network operation is not feasible for us, hence we made the requirement for the DocIdSetIterator to be able to walk to DocIdSet's P4Delta implementation in compressed form. We do not decode the p4delta set and make a second pass for boolean set operations, we cannot afford it in both memory cost and latency. The P4Delta set adheres to the DocIdSet/Iterator api, and the And/Or/Not is performed on that level of abstraction using next() and skipTo methods. -John Just a FYI: Kamikaze was originally started as our sandbox for Lucene contributions until 2.4 is ready. (we needed the DocIdSet/Iterator abstraction that was migrated from Solr)
It has three components: 1) P4Delta So please feel free to incorporate anything you see if or move it to contrib. Attached patch. This includes the pfor impl from LUCENE-1410.
PforDelta is working! I added another codec (pfordelta). It uses the However, there are a couple test failures I wrote up first cut of the toplevel design of this patch, in the wiki: http://wiki.apache.org/lucene-java/FlexibleIndexing
The codec is per segment. However, we ask the codec for
Terms/TermsEnum by fields, so it should be simple to make a Codec that dispatches to field-specific Codecs. Attached current patch. All tests pass:
I'm now working towards getting this committable. While PforDelta Still need to cutover more stuff (queries, AllTermDocs, etc.) to flex New patch attached. All tests pass. The changes are mostly cutting
many things over to the flex API. Still many nocommits to address, but I'm getting closer! I haven't "svn up"d to all the recent the deprecations removals / Cutting over all the MultiTermQuery subclasses was nice because all Sounding cool! I haven't had time to look at the code too much... but I j ust wanted to mention two features I've had in the back of my mind for a while that seem to have multiple use cases.
1) How many terms in a field?
2) Convert back and forth between a term number and a term. Hi Yonik:
These are indeed useful features. LUCENE-1922 addresses 1), perhaps, we can add 2) to the same issue to track? Thanks -John
Actually I've already added this one (Terms.getUniqueTermCount), but I However, some impls may throw UOE (eg a composite IndexReader).
I agree this would be useful. I did have ord() in early iterations of
A "complete" implementation seems hard (i.e. across multiple segments also)... but it still seems useful even if it's only at the segment level. So perhaps just on SegmentTermEnum, and uses would have to cast to access? Exposing the term index array (i.e. every 128th term) as an expert-subject-to-change warning would let people implement variants themselves at least.
Yep. I haven't followed too closely (even though it is one of my favorite issues) but I figured while Yonik was throwing out ideas, I'd add that one of the obvious use cases for flexible indexing is altering scoring. One of the common statistics one needs to implement some more advanced scoring approaches is the average document length. Is this patch far enough along that I could take a look at it and think about how one might do this?
Come on old man, stop clinging to emacs Sounds like some annoyance, and I think I made a comment there - and I'm a man of my word... or child of my word - take your pick. To trunk. Since you likely have moved on, don't worry - this was good practice - I'll do it again sometime if you'd like. I may have mis merged something little or something. I went fairly quick (I think it took like 30 or 40 min - was hoping to do it faster, but eh - sometimes I like to grind). I didn't really look at the code, but some stuff I noticed: java 6 in pfor Arrays.copy skiplist stuff in codecs still have package of index - not sure what is going on there - changed them in IndexWriter: Core tests pass, but I didn't wait for contrib or back compat. eh - even if you have moved on, if I'm going to put up a patch, might as well do it right - here is another:
core and contrib tests pass Whoa thanks for the sudden sprint Mark!
Hey! I'm not so old
Excellent catches! All of these are not right.
Gak, sorry. I have a bunch of mods there, cutting over to flex API.
Woops, for back compat I think we need to leave it in (it's a
Eek, it shouldn't be – indeed it is. When did that happen? We Do you have more fixes coming? If so, I'll let you sprint some more; else, I'll merge in, add contrib & back-compat branch, and post new patch! Thanks
Well, thinking through how you'd do this... likely you'd want to store And you'd need a small customization to the indexing chain to But then on merging segments, you'd need an extensions point, which we Unfortunately, this patch doesn't yet address things like customizing
My fault, I removed it during the remove backwards tests on Saturday. If we do not remove DateTools/DateField for 3.0 (we may need to leave it in for index compatibility), I will restore, these tests, too. It's easy with TortoiseSVN and you can also preserve the history (using svn:mergeinfo prop). I have this on my list when going forward with removing the old TokenStream API.
Ahh – can you do this for TestBackwardsCompatibility? I restored it, but, lost all history. Thanks. Done. I also did it for the BW branch, but didn't create a tag yet. The next tag creation for the next bigger patch is enough (no need to do it now).
What I have done: svn copy from the older revision to the same path
Excellent, thanks! It had a few problems (was still trying to deprecated APIs, some of which were gone) – I just committed fixes.
I think we might want to store fundamentals instead:
Both of these seem really easy to keep track of?
yep. Uber-patch attached: started from Mark's patch (thanks!), added my contrib & back-compat branch changes. All tests pass.
Also, I removed pfor from this issue. I'll attach the pfor codec to LUCENE-1410. Note that I didn't use "svn move" in generating the patch, so that the patch can be applied cleanly. When it [finally] comes time to commit for real, I'll svn move so we preserve history.
can you say both of those things in the same breath? Just how long did it take to get that phd... I'd look it up and guestimate your age, but I think MIT still has my ip blocked from back when I was applying to colleges. So I'm going with the "uses emacs" guestimate.
vi is the only one I can half way use - I know 3 commands - edit mode, leave edit mode, and save. And every now and then I accidently delete a whole line. When I make a change that I don't want to save, I have to kill the power. The patch is in a bit of an unpatchable state Our old friend, the $id is messing up WildcardTermEnum - no problem, I can fix that... But also, NumericUtils is unpatched, Codec is missing, along with most of the classes from the codecs packages! This looks like my work My only conclusion is that your one of those guys that can write the whole program once without even running it - and then it works perfectly on the first go. Thats the only way I can explain those classes in the wrong package previously as well nope - something else - looking through the patch I see the files I want - a second attempt at patching has gone over better.
A couple errors still, but stuff I think I can fix so that I can at least look over. False alarm. My patcher wonked out or something. I can resolve the few errors that popped up this time. Sweet. edit Just for reference - not sure what happened the first time - my patch preview looked the same both times (was only complaining about the $id), but completely failed on attempt one and worked on attempt two - the only issue now appears to be you have half switch deletedDocs to Bits from BitVector - but only have way, so its broken in a dozen places. Not sure what you are doing about size() and what not, so I'm just gonna read around. edit Yes - I found it - BitVector was supposed to implement Bits - which was in the patch ... this patch just did not want to apply. I guess it was right, but Eclipse just did not want it to take ... Bah - all this huffing an puffing over the patch and I'm too sick to stay up late anyway.
Have you started benching at all? I'm seeing like a 40-50% drop in same reader search benches with standard, sep, and pulsing. Like 80% with intblock. Mark is there anything wrong w/ the patch? Did you get it working?
I haven't but it sounds like you have! I'll get to it soon... but one thing I know is missing is the equivalent of the "terminfo cache" so that when a query 1) looks up docFreq of the term (to compute its weight), and 2) looks up the freq/prox offsets, that 2nd lookup is cached. IntBlock is expected to be slow – it naively encodes one int at a time using vInt. Ie, it's just a "test" codec, meant to be the base for real block-based codecs like pfor.
I got it working - it didn't apply cleanly, but perhaps that was just me. It was a weird situation - I get a preview of whats going to happen with complaints, and it only complained about the $id issue in wildcardtermenum - the half the patch failed. A second attempt and it only complained about that again - but then it missed making BitVector implement Bits - could just be ghosts in my machine. I wouldn't worry about it till someone else complains. In any case, I got it working in my case by just fixing the $id issue and adding implements Bits to BitVector.
Nothing serious New patch attached. All tests pass.
I simplified the TermsEnum.seek API, and added ord to the API. The Another for theTermsEnum wishlist: the ability to seek to the term before the given term... useful for finding the largest value in a field, etc.
I imagine "at or before" semantics would also work (like the current semantics of TermEnum in reverse) Okay, I just tried a toy cache with standard - its not perfect because the tests have a bunch that end up finding one doc short, and I don't turn off the cache for any reason (the old one terns it off when returning the segmenttermenum, but I didn't even try to understand that with the new stuff). But that appears to get the majority of the perf back. Went from about 3500 r/s to 7500 - the old is 8400.
This stuff is so cool by the way. edit whew - emphasis on toy - its hard to do this right with docsreader
I was thinking something along the lines of adding a "captureState" to DocsProducer.Reader, that returns an opaque object, and then adding a corresponding seek that accepts that object. It would chain to the positions reader. Then StandardTermsDictReader would hold the thread private cache, using this API. Well thats reassuring - I think I was on the right path then. I've got the thread private cache, and I was initially just capturing in's position so I could set it before calling readTerm after pulling from the cache - so I knew I had an issue with the positions reader in there too (the position of it in readTerm) - but didn't see the cleanest path to set and capture that without modifying the reader like you said - but I wasn't even sure I was on the right path, so thats about where I gave up
Your comment makes me feel a little less dumb about it all though. No problem
BTW I think an interesting codec would be one that pre-loads postings into RAM, storing them uncompressed (eg docs/positions as simple int[]) or slightly compressed (stored as packed bits). This should be a massive performance win at the expense of sizable RAM consumption, ie it makes the same tradeoff as contrib/memory and contrib/instantiated.
Right now seek(TermRef seekTerm) stops at the earliest term that's >= It sounds like you're asking for a variant of seek that'd stop at the How would you use this to seek to the last term in a field? With the Actually, FIRST/LAST could be achieved with seek-by-ord (plus getUniqueTermCount()). Though that'd only work for TermsEnum impls that support ords.
It's not just last in a field, since one may be looking for last out of any given term range (the highest value of a trie int is not the last value encoded in that field).
Ahhh... right, prev could be implemented like so: int ord = seek(triecoded(MAXINT))).ord
As long as ord is supported at the segment level, it's doable. hmm - I think I'm close. Everything passes except for omitTermsTest, LazyProxTest, and for some odd reason the multi term tests. Getting close though.
My main concern at the moment is the state capturing. It seems I have to capture the state before readTerm in next() - but I might not use that state if there are multiple next calls before the hit. So thats a lot of wasted capturing. Have to deal with that somehow. Doing things more correctly like this, the gain is much less significant. What really worries me is that my hack test was still slower than the old - and that skipped a bunch of necessary work, so its almost a better than best case here - I think you might need more gains elsewhere to get back up to speed. edit Hmm - still no equivalent of the cached enum for one I guess.
Wait, how come? It seems like we should only cache if we find exactly the requested term (ie, where we return SeekStatus.FOUND)? So you should only have to capture the state once, there? Hmm I wonder whether we should also cache the seek(ord) calls? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||