|
[
Permlink
| « Hide
]
Eks Dev added a comment - 18/Jul/08 10:42 PM
first cut
Eks Dev made changes - 18/Jul/08 10:42 PM
Ok ok. I'll start working on adding a Filter as a clause to BooleanQuery. Will take some time though, there's a holiday coming up.
Thanks eks, that was fast – I think you set a new record!
The patch looks good, though we definitely need some solid unit tests One question I have: right now if a single field has mixed true/false With this patch we still store the *.prx bytes for a field with I'd also be curious about what gains in index size & filter
Michael McCandless made changes - 19/Jul/08 09:32 AM
Thanks Mike, with just a little bit more hand-holding we are going to be there
I think I have *.prx IO excluded in case omitTf==true, please have a look, this part is really not an easy one (*Merger). Also, now if a single field has mixed true/false for omitTf, I set it to true. One unit test is already there, basic use case works, but the test has to cover a bit more
Eks Dev made changes - 19/Jul/08 12:48 PM
Eks Dev made changes - 20/Jul/08 08:31 PM
OK good progress eks!
I started from your latest patch and made some further changes:
The one place I know of that will still waste bytes is the term dict Unfortunately, I think it's too big a change to try to fix this now; I
Michael McCandless made changes - 21/Jul/08 10:32 AM
Great, it is already more than I expected, even indexing is going to be somewhat faster.
I have tried your patch on smallish index with 8Mio documents and it worked on our regression test without problems. "The one place I know of that will still waste bytes is the term dict About this one, it would be nice not to store this as well, but I think the pointers are already reduced to one byte, as they are 0 for these cases (are they,?) So we have this benefit without expecting it And yes, more "column stride" is great, if you followed my comments on LUCENE-1278, that would mean we could easily "inline" very short postings into term dict (here I expect huge performance benefit, as skip() on another large file is going to be saved independent from omitTf(true)), without increase in size (or minimal) of tii (no locality penalty) If we follow Zipfian distribution, there is a lot of terms with postings shorter than e.g. 16 ... Thanks again for your support, without you this patch would be just another nice idea
Ahh, right. The delta between the proxPointers are written as vlong's. Since the delta will be zero it's now only 1 byte; only a bit worse than 0 bytes
Yes, this looks like it would be a win for cases that need to visit the postings for many small terms. I attached a new rev of the patch:
As a test, I indexed full wikipedia (~3.2 million docs) with this alg: analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker docs.file=/Volumes/External/lucene/wiki.txt doc.stored = false doc.term.vector = false doc.add.log.step=10000 max.field.length=2147483647 directory=FSDirectory autocommit=false compound=false doc.maker.forever = false work.dir=/lucene/work2 ram.flush.mb=64 - CreateIndex { "AddDocs" AddDoc > : * - CloseIndex RepSumByPrefRound AddDoc With tf's it takes 970 seconds and index size is 2.5 GB. Without tf's
Michael McCandless made changes - 24/Jul/08 04:01 PM
One more thing here: since the tiis are loaded into RAM, that unused proxPointer wastes 8 bytes for each indexed terms. For indices with alot of terms this can add up to alot of wasted ram. But still I think we should wait and fix this as part of flexible indexing, when we maybe refactor the TermInfos to be "column stride" instead. Attached patch that also includes fixes to fileformat.{xml,html,pdf}.
Michael McCandless made changes - 24/Jul/08 04:31 PM
we finished our tests
Index without omitTf() :
Queries under test:
Test scope, regression with 30k Queries on the same index with omitTf(true/false). Result:
-Indexing performance (FSDisk!) 13% faster Also, we compared omitTf(false) with this patch and lucene.jar without this patch, no changes whatsoever. From my perspective, this is good to go into production. At least for our usage of lucene, there are no differences with homitTf(true)... >One more thing here: since the tiis are loaded into RAM, that unused proxPointer wastes 8 bytes for each indexed terms. For indices with alot of terms this can add up to alot of wasted ram. But still I think we should wait and fix this as part of flexible indexing, when we maybe refactor the TermInfos to be "column stride" instead. I am more than happy with the results, no need to squeeze the last bit out of it right now. Mike, thanks again for the great work! OK that sounds like a healthy test.
Thank you for the sudden burst of effort to make this happen! So I think this is ready to commit. I'll wait a few days and then commit... I note a change to Fieldable... sigh... Back compatibility fails. Ugh.
Me thinks we should either rework Fieldable as we've previously discussed, or we mark it as being one of the very few classes in LUcene that is subject to change between releases. ouch! it is kind of getting personal between me and Fieldable
Due to Fieldable (things really important, at lest to me):
it would be possible somehow to make it at AbstractField levele and instanceoff at a few places, but I simply hate to do it (I will patch my local copy, this issue is worth to me... must branch off from the trunk for the first time, sigh) funny it is, I see no reason to have anything but AbstractField (Field/Fieldable are just redundant) Yeah, it's one of my biggest regrets in Lucene (yes, I am responsible for it), yet I firmly believe there is a way to do interfaces and abstracts in a proper way in Java.
We could make LazyField extend AbstractField, I think, but it's not clear, as there are some differences between the two, mostly around construction. I'd have to go back and review again. That being said, I still think if there is one place where we should allow breaking the back compat. contract, it is Fieldable! For every rule, there is an exception, right? I thinnk we could, w/ sufficient warning, tell people that we are changing the interface. I am willing to bet that the number of people that would be effected by that would be less than 10. So, please don't give up on this patch. I am totally 100% for it. I think it makes total sense to do. Another option is to speed up going towards 3.0 > I firmly believe there is a way to do interfaces and abstracts in a proper way in Java.
Personally, I've given up on interfaces for stuff with more than one method with at most one parameter. Ditch the interface and move on. that sound like consensus
in that case
no need to regret Grant, if you do nothing you make no mistakes... Interfaces are ok, as long as you can tell what they are going to be doing in next 5 years... this forces you to design "for the future"... something we cannot afford in so popular and complex libraries like lucene at places like Field. Abstract* is equally good design-abstraction... Proposal: from 3.0 on, I could very well live without it, until then, we cause 5 minutes work for people that implement Fieldable on their own and want to stay up to date with the trunk. It is fair deal for everyone and lucene moves forward... Sigh, I too missed that we broke back-compatibility.
But I agree: let's mark Fieldable interface as being allowed to change from release to release (consciously make an exception to back compatibility requirements). Let's also transition away from interface for Field, for 3.0 EG we last had discussions on this, here: OK, I think we should call a vote on it, as it is significant enough in my mind. I will write it up.
Michael McCandless made changes - 05/Aug/08 10:31 AM
Michael McCandless made changes - 05/Aug/08 05:15 PM
Michael McCandless made changes - 05/Aug/08 05:18 PM
Michael McCandless made changes - 11/Oct/08 12:49 PM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||