|
Michael McCandless had some nice ideas on how to make getValue() change performance penalty for legacy usage negligible, this patch includes them:
Hmm ... one problem is Fieldable is an interface, and this patch adds methods to the interface, which I believe breaks our backwards compatibility requirement.
I do not know for sure if this is something we could not live with. Adding new interface sounds equally bad, would work nicely, but I do not like it as it makes code harder to follow with too many interfaces ... I'll have another look at it to see if there is a way to do it without interface changes. Any ideas?
this one keeps addition of new methods localized to AbstractField, does not change Fieldable interface... it looks like it could work done this way with a few instanceof checks in FieldsWriter, This one has dependency on
it will not give you any benefit if you directly implement your Fieldable without extending AbstractField, therefore I would suggest to eventually change Fieldable to support all these methods that operate with offset/length. Or someone clever finds some way to change an interface without braking backwards compatibility OK I updated the patch:
Eks can you see if the changes look OK? Thanks. Alas, I'm not really happy with introducing this API at the
AbstractField level and not in Fieldable. It's awkward that we've deprecated binaryValue() in AbstractField and not in Fieldable. But, I think it's our only way forward with this issue without breaking backwards compatibility. In 3.0 I'd like to at least promote this API up into Fieldable, but What I'd really like to do in 3.0 is change Fieldable to not be an Question: could we simply move forward without Fieldable? Ie, I guess I don't really understand the need for Fieldable. In fact I >>Eks can you see if the changes look OK? Thanks.
It looks perfect, you have brought it to the "commit ready status" already. I will it try it on our production mirror a bit later today and report back if something goes wrong. >>I guess I don't really understand the need for Fieldable. In fact I I am with you 100% here, It looks to me as well that one concrete class could replace it all. But... maybe someone kicks-in with some god arguments why we have it that way.
all test pass, (as expected, only minor diff to take2 version. much like the initial version ) Thanks Eks – this looks good. I think it's ready to commit, once
Great Mike,
it gets better and better, i saw Now it remains only to make one nice patch that enables us to pass our own byte[] for retrieving stored fields during search. I was thinking along the lines of things you did in Analyzers. we could pool the same trick for this, eg. Field Document.getBinaryValue(String FIELD_NAME, Field destination); Field already has all access methods (get/set), the contract would be: If destination==null, new one will be created and returned, if not we use this one and returne the same object back. The method should check if byte[] is big enough, if not simple growth policy can be there. This way we avoid new byte[] each time you fetch stored field.. I did not look exactly at code now, but the last time I was looking into it it looked as quite simple to do something along these lines. Do you have some ideas how we could do it better? Just simple calculation in my case, thanks again to all "workers" behind this greet peace of software... PS: I need to find some time to peek at paul's work in LUVENE -1345 and my wish list will be complete, at least for now (at least until you get your magic with flexi index format done Mike,
This new patch includes take3 and adds the following: Fieldable Document.getStoredBinaryField(String name, byte[] scratch); where scratch param represents user byte buffer that will be used in case it is big enough, if not, it will be simply allocated like today. If scratch is used, you get the same object through Fieldable.getByteValue() for this to work, I added one new method in Fieldable the only interesting implementation is in LazyField The reason for this is in my previous comment this does not affect issues from take3 at all, but is dependant on it, as you need to know the length of byte[] you read. take3 remains good to commit, I just did not know how to make one isolated patch with only these changes without too much work in text editor Eks, could we instead add this to Field:
byte[] binaryValue(byte[] result) and then default the current binaryValue() to just call binaryValue(null)? And similar in Document add: byte[] getBinaryValue(String name, byte[] result) These would work the same way that TokenStream.next(Token result) works, ie, the method should try to use the result passed in, if it works, and return that; else it's free to allocate its own new byte[] and return it? And then only LazyField's implementation of binaryValue(byte[] result) would use byte[] result if it's large enough? Also ... it'd be nice to have a way to do this re-use in the non-lazy case. Ie somehow load a stored doc, but passing in your own Document result which would attempt to re-use the Field instances & byte[] for the binary fields. But we should open another issue to explore that...
byte[] binaryValue(byte[] result) this is exactly where I started, but then realized I am missing actual length we read in LazyField, without it you would have to relocate each time, except in case where your buffer length equals toRead in LazyField... simply, the question is, how the caller of byte[] getBinaryValue(String name, byte[] result) could know what is the length in this returned byte[] Am I missing something obvious?
Right. This also gets back to the fact that the Document you retrieve should probably be different than the Document that you get by loading the stored fields. Some sort of lower level callback interface to populate a Document might even eliminate the need for some of the FieldSelector stuff... or at least it would mostly be independent of the field reading code and users could create more advanced implementations.
Duh, right. Though: couldn't you just call document.getFieldable(name), and then call binaryValue(byte[] result) on that Fieldable, and then get the length from it (getBinaryLength()) too? (Trying to minimize API changes).
This sounds interesting... but how would you re-use your own byte[] with this approach?
sure, good tip, I this could work. No need to have this byte[]->Fieldable-byte[] loop, it confuses. I have attached patch that uses this approach. But I created getBinaryValue(byte[]) instead of binaryValue(byte[]) as we have binaryValue() as deprecated method (would be confusing as well). Not really tested, but looks simple enough Just thinking aloud OK this patch looks good! I plan to commit in a day or two.
I completely agree! OK, this one took a rather circuitous route but it is now committed! Thanks Eks.
how was it: "repetitio est mater studiorum"
thanks Mike! ----- Original Message ---- Send instant messages to your online friends http://uk.messenger.yahoo.com Eks Dev: out of curiosity, did you ever measure the before/after performance difference? If so, what numbers did you see?
sure we did, it's been a while we measured it so I do not have the real numbers at hand. But for both cases (indexing and fetching stored binary field) it showed up during profiling as the only easy quick-win(s) we could make . We index very short documents and indexing speed per thread before this patch was is in 7.5k documents/ second range, after it we run it with the patch over 9.5-10K/Second, sweet... for searching, I do not not remember the numbers, but it was surely above 5% range (try to allocate 12Mb in 6k objects per second as unnecessary addition and you will see it |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
some polish needed and probably more testing, TODOs:
should be enough to get us started.
the only "pseudo-issue" I see is that
public byte[] binaryValue(); now creates byte[] and copies content into it, reference to original array can be now fetched via getBinaryValue() method... this is to preserve compatibility as users expect compact, zero based array from this method and we keep offset/length in Field now
this is "pseudo issue" as users already should have a reference to this array, so this method is rather superfluous for end users.