Issue Details (XML | Word | Printable)

Key: LUCENE-1219
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Michael McCandless
Reporter: Eks Dev
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

support array/offset/ length setters for Field with binary data

Created: 11/Mar/08 11:32 AM   Updated: 11/Oct/08 12:49 PM
Return to search
Component/s: Index
Affects Version/s: None
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-1219.extended.patch 2008-08-09 01:15 PM Eks Dev 17 kB
Text File Licensed for inclusion in ASF works LUCENE-1219.extended.patch 2008-08-08 01:53 PM Eks Dev 18 kB
Text File Licensed for inclusion in ASF works LUCENE-1219.patch 2008-03-12 10:02 AM Eks Dev 8 kB
Text File Licensed for inclusion in ASF works LUCENE-1219.patch 2008-03-11 09:53 PM Eks Dev 8 kB
Text File Licensed for inclusion in ASF works LUCENE-1219.patch 2008-03-11 02:07 PM Eks Dev 8 kB
Text File Licensed for inclusion in ASF works LUCENE-1219.patch 2008-03-11 01:33 PM Eks Dev 8 kB
Text File Licensed for inclusion in ASF works LUCENE-1219.take2.patch 2008-03-13 09:34 AM Michael McCandless 15 kB
Text File Licensed for inclusion in ASF works LUCENE-1219.take3.patch 2008-08-03 09:03 PM Eks Dev 16 kB
Issue Links:
Blocker
 

Lucene Fields: Patch Available, New
Resolution Date: 18/Aug/08 10:32 AM


 Description  « Hide
currently Field/Fieldable interface supports only compact, zero based byte arrays. This forces end users to create and copy content of new objects before passing them to Lucene as such fields are often of variable size. Depending on use case, this can bring far from negligible performance improvement.

this approach extends Fieldable interface with 3 new methods
getOffset(); gettLenght(); and getBinaryValue() (this only returns reference to the array)



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Eks Dev added a comment - 11/Mar/08 01:22 PM
all tests pass with this patch.
some polish needed and probably more testing, TODOs:
  • someone pedantic should check if these new set / get methods should be named better
  • check if there are more places where this new feature cold/should be used, I think I have changed all of them but one place, direct subclass FieldForMerge in FieldsReader, this is the code I do not know so I did not touch it...
  • javadoc is poor

should be enough to get us started.

the only "pseudo-issue" I see is that
public byte[] binaryValue(); now creates byte[] and copies content into it, reference to original array can be now fetched via getBinaryValue() method... this is to preserve compatibility as users expect compact, zero based array from this method and we keep offset/length in Field now
this is "pseudo issue" as users already should have a reference to this array, so this method is rather superfluous for end users.


Eks Dev made changes - 11/Mar/08 01:22 PM
Field Original Value New Value
Attachment LUCENE-1219.patch [ 12377612 ]
Eks Dev made changes - 11/Mar/08 01:32 PM
Attachment LUCENE-1219.patch [ 12377612 ]
Eks Dev made changes - 11/Mar/08 01:33 PM
Attachment LUCENE-1219.patch [ 12377614 ]
Michael McCandless made changes - 11/Mar/08 01:37 PM
Assignee Michael McCandless [ mikemccand ]
Eks Dev added a comment - 11/Mar/08 02:07 PM
Michael McCandless had some nice ideas on how to make getValue() change performance penalty for legacy usage negligible, this patch includes them:
  • deprecates getValue() method
  • returns direct reference if offset==0 && length == data.length

Eks Dev made changes - 11/Mar/08 02:07 PM
Attachment LUCENE-1219.patch [ 12377616 ]
Michael McCandless added a comment - 11/Mar/08 06:40 PM
Hmm ... one problem is Fieldable is an interface, and this patch adds methods to the interface, which I believe breaks our backwards compatibility requirement.

Eks Dev added a comment - 11/Mar/08 08:48 PM
I do not know for sure if this is something we could not live with. Adding new interface sounds equally bad, would work nicely, but I do not like it as it makes code harder to follow with too many interfaces ... I'll have another look at it to see if there is a way to do it without interface changes. Any ideas?

Eks Dev added a comment - 11/Mar/08 09:53 PM
this one keeps addition of new methods localized to AbstractField, does not change Fieldable interface... it looks like it could work done this way with a few instanceof checks in FieldsWriter, This one has dependency on LUCENE-1217

it will not give you any benefit if you directly implement your Fieldable without extending AbstractField, therefore I would suggest to eventually change Fieldable to support all these methods that operate with offset/length. Or someone clever finds some way to change an interface without braking backwards compatibility


Eks Dev made changes - 11/Mar/08 09:53 PM
Attachment LUCENE-1219.patch [ 12377643 ]
Eks Dev made changes - 11/Mar/08 09:59 PM
Link This issue is blocked by LUCENE-1217 [ LUCENE-1217 ]
Eks Dev added a comment - 12/Mar/08 10:02 AM
latest patch updated to the trunk (Lucene-1217 is there. Michael you did not mark it as resolved.)

Eks Dev made changes - 12/Mar/08 10:02 AM
Attachment LUCENE-1219.patch [ 12377681 ]
Michael McCandless added a comment - 13/Mar/08 09:34 AM
OK I updated the patch:
  • Added a ctor to Field to create binary fields with length &
    offset
  • Added a test case
  • Regularized whitespace
  • Renamed things:
    getLength -> getBinaryLength
    getOffset -> getBinaryOffset
    dataOffset -> binaryOffset
    dataLength -> binaryLength

Eks can you see if the changes look OK? Thanks.


Michael McCandless made changes - 13/Mar/08 09:34 AM
Attachment LUCENE-1219.take2.patch [ 12377760 ]
Michael McCandless added a comment - 13/Mar/08 09:35 AM
Alas, I'm not really happy with introducing this API at the
AbstractField level and not in Fieldable. It's awkward that we've
deprecated binaryValue() in AbstractField and not in Fieldable. But,
I think it's our only way forward with this issue without breaking
backwards compatibility.

In 3.0 I'd like to at least promote this API up into Fieldable, but
even that is somewhat messy because I think in 3.0 we would then
deprecate binaryValue() and move these 3 new methods up from
AbstractField.

What I'd really like to do in 3.0 is change Fieldable to not be an
abstract base class instead.

Question: could we simply move forward without Fieldable? Ie,
deprecate Fieldable right now and state that the migration path is
"you should subclass from AbstractField"? I would leave "implements
Fieldable" in AbstractField now, but remove it in 3.0. As far as I
can tell, all uses of Fieldable in Lucene are also using
AbstractField.

I guess I don't really understand the need for Fieldable. In fact I
also don't really understand why we even needed to add AbstractField.
Why couldn't FieldForMerge and LazyField subclass Field? It's
somewhat awkward now because we have newly added APIs to Field, like
setValue, which probably should have been added to Fieldable.


Eks Dev added a comment - 13/Mar/08 11:43 AM
>>Eks can you see if the changes look OK? Thanks.
It looks perfect, you have brought it to the "commit ready status" already.
I will it try it on our production mirror a bit later today and report back if something goes wrong.

>>I guess I don't really understand the need for Fieldable. In fact I
also don't really understand why we even needed to add AbstractField.

I am with you 100% here, It looks to me as well that one concrete class could replace it all. But... maybe someone kicks-in with some god arguments why we have it that way.


Eks Dev added a comment - 03/Aug/08 09:03 PM
  • updated this patch to apply to trunk
  • implemented abstract getBinary***() methods in Fieldable, and removed a few ugly instanceof AbstractField from a few places (introduced by previous versions of this patch. This was there due to the assumption that Fieldable should stay unchanged...)

all test pass, (as expected, only minor diff to take2 version. much like the initial version )


Eks Dev made changes - 03/Aug/08 09:03 PM
Attachment LUCENE-1219.take3.patch [ 12387432 ]
Michael McCandless made changes - 05/Aug/08 10:30 AM
Link This issue is blocked by LUCENE-1349 [ LUCENE-1349 ]
Michael McCandless added a comment - 05/Aug/08 10:51 AM
Thanks Eks – this looks good. I think it's ready to commit, once LUCENE-1349 is in.

Eks Dev added a comment - 05/Aug/08 07:53 PM
Great Mike,
it gets better and better, i saw LUCENE-1340 committed. Thanks to you Grant, Doug and all others that voted for 1349 this happened so quickly. Trust me, these two issues are really making my life easier. I pushed decision to add new hardware to some future point (means, save customer's money now)... a few weeks later would be too late.

Now it remains only to make one nice patch that enables us to pass our own byte[] for retrieving stored fields during search. I was thinking along the lines of things you did in Analyzers.

we could pool the same trick for this, eg.

Field Document.getBinaryValue(String FIELD_NAME, Field destination);

Field already has all access methods (get/set),

the contract would be: If destination==null, new one will be created and returned, if not we use this one and returne the same object back. The method should check if byte[] is big enough, if not simple growth policy can be there. This way we avoid new byte[] each time you fetch stored field..

I did not look exactly at code now, but the last time I was looking into it it looked as quite simple to do something along these lines. Do you have some ideas how we could do it better?

Just simple calculation in my case,
average Hits count is around 200, for each hit we have to fetch one stored field where we do some post-processing, re-scoring and whatnot. Currently we run max 30 rq/second , with average document length of 2k you lend at 2K * 200 * 30 = 6000 object allocations per second totaling 12Mb ... only to get the data... I can imagine people with much longer documents (that would be typical lucene use case) where it gets worse... simply reducing gc() pressure with really small amount of work. I am sure this would have nice effects on some other use cases in lucene.

thanks again to all "workers" behind this greet peace of software...
eks

PS: I need to find some time to peek at paul's work in LUVENE -1345 and my wish list will be complete, at least for now (at least until you get your magic with flexi index format done


Eks Dev added a comment - 08/Aug/08 01:53 PM
Mike,
This new patch includes take3 and adds the following:

Fieldable Document.getStoredBinaryField(String name, byte[] scratch);

where scratch param represents user byte buffer that will be used in case it is big enough, if not, it will be simply allocated like today. If scratch is used, you get the same object through Fieldable.getByteValue()

for this to work, I added one new method in Fieldable
abstract Fieldable getBinaryField(byte[] scratch);

the only interesting implementation is in LazyField

The reason for this is in my previous comment

this does not affect issues from take3 at all, but is dependant on it, as you need to know the length of byte[] you read. take3 remains good to commit, I just did not know how to make one isolated patch with only these changes without too much work in text editor


Eks Dev made changes - 08/Aug/08 01:53 PM
Attachment LUCENE-1219.extended.patch [ 12387817 ]
Michael McCandless added a comment - 08/Aug/08 03:41 PM
Eks, could we instead add this to Field:

byte[] binaryValue(byte[] result)

and then default the current binaryValue() to just call binaryValue(null)?

And similar in Document add:

byte[] getBinaryValue(String name, byte[] result)

These would work the same way that TokenStream.next(Token result) works, ie, the method should try to use the result passed in, if it works, and return that; else it's free to allocate its own new byte[] and return it?

And then only LazyField's implementation of binaryValue(byte[] result) would use byte[] result if it's large enough?

Also ... it'd be nice to have a way to do this re-use in the non-lazy case. Ie somehow load a stored doc, but passing in your own Document result which would attempt to re-use the Field instances & byte[] for the binary fields. But we should open another issue to explore that...


Eks Dev added a comment - 08/Aug/08 08:08 PM

could we instead add this to Field:

byte[] binaryValue(byte[] result)

this is exactly where I started, but then realized I am missing actual length we read in LazyField, without it you would have to relocate each time, except in case where your buffer length equals toRead in LazyField... simply, the question is, how the caller of byte[] getBinaryValue(String name, byte[] result) could know what is the length in this returned byte[]

Am I missing something obvious?


Yonik Seeley added a comment - 08/Aug/08 08:25 PM

Also ... it'd be nice to have a way to do this re-use in the non-lazy case. Ie somehow load a stored doc, but passing in your own Document result which would attempt to re-use the Field instances & byte[] for the binary fields.

Right. This also gets back to the fact that the Document you retrieve should probably be different than the Document that you get by loading the stored fields.

Some sort of lower level callback interface to populate a Document might even eliminate the need for some of the FieldSelector stuff... or at least it would mostly be independent of the field reading code and users could create more advanced implementations.


Michael McCandless added a comment - 08/Aug/08 11:56 PM

realized I am missing actual length we read in LazyField

Duh, right.

Though: couldn't you just call document.getFieldable(name), and then call binaryValue(byte[] result) on that Fieldable, and then get the length from it (getBinaryLength()) too? (Trying to minimize API changes).

Some sort of lower level callback interface to populate a Document might even eliminate the need for some of the FieldSelector stuff... or at least it would mostly be independent of the field reading code and users could create more advanced implementations.

This sounds interesting... but how would you re-use your own byte[] with this approach?


Eks Dev added a comment - 09/Aug/08 01:15 PM

couldn't you just call document.getFieldable(name), and then call binaryValue(byte[] result) on that Fieldable, and then get the length from it (getBinaryLength()) too? (Trying to minimize API changes).

sure, good tip, I this could work. No need to have this byte[]->Fieldable-byte[] loop, it confuses. I have attached patch that uses this approach. But I created getBinaryValue(byte[]) instead of binaryValue(byte[]) as we have binaryValue() as deprecated method (would be confusing as well). Not really tested, but looks simple enough

Just thinking aloud
This is one nice feature, but I permanently had a feeling I do not understand this Field structures, roles and responsibilities Field/Fieldable/AbstractField hierarchy is really ripe for good re-factoring.This bigamy with index / search use cases makes things not really easy to follow, Hoss has right, we need some way to divorce RetrievedField from FieldToBeIndexed, they are definitely not the same, just very similar.


Eks Dev made changes - 09/Aug/08 01:15 PM
Attachment LUCENE-1219.extended.patch [ 12387875 ]
Michael McCandless added a comment - 14/Aug/08 10:10 AM
OK this patch looks good! I plan to commit in a day or two.

This is one nice feature, but I permanently had a feeling I do not understand this Field structures, roles and responsibilities Field/Fieldable/AbstractField hierarchy is really ripe for good re-factoring.This bigamy with index / search use cases makes things not really easy to follow, Hoss has right, we need some way to divorce RetrievedField from FieldToBeIndexed, they are definitely not the same, just very similar.

I completely agree!


Repository Revision Date User Message
ASF #686723 Mon Aug 18 10:31:03 UTC 2008 mikemccand LUCENE-1219: add Fieldable.getBinaryValue/Offset/Length reuse API
Files Changed
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/document/Fieldable.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/index/FieldsWriter.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/index/FieldsReader.java
MODIFY /lucene/java/trunk/src/test/org/apache/lucene/index/TestIndexWriter.java
MODIFY /lucene/java/trunk/CHANGES.txt
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/document/Field.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/document/AbstractField.java

Michael McCandless added a comment - 18/Aug/08 10:32 AM
OK, this one took a rather circuitous route but it is now committed! Thanks Eks.

Michael McCandless made changes - 18/Aug/08 10:32 AM
Resolution Fixed [ 1 ]
Fix Version/s 2.4 [ 12312681 ]
Lucene Fields [Patch Available, New] [New, Patch Available]
Status Open [ 1 ] Resolved [ 5 ]
Eks Dev added a comment - 18/Aug/08 12:06 PM
how was it: "repetitio est mater studiorum"

thanks Mike!

----- Original Message ----

Send instant messages to your online friends http://uk.messenger.yahoo.com


Otis Gospodnetic added a comment - 18/Aug/08 06:58 PM
Eks Dev: out of curiosity, did you ever measure the before/after performance difference? If so, what numbers did you see?

Eks Dev added a comment - 19/Aug/08 07:50 AM

did you ever measure the before/after performance difference?

sure we did, it's been a while we measured it so I do not have the real numbers at hand. But for both cases (indexing and fetching stored binary field) it showed up during profiling as the only easy quick-win(s) we could make .

We index very short documents and indexing speed per thread before this patch was is in 7.5k documents/ second range, after it we run it with the patch over 9.5-10K/Second, sweet...

for searching, I do not not remember the numbers, but it was surely above 5% range (try to allocate 12Mb in 6k objects per second as unnecessary addition and you will see it


Michael McCandless made changes - 11/Oct/08 12:49 PM
Status Resolved [ 5 ] Closed [ 6 ]