Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1219

support array/offset/ length setters for Field with binary data

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      currently Field/Fieldable interface supports only compact, zero based byte arrays. This forces end users to create and copy content of new objects before passing them to Lucene as such fields are often of variable size. Depending on use case, this can bring far from negligible performance improvement.

      this approach extends Fieldable interface with 3 new methods
      getOffset(); gettLenght(); and getBinaryValue() (this only returns reference to the array)

      1. LUCENE-1219.extended.patch
        17 kB
        Eks Dev
      2. LUCENE-1219.extended.patch
        18 kB
        Eks Dev
      3. LUCENE-1219.patch
        8 kB
        Eks Dev
      4. LUCENE-1219.patch
        8 kB
        Eks Dev
      5. LUCENE-1219.patch
        8 kB
        Eks Dev
      6. LUCENE-1219.patch
        8 kB
        Eks Dev
      7. LUCENE-1219.take2.patch
        15 kB
        Michael McCandless
      8. LUCENE-1219.take3.patch
        16 kB
        Eks Dev

        Issue Links

          Activity

          Hide
          eksdev Eks Dev added a comment -

          all tests pass with this patch.
          some polish needed and probably more testing, TODOs:

          • someone pedantic should check if these new set / get methods should be named better
          • check if there are more places where this new feature cold/should be used, I think I have changed all of them but one place, direct subclass FieldForMerge in FieldsReader, this is the code I do not know so I did not touch it...
          • javadoc is poor

          should be enough to get us started.

          the only "pseudo-issue" I see is that
          public byte[] binaryValue(); now creates byte[] and copies content into it, reference to original array can be now fetched via getBinaryValue() method... this is to preserve compatibility as users expect compact, zero based array from this method and we keep offset/length in Field now
          this is "pseudo issue" as users already should have a reference to this array, so this method is rather superfluous for end users.

          Show
          eksdev Eks Dev added a comment - all tests pass with this patch. some polish needed and probably more testing, TODOs: someone pedantic should check if these new set / get methods should be named better check if there are more places where this new feature cold/should be used, I think I have changed all of them but one place, direct subclass FieldForMerge in FieldsReader, this is the code I do not know so I did not touch it... javadoc is poor should be enough to get us started. the only "pseudo-issue" I see is that public byte[] binaryValue(); now creates byte[] and copies content into it, reference to original array can be now fetched via getBinaryValue() method... this is to preserve compatibility as users expect compact, zero based array from this method and we keep offset/length in Field now this is "pseudo issue" as users already should have a reference to this array, so this method is rather superfluous for end users.
          Hide
          eksdev Eks Dev added a comment -

          Michael McCandless had some nice ideas on how to make getValue() change performance penalty for legacy usage negligible, this patch includes them:

          • deprecates getValue() method
          • returns direct reference if offset==0 && length == data.length
          Show
          eksdev Eks Dev added a comment - Michael McCandless had some nice ideas on how to make getValue() change performance penalty for legacy usage negligible, this patch includes them: deprecates getValue() method returns direct reference if offset==0 && length == data.length
          Hide
          mikemccand Michael McCandless added a comment -

          Hmm ... one problem is Fieldable is an interface, and this patch adds methods to the interface, which I believe breaks our backwards compatibility requirement.

          Show
          mikemccand Michael McCandless added a comment - Hmm ... one problem is Fieldable is an interface, and this patch adds methods to the interface, which I believe breaks our backwards compatibility requirement.
          Hide
          eksdev Eks Dev added a comment -

          I do not know for sure if this is something we could not live with. Adding new interface sounds equally bad, would work nicely, but I do not like it as it makes code harder to follow with too many interfaces ... I'll have another look at it to see if there is a way to do it without interface changes. Any ideas?

          Show
          eksdev Eks Dev added a comment - I do not know for sure if this is something we could not live with. Adding new interface sounds equally bad, would work nicely, but I do not like it as it makes code harder to follow with too many interfaces ... I'll have another look at it to see if there is a way to do it without interface changes. Any ideas?
          Hide
          eksdev Eks Dev added a comment -

          this one keeps addition of new methods localized to AbstractField, does not change Fieldable interface... it looks like it could work done this way with a few instanceof checks in FieldsWriter, This one has dependency on LUCENE-1217

          it will not give you any benefit if you directly implement your Fieldable without extending AbstractField, therefore I would suggest to eventually change Fieldable to support all these methods that operate with offset/length. Or someone clever finds some way to change an interface without braking backwards compatibility

          Show
          eksdev Eks Dev added a comment - this one keeps addition of new methods localized to AbstractField, does not change Fieldable interface... it looks like it could work done this way with a few instanceof checks in FieldsWriter, This one has dependency on LUCENE-1217 it will not give you any benefit if you directly implement your Fieldable without extending AbstractField, therefore I would suggest to eventually change Fieldable to support all these methods that operate with offset/length. Or someone clever finds some way to change an interface without braking backwards compatibility
          Hide
          eksdev Eks Dev added a comment -

          latest patch updated to the trunk (Lucene-1217 is there. Michael you did not mark it as resolved.)

          Show
          eksdev Eks Dev added a comment - latest patch updated to the trunk (Lucene-1217 is there. Michael you did not mark it as resolved.)
          Hide
          mikemccand Michael McCandless added a comment -

          OK I updated the patch:

          • Added a ctor to Field to create binary fields with length &
            offset
          • Added a test case
          • Regularized whitespace
          • Renamed things:
            getLength -> getBinaryLength
            getOffset -> getBinaryOffset
            dataOffset -> binaryOffset
            dataLength -> binaryLength

          Eks can you see if the changes look OK? Thanks.

          Show
          mikemccand Michael McCandless added a comment - OK I updated the patch: Added a ctor to Field to create binary fields with length & offset Added a test case Regularized whitespace Renamed things: getLength -> getBinaryLength getOffset -> getBinaryOffset dataOffset -> binaryOffset dataLength -> binaryLength Eks can you see if the changes look OK? Thanks.
          Hide
          mikemccand Michael McCandless added a comment -

          Alas, I'm not really happy with introducing this API at the
          AbstractField level and not in Fieldable. It's awkward that we've
          deprecated binaryValue() in AbstractField and not in Fieldable. But,
          I think it's our only way forward with this issue without breaking
          backwards compatibility.

          In 3.0 I'd like to at least promote this API up into Fieldable, but
          even that is somewhat messy because I think in 3.0 we would then
          deprecate binaryValue() and move these 3 new methods up from
          AbstractField.

          What I'd really like to do in 3.0 is change Fieldable to not be an
          abstract base class instead.

          Question: could we simply move forward without Fieldable? Ie,
          deprecate Fieldable right now and state that the migration path is
          "you should subclass from AbstractField"? I would leave "implements
          Fieldable" in AbstractField now, but remove it in 3.0. As far as I
          can tell, all uses of Fieldable in Lucene are also using
          AbstractField.

          I guess I don't really understand the need for Fieldable. In fact I
          also don't really understand why we even needed to add AbstractField.
          Why couldn't FieldForMerge and LazyField subclass Field? It's
          somewhat awkward now because we have newly added APIs to Field, like
          setValue, which probably should have been added to Fieldable.

          Show
          mikemccand Michael McCandless added a comment - Alas, I'm not really happy with introducing this API at the AbstractField level and not in Fieldable. It's awkward that we've deprecated binaryValue() in AbstractField and not in Fieldable. But, I think it's our only way forward with this issue without breaking backwards compatibility. In 3.0 I'd like to at least promote this API up into Fieldable, but even that is somewhat messy because I think in 3.0 we would then deprecate binaryValue() and move these 3 new methods up from AbstractField. What I'd really like to do in 3.0 is change Fieldable to not be an abstract base class instead. Question: could we simply move forward without Fieldable? Ie, deprecate Fieldable right now and state that the migration path is "you should subclass from AbstractField"? I would leave "implements Fieldable" in AbstractField now, but remove it in 3.0. As far as I can tell, all uses of Fieldable in Lucene are also using AbstractField. I guess I don't really understand the need for Fieldable. In fact I also don't really understand why we even needed to add AbstractField. Why couldn't FieldForMerge and LazyField subclass Field? It's somewhat awkward now because we have newly added APIs to Field, like setValue , which probably should have been added to Fieldable.
          Hide
          eksdev Eks Dev added a comment -

          >>Eks can you see if the changes look OK? Thanks.
          It looks perfect, you have brought it to the "commit ready status" already.
          I will it try it on our production mirror a bit later today and report back if something goes wrong.

          >>I guess I don't really understand the need for Fieldable. In fact I
          also don't really understand why we even needed to add AbstractField.

          I am with you 100% here, It looks to me as well that one concrete class could replace it all. But... maybe someone kicks-in with some god arguments why we have it that way.

          Show
          eksdev Eks Dev added a comment - >>Eks can you see if the changes look OK? Thanks. It looks perfect, you have brought it to the "commit ready status" already. I will it try it on our production mirror a bit later today and report back if something goes wrong. >>I guess I don't really understand the need for Fieldable. In fact I also don't really understand why we even needed to add AbstractField. I am with you 100% here, It looks to me as well that one concrete class could replace it all. But... maybe someone kicks-in with some god arguments why we have it that way.
          Hide
          eksdev Eks Dev added a comment -
          • updated this patch to apply to trunk
          • implemented abstract getBinary***() methods in Fieldable, and removed a few ugly instanceof AbstractField from a few places (introduced by previous versions of this patch. This was there due to the assumption that Fieldable should stay unchanged...)

          all test pass, (as expected, only minor diff to take2 version. much like the initial version )

          Show
          eksdev Eks Dev added a comment - updated this patch to apply to trunk implemented abstract getBinary***() methods in Fieldable, and removed a few ugly instanceof AbstractField from a few places (introduced by previous versions of this patch. This was there due to the assumption that Fieldable should stay unchanged...) all test pass, (as expected, only minor diff to take2 version. much like the initial version )
          Hide
          mikemccand Michael McCandless added a comment -

          Thanks Eks – this looks good. I think it's ready to commit, once LUCENE-1349 is in.

          Show
          mikemccand Michael McCandless added a comment - Thanks Eks – this looks good. I think it's ready to commit, once LUCENE-1349 is in.
          Hide
          eksdev Eks Dev added a comment -

          Great Mike,
          it gets better and better, i saw LUCENE-1340 committed. Thanks to you Grant, Doug and all others that voted for 1349 this happened so quickly. Trust me, these two issues are really making my life easier. I pushed decision to add new hardware to some future point (means, save customer's money now)... a few weeks later would be too late.

          Now it remains only to make one nice patch that enables us to pass our own byte[] for retrieving stored fields during search. I was thinking along the lines of things you did in Analyzers.

          we could pool the same trick for this, eg.

          Field Document.getBinaryValue(String FIELD_NAME, Field destination);

          Field already has all access methods (get/set),

          the contract would be: If destination==null, new one will be created and returned, if not we use this one and returne the same object back. The method should check if byte[] is big enough, if not simple growth policy can be there. This way we avoid new byte[] each time you fetch stored field..

          I did not look exactly at code now, but the last time I was looking into it it looked as quite simple to do something along these lines. Do you have some ideas how we could do it better?

          Just simple calculation in my case,
          average Hits count is around 200, for each hit we have to fetch one stored field where we do some post-processing, re-scoring and whatnot. Currently we run max 30 rq/second , with average document length of 2k you lend at 2K * 200 * 30 = 6000 object allocations per second totaling 12Mb ... only to get the data... I can imagine people with much longer documents (that would be typical lucene use case) where it gets worse... simply reducing gc() pressure with really small amount of work. I am sure this would have nice effects on some other use cases in lucene.

          thanks again to all "workers" behind this greet peace of software...
          eks

          PS: I need to find some time to peek at paul's work in LUVENE -1345 and my wish list will be complete, at least for now (at least until you get your magic with flexi index format done

          Show
          eksdev Eks Dev added a comment - Great Mike, it gets better and better, i saw LUCENE-1340 committed. Thanks to you Grant, Doug and all others that voted for 1349 this happened so quickly. Trust me, these two issues are really making my life easier. I pushed decision to add new hardware to some future point (means, save customer's money now)... a few weeks later would be too late. Now it remains only to make one nice patch that enables us to pass our own byte[] for retrieving stored fields during search. I was thinking along the lines of things you did in Analyzers. we could pool the same trick for this, eg. Field Document.getBinaryValue(String FIELD_NAME, Field destination); Field already has all access methods (get/set), the contract would be: If destination==null, new one will be created and returned, if not we use this one and returne the same object back. The method should check if byte[] is big enough, if not simple growth policy can be there. This way we avoid new byte[] each time you fetch stored field.. I did not look exactly at code now, but the last time I was looking into it it looked as quite simple to do something along these lines. Do you have some ideas how we could do it better? Just simple calculation in my case, average Hits count is around 200, for each hit we have to fetch one stored field where we do some post-processing, re-scoring and whatnot. Currently we run max 30 rq/second , with average document length of 2k you lend at 2K * 200 * 30 = 6000 object allocations per second totaling 12Mb ... only to get the data... I can imagine people with much longer documents (that would be typical lucene use case) where it gets worse... simply reducing gc() pressure with really small amount of work. I am sure this would have nice effects on some other use cases in lucene. thanks again to all "workers" behind this greet peace of software... eks PS: I need to find some time to peek at paul's work in LUVENE -1345 and my wish list will be complete, at least for now (at least until you get your magic with flexi index format done
          Hide
          eksdev Eks Dev added a comment -

          Mike,
          This new patch includes take3 and adds the following:

          Fieldable Document.getStoredBinaryField(String name, byte[] scratch);

          where scratch param represents user byte buffer that will be used in case it is big enough, if not, it will be simply allocated like today. If scratch is used, you get the same object through Fieldable.getByteValue()

          for this to work, I added one new method in Fieldable
          abstract Fieldable getBinaryField(byte[] scratch);

          the only interesting implementation is in LazyField

          The reason for this is in my previous comment

          this does not affect issues from take3 at all, but is dependant on it, as you need to know the length of byte[] you read. take3 remains good to commit, I just did not know how to make one isolated patch with only these changes without too much work in text editor

          Show
          eksdev Eks Dev added a comment - Mike, This new patch includes take3 and adds the following: Fieldable Document.getStoredBinaryField(String name, byte[] scratch); where scratch param represents user byte buffer that will be used in case it is big enough, if not, it will be simply allocated like today. If scratch is used, you get the same object through Fieldable.getByteValue() for this to work, I added one new method in Fieldable abstract Fieldable getBinaryField(byte[] scratch); the only interesting implementation is in LazyField The reason for this is in my previous comment this does not affect issues from take3 at all, but is dependant on it, as you need to know the length of byte[] you read. take3 remains good to commit, I just did not know how to make one isolated patch with only these changes without too much work in text editor
          Hide
          mikemccand Michael McCandless added a comment -

          Eks, could we instead add this to Field:

          byte[] binaryValue(byte[] result)

          and then default the current binaryValue() to just call binaryValue(null)?

          And similar in Document add:

          byte[] getBinaryValue(String name, byte[] result)

          These would work the same way that TokenStream.next(Token result) works, ie, the method should try to use the result passed in, if it works, and return that; else it's free to allocate its own new byte[] and return it?

          And then only LazyField's implementation of binaryValue(byte[] result) would use byte[] result if it's large enough?

          Also ... it'd be nice to have a way to do this re-use in the non-lazy case. Ie somehow load a stored doc, but passing in your own Document result which would attempt to re-use the Field instances & byte[] for the binary fields. But we should open another issue to explore that...

          Show
          mikemccand Michael McCandless added a comment - Eks, could we instead add this to Field: byte[] binaryValue(byte[] result) and then default the current binaryValue() to just call binaryValue(null)? And similar in Document add: byte[] getBinaryValue(String name, byte[] result) These would work the same way that TokenStream.next(Token result) works, ie, the method should try to use the result passed in, if it works, and return that; else it's free to allocate its own new byte[] and return it? And then only LazyField's implementation of binaryValue(byte[] result) would use byte[] result if it's large enough? Also ... it'd be nice to have a way to do this re-use in the non-lazy case. Ie somehow load a stored doc, but passing in your own Document result which would attempt to re-use the Field instances & byte[] for the binary fields. But we should open another issue to explore that...
          Hide
          eksdev Eks Dev added a comment -

          could we instead add this to Field:

          byte[] binaryValue(byte[] result)

          this is exactly where I started, but then realized I am missing actual length we read in LazyField, without it you would have to relocate each time, except in case where your buffer length equals toRead in LazyField... simply, the question is, how the caller of byte[] getBinaryValue(String name, byte[] result) could know what is the length in this returned byte[]

          Am I missing something obvious?

          Show
          eksdev Eks Dev added a comment - could we instead add this to Field: byte[] binaryValue(byte[] result) this is exactly where I started, but then realized I am missing actual length we read in LazyField, without it you would have to relocate each time, except in case where your buffer length equals toRead in LazyField... simply, the question is, how the caller of byte[] getBinaryValue(String name, byte[] result) could know what is the length in this returned byte[] Am I missing something obvious?
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Also ... it'd be nice to have a way to do this re-use in the non-lazy case. Ie somehow load a stored doc, but passing in your own Document result which would attempt to re-use the Field instances & byte[] for the binary fields.

          Right. This also gets back to the fact that the Document you retrieve should probably be different than the Document that you get by loading the stored fields.

          Some sort of lower level callback interface to populate a Document might even eliminate the need for some of the FieldSelector stuff... or at least it would mostly be independent of the field reading code and users could create more advanced implementations.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Also ... it'd be nice to have a way to do this re-use in the non-lazy case. Ie somehow load a stored doc, but passing in your own Document result which would attempt to re-use the Field instances & byte[] for the binary fields. Right. This also gets back to the fact that the Document you retrieve should probably be different than the Document that you get by loading the stored fields. Some sort of lower level callback interface to populate a Document might even eliminate the need for some of the FieldSelector stuff... or at least it would mostly be independent of the field reading code and users could create more advanced implementations.
          Hide
          mikemccand Michael McCandless added a comment -

          realized I am missing actual length we read in LazyField

          Duh, right.

          Though: couldn't you just call document.getFieldable(name), and then call binaryValue(byte[] result) on that Fieldable, and then get the length from it (getBinaryLength()) too? (Trying to minimize API changes).

          Some sort of lower level callback interface to populate a Document might even eliminate the need for some of the FieldSelector stuff... or at least it would mostly be independent of the field reading code and users could create more advanced implementations.

          This sounds interesting... but how would you re-use your own byte[] with this approach?

          Show
          mikemccand Michael McCandless added a comment - realized I am missing actual length we read in LazyField Duh, right. Though: couldn't you just call document.getFieldable(name), and then call binaryValue(byte[] result) on that Fieldable, and then get the length from it (getBinaryLength()) too? (Trying to minimize API changes). Some sort of lower level callback interface to populate a Document might even eliminate the need for some of the FieldSelector stuff... or at least it would mostly be independent of the field reading code and users could create more advanced implementations. This sounds interesting... but how would you re-use your own byte[] with this approach?
          Hide
          eksdev Eks Dev added a comment -

          couldn't you just call document.getFieldable(name), and then call binaryValue(byte[] result) on that Fieldable, and then get the length from it (getBinaryLength()) too? (Trying to minimize API changes).

          sure, good tip, I this could work. No need to have this byte[]->Fieldable-byte[] loop, it confuses. I have attached patch that uses this approach. But I created getBinaryValue(byte[]) instead of binaryValue(byte[]) as we have binaryValue() as deprecated method (would be confusing as well). Not really tested, but looks simple enough

          Just thinking aloud
          This is one nice feature, but I permanently had a feeling I do not understand this Field structures, roles and responsibilities Field/Fieldable/AbstractField hierarchy is really ripe for good re-factoring.This bigamy with index / search use cases makes things not really easy to follow, Hoss has right, we need some way to divorce RetrievedField from FieldToBeIndexed, they are definitely not the same, just very similar.

          Show
          eksdev Eks Dev added a comment - couldn't you just call document.getFieldable(name), and then call binaryValue(byte[] result) on that Fieldable, and then get the length from it (getBinaryLength()) too? (Trying to minimize API changes). sure, good tip, I this could work. No need to have this byte[]->Fieldable-byte[] loop, it confuses. I have attached patch that uses this approach. But I created getBinaryValue(byte[]) instead of binaryValue(byte[]) as we have binaryValue() as deprecated method (would be confusing as well). Not really tested, but looks simple enough Just thinking aloud This is one nice feature, but I permanently had a feeling I do not understand this Field structures, roles and responsibilities Field/Fieldable/AbstractField hierarchy is really ripe for good re-factoring.This bigamy with index / search use cases makes things not really easy to follow, Hoss has right, we need some way to divorce RetrievedField from FieldToBeIndexed, they are definitely not the same, just very similar.
          Hide
          mikemccand Michael McCandless added a comment -

          OK this patch looks good! I plan to commit in a day or two.

          This is one nice feature, but I permanently had a feeling I do not understand this Field structures, roles and responsibilities Field/Fieldable/AbstractField hierarchy is really ripe for good re-factoring.This bigamy with index / search use cases makes things not really easy to follow, Hoss has right, we need some way to divorce RetrievedField from FieldToBeIndexed, they are definitely not the same, just very similar.

          I completely agree!

          Show
          mikemccand Michael McCandless added a comment - OK this patch looks good! I plan to commit in a day or two. This is one nice feature, but I permanently had a feeling I do not understand this Field structures, roles and responsibilities Field/Fieldable/AbstractField hierarchy is really ripe for good re-factoring.This bigamy with index / search use cases makes things not really easy to follow, Hoss has right, we need some way to divorce RetrievedField from FieldToBeIndexed, they are definitely not the same, just very similar. I completely agree!
          Hide
          mikemccand Michael McCandless added a comment -

          OK, this one took a rather circuitous route but it is now committed! Thanks Eks.

          Show
          mikemccand Michael McCandless added a comment - OK, this one took a rather circuitous route but it is now committed! Thanks Eks.
          Hide
          eksdev Eks Dev added a comment -

          how was it: "repetitio est mater studiorum"

          thanks Mike!

          ----- Original Message ----

          Send instant messages to your online friends http://uk.messenger.yahoo.com

          Show
          eksdev Eks Dev added a comment - how was it: "repetitio est mater studiorum" thanks Mike! ----- Original Message ---- Send instant messages to your online friends http://uk.messenger.yahoo.com
          Hide
          otis Otis Gospodnetic added a comment -

          Eks Dev: out of curiosity, did you ever measure the before/after performance difference? If so, what numbers did you see?

          Show
          otis Otis Gospodnetic added a comment - Eks Dev: out of curiosity, did you ever measure the before/after performance difference? If so, what numbers did you see?
          Hide
          eksdev Eks Dev added a comment -

          did you ever measure the before/after performance difference?

          sure we did, it's been a while we measured it so I do not have the real numbers at hand. But for both cases (indexing and fetching stored binary field) it showed up during profiling as the only easy quick-win(s) we could make .

          We index very short documents and indexing speed per thread before this patch was is in 7.5k documents/ second range, after it we run it with the patch over 9.5-10K/Second, sweet...

          for searching, I do not not remember the numbers, but it was surely above 5% range (try to allocate 12Mb in 6k objects per second as unnecessary addition and you will see it

          Show
          eksdev Eks Dev added a comment - did you ever measure the before/after performance difference? sure we did, it's been a while we measured it so I do not have the real numbers at hand. But for both cases (indexing and fetching stored binary field) it showed up during profiling as the only easy quick-win(s) we could make . We index very short documents and indexing speed per thread before this patch was is in 7.5k documents/ second range, after it we run it with the patch over 9.5-10K/Second, sweet... for searching, I do not not remember the numbers, but it was surely above 5% range (try to allocate 12Mb in 6k objects per second as unnecessary addition and you will see it

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              eksdev Eks Dev
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development