Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.

      A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way.

      API and Usage
      ------------------------------
      The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.

      In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
      /** Sets this Token's payload. */
      public void setPayload(Payload payload);

      /** Returns this Token's payload. */
      public Payload getPayload();

      In order to retrieve the data from the index the interface TermPositions now offers two new methods:
      /** Returns the payload length of the current term position.

      • This is invalid until {@link #nextPosition()} is called for
        * the first time.
        *
        * @return length of the current payload in number of bytes
        */
        int getPayloadLength();

        /** Returns the payload data of the current term position.
        * This is invalid until {@link #nextPosition()}

        is called for

      • the first time.
      • This method must not be called more than once after each call
      • of {@link #nextPosition()}

        . However, payloads are loaded lazily,

      • so if the payload data for the current position is not needed,
      • this method may not be called at all for performance reasons.
      • @param data the array into which the data of this payload is to be
      • stored, if it is big enough; otherwise, a new byte[] array
      • is allocated for this purpose.
      • @param offset the offset in the array into which the data of this payload
      • is to be stored.
      • @return a byte[] array containing the data of this payload
      • @throws IOException
        */
        byte[] getPayload(byte[] data, int offset) throws IOException;

      Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument.

      Implementation details
      ------------------------------

      • One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
      • The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
      • The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
      • Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
      • Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
      • Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
      • In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
      • Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.

      Changes of file formats
      ------------------------------

      • FieldInfos (.fnm)
        The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field.
      • ProxFile (.prx)
        ProxFile (.prx) --> <TermPositions>^TermCount
        TermPositions --> <Positions>^DocFreq
        Positions --> <PositionDelta, Payload?>^Freq
        Payload --> <PayloadLength?, PayloadData>
        PositionDelta --> VInt
        PayloadLength --> VInt
        PayloadData --> byte^PayloadLength

      For payloads disabled (unchanged):
      PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first occurrence in this document).

      For Payloads enabled:
      PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.

      • FreqFile (.frq)

      SkipDatum --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
      PayloadLength --> VInt

      For payloads disabled (unchanged):
      DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.

      For payloads enabled:
      DocSkip/2 records the document number before every SkipInterval th document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.

      This encoding is space efficient for different use cases:

      • If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
      • If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
      • If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.

      All unit tests pass.

      1. payloads.patch
        78 kB
        Michael Busch
      2. payloads.patch
        78 kB
        Michael Busch
      3. payloads.patch
        78 kB
        Michael Busch
      4. payload.patch
        93 kB
        Nicolas Lalevée
      5. payloads.patch
        76 kB
        Michael Busch

        Issue Links

          Activity

          Michael Busch created issue -
          Michael Busch made changes -
          Field Original Value New Value
          Attachment payloads.patch [ 12347603 ]
          Hide
          Grant Ingersoll added a comment -

          Great patch, Michael, and something that will come in handy for a lot of people. I can vouch it applies cleanly and all the tests pass.

          Now I am not sure I am totally understanding everything just yet so the following is thinking aloud, but bear with me.

          One of the big unanswered questions (besides how this fits into the whole flexible indexing scheme as discussed on the Payloads and Flexible indexing threads on java-dev) at this point for me is: how do we expose/integrate this into the scoring side of the equation? It seems we would need some interfaces that hook into the scoring mechanism so that people can define what all these payloads are actually used for, or am I missing something? Yet the TermScorer takes in the TermDocs, so it doesn't yet have access to the payloads (although this is easily remedied since we have access to the TermPositions when we construct TermScorer.) Span Queries could easily be extended to include payload information since they use the TermPositions, which would be useful for post-processing algorithms.

          I can imagine an interface that you would have to be set on the Query/Scorer (and inherited unless otherwise set???). The default implementation would be to ignore any payload, I suppose. We could also add a callback in the Similarity mechanism, something like:

          float calculatePayloadFactor(byte[] payload);
          or
          float calculatePayloadFactor(Term term, byte[] payload);

          Then this factor could be added/multiplied into the term score or whatever other scorers use it??????

          Is this making any sense?

          Show
          Grant Ingersoll added a comment - Great patch, Michael, and something that will come in handy for a lot of people. I can vouch it applies cleanly and all the tests pass. Now I am not sure I am totally understanding everything just yet so the following is thinking aloud, but bear with me. One of the big unanswered questions (besides how this fits into the whole flexible indexing scheme as discussed on the Payloads and Flexible indexing threads on java-dev) at this point for me is: how do we expose/integrate this into the scoring side of the equation? It seems we would need some interfaces that hook into the scoring mechanism so that people can define what all these payloads are actually used for, or am I missing something? Yet the TermScorer takes in the TermDocs, so it doesn't yet have access to the payloads (although this is easily remedied since we have access to the TermPositions when we construct TermScorer.) Span Queries could easily be extended to include payload information since they use the TermPositions, which would be useful for post-processing algorithms. I can imagine an interface that you would have to be set on the Query/Scorer (and inherited unless otherwise set???). The default implementation would be to ignore any payload, I suppose. We could also add a callback in the Similarity mechanism, something like: float calculatePayloadFactor(byte[] payload); or float calculatePayloadFactor(Term term, byte[] payload); Then this factor could be added/multiplied into the term score or whatever other scorers use it?????? Is this making any sense?
          Hide
          Michael Busch added a comment -

          > Great patch, Michael, and something that will come in handy for a lot of people. I can vouch it applies cleanly and all the tests pass.

          Cool, thanks for trying it out, Grant!

          > Now I am not sure I am totally understanding everything just yet so the following is thinking aloud, but bear with me.

          > One of the big unanswered questions (besides how this fits into the whole flexible indexing scheme as discussed on the Payloads and
          > Flexible indexing threads on java-dev) at this point for me is: how do we expose/integrate this into the scoring side of the equation? It seems
          > we would need some interfaces that hook into the scoring mechanism so that people can define what all these payloads are actually used
          > for, or am I missing something? Yet the TermScorer takes in the TermDocs, so it doesn't yet have access to the payloads (although this is
          > easily remedied since we have access to the TermPositions when we construct TermScorer.) Span Queries could easily be extended to
          > include payload information since they use the TermPositions, which would be useful for post-processing algorithms.

          I would say it really depends on the use case of the payloads. For example XML search: here payloads can be used to store depths information of terms. An extended Span class could then take the depth information into account for query evaluation. As you pointed out the span classes already have easy access to the payloads since they use TermPositions, so to implement such a subclass should be fairly simple.

          > I can imagine an interface that you would have to be set on the Query/Scorer (and inherited unless otherwise set???). The default
          > implementation would be to ignore any payload, I suppose. We could also add a callback in the Similarity mechanism, something like:
          >
          > float calculatePayloadFactor(byte[] payload);
          > or
          > float calculatePayloadFactor(Term term, byte[] payload);
          >
          > Then this factor could be added/multiplied into the term score or whatever other scorers use it??????
          >
          > Is this making any sense?

          I believe the case you're describing here is per-term norms/boosts? Yah I think this would work and you are right, the Scorers have to have access to TermPositions, TermDocs is not sufficient. So yes, it would be nice if TermScorer would use TermPositions instead of TermDocs. I just opened LUCENE-761, which changes SegmentTermPositions to clone the proxStream lazily at the first time nextPosition() is called. Then the costs for creating TermDocs and TermPositions are the same and together with lazy prox skipping (LUCENE-687) there's no reason anymore to not use TermPositions.

          However, as currently discussed on java-dev, per-term boosts could also be part of a new posting format in the flexible index scheme and thus not stored in the payloads.

          So in general this patch doesn't add yet a new search feature to Lucene, it rather opens the door for new features in the future. The way to add such a new feature is then:
          1) Write an analyzer that provides data neccessary for the new feature and produces Tokens with payloads containing these data.
          2) Write/extend a Scorer that has access to TermPositions and makes use of the data in the payloads for matching or scoring or both.

          Show
          Michael Busch added a comment - > Great patch, Michael, and something that will come in handy for a lot of people. I can vouch it applies cleanly and all the tests pass. Cool, thanks for trying it out, Grant! > Now I am not sure I am totally understanding everything just yet so the following is thinking aloud, but bear with me. > One of the big unanswered questions (besides how this fits into the whole flexible indexing scheme as discussed on the Payloads and > Flexible indexing threads on java-dev) at this point for me is: how do we expose/integrate this into the scoring side of the equation? It seems > we would need some interfaces that hook into the scoring mechanism so that people can define what all these payloads are actually used > for, or am I missing something? Yet the TermScorer takes in the TermDocs, so it doesn't yet have access to the payloads (although this is > easily remedied since we have access to the TermPositions when we construct TermScorer.) Span Queries could easily be extended to > include payload information since they use the TermPositions, which would be useful for post-processing algorithms. I would say it really depends on the use case of the payloads. For example XML search: here payloads can be used to store depths information of terms. An extended Span class could then take the depth information into account for query evaluation. As you pointed out the span classes already have easy access to the payloads since they use TermPositions, so to implement such a subclass should be fairly simple. > I can imagine an interface that you would have to be set on the Query/Scorer (and inherited unless otherwise set???). The default > implementation would be to ignore any payload, I suppose. We could also add a callback in the Similarity mechanism, something like: > > float calculatePayloadFactor(byte[] payload); > or > float calculatePayloadFactor(Term term, byte[] payload); > > Then this factor could be added/multiplied into the term score or whatever other scorers use it?????? > > Is this making any sense? I believe the case you're describing here is per-term norms/boosts? Yah I think this would work and you are right, the Scorers have to have access to TermPositions, TermDocs is not sufficient. So yes, it would be nice if TermScorer would use TermPositions instead of TermDocs. I just opened LUCENE-761 , which changes SegmentTermPositions to clone the proxStream lazily at the first time nextPosition() is called. Then the costs for creating TermDocs and TermPositions are the same and together with lazy prox skipping ( LUCENE-687 ) there's no reason anymore to not use TermPositions. However, as currently discussed on java-dev, per-term boosts could also be part of a new posting format in the flexible index scheme and thus not stored in the payloads. So in general this patch doesn't add yet a new search feature to Lucene, it rather opens the door for new features in the future. The way to add such a new feature is then: 1) Write an analyzer that provides data neccessary for the new feature and produces Tokens with payloads containing these data. 2) Write/extend a Scorer that has access to TermPositions and makes use of the data in the payloads for matching or scoring or both.
          Hoss Man made changes -
          Link This issue blocks LUCENE-761 [ LUCENE-761 ]
          Nicolas Lalevée made changes -
          Attachment payload.patch [ 12348592 ]
          Hide
          Nicolas Lalevée added a comment -

          The patch I have just upload (payload.patch) is Michael's one (payloads.patch) with the customization of how payload are written and read, exactly like I did for Lucene-662. An IndexFormat is in fact a factory of PayloadWriter and PayloadReader, this index format being stored in the Directory instance.

          Note that I haven't changed the javadoc neither the comments included in Michael's patch, it needs some cleanup if somebody is interested in commiting it.
          And sorry for the name of the patch I have uploaded, it is a little bit confusing now, and I can't change it's name. I will be more carefull next time when naming my patch files.

          Show
          Nicolas Lalevée added a comment - The patch I have just upload (payload.patch) is Michael's one (payloads.patch) with the customization of how payload are written and read, exactly like I did for Lucene-662. An IndexFormat is in fact a factory of PayloadWriter and PayloadReader, this index format being stored in the Directory instance. Note that I haven't changed the javadoc neither the comments included in Michael's patch, it needs some cleanup if somebody is interested in commiting it. And sorry for the name of the patch I have uploaded, it is a little bit confusing now, and I can't change it's name. I will be more carefull next time when naming my patch files.
          Michael Busch made changes -
          Link This issue blocks LUCENE-761 [ LUCENE-761 ]
          Michael Busch made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Hide
          Grant Ingersoll added a comment -

          Nicolas,

          Are you implying your patch fits in with 662 (and needs to be applied after) or it is just in the style of 662 but isn't dependent on?

          Thanks,
          Grant

          Show
          Grant Ingersoll added a comment - Nicolas, Are you implying your patch fits in with 662 (and needs to be applied after) or it is just in the style of 662 but isn't dependent on? Thanks, Grant
          Hide
          Nicolas Lalevée added a comment -

          Grant>
          The patch I have propsed here has no dependency on LUCENE-662, I just "imported" some ideas from it and put them there. Since the LUCENE-662 have involved, the patches will probably make conflicts. The best to use here is Michael's one. I think it won't conflit with LUCENE-662. And if both are intended to be commited, then the best is to commit the both seperately and redo the work I have done with the provided patch (I remember that it was quite easy).

          Show
          Nicolas Lalevée added a comment - Grant> The patch I have propsed here has no dependency on LUCENE-662 , I just "imported" some ideas from it and put them there. Since the LUCENE-662 have involved, the patches will probably make conflicts. The best to use here is Michael's one. I think it won't conflit with LUCENE-662 . And if both are intended to be commited, then the best is to commit the both seperately and redo the work I have done with the provided patch (I remember that it was quite easy).
          Hide
          Michael Busch added a comment -

          I'm attaching the new patch with the following changes:

          • applies cleanly on the current trunk
          • fixed a bug in FSDirectory which affected payloads with length greater than 1024 bytes and extended testcase TestPayloads to test this fix
          • added the following warning comments to the new APIs:
          • Warning: The status of the Payloads feature is experimental. The APIs
          • introduced here might change in the future and will not be supported anymore
          • in such a case. If you want to use this feature in a production environment
          • you should wait for an official release.

          Another comment about an API change: In BufferedIndexOutput I changed the method
          protected abstract void flushBuffer(byte[] b, int len) throws IOException;
          to
          protected abstract void flushBuffer(byte[] b, int offset, int len) throws IOException;

          which means that subclasses of BufferedIndexOutput won't compile anymore. I made this change for performance reasons: If a payload is longer than 1024 bytes (standard buffer size of BufferedIndexOutput) then it can be flushed efficiently to disk without having to perform array copies.

          Is this API change acceptable? Users who have custom subclasses of BufferedIndexOutput would have to change their classes in order to work.

          Show
          Michael Busch added a comment - I'm attaching the new patch with the following changes: applies cleanly on the current trunk fixed a bug in FSDirectory which affected payloads with length greater than 1024 bytes and extended testcase TestPayloads to test this fix added the following warning comments to the new APIs: Warning: The status of the Payloads feature is experimental. The APIs introduced here might change in the future and will not be supported anymore in such a case. If you want to use this feature in a production environment you should wait for an official release. Another comment about an API change: In BufferedIndexOutput I changed the method protected abstract void flushBuffer(byte[] b, int len) throws IOException; to protected abstract void flushBuffer(byte[] b, int offset, int len) throws IOException; which means that subclasses of BufferedIndexOutput won't compile anymore. I made this change for performance reasons: If a payload is longer than 1024 bytes (standard buffer size of BufferedIndexOutput) then it can be flushed efficiently to disk without having to perform array copies. Is this API change acceptable? Users who have custom subclasses of BufferedIndexOutput would have to change their classes in order to work.
          Michael Busch made changes -
          Attachment payloads.patch [ 12353080 ]
          Hide
          Michael Busch added a comment -

          Attaching a new patch. The previous one didn't apply cleanly anymore after LUCENE-710 was committed.

          Show
          Michael Busch added a comment - Attaching a new patch. The previous one didn't apply cleanly anymore after LUCENE-710 was committed.
          Michael Busch made changes -
          Attachment payloads.patch [ 12353231 ]
          Hide
          Michael Busch added a comment -

          Another one! (previous version didn't apply cleanly anymore after committing LUCENE-818, Mike is keeping me busy ).

          Grant, did you get a chance to review the patch? I would like to go ahead and commit it soon with the API warnings if nobody objects...

          Show
          Michael Busch added a comment - Another one! (previous version didn't apply cleanly anymore after committing LUCENE-818 , Mike is keeping me busy ). Grant, did you get a chance to review the patch? I would like to go ahead and commit it soon with the API warnings if nobody objects...
          Michael Busch made changes -
          Attachment payloads.patch [ 12353316 ]
          Hide
          Grant Ingersoll added a comment -

          OK, I've applied the patch. All tests pass for me. I think it looks
          good. Have you run any benchmarks on it? I ran the standard one on
          the patched version and on trunk, in a totally unscientific test. In
          theory, the case with no payloads should perform very closely to the
          existing code, and this seems to be born out by me running the micro-
          standard (ant run-task in contrib/benchmark). Once we have this
          committed someone can take a crack at adding support to the
          benchmarker for payloads.

          Payload should probably be serializable.

          All in all, I think we could commit this, then adding the search/
          scoring capabilities like we've talked about. I like the
          documentation/comments you have added, very useful. (One of these
          days I will take on documenting the index package like I intend to,
          so what you've added will be quite helpful!) We will/may want to
          add in, for example, a PayloadQuery and derivatives and a QueryParser
          operator that supported searching in the payload, or possibly
          boosting if a certain term has a certain type of payload (not that I
          want anything to do with the QueryParser). Even beyond that,
          SpanPayloadQuery, etc. I will possibly have some cycles to actually
          write some code for these next week.

          Just throwing this out there, I'm not sure I really mean it or
          not , but:
          do you think it would be useful to consider restricting the size of
          the payload? I know, I know, as soon as we put a limit on it,
          someone will want to expand it, but I was thinking if we knew the
          size had a limit we could better control the performance and caching,
          etc. on the scoring/search side. I guess it is buyer beware, maybe
          we put some javadocs on this.

          Also, I started http://wiki.apache.org/lucene-java/Payloads as I
          think we will want to have some docs explaining why Payloads are
          useful in non-javadoc format.

          On a side note, have a look at http://wiki.apache.org/lucene-java/
          PatchCheckList to see if there is anything you feel you can add.

          --------------------------
          Grant Ingersoll
          Center for Natural Language Processing
          http://www.cnlp.org

          Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
          LuceneFAQ

          Show
          Grant Ingersoll added a comment - OK, I've applied the patch. All tests pass for me. I think it looks good. Have you run any benchmarks on it? I ran the standard one on the patched version and on trunk, in a totally unscientific test. In theory, the case with no payloads should perform very closely to the existing code, and this seems to be born out by me running the micro- standard (ant run-task in contrib/benchmark). Once we have this committed someone can take a crack at adding support to the benchmarker for payloads. Payload should probably be serializable. All in all, I think we could commit this, then adding the search/ scoring capabilities like we've talked about. I like the documentation/comments you have added, very useful. (One of these days I will take on documenting the index package like I intend to, so what you've added will be quite helpful!) We will/may want to add in, for example, a PayloadQuery and derivatives and a QueryParser operator that supported searching in the payload, or possibly boosting if a certain term has a certain type of payload (not that I want anything to do with the QueryParser). Even beyond that, SpanPayloadQuery, etc. I will possibly have some cycles to actually write some code for these next week. Just throwing this out there, I'm not sure I really mean it or not , but: do you think it would be useful to consider restricting the size of the payload? I know, I know, as soon as we put a limit on it, someone will want to expand it, but I was thinking if we knew the size had a limit we could better control the performance and caching, etc. on the scoring/search side. I guess it is buyer beware, maybe we put some javadocs on this. Also, I started http://wiki.apache.org/lucene-java/Payloads as I think we will want to have some docs explaining why Payloads are useful in non-javadoc format. On a side note, have a look at http://wiki.apache.org/lucene-java/ PatchCheckList to see if there is anything you feel you can add. -------------------------- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ
          Hide
          Michael Busch added a comment -

          Grant Ingersoll commented on LUCENE-755:
          ----------------------------------------

          > OK, I've applied the patch. All tests pass for me. I think it looks
          > good. Have you run any benchmarks on it? I ran the standard one on
          > the patched version and on trunk, in a totally unscientific test. In
          > theory, the case with no payloads should perform very closely to the
          > existing code, and this seems to be born out by me running the micro-
          > standard (ant run-task in contrib/benchmark). Once we have this

          Grant, thank you for running the benchmarks!
          In case no payloads are used there is indeed no performance decrease to
          expect, because the file format does not change at all in that case.

          > committed someone can take a crack at adding support to the
          > benchmarker for payloads.

          Good point! This will help us finding possible optimizations.

          > Payload should probably be serializable.

          Agreed. Will do ...

          > All in all, I think we could commit this, then adding the search/
          > scoring capabilities like we've talked about. I like the
          > documentation/comments you have added, very useful. (One of these
          > days I will take on documenting the index package like I intend to,
          > so what you've added will be quite helpful!) We will/may want to

          That's what I was planning to do as well... haven't had time yet. But
          good that there's another volunteer, so we can split the work

          > add in, for example, a PayloadQuery and derivatives and a QueryParser
          > operator that supported searching in the payload, or possibly
          > boosting if a certain term has a certain type of payload (not that I
          > want anything to do with the QueryParser). Even beyond that,
          > SpanPayloadQuery, etc. I will possibly have some cycles to actually
          > write some code for these next week.

          Yes there are lots of things we could do. I was also thinking about
          providing a demo that uses payloads. Let's commit this first, then
          we can start working on these items...

          > Just throwing this out there, I'm not sure I really mean it or
          > not , but:
          > do you think it would be useful to consider restricting the size of
          > the payload? I know, I know, as soon as we put a limit on it,
          > someone will want to expand it, but I was thinking if we knew the
          > size had a limit we could better control the performance and caching,
          > etc. on the scoring/search side. I guess it is buyer beware, maybe
          > we put some javadocs on this.

          Hmm, I'm not sure if we should limit the size... since there are
          so many different use cases I wouldn't even know how to pick such
          a limit. However, if we discover later that a limit would be helpful
          to optimize things on the search side we could think about a limit
          parameter on field level, which would be easy to add if we introduce
          a schema and global field semantics with FI.

          > Also, I started http://wiki.apache.org/lucene-java/Payloads as I
          > think we will want to have some docs explaining why Payloads are
          > useful in non-javadoc format.

          Cool, that will be helpful!

          > On a side note, have a look at http://wiki.apache.org/lucene-java/
          > PatchCheckList to see if there is anything you feel you can add.

          Thanks for reviewing this so thoroughly, Grant! I will commit it soon!

          Show
          Michael Busch added a comment - Grant Ingersoll commented on LUCENE-755 : ---------------------------------------- > OK, I've applied the patch. All tests pass for me. I think it looks > good. Have you run any benchmarks on it? I ran the standard one on > the patched version and on trunk, in a totally unscientific test. In > theory, the case with no payloads should perform very closely to the > existing code, and this seems to be born out by me running the micro- > standard (ant run-task in contrib/benchmark). Once we have this Grant, thank you for running the benchmarks! In case no payloads are used there is indeed no performance decrease to expect, because the file format does not change at all in that case. > committed someone can take a crack at adding support to the > benchmarker for payloads. Good point! This will help us finding possible optimizations. > Payload should probably be serializable. Agreed. Will do ... > All in all, I think we could commit this, then adding the search/ > scoring capabilities like we've talked about. I like the > documentation/comments you have added, very useful. (One of these > days I will take on documenting the index package like I intend to, > so what you've added will be quite helpful!) We will/may want to That's what I was planning to do as well... haven't had time yet. But good that there's another volunteer, so we can split the work > add in, for example, a PayloadQuery and derivatives and a QueryParser > operator that supported searching in the payload, or possibly > boosting if a certain term has a certain type of payload (not that I > want anything to do with the QueryParser). Even beyond that, > SpanPayloadQuery, etc. I will possibly have some cycles to actually > write some code for these next week. Yes there are lots of things we could do. I was also thinking about providing a demo that uses payloads. Let's commit this first, then we can start working on these items... > Just throwing this out there, I'm not sure I really mean it or > not , but: > do you think it would be useful to consider restricting the size of > the payload? I know, I know, as soon as we put a limit on it, > someone will want to expand it, but I was thinking if we knew the > size had a limit we could better control the performance and caching, > etc. on the scoring/search side. I guess it is buyer beware, maybe > we put some javadocs on this. Hmm, I'm not sure if we should limit the size... since there are so many different use cases I wouldn't even know how to pick such a limit. However, if we discover later that a limit would be helpful to optimize things on the search side we could think about a limit parameter on field level, which would be easy to add if we introduce a schema and global field semantics with FI. > Also, I started http://wiki.apache.org/lucene-java/Payloads as I > think we will want to have some docs explaining why Payloads are > useful in non-javadoc format. Cool, that will be helpful! > On a side note, have a look at http://wiki.apache.org/lucene-java/ > PatchCheckList to see if there is anything you feel you can add. Thanks for reviewing this so thoroughly, Grant! I will commit it soon!
          Hide
          Michael Busch added a comment -

          I just committed this. Payload is serializable now.

          Show
          Michael Busch added a comment - I just committed this. Payload is serializable now.
          Michael Busch made changes -
          Status In Progress [ 3 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Fix Version/s 2.2 [ 12312328 ]
          Grant Ingersoll made changes -
          Link This issue relates to LUCENE-834 [ LUCENE-834 ]
          Michael Busch made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Mark Thomas made changes -
          Workflow jira [ 12392924 ] Default workflow, editable Closed status [ 12562509 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12562509 ] jira [ 12583478 ]

            People

            • Assignee:
              Michael Busch
              Reporter:
              Michael Busch
            • Votes:
              3 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development