Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Now that payloads have been implemented, it will be good to make them searchable via one or more Query mechanisms. See http://wiki.apache.org/lucene-java/Payload_Planning for some background information and https://issues.apache.org/jira/browse/LUCENE-755 for the issue that started it all.

      1. btq.fix.patch
        12 kB
        Grant Ingersoll
      2. boosting.term.query.patch
        28 kB
        Grant Ingersoll

        Issue Links

          Activity

          Grant Ingersoll created issue -
          Grant Ingersoll made changes -
          Field Original Value New Value
          Link This issue is related to LUCENE-755 [ LUCENE-755 ]
          Hide
          Grant Ingersoll added a comment -

          First draft at a BoostingTermQuery, which is based on the SpanTermQuery and can be used for boosting the score of a term based on what is in the payload (for things like weighting terms higher according to their font size or part of speech).

          A couple of classes that were previously package level are now public and have been marked as Public and for derivational purposes only.

          See the CHANGES.xml for some more details.

          I believe all tests still pass.

          Show
          Grant Ingersoll added a comment - First draft at a BoostingTermQuery, which is based on the SpanTermQuery and can be used for boosting the score of a term based on what is in the payload (for things like weighting terms higher according to their font size or part of speech). A couple of classes that were previously package level are now public and have been marked as Public and for derivational purposes only. See the CHANGES.xml for some more details. I believe all tests still pass.
          Grant Ingersoll made changes -
          Attachment boosting.term.query.patch [ 12353577 ]
          Grant Ingersoll made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Hide
          Grant Ingersoll added a comment -

          I should add, one open question is how big the array in the BoostingSpanScorer should be preallocated to. I set it to 256, but I would imagine it could be smaller???? but I'm not sure. Probably should be configurable, but I didn't go that route. I would think, for practical purposes, that payloads should be kept small for the most part, otherwise performance will most likely suffer. What do others think? Have you seen papers/applications where the engine is storing large amounts of data on a per term basis?

          I could see where it might be useful to write VInts, etc. to the payload. Perhaps a refactoring of some of the writing/reading methods to allow for usage of them may be useful. Just thinking out loud...

          Show
          Grant Ingersoll added a comment - I should add, one open question is how big the array in the BoostingSpanScorer should be preallocated to. I set it to 256, but I would imagine it could be smaller???? but I'm not sure. Probably should be configurable, but I didn't go that route. I would think, for practical purposes, that payloads should be kept small for the most part, otherwise performance will most likely suffer. What do others think? Have you seen papers/applications where the engine is storing large amounts of data on a per term basis? I could see where it might be useful to write VInts, etc. to the payload. Perhaps a refactoring of some of the writing/reading methods to allow for usage of them may be useful. Just thinking out loud...
          Hide
          Grant Ingersoll added a comment -

          I committed this patch, plus added in some more documentation. Everything is still marked experimental.

          SVN revision: 523302.

          Show
          Grant Ingersoll added a comment - I committed this patch, plus added in some more documentation. Everything is still marked experimental. SVN revision: 523302.
          Hide
          Michael Busch added a comment -

          Hi Grant,

          cool that you started implementing queries that use of payloads! I have a question about this one: BoostingTermQuery only takes the payload of the first term position into account for scoring. Could you explain why you implemented it this way? Shouldn't we rather compute the average of the payload values of all positions?

          Show
          Michael Busch added a comment - Hi Grant, cool that you started implementing queries that use of payloads! I have a question about this one: BoostingTermQuery only takes the payload of the first term position into account for scoring. Could you explain why you implemented it this way? Shouldn't we rather compute the average of the payload values of all positions?
          Hide
          Grant Ingersoll added a comment -

          Uh, because I always forget about multiple terms per position? Mea culpa.

          Average sounds good, or, would it be better to deliver all the payloads at a given term and let the implementation decide?

          Show
          Grant Ingersoll added a comment - Uh, because I always forget about multiple terms per position? Mea culpa. Average sounds good, or, would it be better to deliver all the payloads at a given term and let the implementation decide?
          Hide
          Michael Busch added a comment -

          I think averaging should be the default, but you are right, it would be nice if it was possible to alter this, maybe via subclassing?. I wouldn't recommend to gather all payloads and send them to Similarity. Memory consumption would be proportional to posting list size then.

          Show
          Michael Busch added a comment - I think averaging should be the default, but you are right, it would be nice if it was possible to alter this, maybe via subclassing?. I wouldn't recommend to gather all payloads and send them to Similarity. Memory consumption would be proportional to posting list size then.
          Hide
          Marvin Humphrey added a comment -

          Would it be so bad if memory consumption was proportional to posting list size? True, special consideration might be necessary for large documents if payloads were large, and if you have any Query subclasses that rewrite themselves to a zillion subqueries, each of which maintains its own TermPositions subclass instance, that could pose difficulties. What are some other problematic scenarios?

          Show
          Marvin Humphrey added a comment - Would it be so bad if memory consumption was proportional to posting list size? True, special consideration might be necessary for large documents if payloads were large, and if you have any Query subclasses that rewrite themselves to a zillion subqueries, each of which maintains its own TermPositions subclass instance, that could pose difficulties. What are some other problematic scenarios?
          Hide
          Michael Busch added a comment -

          Yes, I was mainly thinking about large documents. I think in general memory consumption during search should depend on query complexity, not on the actual index.
          Besides, I don't see much benefits in gathering all payloads up front and processing them thereafter (maybe I overlook some?). What about having a method in BoostingTermScorer like:

          protected float calculateTermBoost(TermPostions tp);

          which implements averaging per default but can be overwritten by subclasses? An optimized implementation might e. g. consider just to read the first x% position payloads for large docs and estimate the boost for performance reasons.

          Show
          Michael Busch added a comment - Yes, I was mainly thinking about large documents. I think in general memory consumption during search should depend on query complexity, not on the actual index. Besides, I don't see much benefits in gathering all payloads up front and processing them thereafter (maybe I overlook some?). What about having a method in BoostingTermScorer like: protected float calculateTermBoost(TermPostions tp); which implements averaging per default but can be overwritten by subclasses? An optimized implementation might e. g. consider just to read the first x% position payloads for large docs and estimate the boost for performance reasons.
          Hide
          Marvin Humphrey added a comment -

          Averaging is how I've got this implemented by default in KS. However, all positions and boosts get read in at once. TermDocs/TermPositions has been replaced by PostingList, which goes doc by doc. The type of Posting assigned to each field (MatchPosting, ScorePosting, RichPosting, eventually PayloadPosting) determines how much gets read in. (KS doesn't have any queries that rewrite(), so it's only large docs that are an issue.)

          I haven't yet worked out the mechanics of per-position boosts and phrase queries.

          Show
          Marvin Humphrey added a comment - Averaging is how I've got this implemented by default in KS. However, all positions and boosts get read in at once. TermDocs/TermPositions has been replaced by PostingList, which goes doc by doc. The type of Posting assigned to each field (MatchPosting, ScorePosting, RichPosting, eventually PayloadPosting) determines how much gets read in. (KS doesn't have any queries that rewrite(), so it's only large docs that are an issue.) I haven't yet worked out the mechanics of per-position boosts and phrase queries.
          Hide
          Grant Ingersoll added a comment -

          Fixed the issue with only loading one payload per term and added unit test for it. The unit test uses the multiField field that contains a repeat of each set of terms per document.

          Updated the docs on the Similarity to indicate what offset and length are used for.

          Show
          Grant Ingersoll added a comment - Fixed the issue with only loading one payload per term and added unit test for it. The unit test uses the multiField field that contains a repeat of each set of terms per document. Updated the docs on the Similarity to indicate what offset and length are used for.
          Grant Ingersoll made changes -
          Attachment btq.fix.patch [ 12355983 ]
          Hide
          Grant Ingersoll added a comment -

          I applied and committed this patch

          Show
          Grant Ingersoll added a comment - I applied and committed this patch
          Mark Miller made changes -
          Status In Progress [ 3 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Fix Version/s 2.2 [ 12312328 ]
          Mark Thomas made changes -
          Workflow jira [ 12399791 ] Default workflow, editable Closed status [ 12562974 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12562974 ] jira [ 12583849 ]
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development