Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      It will be nice to have access to payloads when doing SpanQuerys.

      See http://www.gossamer-threads.com/lists/lucene/java-dev/52270 and http://www.gossamer-threads.com/lists/lucene/java-dev/51134

      Current API, added to Spans.java is below. I will try to post a patch as soon as I can figure out how to make it work for unordered spans (I believe I have all the other cases working).

       /**
         * Returns the payload data for the current span.
         * This is invalid until {@link #next()} is called for
         * the first time.
         * This method must not be called more than once after each call
         * of {@link #next()}. However, payloads are loaded lazily,
         * so if the payload data for the current position is not needed,
         * this method may not be called at all for performance reasons.<br>
         * <br>
         * <p><font color="#FF0000">
         * WARNING: The status of the <b>Payloads</b> feature is experimental.
         * The APIs introduced here might change in the future and will not be
         * supported anymore in such a case.</font>
         *
         * @return a List of byte arrays containing the data of this payload
         * @throws IOException
         */
        // TODO: Remove warning after API has been finalized
        List/*<byte[]>*/ getPayload() throws IOException;
      
        /**
         * Checks if a payload can be loaded at this position.
         * <p/>
         * Payloads can only be loaded once per call to
         * {@link #next()}.
         * <p/>
         * <p><font color="#FF0000">
         * WARNING: The status of the <b>Payloads</b> feature is experimental.
         * The APIs introduced here might change in the future and will not be
         * supported anymore in such a case.</font>
         *
         * @return true if there is a payload available at this position that can be loaded
         */
        // TODO: Remove warning after API has been finalized
        public boolean isPayloadAvailable();
      
      1. LUCENE-1001.patch
        56 kB
        Mark Miller
      2. LUCENE-1001.patch
        58 kB
        Grant Ingersoll
      3. LUCENE-1001.patch
        55 kB
        Mark Miller
      4. LUCENE-1001.patch
        55 kB
        Mark Miller
      5. LUCENE-1001.patch
        46 kB
        Mark Miller
      6. LUCENE-1001.patch
        43 kB
        Mark Miller
      7. LUCENE-1001.patch
        44 kB
        Grant Ingersoll
      8. LUCENE-1001.patch
        43 kB
        Grant Ingersoll
      9. LUCENE-1001-fix.patch
        7 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Paul Elschot added a comment -

          Could you put this in a subclass of Spans?

          A little while ago I suggested to add a score() method to Spans to summarize the influence of subspans on the span score, and that would be another direction for a subclass.

          Show
          Paul Elschot added a comment - Could you put this in a subclass of Spans? A little while ago I suggested to add a score() method to Spans to summarize the influence of subspans on the span score, and that would be another direction for a subclass.
          Hide
          Grant Ingersoll added a comment -

          Do you mean in a separate interface? I suppose I should for backward compatibility so that people with existing Span implementations outside of Lucene aren't broken.

          Show
          Grant Ingersoll added a comment - Do you mean in a separate interface? I suppose I should for backward compatibility so that people with existing Span implementations outside of Lucene aren't broken.
          Hide
          Grant Ingersoll added a comment -

          This also blurs the line more between payloads and Spans, I suggest we drop the payloads package and move the BoostingTermQuery into the Spans package. I originally thought there would be more distinction between the two packages, but that seems less likely now.

          Since payloads are still marked as experimental, does anyone see a problem with moving it? I suppose it could be deprecated and copied since it has been released.

          Show
          Grant Ingersoll added a comment - This also blurs the line more between payloads and Spans, I suggest we drop the payloads package and move the BoostingTermQuery into the Spans package. I originally thought there would be more distinction between the two packages, but that seems less likely now. Since payloads are still marked as experimental, does anyone see a problem with moving it? I suppose it could be deprecated and copied since it has been released.
          Hide
          Paul Elschot added a comment -

          My mistake, I thought Spans was an abstract class, but it is an interface.
          That also means that Spans should not be changed at all, interfaces are forever.

          Something like this could do nicely:

          public interface PayloadSpans extends Spans

          { // as above }

          I'd prefer the payload "core" classes to stay in their own package search/payload
          because they may well turn out to be useful in other circumstances, for example
          as a way to avoid disjunctions with many terms.
          At the moment I have no preference for a package for the PayloadSpans above,
          it could be search.spans or search.payloads.

          Show
          Paul Elschot added a comment - My mistake, I thought Spans was an abstract class, but it is an interface. That also means that Spans should not be changed at all, interfaces are forever. Something like this could do nicely: public interface PayloadSpans extends Spans { // as above } I'd prefer the payload "core" classes to stay in their own package search/payload because they may well turn out to be useful in other circumstances, for example as a way to avoid disjunctions with many terms. At the moment I have no preference for a package for the PayloadSpans above, it could be search.spans or search.payloads.
          Hide
          Grant Ingersoll added a comment -

          public interface PayloadSpans extends Spans

          Unknown macro: { // as above }

          I think this is problematic too, unfortunately, since many spans actually contain other spans, so there is no way to safely cast even in the internal implementations.

          Alternative might be to add SpanQuery.getPayloadSpans() but that is ugly, too.

          I wish there was an equivalent way to deprecated that allowed one to tell people new methods are coming, but that won't break the existing interface. Semantics of it would need to be figured out, but it would be useful here to just be able to let people know that we want to add to the Spans interface, but they just get a warning when compiling until we make it official.

          I suppose the right thing to do if we really want this to work is to deprecate Spans and SpanQuery.getSpans() and introduce a new form of Spans (maybe as an abstract class this time?)

          Or am I missing something that provides a clearer way of doing this?

          Show
          Grant Ingersoll added a comment - public interface PayloadSpans extends Spans Unknown macro: { // as above } I think this is problematic too, unfortunately, since many spans actually contain other spans, so there is no way to safely cast even in the internal implementations. Alternative might be to add SpanQuery.getPayloadSpans() but that is ugly, too. I wish there was an equivalent way to deprecated that allowed one to tell people new methods are coming, but that won't break the existing interface. Semantics of it would need to be figured out, but it would be useful here to just be able to let people know that we want to add to the Spans interface, but they just get a warning when compiling until we make it official. I suppose the right thing to do if we really want this to work is to deprecate Spans and SpanQuery.getSpans() and introduce a new form of Spans (maybe as an abstract class this time?) Or am I missing something that provides a clearer way of doing this?
          Hide
          Paul Elschot added a comment -

          I think I would just go ahead and create a parallel class hierarchy starting from class PayloadSpanQuery with a getPayloadSpans() method, and try and use delegation to the various SpanQueries and their Spans as much as possible.
          That means you would end up with this a few times:

          return new PayLoadSpans ()

          { Spans delegatedTo = aFinalSpans; .... some Spans methods directly delegated... }

          ;

          When this delegation happens too often, it could even be factored out into its own superclass.
          If the delegation turns out to be a performance problem it might be inlined, but that would mean code duplication.

          For the rest, in case you need some existing SpanQuery private methods you could change them to package private, and move your classes to the search.spans package for that reason.

          Show
          Paul Elschot added a comment - I think I would just go ahead and create a parallel class hierarchy starting from class PayloadSpanQuery with a getPayloadSpans() method, and try and use delegation to the various SpanQueries and their Spans as much as possible. That means you would end up with this a few times: return new PayLoadSpans () { Spans delegatedTo = aFinalSpans; .... some Spans methods directly delegated... } ; When this delegation happens too often, it could even be factored out into its own superclass. If the delegation turns out to be a performance problem it might be inlined, but that would mean code duplication. For the rest, in case you need some existing SpanQuery private methods you could change them to package private, and move your classes to the search.spans package for that reason.
          Hide
          Grant Ingersoll added a comment -

          Paul,

          I think I have this implemented for the most part, but am having trouble with the NearSpansUnordered. Can you enlighten me to the use of the PriorityQueue in the file? Not saying its wrong, just saying I don't understand what's going on. Also wondering about how it relates to the ordered member and the first and last SpansCell.

          Thanks,
          Grant

          Show
          Grant Ingersoll added a comment - Paul, I think I have this implemented for the most part, but am having trouble with the NearSpansUnordered. Can you enlighten me to the use of the PriorityQueue in the file? Not saying its wrong, just saying I don't understand what's going on. Also wondering about how it relates to the ordered member and the first and last SpansCell. Thanks, Grant
          Hide
          Paul Elschot added a comment -

          I only wrote NearSpansOrdered. NearSpansOrdered is what is left of the original NearSpans after the ordered case was taken out, and after I specialized NearSpans the unordered case. That means I only simplified the original NearSpans.

          Off the top of my head: the priority queue is used to make sure that the Spans are processed by increasing doc numbers and increasing token positions; the first and the last Spans determine whether there is a match, and all other Spans (in the queue) are "in between".

          I think you'll only need to implement the Spans interface to use the same priority queue,
          and you'll have to come up with a way to collect all the payloads from your payload spans on a match.
          Probably fairly straightforward, but with Spans there is always unexpected fun waiting around the next corner

          Show
          Paul Elschot added a comment - I only wrote NearSpansOrdered. NearSpansOrdered is what is left of the original NearSpans after the ordered case was taken out, and after I specialized NearSpans the unordered case. That means I only simplified the original NearSpans. Off the top of my head: the priority queue is used to make sure that the Spans are processed by increasing doc numbers and increasing token positions; the first and the last Spans determine whether there is a match, and all other Spans (in the queue) are "in between". I think you'll only need to implement the Spans interface to use the same priority queue, and you'll have to come up with a way to collect all the payloads from your payload spans on a match. Probably fairly straightforward, but with Spans there is always unexpected fun waiting around the next corner
          Hide
          Grant Ingersoll added a comment -

          I have it implemented for the Ordered case (and all other Spans) but will need to dig in deeper for the unordered case. Worst case, I can pop the queue from min() to max, but I then have to reconstruct the queue and that doesn't seem good. Hopefully, I can dig in soon, as I would like to get this in 2.3 along with the other Payload stuff we are discussing since it sounds like that API is firming up.

          Show
          Grant Ingersoll added a comment - I have it implemented for the Ordered case (and all other Spans) but will need to dig in deeper for the unordered case. Worst case, I can pop the queue from min() to max, but I then have to reconstruct the queue and that doesn't seem good. Hopefully, I can dig in soon, as I would like to get this in 2.3 along with the other Payload stuff we are discussing since it sounds like that API is firming up.
          Hide
          Paul Elschot added a comment -

          So far, it's easier than that: when they match, they all match, so you only need to keep the input Spans around in List or whatever. Then use them all as a source for your payloads.

          Show
          Paul Elschot added a comment - So far, it's easier than that: when they match, they all match, so you only need to keep the input Spans around in List or whatever. Then use them all as a source for your payloads.
          Hide
          Grant Ingersoll added a comment -

          Off the top of my head: the priority queue is used to make sure that the Spans are processed by increasing doc numbers and increasing token positions; the first and the last Spans determine whether there is a match, and all other Spans (in the queue) are "in between".

          Would it be simpler to just use a SortedSet? Then we could iterate w/o losing the sort, right? Would this be faster since we wouldn't have to do the heap operations?

          Show
          Grant Ingersoll added a comment - Off the top of my head: the priority queue is used to make sure that the Spans are processed by increasing doc numbers and increasing token positions; the first and the last Spans determine whether there is a match, and all other Spans (in the queue) are "in between". Would it be simpler to just use a SortedSet? Then we could iterate w/o losing the sort, right? Would this be faster since we wouldn't have to do the heap operations?
          Hide
          Doug Cutting added a comment -

          > Would it be simpler to just use a SortedSet?

          TreeMap is slower than a PriorityQueue for this. With PriorityQueue, insertions and deletions do not allocate new objects. And, if some items are much more frequent than others, using adjustTop() instead of inserting and deleting makes merges run much faster, since most updates are then considerably faster than log.

          Show
          Doug Cutting added a comment - > Would it be simpler to just use a SortedSet? TreeMap is slower than a PriorityQueue for this. With PriorityQueue, insertions and deletions do not allocate new objects. And, if some items are much more frequent than others, using adjustTop() instead of inserting and deleting makes merges run much faster, since most updates are then considerably faster than log .
          Hide
          Grant Ingersoll added a comment -

          Sure, but how do I get access to the position payloads in the order that they occur in the PQ? I have to go and pop them all of the PQ or I need to maintain a separate PQ for the Payloads so that when I go to get a payload for a span, I can iterate over all the items by calling PQ.pop() but then I have to rebuild it again if getPayload is called again, right?

          I think I need to take a break and come back to this after some Turkey...

          Show
          Grant Ingersoll added a comment - Sure, but how do I get access to the position payloads in the order that they occur in the PQ? I have to go and pop them all of the PQ or I need to maintain a separate PQ for the Payloads so that when I go to get a payload for a span, I can iterate over all the items by calling PQ.pop() but then I have to rebuild it again if getPayload is called again, right? I think I need to take a break and come back to this after some Turkey...
          Hide
          Doug Cutting added a comment -

          > how do I get access to the position payloads in the order that they occur in the PQ?

          Why do you need them in that order? In the API you propose in the description of this issue, there's no clear association between the payloads returned and the query terms. So I don't yet see how the order of the payloads is useful.

          You could pretty easily return the list of payloads along with their positions by iterating through the list of sub-queries. The problem is that providing that sorted by position is expensive. Perhaps you could leave any such sorting, if required, to the application?

          Show
          Doug Cutting added a comment - > how do I get access to the position payloads in the order that they occur in the PQ? Why do you need them in that order? In the API you propose in the description of this issue, there's no clear association between the payloads returned and the query terms. So I don't yet see how the order of the payloads is useful. You could pretty easily return the list of payloads along with their positions by iterating through the list of sub-queries. The problem is that providing that sorted by position is expensive. Perhaps you could leave any such sorting, if required, to the application?
          Hide
          Grant Ingersoll added a comment -

          Yeah, I was thinking this as a possibility, but thought people may want to rely on the ordering, even if they mark it as unordered. I guess I will just have to properly document it, b/c I agree it is expensive to sort.

          I will submit a patch shortly.

          Show
          Grant Ingersoll added a comment - Yeah, I was thinking this as a possibility, but thought people may want to rely on the ordering, even if they mark it as unordered. I guess I will just have to properly document it, b/c I agree it is expensive to sort. I will submit a patch shortly.
          Hide
          Grant Ingersoll added a comment -

          First draft of a patch for this issue. Need to expand/double check the tests and cleanup a few things before committing, but wanted to get opinion on the API.

          Also, would like to see if NearSpansOrdered can be optimized a bit to not load the payloads in the case where the user doesn't ask for PayloadSpans. The other implementations shouldn't have this issue.

          Show
          Grant Ingersoll added a comment - First draft of a patch for this issue. Need to expand/double check the tests and cleanup a few things before committing, but wanted to get opinion on the API. Also, would like to see if NearSpansOrdered can be optimized a bit to not load the payloads in the case where the user doesn't ask for PayloadSpans. The other implementations shouldn't have this issue.
          Hide
          Grant Ingersoll added a comment -

          There is an issue w/ this patch related to unordered, overlapping spans that still needs to be fixed. Will try to get an updated patch out soon.

          Show
          Grant Ingersoll added a comment - There is an issue w/ this patch related to unordered, overlapping spans that still needs to be fixed. Will try to get an updated patch out soon.
          Hide
          Grant Ingersoll added a comment -

          Fixes the unordered problem. Still needs more testing, but I believe it is working

          Show
          Grant Ingersoll added a comment - Fixes the unordered problem. Still needs more testing, but I believe it is working
          Hide
          Grant Ingersoll added a comment -

          I don't think that patch is compatible w/ anything It was a
          rough sketch that never properly worked. I put it up there in the
          hopes that maybe someone would have some more insight to offer.

          --------------------------
          Grant Ingersoll

          Lucene Helpful Hints:
          http://wiki.apache.org/lucene-java/BasicsOfPerformance
          http://wiki.apache.org/lucene-java/LuceneFAQ

          Show
          Grant Ingersoll added a comment - I don't think that patch is compatible w/ anything It was a rough sketch that never properly worked. I put it up there in the hopes that maybe someone would have some more insight to offer. -------------------------- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
          Hide
          Mark Miller added a comment -

          Anyone still have a use case for this issue?

          Here is a patch that I think fixes the orderedspans problem - need to test further, but that may be the last piece on those parts.

          Beyond that, I think that a span uses only one clause to determine if a payload is available for the whole span - it seems to me we have to ask every clause.

          As far as the ordering of returned payloads, I don't see how they can be ordered by the user without having some info in the payload itself - I mean its just going to be a collection of byte arrays right? How could you order them? Seems at most you can say those payloads came from the given span and use them all.

          The more I look at spans the less I understand them I think <g> Its like repeating certain words over and over.

          Show
          Mark Miller added a comment - Anyone still have a use case for this issue? Here is a patch that I think fixes the orderedspans problem - need to test further, but that may be the last piece on those parts. Beyond that, I think that a span uses only one clause to determine if a payload is available for the whole span - it seems to me we have to ask every clause. As far as the ordering of returned payloads, I don't see how they can be ordered by the user without having some info in the payload itself - I mean its just going to be a collection of byte arrays right? How could you order them? Seems at most you can say those payloads came from the given span and use them all. The more I look at spans the less I understand them I think <g> Its like repeating certain words over and over.
          Hide
          Mark Miller added a comment -

          Without the absolute paths in the patch this time (get it together eclipse)

          Show
          Mark Miller added a comment - Without the absolute paths in the patch this time (get it together eclipse)
          Hide
          Mark Miller added a comment -

          Fixes the unorderedspan ispostionavailable issue for good measure.

          I think we have to give the payloads back unsorted - there are probably cases where you could just use all the payloads for a span rather than per term, so we might as well not incur a penalty there. If you need per term, you can put the position into the payload (pretty simple) and then just sort yourself.

          Show
          Mark Miller added a comment - Fixes the unorderedspan ispostionavailable issue for good measure. I think we have to give the payloads back unsorted - there are probably cases where you could just use all the payloads for a span rather than per term, so we might as well not incur a penalty there. If you need per term, you can put the position into the payload (pretty simple) and then just sort yourself.
          Hide
          Mark Miller added a comment -

          I'll correct it if I have anything to add later, but the new isPayloadAvailable on the unordered spans should start at min(), not first.

          Show
          Mark Miller added a comment - I'll correct it if I have anything to add later, but the new isPayloadAvailable on the unordered spans should start at min(), not first.
          Hide
          Mark Miller added a comment -

          Might as well keep striking while the iron's hot. This fixes the first to min() issue, I think it adds an isSpansAvailable call that had been commented out, adds a couple tests, tightens a test, and adds a new class that takes a single document and returns all of payloads from the terms that match a given Query object (so you could do a mix of span/query) - may or may not be useful someday.

          Show
          Mark Miller added a comment - Might as well keep striking while the iron's hot. This fixes the first to min() issue, I think it adds an isSpansAvailable call that had been commented out, adds a couple tests, tightens a test, and adds a new class that takes a single document and returns all of payloads from the terms that match a given Query object (so you could do a mix of span/query) - may or may not be useful someday.
          Hide
          Grant Ingersoll added a comment -

          TestBoostingTermQuery.testNoPayload now fails for me.

          Also noticed some extraneous System.out.println.

          Show
          Grant Ingersoll added a comment - TestBoostingTermQuery.testNoPayload now fails for me. Also noticed some extraneous System.out.println.
          Hide
          Mark Miller added a comment -

          I see what happened with the boosting test - the old patch didn't apply cleanly for me to the trunk for that test class - I looked for the issue but it looked clean so I left it to deal with later - turns out one or two lines were mangled. New patch that fixes that.

          Also turned off DEBUG for the payloads test class so that will drop the System.outs.

          Pretty clean now.

          Show
          Mark Miller added a comment - I see what happened with the boosting test - the old patch didn't apply cleanly for me to the trunk for that test class - I looked for the issue but it looked clean so I left it to deal with later - turns out one or two lines were mangled. New patch that fixes that. Also turned off DEBUG for the payloads test class so that will drop the System.outs. Pretty clean now.
          Hide
          Grant Ingersoll added a comment -

          Added to CHANGES.txt

          Made the SpanQuery.getPayloadSpans() method an empty implementation that returns null instead of being abstract, so as not to break anyone that extends SpanQuery. All Lucene span implementations override it.

          Also added in license headers.

          Plan to commit in a day or two

          Show
          Grant Ingersoll added a comment - Added to CHANGES.txt Made the SpanQuery.getPayloadSpans() method an empty implementation that returns null instead of being abstract, so as not to break anyone that extends SpanQuery. All Lucene span implementations override it. Also added in license headers. Plan to commit in a day or two
          Hide
          Mark Miller added a comment -

          Took one last look through for final cleanup - removed an unused import and some unneeded commented code.

          Show
          Mark Miller added a comment - Took one last look through for final cleanup - removed an unused import and some unneeded commented code.
          Hide
          Grant Ingersoll added a comment -

          Committed revision 687379.

          Show
          Grant Ingersoll added a comment - Committed revision 687379.
          Hide
          Jonathan Mamou added a comment -

          Hi,

          I use getPayloadSpans to get the spans (PayloadSpans spans = sq.getPayloadSpans(this.getIndexReader()) and for each span I extract the relevant payload using spans.getPayload() (after calling spans.next()).

          I have tested it for a SpanNearQuery, I have stored the offset in the payload (for testing).

          It seems to me that the payloads returned for the span (spans.getPayload()) does not match the current span (given by span.next()). In fact, the offset returned by the payload are different from the offset returned by span.start().
          It occurs when they are multiple occurrences of the terms.

          Are you aware of this bug?

          Thanks,

          Jonathan

          Show
          Jonathan Mamou added a comment - Hi, I use getPayloadSpans to get the spans (PayloadSpans spans = sq.getPayloadSpans(this.getIndexReader()) and for each span I extract the relevant payload using spans.getPayload() (after calling spans.next()). I have tested it for a SpanNearQuery, I have stored the offset in the payload (for testing). It seems to me that the payloads returned for the span (spans.getPayload()) does not match the current span (given by span.next()). In fact, the offset returned by the payload are different from the offset returned by span.start(). It occurs when they are multiple occurrences of the terms. Are you aware of this bug? Thanks, Jonathan
          Hide
          Mark Miller added a comment -

          So are you finding the payloads not at the right term if the terms match ie

          term1 term2 term1

          You might get the second term1's offsets for the first term1 and vice versa?

          Show
          Mark Miller added a comment - So are you finding the payloads not at the right term if the terms match ie term1 term2 term1 You might get the second term1's offsets for the first term1 and vice versa?
          Hide
          Jonathan Mamou added a comment -

          In my document, I have ".... term1 ... term1 term2 ..."
          offset of term1: 10, 20
          offset of term2: 21

          When running the query SpanNearQuery(term1 term2),
          according to span.start, I get 20
          while I get 10 and 21 when reading the payloads of the span.
          I would expect to get 20 and 21when reading the payloads of the span.

          Show
          Jonathan Mamou added a comment - In my document, I have ".... term1 ... term1 term2 ..." offset of term1: 10, 20 offset of term2: 21 When running the query SpanNearQuery(term1 term2), according to span.start, I get 20 while I get 10 and 21 when reading the payloads of the span. I would expect to get 20 and 21when reading the payloads of the span.
          Hide
          Jonathan Mamou added a comment -

          Hi,

          Here is the relevant code.

          I would expect to obtain
          10
          pos: 10
          pos: 11

          while I obtain
          10
          pos: 0
          pos: 11

          import java.io.StringReader;
          import java.util.Collection;
          import java.util.Iterator;

          import org.apache.lucene.document.Document;
          import org.apache.lucene.document.Field;
          import org.apache.lucene.index.IndexWriter;
          import org.apache.lucene.index.Term;
          import org.apache.lucene.search.IndexSearcher;
          import org.apache.lucene.search.TopDocs;
          import org.apache.lucene.search.spans.PayloadSpans;
          import org.apache.lucene.search.spans.SpanNearQuery;
          import org.apache.lucene.search.spans.SpanQuery;
          import org.apache.lucene.search.spans.SpanTermQuery;

          public class Test {

          public static void main (String args[]) throws Exception{
          IndexWriter writer = new IndexWriter(args[0], new
          TestPayloadAnalyzer(), IndexWriter.MaxFieldLength.LIMITED);
          Document doc = new Document();
          doc.add();new Field("content", new StringReader("a b c d e f g
          h i j a k")));
          writer.addDocument(doc);
          writer.close();

          IndexSearcher is = new IndexSearcher(args[0]);
          SpanTermQuery stq1 = new SpanTermQuery(new Term("content", "a"
          ));
          SpanTermQuery stq2 = new SpanTermQuery(new Term("content", "k"
          ));
          SpanQuery[] sqs =

          {stq1,stq2}

          ;
          SpanNearQuery snq = new SpanNearQuery(sqs,1,true);
          PayloadSpans spans = snq.getPayloadSpans(is.getIndexReader());

          TopDocs topDocs = is.search(snq,1);

          for (int i = 0; i < topDocs.scoreDocs.length; i++) {
          while) (spans.next()) {
          System.out.println(spans.start());
          Collection<byte[]> payloads = spans.getPayload();
          for (Iterator<byte[]> it = payloads.iterator();
          it.hasNext()

          { System.out.println(new String(it.next())); }

          }}
          }
          }
          }}

          -------------------------------------------------------------------------------------------------------------------------------------
          import java.io.IOException;
          import java.io.Reader;

          import org.apache.lucene.analysis.Analyzer;
          import org.apache.lucene.analysis.LowerCaseTokenizer;
          import org.apache.lucene.analysis.Token;
          import org.apache.lucene.analysis.TokenFilter;
          import org.apache.lucene.analysis.TokenStream;
          import org.apache.lucene.index.Payload;

          public class TestPayloadAnalyzer extends Analyzer {

          public TokenStream tokenStream(String fieldName, Reader reader)

          { TokenStream result = new LowerCaseTokenizer(reader); result = new PayloadFilter(result, fieldName); return result; }

          }

          class PayloadFilter extends TokenFilter {
          String fieldName;
          int pos;

          public PayloadFilter(TokenStream input, String fieldName)

          { super(input); this.fieldName = fieldName; pos = 0; }

          public Token next() throws IOException {
          Token result = input.next();
          if (result != null)

          { String token = new String(result.termBuffer(), 0, result.termLength ()); result.setPayload(),new Payload(("pos: " + pos).getBytes())); pos += result.getPositionIncrement(); }

          return} result;
          }
          }

          Jonathan

          Show
          Jonathan Mamou added a comment - Hi, Here is the relevant code. I would expect to obtain 10 pos: 10 pos: 11 while I obtain 10 pos: 0 pos: 11 import java.io.StringReader; import java.util.Collection; import java.util.Iterator; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.TopDocs; import org.apache.lucene.search.spans.PayloadSpans; import org.apache.lucene.search.spans.SpanNearQuery; import org.apache.lucene.search.spans.SpanQuery; import org.apache.lucene.search.spans.SpanTermQuery; public class Test { public static void main (String args[]) throws Exception{ IndexWriter writer = new IndexWriter(args [0] , new TestPayloadAnalyzer(), IndexWriter.MaxFieldLength.LIMITED); Document doc = new Document(); doc.add();new Field("content", new StringReader("a b c d e f g h i j a k"))); writer.addDocument(doc); writer.close(); IndexSearcher is = new IndexSearcher(args [0] ) ; SpanTermQuery stq1 = new SpanTermQuery(new Term("content", "a" )); SpanTermQuery stq2 = new SpanTermQuery(new Term("content", "k" )); SpanQuery[] sqs = {stq1,stq2} ; SpanNearQuery snq = new SpanNearQuery(sqs,1,true); PayloadSpans spans = snq.getPayloadSpans(is.getIndexReader()); TopDocs topDocs = is.search(snq,1); for (int i = 0; i < topDocs.scoreDocs.length; i++) { while) (spans.next()) { System.out.println(spans.start()); Collection<byte[]> payloads = spans.getPayload(); for (Iterator<byte[]> it = payloads.iterator(); it.hasNext() { System.out.println(new String(it.next())); } }} } } }} ------------------------------------------------------------------------------------------------------------------------------------- import java.io.IOException; import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.LowerCaseTokenizer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.index.Payload; public class TestPayloadAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new LowerCaseTokenizer(reader); result = new PayloadFilter(result, fieldName); return result; } } class PayloadFilter extends TokenFilter { String fieldName; int pos; public PayloadFilter(TokenStream input, String fieldName) { super(input); this.fieldName = fieldName; pos = 0; } public Token next() throws IOException { Token result = input.next(); if (result != null) { String token = new String(result.termBuffer(), 0, result.termLength ()); result.setPayload(),new Payload(("pos: " + pos).getBytes())); pos += result.getPositionIncrement(); } return} result; } } Jonathan
          Hide
          Mark Miller added a comment -

          Okay, I still understand like 2% of spans, but I think I have fixed the bug.

          After finding a match, but before finding a min match, we were pulling the payload - works fine when the match is the min match, but otherwise we actually have to wait to get the payload until we have crunched in on the min match. I had an idea of this before, and the code before I touched it tried to grab the payloads at this point - the problem is, in finding the min match, you've often advanced passed the term position of interest to find out there was no such min match. So you have to save the possible payload ahead of time, and either find a new one or use the possible saved one. Sucks to have to add extra loading, but at the moment I don't see how to do it differently (I admittedly can't see much in spans). Thats all partly a guess, partly probably true.

          Non the less, this patch handles the previous test cases, plus the bug case reported above. I have also added a modified version of the given test for the bug to the span battery of tests.

          Thanks Jonathan!

          Show
          Mark Miller added a comment - Okay, I still understand like 2% of spans, but I think I have fixed the bug. After finding a match, but before finding a min match, we were pulling the payload - works fine when the match is the min match, but otherwise we actually have to wait to get the payload until we have crunched in on the min match. I had an idea of this before, and the code before I touched it tried to grab the payloads at this point - the problem is, in finding the min match, you've often advanced passed the term position of interest to find out there was no such min match. So you have to save the possible payload ahead of time, and either find a new one or use the possible saved one. Sucks to have to add extra loading, but at the moment I don't see how to do it differently (I admittedly can't see much in spans). Thats all partly a guess, partly probably true. Non the less, this patch handles the previous test cases, plus the bug case reported above. I have also added a modified version of the given test for the bug to the span battery of tests. Thanks Jonathan!
          Hide
          Mark Miller added a comment -

          Whats the best procedure JIRA wise on this? Reopen this issue or start a new issue? If we reopen this issue, how is the bug fix tracked in changes?

          Show
          Mark Miller added a comment - Whats the best procedure JIRA wise on this? Reopen this issue or start a new issue? If we reopen this issue, how is the bug fix tracked in changes?
          Hide
          Grant Ingersoll added a comment -

          Open a new one.

          Show
          Grant Ingersoll added a comment - Open a new one.

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development