Issue Details (XML | Word | Printable)

Key: LUCENE-1001
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Grant Ingersoll
Reporter: Grant Ingersoll
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Add Payload retrieval to Spans

Created: 16/Sep/07 02:48 AM   Updated: 21/Nov/08 09:46 PM
Return to search "STDCXX MSVC issues"
Component/s: Search
Affects Version/s: None
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-1001-fix.patch 2008-11-21 02:37 AM Mark Miller 7 kB
Text File Licensed for inclusion in ASF works LUCENE-1001.patch 2008-08-19 09:43 PM Mark Miller 56 kB
Text File Licensed for inclusion in ASF works LUCENE-1001.patch 2008-08-19 02:39 PM Grant Ingersoll 58 kB
Text File Licensed for inclusion in ASF works LUCENE-1001.patch 2008-08-11 08:56 PM Mark Miller 55 kB
Text File Licensed for inclusion in ASF works LUCENE-1001.patch 2008-08-11 01:22 AM Mark Miller 55 kB
Text File Licensed for inclusion in ASF works LUCENE-1001.patch 2008-08-10 03:47 AM Mark Miller 46 kB
Text File Licensed for inclusion in ASF works LUCENE-1001.patch 2008-08-09 05:02 PM Mark Miller 43 kB
Text File Licensed for inclusion in ASF works LUCENE-1001.patch 2007-11-26 02:26 PM Grant Ingersoll 44 kB
Text File Licensed for inclusion in ASF works LUCENE-1001.patch 2007-11-25 05:52 PM Grant Ingersoll 43 kB
Issue Links:
Reference
 

Lucene Fields: Patch Available
Resolution Date: 20/Aug/08 04:07 PM


 Description  « Hide
It will be nice to have access to payloads when doing SpanQuerys.

See http://www.gossamer-threads.com/lists/lucene/java-dev/52270 and http://www.gossamer-threads.com/lists/lucene/java-dev/51134

Current API, added to Spans.java is below. I will try to post a patch as soon as I can figure out how to make it work for unordered spans (I believe I have all the other cases working).

 /**
   * Returns the payload data for the current span.
   * This is invalid until {@link #next()} is called for
   * the first time.
   * This method must not be called more than once after each call
   * of {@link #next()}. However, payloads are loaded lazily,
   * so if the payload data for the current position is not needed,
   * this method may not be called at all for performance reasons.<br>
   * <br>
   * <p><font color="#FF0000">
   * WARNING: The status of the <b>Payloads</b> feature is experimental.
   * The APIs introduced here might change in the future and will not be
   * supported anymore in such a case.</font>
   *
   * @return a List of byte arrays containing the data of this payload
   * @throws IOException
   */
  // TODO: Remove warning after API has been finalized
  List/*<byte[]>*/ getPayload() throws IOException;

  /**
   * Checks if a payload can be loaded at this position.
   * <p/>
   * Payloads can only be loaded once per call to
   * {@link #next()}.
   * <p/>
   * <p><font color="#FF0000">
   * WARNING: The status of the <b>Payloads</b> feature is experimental.
   * The APIs introduced here might change in the future and will not be
   * supported anymore in such a case.</font>
   *
   * @return true if there is a payload available at this position that can be loaded
   */
  // TODO: Remove warning after API has been finalized
  public boolean isPayloadAvailable();


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Paul Elschot added a comment - 16/Sep/07 08:49 AM
Could you put this in a subclass of Spans?

A little while ago I suggested to add a score() method to Spans to summarize the influence of subspans on the span score, and that would be another direction for a subclass.


Grant Ingersoll added a comment - 16/Sep/07 12:07 PM
Do you mean in a separate interface? I suppose I should for backward compatibility so that people with existing Span implementations outside of Lucene aren't broken.

Grant Ingersoll added a comment - 16/Sep/07 12:11 PM
This also blurs the line more between payloads and Spans, I suggest we drop the payloads package and move the BoostingTermQuery into the Spans package. I originally thought there would be more distinction between the two packages, but that seems less likely now.

Since payloads are still marked as experimental, does anyone see a problem with moving it? I suppose it could be deprecated and copied since it has been released.


Paul Elschot added a comment - 16/Sep/07 02:03 PM
My mistake, I thought Spans was an abstract class, but it is an interface.
That also means that Spans should not be changed at all, interfaces are forever.

Something like this could do nicely:

public interface PayloadSpans extends Spans {
// as above
}

I'd prefer the payload "core" classes to stay in their own package search/payload
because they may well turn out to be useful in other circumstances, for example
as a way to avoid disjunctions with many terms.
At the moment I have no preference for a package for the PayloadSpans above,
it could be search.spans or search.payloads.


Grant Ingersoll added a comment - 23/Oct/07 12:29 AM

public interface PayloadSpans extends Spans

Unknown macro: { // as above }

I think this is problematic too, unfortunately, since many spans actually contain other spans, so there is no way to safely cast even in the internal implementations.

Alternative might be to add SpanQuery.getPayloadSpans() but that is ugly, too.

I wish there was an equivalent way to deprecated that allowed one to tell people new methods are coming, but that won't break the existing interface. Semantics of it would need to be figured out, but it would be useful here to just be able to let people know that we want to add to the Spans interface, but they just get a warning when compiling until we make it official.

I suppose the right thing to do if we really want this to work is to deprecate Spans and SpanQuery.getSpans() and introduce a new form of Spans (maybe as an abstract class this time?)

Or am I missing something that provides a clearer way of doing this?


Paul Elschot added a comment - 23/Oct/07 05:17 PM
I think I would just go ahead and create a parallel class hierarchy starting from class PayloadSpanQuery with a getPayloadSpans() method, and try and use delegation to the various SpanQueries and their Spans as much as possible.
That means you would end up with this a few times:

return new PayLoadSpans () {
Spans delegatedTo = aFinalSpans;
.... some Spans methods directly delegated...
};

When this delegation happens too often, it could even be factored out into its own superclass.
If the delegation turns out to be a performance problem it might be inlined, but that would mean code duplication.

For the rest, in case you need some existing SpanQuery private methods you could change them to package private, and move your classes to the search.spans package for that reason.


Grant Ingersoll added a comment - 19/Nov/07 07:10 PM
Paul,

I think I have this implemented for the most part, but am having trouble with the NearSpansUnordered. Can you enlighten me to the use of the PriorityQueue in the file? Not saying its wrong, just saying I don't understand what's going on. Also wondering about how it relates to the ordered member and the first and last SpansCell.

Thanks,
Grant


Paul Elschot added a comment - 19/Nov/07 09:08 PM
I only wrote NearSpansOrdered. NearSpansOrdered is what is left of the original NearSpans after the ordered case was taken out, and after I specialized NearSpans the unordered case. That means I only simplified the original NearSpans.

Off the top of my head: the priority queue is used to make sure that the Spans are processed by increasing doc numbers and increasing token positions; the first and the last Spans determine whether there is a match, and all other Spans (in the queue) are "in between".

I think you'll only need to implement the Spans interface to use the same priority queue,
and you'll have to come up with a way to collect all the payloads from your payload spans on a match.
Probably fairly straightforward, but with Spans there is always unexpected fun waiting around the next corner


Grant Ingersoll added a comment - 19/Nov/07 09:16 PM
I have it implemented for the Ordered case (and all other Spans) but will need to dig in deeper for the unordered case. Worst case, I can pop the queue from min() to max, but I then have to reconstruct the queue and that doesn't seem good. Hopefully, I can dig in soon, as I would like to get this in 2.3 along with the other Payload stuff we are discussing since it sounds like that API is firming up.

Paul Elschot added a comment - 19/Nov/07 10:54 PM
So far, it's easier than that: when they match, they all match, so you only need to keep the input Spans around in List or whatever. Then use them all as a source for your payloads.

Grant Ingersoll added a comment - 20/Nov/07 03:03 PM

Off the top of my head: the priority queue is used to make sure that the Spans are processed by increasing doc numbers and increasing token positions; the first and the last Spans determine whether there is a match, and all other Spans (in the queue) are "in between".

Would it be simpler to just use a SortedSet? Then we could iterate w/o losing the sort, right? Would this be faster since we wouldn't have to do the heap operations?


Doug Cutting added a comment - 20/Nov/07 07:50 PM
> Would it be simpler to just use a SortedSet?

TreeMap is slower than a PriorityQueue for this. With PriorityQueue, insertions and deletions do not allocate new objects. And, if some items are much more frequent than others, using adjustTop() instead of inserting and deleting makes merges run much faster, since most updates are then considerably faster than log.


Grant Ingersoll added a comment - 20/Nov/07 10:46 PM
Sure, but how do I get access to the position payloads in the order that they occur in the PQ? I have to go and pop them all of the PQ or I need to maintain a separate PQ for the Payloads so that when I go to get a payload for a span, I can iterate over all the items by calling PQ.pop() but then I have to rebuild it again if getPayload is called again, right?

I think I need to take a break and come back to this after some Turkey...


Doug Cutting added a comment - 21/Nov/07 05:37 PM
> how do I get access to the position payloads in the order that they occur in the PQ?

Why do you need them in that order? In the API you propose in the description of this issue, there's no clear association between the payloads returned and the query terms. So I don't yet see how the order of the payloads is useful.

You could pretty easily return the list of payloads along with their positions by iterating through the list of sub-queries. The problem is that providing that sorted by position is expensive. Perhaps you could leave any such sorting, if required, to the application?


Grant Ingersoll added a comment - 25/Nov/07 02:20 PM
Yeah, I was thinking this as a possibility, but thought people may want to rely on the ordering, even if they mark it as unordered. I guess I will just have to properly document it, b/c I agree it is expensive to sort.

I will submit a patch shortly.


Grant Ingersoll added a comment - 25/Nov/07 05:52 PM
First draft of a patch for this issue. Need to expand/double check the tests and cleanup a few things before committing, but wanted to get opinion on the API.

Also, would like to see if NearSpansOrdered can be optimized a bit to not load the payloads in the case where the user doesn't ask for PayloadSpans. The other implementations shouldn't have this issue.


Grant Ingersoll added a comment - 26/Nov/07 12:06 PM
There is an issue w/ this patch related to unordered, overlapping spans that still needs to be fixed. Will try to get an updated patch out soon.

Grant Ingersoll added a comment - 26/Nov/07 02:26 PM
Fixes the unordered problem. Still needs more testing, but I believe it is working

Grant Ingersoll added a comment - 30/Apr/08 04:32 PM
I don't think that patch is compatible w/ anything It was a
rough sketch that never properly worked. I put it up there in the
hopes that maybe someone would have some more insight to offer.

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ


Mark Miller added a comment - 09/Aug/08 04:57 PM
Anyone still have a use case for this issue?

Here is a patch that I think fixes the orderedspans problem - need to test further, but that may be the last piece on those parts.

Beyond that, I think that a span uses only one clause to determine if a payload is available for the whole span - it seems to me we have to ask every clause.

As far as the ordering of returned payloads, I don't see how they can be ordered by the user without having some info in the payload itself - I mean its just going to be a collection of byte arrays right? How could you order them? Seems at most you can say those payloads came from the given span and use them all.

The more I look at spans the less I understand them I think <g> Its like repeating certain words over and over.


Mark Miller added a comment - 09/Aug/08 05:02 PM
Without the absolute paths in the patch this time (get it together eclipse)

Mark Miller added a comment - 10/Aug/08 03:47 AM
Fixes the unorderedspan ispostionavailable issue for good measure.

I think we have to give the payloads back unsorted - there are probably cases where you could just use all the payloads for a span rather than per term, so we might as well not incur a penalty there. If you need per term, you can put the position into the payload (pretty simple) and then just sort yourself.


Mark Miller added a comment - 10/Aug/08 11:46 PM
I'll correct it if I have anything to add later, but the new isPayloadAvailable on the unordered spans should start at min(), not first.

Mark Miller added a comment - 11/Aug/08 01:22 AM
Might as well keep striking while the iron's hot. This fixes the first to min() issue, I think it adds an isSpansAvailable call that had been commented out, adds a couple tests, tightens a test, and adds a new class that takes a single document and returns all of payloads from the terms that match a given Query object (so you could do a mix of span/query) - may or may not be useful someday.

Grant Ingersoll added a comment - 11/Aug/08 08:22 PM
TestBoostingTermQuery.testNoPayload now fails for me.

Also noticed some extraneous System.out.println.


Mark Miller added a comment - 11/Aug/08 08:56 PM
I see what happened with the boosting test - the old patch didn't apply cleanly for me to the trunk for that test class - I looked for the issue but it looked clean so I left it to deal with later - turns out one or two lines were mangled. New patch that fixes that.

Also turned off DEBUG for the payloads test class so that will drop the System.outs.

Pretty clean now.


Grant Ingersoll added a comment - 19/Aug/08 02:39 PM
Added to CHANGES.txt

Made the SpanQuery.getPayloadSpans() method an empty implementation that returns null instead of being abstract, so as not to break anyone that extends SpanQuery. All Lucene span implementations override it.

Also added in license headers.

Plan to commit in a day or two


Mark Miller added a comment - 19/Aug/08 09:43 PM
Took one last look through for final cleanup - removed an unused import and some unneeded commented code.

Grant Ingersoll added a comment - 20/Aug/08 04:07 PM
Committed revision 687379.

Jonathan Mamou added a comment - 16/Nov/08 01:15 PM
Hi,

I use getPayloadSpans to get the spans (PayloadSpans spans = sq.getPayloadSpans(this.getIndexReader()) and for each span I extract the relevant payload using spans.getPayload() (after calling spans.next()).

I have tested it for a SpanNearQuery, I have stored the offset in the payload (for testing).

It seems to me that the payloads returned for the span (spans.getPayload()) does not match the current span (given by span.next()). In fact, the offset returned by the payload are different from the offset returned by span.start().
It occurs when they are multiple occurrences of the terms.

Are you aware of this bug?

Thanks,

Jonathan


Mark Miller added a comment - 16/Nov/08 01:29 PM
So are you finding the payloads not at the right term if the terms match ie

term1 term2 term1

You might get the second term1's offsets for the first term1 and vice versa?


Jonathan Mamou added a comment - 16/Nov/08 02:12 PM
In my document, I have ".... term1 ... term1 term2 ..."
offset of term1: 10, 20
offset of term2: 21

When running the query SpanNearQuery(term1 term2),
according to span.start, I get 20
while I get 10 and 21 when reading the payloads of the span.
I would expect to get 20 and 21when reading the payloads of the span.


Jonathan Mamou added a comment - 17/Nov/08 07:11 PM
Hi,

Here is the relevant code.

I would expect to obtain
10
pos: 10
pos: 11

while I obtain
10
pos: 0
pos: 11

import java.io.StringReader;
import java.util.Collection;
import java.util.Iterator;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.PayloadSpans;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;

public class Test {

public static void main (String args[]) throws Exception{
IndexWriter writer = new IndexWriter(args[0], new
TestPayloadAnalyzer(), IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
doc.add();new Field("content", new StringReader("a b c d e f g
h i j a k")));
writer.addDocument(doc);
writer.close();

IndexSearcher is = new IndexSearcher(args[0]);
SpanTermQuery stq1 = new SpanTermQuery(new Term("content", "a"
));
SpanTermQuery stq2 = new SpanTermQuery(new Term("content", "k"
));
SpanQuery[] sqs = {stq1,stq2};
SpanNearQuery snq = new SpanNearQuery(sqs,1,true);
PayloadSpans spans = snq.getPayloadSpans(is.getIndexReader());

TopDocs topDocs = is.search(snq,1);

for (int i = 0; i < topDocs.scoreDocs.length; i++) {
while) (spans.next()) {
System.out.println(spans.start());
Collection<byte[]> payloads = spans.getPayload();
for (Iterator<byte[]> it = payloads.iterator();
it.hasNext() { System.out.println(new String(it.next())); } }}
}
}
}}

-------------------------------------------------------------------------------------------------------------------------------------
import java.io.IOException;
import java.io.Reader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.index.Payload;

public class TestPayloadAnalyzer extends Analyzer {

public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new LowerCaseTokenizer(reader);
result = new PayloadFilter(result, fieldName);
return result;
}
}

class PayloadFilter extends TokenFilter {
String fieldName;
int pos;

public PayloadFilter(TokenStream input, String fieldName) { super(input); this.fieldName = fieldName; pos = 0; }

public Token next() throws IOException {
Token result = input.next();
if (result != null) { String token = new String(result.termBuffer(), 0, result.termLength ()); result.setPayload(),new Payload(("pos: " + pos).getBytes())); pos += result.getPositionIncrement(); } return} result;
}
}

Jonathan


Mark Miller added a comment - 21/Nov/08 02:37 AM
Okay, I still understand like 2% of spans, but I think I have fixed the bug.

After finding a match, but before finding a min match, we were pulling the payload - works fine when the match is the min match, but otherwise we actually have to wait to get the payload until we have crunched in on the min match. I had an idea of this before, and the code before I touched it tried to grab the payloads at this point - the problem is, in finding the min match, you've often advanced passed the term position of interest to find out there was no such min match. So you have to save the possible payload ahead of time, and either find a new one or use the possible saved one. Sucks to have to add extra loading, but at the moment I don't see how to do it differently (I admittedly can't see much in spans). Thats all partly a guess, partly probably true.

Non the less, this patch handles the previous test cases, plus the bug case reported above. I have also added a modified version of the given test for the bug to the span battery of tests.

Thanks Jonathan!


Mark Miller added a comment - 21/Nov/08 11:48 AM
Whats the best procedure JIRA wise on this? Reopen this issue or start a new issue? If we reopen this issue, how is the bug fix tracked in changes?

Grant Ingersoll added a comment - 21/Nov/08 02:04 PM
Open a new one.