|
Do you mean in a separate interface? I suppose I should for backward compatibility so that people with existing Span implementations outside of Lucene aren't broken.
This also blurs the line more between payloads and Spans, I suggest we drop the payloads package and move the BoostingTermQuery into the Spans package. I originally thought there would be more distinction between the two packages, but that seems less likely now.
Since payloads are still marked as experimental, does anyone see a problem with moving it? I suppose it could be deprecated and copied since it has been released. My mistake, I thought Spans was an abstract class, but it is an interface.
That also means that Spans should not be changed at all, interfaces are forever. Something like this could do nicely: public interface PayloadSpans extends Spans { I'd prefer the payload "core" classes to stay in their own package search/payload
I think this is problematic too, unfortunately, since many spans actually contain other spans, so there is no way to safely cast even in the internal implementations. Alternative might be to add SpanQuery.getPayloadSpans() but that is ugly, too. I wish there was an equivalent way to deprecated that allowed one to tell people new methods are coming, but that won't break the existing interface. Semantics of it would need to be figured out, but it would be useful here to just be able to let people know that we want to add to the Spans interface, but they just get a warning when compiling until we make it official. I suppose the right thing to do if we really want this to work is to deprecate Spans and SpanQuery.getSpans() and introduce a new form of Spans (maybe as an abstract class this time?) Or am I missing something that provides a clearer way of doing this? I think I would just go ahead and create a parallel class hierarchy starting from class PayloadSpanQuery with a getPayloadSpans() method, and try and use delegation to the various SpanQueries and their Spans as much as possible.
That means you would end up with this a few times: return new PayLoadSpans () { When this delegation happens too often, it could even be factored out into its own superclass. For the rest, in case you need some existing SpanQuery private methods you could change them to package private, and move your classes to the search.spans package for that reason. Paul,
I think I have this implemented for the most part, but am having trouble with the NearSpansUnordered. Can you enlighten me to the use of the PriorityQueue in the file? Not saying its wrong, just saying I don't understand what's going on. Also wondering about how it relates to the ordered member and the first and last SpansCell. Thanks, I only wrote NearSpansOrdered. NearSpansOrdered is what is left of the original NearSpans after the ordered case was taken out, and after I specialized NearSpans the unordered case. That means I only simplified the original NearSpans.
Off the top of my head: the priority queue is used to make sure that the Spans are processed by increasing doc numbers and increasing token positions; the first and the last Spans determine whether there is a match, and all other Spans (in the queue) are "in between". I think you'll only need to implement the Spans interface to use the same priority queue, I have it implemented for the Ordered case (and all other Spans) but will need to dig in deeper for the unordered case. Worst case, I can pop the queue from min() to max, but I then have to reconstruct the queue and that doesn't seem good. Hopefully, I can dig in soon, as I would like to get this in 2.3 along with the other Payload stuff we are discussing since it sounds like that API is firming up.
So far, it's easier than that: when they match, they all match, so you only need to keep the input Spans around in List or whatever. Then use them all as a source for your payloads.
Would it be simpler to just use a SortedSet? Then we could iterate w/o losing the sort, right? Would this be faster since we wouldn't have to do the heap operations? > Would it be simpler to just use a SortedSet?
TreeMap is slower than a PriorityQueue for this. With PriorityQueue, insertions and deletions do not allocate new objects. And, if some items are much more frequent than others, using adjustTop() instead of inserting and deleting makes merges run much faster, since most updates are then considerably faster than log Sure, but how do I get access to the position payloads in the order that they occur in the PQ? I have to go and pop them all of the PQ or I need to maintain a separate PQ for the Payloads so that when I go to get a payload for a span, I can iterate over all the items by calling PQ.pop() but then I have to rebuild it again if getPayload is called again, right?
I think I need to take a break and come back to this after some Turkey... > how do I get access to the position payloads in the order that they occur in the PQ?
Why do you need them in that order? In the API you propose in the description of this issue, there's no clear association between the payloads returned and the query terms. So I don't yet see how the order of the payloads is useful. You could pretty easily return the list of payloads along with their positions by iterating through the list of sub-queries. The problem is that providing that sorted by position is expensive. Perhaps you could leave any such sorting, if required, to the application? Yeah, I was thinking this as a possibility, but thought people may want to rely on the ordering, even if they mark it as unordered. I guess I will just have to properly document it, b/c I agree it is expensive to sort.
I will submit a patch shortly. First draft of a patch for this issue. Need to expand/double check the tests and cleanup a few things before committing, but wanted to get opinion on the API.
Also, would like to see if NearSpansOrdered can be optimized a bit to not load the payloads in the case where the user doesn't ask for PayloadSpans. The other implementations shouldn't have this issue. There is an issue w/ this patch related to unordered, overlapping spans that still needs to be fixed. Will try to get an updated patch out soon.
Fixes the unordered problem. Still needs more testing, but I believe it is working
I don't think that patch is compatible w/ anything
rough sketch that never properly worked. I put it up there in the hopes that maybe someone would have some more insight to offer. -------------------------- Lucene Helpful Hints: Anyone still have a use case for this issue?
Here is a patch that I think fixes the orderedspans problem - need to test further, but that may be the last piece on those parts. Beyond that, I think that a span uses only one clause to determine if a payload is available for the whole span - it seems to me we have to ask every clause. As far as the ordering of returned payloads, I don't see how they can be ordered by the user without having some info in the payload itself - I mean its just going to be a collection of byte arrays right? How could you order them? Seems at most you can say those payloads came from the given span and use them all. The more I look at spans the less I understand them I think <g> Its like repeating certain words over and over. Without the absolute paths in the patch this time (get it together eclipse)
Fixes the unorderedspan ispostionavailable issue for good measure.
I think we have to give the payloads back unsorted - there are probably cases where you could just use all the payloads for a span rather than per term, so we might as well not incur a penalty there. If you need per term, you can put the position into the payload (pretty simple) and then just sort yourself. I'll correct it if I have anything to add later, but the new isPayloadAvailable on the unordered spans should start at min(), not first.
Might as well keep striking while the iron's hot. This fixes the first to min() issue, I think it adds an isSpansAvailable call that had been commented out, adds a couple tests, tightens a test, and adds a new class that takes a single document and returns all of payloads from the terms that match a given Query object (so you could do a mix of span/query) - may or may not be useful someday.
TestBoostingTermQuery.testNoPayload now fails for me.
Also noticed some extraneous System.out.println. I see what happened with the boosting test - the old patch didn't apply cleanly for me to the trunk for that test class - I looked for the issue but it looked clean so I left it to deal with later - turns out one or two lines were mangled. New patch that fixes that.
Also turned off DEBUG for the payloads test class so that will drop the System.outs. Pretty clean now. Added to CHANGES.txt
Made the SpanQuery.getPayloadSpans() method an empty implementation that returns null instead of being abstract, so as not to break anyone that extends SpanQuery. All Lucene span implementations override it. Also added in license headers. Plan to commit in a day or two Took one last look through for final cleanup - removed an unused import and some unneeded commented code.
Committed revision 687379.
Hi,
I use getPayloadSpans to get the spans (PayloadSpans spans = sq.getPayloadSpans(this.getIndexReader()) I have tested it for a SpanNearQuery, I have stored the offset in the payload (for testing). It seems to me that the payloads returned for the span (spans.getPayload()) does not match the current span (given by span.next()). In fact, the offset returned by the payload are different from the offset returned by span.start(). Are you aware of this bug? Thanks, Jonathan So are you finding the payloads not at the right term if the terms match ie
term1 term2 term1 You might get the second term1's offsets for the first term1 and vice versa? In my document, I have ".... term1 ... term1 term2 ..."
offset of term1: 10, 20 offset of term2: 21 When running the query SpanNearQuery(term1 term2), Hi,
Here is the relevant code. I would expect to obtain while I obtain import java.io.StringReader; import org.apache.lucene.document.Document; public class Test { public static void main (String args[]) throws Exception{ IndexSearcher is = new IndexSearcher(args[0]) TopDocs topDocs = is.search(snq,1); for (int i = 0; i < topDocs.scoreDocs.length; i++) { ------------------------------------------------------------------------------------------------------------------------------------- import org.apache.lucene.analysis.Analyzer; public class TestPayloadAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { class PayloadFilter extends TokenFilter { public PayloadFilter(TokenStream input, String fieldName) { super(input); this.fieldName = fieldName; pos = 0; } public Token next() throws IOException { Jonathan Okay, I still understand like 2% of spans, but I think I have fixed the bug.
After finding a match, but before finding a min match, we were pulling the payload - works fine when the match is the min match, but otherwise we actually have to wait to get the payload until we have crunched in on the min match. I had an idea of this before, and the code before I touched it tried to grab the payloads at this point - the problem is, in finding the min match, you've often advanced passed the term position of interest to find out there was no such min match. So you have to save the possible payload ahead of time, and either find a new one or use the possible saved one. Sucks to have to add extra loading, but at the moment I don't see how to do it differently (I admittedly can't see much in spans). Thats all partly a guess, partly probably true. Non the less, this patch handles the previous test cases, plus the bug case reported above. I have also added a modified version of the given test for the bug to the span battery of tests. Thanks Jonathan! Whats the best procedure JIRA wise on this? Reopen this issue or start a new issue? If we reopen this issue, how is the bug fix tracked in changes?
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A little while ago I suggested to add a score() method to Spans to summarize the influence of subspans on the span score, and that would be another direction for a subclass.