Issue Details (XML | Word | Printable)

Key: LUCENE-1465
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Mark Miller
Reporter: Mark Miller
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

NearSpansOrdered.getPayload does not return the payload from the minimum match span

Created: 21/Nov/08 09:44 PM   Updated: 25/Sep/09 04:23 PM
Return to search
Component/s: Search
Affects Version/s: 2.4
Fix Version/s: 2.4.1, 2.9

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-1465.patch 2008-11-24 11:33 PM Mark Miller 10 kB
Text File Licensed for inclusion in ASF works LUCENE-1465.patch 2008-11-24 10:02 PM Mark Miller 8 kB
Text File Licensed for inclusion in ASF works LUCENE-1465.patch 2008-11-24 08:16 PM Mark Miller 7 kB
Text File Licensed for inclusion in ASF works LUCENE-1465.patch 2008-11-21 09:45 PM Mark Miller 7 kB
Java Source File Licensed for inclusion in ASF works Test.java 2008-12-01 02:16 PM Jonathan Mamou 3 kB
Issue Links:
Reference
 

Lucene Fields: Patch Available, New
Resolution Date: 23/Feb/09 02:08 PM


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Mark Miller added a comment - 21/Nov/08 09:45 PM
Fix + test

Mark Miller added a comment - 21/Nov/08 09:46 PM
See LUCENE-1001 for discussion of the bug.

Mark Miller added a comment - 24/Nov/08 07:33 PM
I plan on committing this soon. This is a real deal breaker if you are trying to use the new getPayload API with ordered nearspans.

The attached path has java 1.5 code in the test which I'll remove.


Mark Miller added a comment - 24/Nov/08 10:02 PM
Bah. Its even worse than that. Even after you get down to a min match, it might not meet the slop requirements! You have to load the payloads and then dump them if the slop is not met.

I don't like all this extra payload loading. Come to think of it, if you don't use the getPayload, your still paying for it! I don't have a way around it, but I don't like it. In this case, not only do you pay for loading, you also pay for loading the payloads of a bunch of possible matches that don't end up being a match!

Over a large index with lots of hits, its a lot of payloads to load...

I havn't thought about any of it at a high level, but I think this has to be addressed somehow...maybe you have to turn on payload collecting first, or it doesnt do it? We need something...

but until then, I think this still has to be fixed, and we are loading them one way or another now...might as well add a few more "possible" wrong loads (this last patch added a couple as well) to make the behavior correct - somewhat useless otherwise


Mark Miller added a comment - 24/Nov/08 11:33 PM
That still wasn't quite right. A third test and a third fix. I am pretty sure this solves it, but my previous concerns still concern me.

Mark Miller added a comment - 25/Nov/08 11:40 PM
Thanks Jonathan and Greg!

Jonathan Mamou added a comment - 01/Dec/08 02:16 PM
Hi
It seems that the fix does not cover the case where 2 terms are indexed at the same position.
I attach a sample program illustrating the issue. Each 2 terms are indexed at the same position.
Best regards,
Jonathan

Michael McCandless added a comment - 03/Dec/08 04:48 PM
Let's backport fix to 2.4 branch (for eventual 2.4.1).

Mark Miller added a comment - 04/Dec/08 12:22 PM
Whats involved in a backport - just commit it to the 2.4 branch and thats all?

Looks like I have to look into terms indexed at the same position first - I'll try to get to that soon.

  • Mark

Michael McCandless added a comment - 04/Dec/08 01:13 PM

Whats involved in a backport - just commit it to the 2.4 branch and thats all?

Yup. "svn merge" works well as long as the code hasn't diverged much, eg running this in a 2.4 branch checkout:

svn merge -r(N-1):N https://svn.apache.org/repos/asf/lucene/java/trunk

where N was the revision committed to trunk.


Mark Miller added a comment - 18/Dec/08 03:37 AM
This is an odd one Jonathan. Its actually for the unordered case (the others were for the ordered). I am not exactly clear on whats going on yet.

When I look at the payloads coming back, it would seem we are get 0,7,7 when we should get 6,7,7. When I look at the offsets for the spans that I get the payloads from though - they appear correct. Its returning the payloads from the right offsets it seems, but somehow one of those payloads is from the term at position 0? Very odd. So when I debug in, it does indeed look like the first match happens at index 6...but the term offsets are start: 2147483647, end:-2147483648. What the heck? This is going to take some more time...


Jonathan Mamou added a comment - 22/Dec/08 01:07 PM
Mark, I would expect to get 0,0,3,6,7,7 and not only 6,7,7.

As you wrote, "a SpanAndQuery could easily be a SpanNearQuery if a huge distance was allowed." at http://www.gossamer-threads.com/lists/lucene/java-user/51983


Mark Miller added a comment - 22/Dec/08 01:11 PM
Hmmm...I think thats true, but thats for finding 'a hit' on a document, not for finding every possible sequence of spans that could cause hit. Spans work by finding a minimum match, not greedily finding every match (which is a different algorithm).

Mark Miller added a comment - 23/Feb/09 02:08 PM
This has been backported to 2.4 and is resolved. The unresolved dangling issue is a separate issue involving a different class, and is being tracked with LUCENE-1542.