Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      SpanContainingQuery reducing a spans to where it is containing another spans.
      SpanContainedQuery reducing a spans to where it is contained in another spans.

      1. LUCENE-6083.patch
        34 kB
        Paul Elschot
      2. LUCENE-6083.patch
        35 kB
        Robert Muir
      3. LUCENE-6083.patch
        34 kB
        Paul Elschot
      4. LUCENE-6083.patch
        31 kB
        Paul Elschot
      5. LUCENE-6083.patch
        30 kB
        Paul Elschot
      6. LUCENE-6083.patch
        33 kB
        Paul Elschot
      7. LUCENE-6083.patch
        23 kB
        Paul Elschot
      8. LUCENE-6083.patch
        22 kB
        Paul Elschot

        Activity

        Hide
        Paul Elschot added a comment -

        This is for use with the positional joins module LUCENE-5627, but it is general enough for core.

        Show
        Paul Elschot added a comment - This is for use with the positional joins module LUCENE-5627 , but it is general enough for core.
        Hide
        Jason Gerlowski added a comment -

        Hi Paul.

        Looks like a neat addition. I have a few comments/suggestions. Please take with a grain of salt.

        1) There's some logic in ContainSpans that assigns one of the provided Spans to a 'sourceSpans' instance variable. Calls to ContainSpans.start(), .end(), .doc(), etc. then just call the associated method on 'sourceSpans'. This looks like it might be a good use case for FilterSpans (https://issues.apache.org/jira/browse/LUCENE-5933), which I think is available in trunk and 4x. Using FilterSpans might help get rid of some of the boilerplate here.

        2) It would be nice if there was a way to customize/change what it means for one Span to be "contained" in another. Personally, I agree with the definition of "containing" laid out in ContainSpans.toContained(), but I can definitely see how other people might have different ideas of what being "contained" means.

        Different potential semantics include:
        a) the current behavior: (container.start <= contained.start && contained.end <= container.end)
        b) no sharing start/end points: (container.start < contained.start && contained.end < container.end)
        c) spans can be partially-contained/overlapping: ((contained.start > container.start && contained.start < container.end) || (contained.end > container.start && contained.end < container.end))

        I think it'd be a good idea to allow support for any of the semantics above. Letting people choose (or even define) the "containing" semantics they want would make your patch even more flexible/powerful!

        (a) and (b) above could probably be supported by just adding a boolean option to the constructor for *Query. If being even more flexible makes sense, (a), (b), and (c) could probably all be supported by moving the ContainSpans.toContained() logic into a separate class that can get passed into *Query as a constructor arg. This would allow arbitrary "contained" semantics, as people can subclass the ContainedSpanMatchFinder (or whatever it would be called.).

        Thoughts?

        Show
        Jason Gerlowski added a comment - Hi Paul. Looks like a neat addition. I have a few comments/suggestions. Please take with a grain of salt. 1) There's some logic in ContainSpans that assigns one of the provided Spans to a 'sourceSpans' instance variable. Calls to ContainSpans.start(), .end(), .doc(), etc. then just call the associated method on 'sourceSpans'. This looks like it might be a good use case for FilterSpans ( https://issues.apache.org/jira/browse/LUCENE-5933 ), which I think is available in trunk and 4x. Using FilterSpans might help get rid of some of the boilerplate here. 2) It would be nice if there was a way to customize/change what it means for one Span to be "contained" in another. Personally, I agree with the definition of "containing" laid out in ContainSpans.toContained(), but I can definitely see how other people might have different ideas of what being "contained" means. Different potential semantics include: a) the current behavior: (container.start <= contained.start && contained.end <= container.end) b) no sharing start/end points: (container.start < contained.start && contained.end < container.end) c) spans can be partially-contained/overlapping: ((contained.start > container.start && contained.start < container.end) || (contained.end > container.start && contained.end < container.end)) I think it'd be a good idea to allow support for any of the semantics above. Letting people choose (or even define) the "containing" semantics they want would make your patch even more flexible/powerful! (a) and (b) above could probably be supported by just adding a boolean option to the constructor for *Query. If being even more flexible makes sense, (a), (b), and (c) could probably all be supported by moving the ContainSpans.toContained() logic into a separate class that can get passed into *Query as a constructor arg. This would allow arbitrary "contained" semantics, as people can subclass the ContainedSpanMatchFinder (or whatever it would be called.). Thoughts?
        Hide
        Paul Elschot added a comment -

        Jason,

        1) About FilterSpans: I actually started this from SpanNotQuery, so there may be even more boilerplate to get rid of.

        2) b) can also be had by manipulating one of the input spans, and SpanNotQuery allows passing pre/post offsets to be added before the comparisons. Maybe it is better to allow some offsets like SpanNotQuery.
        2) c), or something close that can be had now by using SpanNearQuery, and negative pre/post offsets also go in this direction.
        ContainedSpanMatchFinder would also work.
        I'm not sure in which direction to generalize.

        In the current patch the toContained() and toContaining() methods only use skipTo() on one of the input spans. It is more efficient to also use skipTo() on the other one, so I'll add that first.

        Show
        Paul Elschot added a comment - Jason, 1) About FilterSpans: I actually started this from SpanNotQuery, so there may be even more boilerplate to get rid of. 2) b) can also be had by manipulating one of the input spans, and SpanNotQuery allows passing pre/post offsets to be added before the comparisons. Maybe it is better to allow some offsets like SpanNotQuery. 2) c), or something close that can be had now by using SpanNearQuery, and negative pre/post offsets also go in this direction. ContainedSpanMatchFinder would also work. I'm not sure in which direction to generalize. In the current patch the toContained() and toContaining() methods only use skipTo() on one of the input spans. It is more efficient to also use skipTo() on the other one, so I'll add that first.
        Hide
        Paul Elschot added a comment -

        Skip on both container spans and contained spans

        Show
        Paul Elschot added a comment - Skip on both container spans and contained spans
        Hide
        Paul Elschot added a comment -

        Patch of 3 April 2015.
        This introduces a package private class ConjunctionSpans as a common superclass for NearSpans and ContainSpans, that also does the basic work for two phase doc id iteration.

        The -1 and Integer.MAX_VALUE positions have a special meaning now, so adding offsets would have added some complexity, and I avoided that for now.

        Show
        Paul Elschot added a comment - Patch of 3 April 2015. This introduces a package private class ConjunctionSpans as a common superclass for NearSpans and ContainSpans, that also does the basic work for two phase doc id iteration. The -1 and Integer.MAX_VALUE positions have a special meaning now, so adding offsets would have added some complexity, and I avoided that for now.
        Hide
        Robert Muir added a comment -

        I'll take a look Paul, thanks. I like the idea of factoring out the ConjunctionSpans.

        One idea: not really any logic is left in NearSpans after the patch. Maybe we should remove the abstraction and just store slop/query in Ordered/UnOrdered?

        Show
        Robert Muir added a comment - I'll take a look Paul, thanks. I like the idea of factoring out the ConjunctionSpans. One idea: not really any logic is left in NearSpans after the patch. Maybe we should remove the abstraction and just store slop/query in Ordered/UnOrdered?
        Hide
        Paul Elschot added a comment -

        not really any logic is left in NearSpans after the patch. Maybe we should remove the abstraction and just store slop/query in Ordered/UnOrdered?

        In case that improves performance or might reduce future maintenance, then yes.
        Otherwise the NearSpans here nicely shows what it is: a conjunction of spans for a query with an allowed slop.

        Show
        Paul Elschot added a comment - not really any logic is left in NearSpans after the patch. Maybe we should remove the abstraction and just store slop/query in Ordered/UnOrdered? In case that improves performance or might reduce future maintenance, then yes. Otherwise the NearSpans here nicely shows what it is: a conjunction of spans for a query with an allowed slop.
        Hide
        Paul Elschot added a comment -

        Patch of 4 April 2015. Resolves conflict after LUCENE-6388.

        Aside: in NearSpans*Ordered int lastEnd was removed, but it is still mentioned in a few comments.

        Show
        Paul Elschot added a comment - Patch of 4 April 2015. Resolves conflict after LUCENE-6388 . Aside: in NearSpans*Ordered int lastEnd was removed, but it is still mentioned in a few comments.
        Hide
        Paul Elschot added a comment - - edited

        The patch of 4 April 2015 has unintended recursion between nextDoc() and toMatchDoc() in ContainSpans: toMatchDoc() there should call conjunction.nextDoc() instead.

        Show
        Paul Elschot added a comment - - edited The patch of 4 April 2015 has unintended recursion between nextDoc() and toMatchDoc() in ContainSpans: toMatchDoc() there should call conjunction.nextDoc() instead.
        Hide
        Robert Muir added a comment -

        In SpanContainQuery.prepareConjunction I think the second null check should be:

            if (containedSpans == null) {
              return null;
            }
        

        I'm having a tough time understanding the difference between SpanContainedQuery(x, y) and SpanContainingQuery(y, x). What am I missing?

        Show
        Robert Muir added a comment - In SpanContainQuery.prepareConjunction I think the second null check should be: if (containedSpans == null ) { return null ; } I'm having a tough time understanding the difference between SpanContainedQuery(x, y) and SpanContainingQuery(y, x). What am I missing?
        Hide
        Paul Elschot added a comment -

        The check for containment is the same.
        The spans of SpanContained query is reduced to only those contained in the containing spans.
        The spans of SpanContainingQuery is reduced to only those containing the contained spans.

        The constructor arguments to both queries have the same order (container, contained), so I avoided using (x, y) and (y, x) above.

        I'll try and get my head around the defer trick in LUCENE-6393

        Show
        Paul Elschot added a comment - The check for containment is the same. The spans of SpanContained query is reduced to only those contained in the containing spans. The spans of SpanContainingQuery is reduced to only those containing the contained spans. The constructor arguments to both queries have the same order (container, contained), so I avoided using (x, y) and (y, x) above. I'll try and get my head around the defer trick in LUCENE-6393
        Hide
        Robert Muir added a comment -

        The check for containment is the same.
        The spans of SpanContained query is reduced to only those contained in the containing spans.
        The spans of SpanContainingQuery is reduced to only those containing the contained spans.

        The constructor arguments to both queries have the same order (container, contained), so I avoided using (x, y) and (y, x) above.

        OK, I'm just wondering if we need both ways to specify it. If we really do, could one just be a sugar implementation that works via Query.rewrite()?

        I'll try and get my head around the defer trick in LUCENE-6393

        Yeah the sneaky part is TermSpans. it doesnt implement the two-phase api (returns null). We might want to change this for consistency, just to make this stuff easier to reason about.

        But today if we are intersecting SpanFirstQuery(SpanTermQuery("foo"), 20) with a filter, we don't want the SpanFirstQuery logic to read any positions during the conjunction logic, until absolutely needed when matches() is called. This is safe because SpanPositionCheck(X) is always a subset of X, so we can just use X as our approximation.

        Show
        Robert Muir added a comment - The check for containment is the same. The spans of SpanContained query is reduced to only those contained in the containing spans. The spans of SpanContainingQuery is reduced to only those containing the contained spans. The constructor arguments to both queries have the same order (container, contained), so I avoided using (x, y) and (y, x) above. OK, I'm just wondering if we need both ways to specify it. If we really do, could one just be a sugar implementation that works via Query.rewrite()? I'll try and get my head around the defer trick in LUCENE-6393 Yeah the sneaky part is TermSpans. it doesnt implement the two-phase api (returns null). We might want to change this for consistency, just to make this stuff easier to reason about. But today if we are intersecting SpanFirstQuery(SpanTermQuery("foo"), 20) with a filter, we don't want the SpanFirstQuery logic to read any positions during the conjunction logic, until absolutely needed when matches() is called. This is safe because SpanPositionCheck(X) is always a subset of X, so we can just use X as our approximation.
        Hide
        Paul Elschot added a comment -

        I'm just wondering if we need both ways to specify it. If we really do, could one just be a sugar implementation that works via Query.rewrite()?

        That can be done by merging SpanContainingQuery and SpanContainedQuery into SpanContainQuery and adding a flag to the constructor.
        Merging the java code is no problem, but merging the javadocs is not going to be nice I think.

        Show
        Paul Elschot added a comment - I'm just wondering if we need both ways to specify it. If we really do, could one just be a sugar implementation that works via Query.rewrite()? That can be done by merging SpanContainingQuery and SpanContainedQuery into SpanContainQuery and adding a flag to the constructor. Merging the java code is no problem, but merging the javadocs is not going to be nice I think.
        Hide
        Paul Elschot added a comment - - edited

        Patch of 6 April 2015. Compared to the previous patch this:

        • moves the toMatchDoc() method from ContainSpans to ConjunctionSpans, and removes the recursion with nextDoc(),
        • moves boost setting to SpanContainQuery class only (prepare a little for query immutability),
        • improves the constructor of ContainSpans to have a sourceSpans argument instead of a flag indicating which of the two spans to use for start/end positions and payloads,
        • uses FilterSpans from LUCENE-6393.
        Show
        Paul Elschot added a comment - - edited Patch of 6 April 2015. Compared to the previous patch this: moves the toMatchDoc() method from ContainSpans to ConjunctionSpans, and removes the recursion with nextDoc(), moves boost setting to SpanContainQuery class only (prepare a little for query immutability), improves the constructor of ContainSpans to have a sourceSpans argument instead of a flag indicating which of the two spans to use for start/end positions and payloads, uses FilterSpans from LUCENE-6393 .
        Hide
        Paul Elschot added a comment -

        Patch of 7 April 2015. Compared to previous patch this removes toMatchDoc() in NearSpansOrdered and NearSpansUnordered.

        Show
        Paul Elschot added a comment - Patch of 7 April 2015. Compared to previous patch this removes toMatchDoc() in NearSpansOrdered and NearSpansUnordered.
        Hide
        Robert Muir added a comment -

        Updated patch to trunk. I added a few tests and tried my hand at some renames:

        The two queries are currently named SpanWithinQuery vs SpanContainingQuery in the patch. I think this makes it easier to differentiate (vs Containing/Contained).

        I also changed parameters to be "big" and "little" instead of contained and containing.

        Show
        Robert Muir added a comment - Updated patch to trunk. I added a few tests and tried my hand at some renames: The two queries are currently named SpanWithinQuery vs SpanContainingQuery in the patch. I think this makes it easier to differentiate (vs Containing/Contained). I also changed parameters to be "big" and "little" instead of contained and containing.
        Hide
        Paul Elschot added a comment -

        Patch of 28 April 2015.

        This starts from Robert's patch, removes getContainer/getContained methods in SpanContainQuery, and renames the arguments to the SpanWithinQuery constructor from container/contained to big/little.

        Show
        Paul Elschot added a comment - Patch of 28 April 2015. This starts from Robert's patch, removes getContainer/getContained methods in SpanContainQuery, and renames the arguments to the SpanWithinQuery constructor from container/contained to big/little.
        Hide
        Robert Muir added a comment -

        Thanks Paul, I think this one is ready to go. I'll wait a bit before committing.

        Show
        Robert Muir added a comment - Thanks Paul, I think this one is ready to go. I'll wait a bit before committing.
        Hide
        ASF subversion and git services added a comment -

        Commit 1676716 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1676716 ]

        LUCENE-6083: SpanWithinQuery and SpanContainingQuery

        Show
        ASF subversion and git services added a comment - Commit 1676716 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1676716 ] LUCENE-6083 : SpanWithinQuery and SpanContainingQuery
        Hide
        ASF subversion and git services added a comment -

        Commit 1676723 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1676723 ]

        LUCENE-6083: SpanWithinQuery and SpanContainingQuery

        Show
        ASF subversion and git services added a comment - Commit 1676723 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1676723 ] LUCENE-6083 : SpanWithinQuery and SpanContainingQuery
        Hide
        Robert Muir added a comment -

        Paul, thank you for the work here!

        Show
        Robert Muir added a comment - Paul, thank you for the work here!

          People

          • Assignee:
            Unassigned
            Reporter:
            Paul Elschot
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development