Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: 3.0.2
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

      1. LuceneNestedDocumentSupport.zip
        1.44 MB
        Mark Harwood
      2. LUCENE-2454.patch
        13 kB
        Michael McCandless
      3. LUCENE-2454.patch
        14 kB
        Paul Elschot

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Do you think there any efficiencies to be gained on the document retrieve side of things if you know that the documents commonly being retrieved are physically nearby

          Good question! I think OS level caching should mostly solve this?

          Show
          Michael McCandless added a comment - Do you think there any efficiencies to be gained on the document retrieve side of things if you know that the documents commonly being retrieved are physically nearby Good question! I think OS level caching should mostly solve this?
          Hide
          Michael McCandless added a comment -

          Duplicate of LUCENE-3171.

          Show
          Michael McCandless added a comment - Duplicate of LUCENE-3171 .
          Hide
          Srinivas Raj added a comment -

          This is exactly what I am looking for, hope this becomes part of core.

          How to make this work with Lucene 3.2? I downloaded the zip file and I was able to run the test with lucene 3.0, but I would like to use the addDocuments() method added to Lucene 3.2. The patches seems to be specific to Lucene 4.0.

          Show
          Srinivas Raj added a comment - This is exactly what I am looking for, hope this becomes part of core. How to make this work with Lucene 3.2? I downloaded the zip file and I was able to run the test with lucene 3.0, but I would like to use the addDocuments() method added to Lucene 3.2. The patches seems to be specific to Lucene 4.0.
          Hide
          Mark Harwood added a comment -

          Could that work for your use case?

          Sounds like it, that's great
          Do you think there any efficiencies to be gained on the document retrieve side of things if you know that the documents commonly being retrieved are physically nearby i.e. an app will often retrieve a parent's fields and then those from child docs which are required to be physically located adjacent to the parent's data. Would existing lower-level caching in Directory or the OS mean there's already a good chance of finding child data in cached blocks or could a change to file structures and/or doc retrieve APIs radically boost parent-plus-child retrieve performance?

          Show
          Mark Harwood added a comment - Could that work for your use case? Sounds like it, that's great Do you think there any efficiencies to be gained on the document retrieve side of things if you know that the documents commonly being retrieved are physically nearby i.e. an app will often retrieve a parent's fields and then those from child docs which are required to be physically located adjacent to the parent's data. Would existing lower-level caching in Directory or the OS mean there's already a good chance of finding child data in cached blocks or could a change to file structures and/or doc retrieve APIs radically boost parent-plus-child retrieve performance?
          Hide
          Michael McCandless added a comment -

          I think the only thing 3171 may be missing from my original use cases then is that I can use multiple PerParentLimitedQueries in one query to get a limit of children of different types e.g. for each parent resume, max 10 results from "employment detail" children and max 10 results from "education background" children.

          I think LUCENE-3171 can handle this, or something very similar: the
          collector tracks all of the BlockJoinQuerys involved in the top query.

          So, you'd have 1 BJQ matching "employment detail" child docs and
          another matching "education bg" child docs. The BJC collects the
          top parent docs, then you can retrieve separate TopGroups for each
          BJQ.

          In the end you have a TopGroups for the "employment detail" child docs
          and another TopGroups for the "education bg" child docs.

          Could that work for your use case?

          Show
          Michael McCandless added a comment - I think the only thing 3171 may be missing from my original use cases then is that I can use multiple PerParentLimitedQueries in one query to get a limit of children of different types e.g. for each parent resume, max 10 results from "employment detail" children and max 10 results from "education background" children. I think LUCENE-3171 can handle this, or something very similar: the collector tracks all of the BlockJoinQuerys involved in the top query. So, you'd have 1 BJQ matching "employment detail" child docs and another matching "education bg" child docs. The BJC collects the top parent docs, then you can retrieve separate TopGroups for each BJQ. In the end you have a TopGroups for the "employment detail" child docs and another TopGroups for the "education bg" child docs. Could that work for your use case?
          Hide
          Mark Harwood added a comment -

          Sounds encouraging. I think the only thing 3171 may be missing from my original use cases then is that I can use multiple PerParentLimitedQueries in one query to get a limit of children of different types e.g. for each parent resume, max 10 results from "employment detail" children and max 10 results from "education background" children.
          Not sure that behaviour is possible in the collector-based approach?

          Show
          Mark Harwood added a comment - Sounds encouraging. I think the only thing 3171 may be missing from my original use cases then is that I can use multiple PerParentLimitedQueries in one query to get a limit of children of different types e.g. for each parent resume, max 10 results from "employment detail" children and max 10 results from "education background" children. Not sure that behaviour is possible in the collector-based approach?
          Hide
          Michael McCandless added a comment -

          A common pattern is for solutions to ask for the best 11 children for the best parents and display only 10 - that way the app knows that for certain parents there is more data available (i.e. those with 11 matches) and can offer a "more" button to retrieve the extra children for parents of interest

          With LUCENE-3171, you should be able to just ask for 10 here, and then check if the TopDocs.totalHits is > 10 to decide whether to offer the "more" button.

          Show
          Michael McCandless added a comment - A common pattern is for solutions to ask for the best 11 children for the best parents and display only 10 - that way the app knows that for certain parents there is more data available (i.e. those with 11 matches) and can offer a "more" button to retrieve the extra children for parents of interest With LUCENE-3171 , you should be able to just ask for 10 here, and then check if the TopDocs.totalHits is > 10 to decide whether to offer the "more" button.
          Hide
          Michael McCandless added a comment -

          A variant of prevSetBit could take this largest known child as an argument to limit its search,

          I think we should not require the app to "know" the max number of children per parent? (Ie, we should just grow buffers, etc., on demand as we collect).

          I mean, if this information is easily available we could optimize for that case, but for some apps it's a good amount of work to record this and update it so I don't think it should be a required arg when creating the query/collectors, even though it's tempting

          Show
          Michael McCandless added a comment - A variant of prevSetBit could take this largest known child as an argument to limit its search, I think we should not require the app to "know" the max number of children per parent? (Ie, we should just grow buffers, etc., on demand as we collect). I mean, if this information is easily available we could optimize for that case, but for some apps it's a good amount of work to record this and update it so I don't think it should be a required arg when creating the query/collectors, even though it's tempting
          Hide
          Michael McCandless added a comment -

          It uses 2 passes if you also want to collect child docs per parent

          I tend to work with distributed indexes so it involves a 2 pass op anyway - one to understand best parents across the multiple shards first then the perparentlimitedquery to ensure we only pay the retrieve costs for those parents that make the final cut.

          The distributed case can still be done single pass, using LUCENE-3171,
          ie each shard returns the top groups and then they are merged in the
          front. This should be substantially faster than doing a 2nd pass out
          to all shards.

          Also, we now have TopDocs.merge/TopGroups.merge to support this use
          case.

          This overlaps with the BlockJoinQuery of LUCENE-3171, this issue might even be closed as duplicate of that one. Which one is preferred?

          I think they are likely dups of one another and I agree we need to
          make sure all important use cases are covered.

          Apps commonly need to return a selection of both matching and non-matching children along with the "best" parents.

          LUCENE-3171 can do this as well, with the same approach as here, ie
          doing 2 passes with two different child queries.

          However, I think for both this issue and for LUCENE-3171, this means
          each child doc must have the parent's PK indexed against it, right?
          Ie, for that 2nd query you need some way to return all child docs
          under any of the top parents, so the child query is "parentID MUST be
          in XX, YY, ZZ" and "childDoc SHOULD XYZ".

          In fact, we could make this a single pass capability with LUCENE-3171
          and without requireing each child doc index its parent PK, ie also
          pull & sort all other non-matching children under any top parent,
          because collction within each parent is done when you retrieve the
          TopGroups, but this can be a later enhancement.

          Show
          Michael McCandless added a comment - It uses 2 passes if you also want to collect child docs per parent I tend to work with distributed indexes so it involves a 2 pass op anyway - one to understand best parents across the multiple shards first then the perparentlimitedquery to ensure we only pay the retrieve costs for those parents that make the final cut. The distributed case can still be done single pass, using LUCENE-3171 , ie each shard returns the top groups and then they are merged in the front. This should be substantially faster than doing a 2nd pass out to all shards. Also, we now have TopDocs.merge/TopGroups.merge to support this use case. This overlaps with the BlockJoinQuery of LUCENE-3171 , this issue might even be closed as duplicate of that one. Which one is preferred? I think they are likely dups of one another and I agree we need to make sure all important use cases are covered. Apps commonly need to return a selection of both matching and non-matching children along with the "best" parents. LUCENE-3171 can do this as well, with the same approach as here, ie doing 2 passes with two different child queries. However, I think for both this issue and for LUCENE-3171 , this means each child doc must have the parent's PK indexed against it, right? Ie, for that 2nd query you need some way to return all child docs under any of the top parents, so the child query is "parentID MUST be in XX, YY, ZZ" and "childDoc SHOULD XYZ". In fact, we could make this a single pass capability with LUCENE-3171 and without requireing each child doc index its parent PK, ie also pull & sort all other non-matching children under any top parent, because collction within each parent is done when you retrieve the TopGroups, but this can be a later enhancement.
          Hide
          Mark Harwood added a comment -

          This overlaps with the BlockJoinQuery of LUCENE-3171, this issue might even be closed as duplicate of that one. Which one is preferred?

          We need to look at the likely use cases. 2454 was created to service a use case which I expect to be a very common pattern and I'm not sure if LUCENE-3171 satisfies this need. Apps commonly need to return a selection of both matching and non-matching children along with the "best" parents. Why? - it's a very similar rationale to the way that highlighting returns a summary of text - it doesn't just return the matched words, it also returns surrounding text as useful context when displaying results to users. However, some texts can be very large and there's a need to limit what context is brought back.
          If we apply this logic to 2454 we can see that for the top parents it is common to also want some non-matching children (e.g. for a resume return a person's employment history - not just the employments that matched the original search) but it is also necessary to summarize some parent's history (e.g. the contractor who listed a gazillion positions in his employment history needs summarising). A common pattern is for solutions to ask for the best 11 children for the best parents and display only 10 - that way the app knows that for certain parents there is more data available (i.e. those with 11 matches) and can offer a "more" button to retrieve the extra children for parents of interest. 2454 satisfies this use case as follows:

          1. Use a NestedDocumentQuery to get best parents with child criteria expressed as a "must"
          2. Use a PerParentLimitedQuery to get a selection of children per top parent where MUST belong to a top parent (tested using primary key) and use the child criteria again but this time as a "SHOULD" clause to relevance rank the selection of children returned

          It's worth considering this sort of use case carefully before making any code decisions.

          Show
          Mark Harwood added a comment - This overlaps with the BlockJoinQuery of LUCENE-3171 , this issue might even be closed as duplicate of that one. Which one is preferred? We need to look at the likely use cases. 2454 was created to service a use case which I expect to be a very common pattern and I'm not sure if LUCENE-3171 satisfies this need. Apps commonly need to return a selection of both matching and non-matching children along with the "best" parents. Why? - it's a very similar rationale to the way that highlighting returns a summary of text - it doesn't just return the matched words, it also returns surrounding text as useful context when displaying results to users. However, some texts can be very large and there's a need to limit what context is brought back. If we apply this logic to 2454 we can see that for the top parents it is common to also want some non-matching children (e.g. for a resume return a person's employment history - not just the employments that matched the original search) but it is also necessary to summarize some parent's history (e.g. the contractor who listed a gazillion positions in his employment history needs summarising). A common pattern is for solutions to ask for the best 11 children for the best parents and display only 10 - that way the app knows that for certain parents there is more data available (i.e. those with 11 matches) and can offer a "more" button to retrieve the extra children for parents of interest. 2454 satisfies this use case as follows: Use a NestedDocumentQuery to get best parents with child criteria expressed as a "must" Use a PerParentLimitedQuery to get a selection of children per top parent where MUST belong to a top parent (tested using primary key) and use the child criteria again but this time as a "SHOULD" clause to relevance rank the selection of children returned It's worth considering this sort of use case carefully before making any code decisions.
          Hide
          Paul Elschot added a comment -

          This overlaps with the BlockJoinQuery of LUCENE-3171, this issue might even be closed as duplicate of that one. Which one is preferred?

          On using prev/nextSetBit in a safe range, this safe range starts with the parent and ends with the largest known child. A variant of prevSetBit could take this largest known child as an argument to limit its search, and then from the return value one has either a new parent, or one is certain that the current parent is the right one. This would also limit the worst case number of inspected bits for the group to the group size.

          With or without that variant, I think it would be good to add a remark in the javadocs about the possible inefficiency of the use of OpenBitSet for larger group sizes. When the typical group size gets a lot bigger than the number of bits in a long, another implementation might be faster. This remark the in javadocs would allow us to wait for someone to come along with bigger group sizes and a real performance problem here.

          Show
          Paul Elschot added a comment - This overlaps with the BlockJoinQuery of LUCENE-3171 , this issue might even be closed as duplicate of that one. Which one is preferred? On using prev/nextSetBit in a safe range, this safe range starts with the parent and ends with the largest known child. A variant of prevSetBit could take this largest known child as an argument to limit its search, and then from the return value one has either a new parent, or one is certain that the current parent is the right one. This would also limit the worst case number of inspected bits for the group to the group size. With or without that variant, I think it would be good to add a remark in the javadocs about the possible inefficiency of the use of OpenBitSet for larger group sizes. When the typical group size gets a lot bigger than the number of bits in a long, another implementation might be faster. This remark the in javadocs would allow us to wait for someone to come along with bigger group sizes and a real performance problem here.
          Hide
          Mark Harwood added a comment -

          prevSetBit is called for each child doc

          You could call nextSetBit on the first child to know the "safe" range of child docs attributable to the same parent but you would be taking a gamble that this was worth the call i.e. there were many possible children per parent to be tested.

          It uses 2 passes if you also want to collect child docs per parent

          I tend to work with distributed indexes so it involves a 2 pass op anyway - one to understand best parents across the multiple shards first then the perparentlimitedquery to ensure we only pay the retrieve costs for those parents that make the final cut.

          I think it should use a PQ to find the lowest child to evict per parent doc?

          Careful object reuse would need to be factored in to avoid excessive GC - each parent would fill a PQ full of child-match object instances that could/should be reused in assessing the next parent

          Show
          Mark Harwood added a comment - prevSetBit is called for each child doc You could call nextSetBit on the first child to know the "safe" range of child docs attributable to the same parent but you would be taking a gamble that this was worth the call i.e. there were many possible children per parent to be tested. It uses 2 passes if you also want to collect child docs per parent I tend to work with distributed indexes so it involves a 2 pass op anyway - one to understand best parents across the multiple shards first then the perparentlimitedquery to ensure we only pay the retrieve costs for those parents that make the final cut. I think it should use a PQ to find the lowest child to evict per parent doc? Careful object reuse would need to be factored in to avoid excessive GC - each parent would fill a PQ full of child-match object instances that could/should be reused in assessing the next parent
          Hide
          Michael McCandless added a comment -

          Would modules/grouping meanwhile be a better place for this than lucene/contrib/queries?

          I think modules/join is the right place? When we factor out Solr's
          generic join impl it can go there too...

          I have some concerns about the current approach here (this is why I
          opened LUCENE-3171):

          • prevSetBit is called for each child doc, which is an O(N^2) cost
            (N = number of child docs for one parent) I think? Admittedly,
            "typically" N is probably small...
          • It uses 2 passes if you also want to collect child docs per
            parent
          • PerParentLimitedQuery is also O(N^2) cost, both on insert of a new
            child and on popping the child docs per group: I think it should
            use a PQ to find the lowest child to evict per parent doc?
          • I think "typically" an app will want to collect the top N groups
            (parent docs and their children), so it's more efficient to gather
            those top N and only in the end sort the each set of children
            per-parent? (This is similar to how 2nd pass grouping collector
            works).
          • PerParentLimitedQuery only supports relevance sort w/in each
            parent.
          • You don't get the parent/child structure back, from
            PerParentLimitedQuery (but now we have TopGroups which is a great
            match for representing each parent and its children).

          If you always only use PerParentLimitedQuery on the top parents from
          the first pass, eg you AND/filter it against those parent docs, then
          the O(N^2) cost is less severe since it'll have a small constant in
          front, but since it's a Query I imagine users will use it w/o that
          filter, which is bad... I think using a TopN Collector is a better match
          here.

          Show
          Michael McCandless added a comment - Would modules/grouping meanwhile be a better place for this than lucene/contrib/queries? I think modules/join is the right place? When we factor out Solr's generic join impl it can go there too... I have some concerns about the current approach here (this is why I opened LUCENE-3171 ): prevSetBit is called for each child doc, which is an O(N^2) cost (N = number of child docs for one parent) I think? Admittedly, "typically" N is probably small... It uses 2 passes if you also want to collect child docs per parent PerParentLimitedQuery is also O(N^2) cost, both on insert of a new child and on popping the child docs per group: I think it should use a PQ to find the lowest child to evict per parent doc? I think "typically" an app will want to collect the top N groups (parent docs and their children), so it's more efficient to gather those top N and only in the end sort the each set of children per-parent? (This is similar to how 2nd pass grouping collector works). PerParentLimitedQuery only supports relevance sort w/in each parent. You don't get the parent/child structure back, from PerParentLimitedQuery (but now we have TopGroups which is a great match for representing each parent and its children). If you always only use PerParentLimitedQuery on the top parents from the first pass, eg you AND/filter it against those parent docs, then the O(N^2) cost is less severe since it'll have a small constant in front, but since it's a Query I imagine users will use it w/o that filter, which is bad... I think using a TopN Collector is a better match here.
          Hide
          Paul Elschot added a comment -

          The assert on the parent was an IllegalArgumentException in the previous patch.
          Such and unconditional exception would probably be better than an assert, because when the assert is switched off a mistake in the parent filter would not be detected early.

          Show
          Paul Elschot added a comment - The assert on the parent was an IllegalArgumentException in the previous patch. Such and unconditional exception would probably be better than an assert, because when the assert is switched off a mistake in the parent filter would not be detected early.
          Hide
          Paul Elschot added a comment -

          Patch of 19 June 2011.

          In NestedDocumentQuery.java:
          Added use of prevSetBit with an assert for the presence of a parent.
          Added/changed rewrite/createWeight.
          Also added equals/hashCode.

          TestNestedDocument.java is unchanged in this patch and passes.

          In the patch both java files are in lucene/contrib/queries.

          There is still one nocommit for the enum for the score mode.

          Show
          Paul Elschot added a comment - Patch of 19 June 2011. In NestedDocumentQuery.java: Added use of prevSetBit with an assert for the presence of a parent. Added/changed rewrite/createWeight. Also added equals/hashCode. TestNestedDocument.java is unchanged in this patch and passes. In the patch both java files are in lucene/contrib/queries. There is still one nocommit for the enum for the score mode.
          Hide
          Paul Elschot added a comment -

          With these rewrite and createWeight methods TestNestedDocumentQuery passes:

          +  @Override
          +  public Query rewrite(IndexReader reader) throws IOException {
          +    Query rewrittenChildQuery = childQuery.rewrite(reader);
          +    return (rewrittenChildQuery == childQuery) ? this
          +      : new NestedDocumentQuery(rewrittenChildQuery, parentsFilter, scoreMode);
          +  }
          +
          +  @Override
          +  public Weight createWeight(IndexSearcher searcher) throws IOException {
          +    return new NestedDocumentQueryWeight(childQuery.createWeight(searcher));
          +  }
          +
          

          I'll continue adding the use of prevSetBit.

          Would modules/grouping meanwhile be a better place for this than lucene/contrib/queries?

          Show
          Paul Elschot added a comment - With these rewrite and createWeight methods TestNestedDocumentQuery passes: + @Override + public Query rewrite(IndexReader reader) throws IOException { + Query rewrittenChildQuery = childQuery.rewrite(reader); + return (rewrittenChildQuery == childQuery) ? this + : new NestedDocumentQuery(rewrittenChildQuery, parentsFilter, scoreMode); + } + + @Override + public Weight createWeight(IndexSearcher searcher) throws IOException { + return new NestedDocumentQueryWeight(childQuery.createWeight(searcher)); + } + I'll continue adding the use of prevSetBit. Would modules/grouping meanwhile be a better place for this than lucene/contrib/queries?
          Hide
          Paul Elschot added a comment -

          At Query, the javadocs of both createWeight() and rewrite() start with a word of warning.
          I'll probably need at least a few days to wrap my head around it, so in case anyone meanwhile can provide more help...

          Show
          Paul Elschot added a comment - At Query, the javadocs of both createWeight() and rewrite() start with a word of warning. I'll probably need at least a few days to wrap my head around it, so in case anyone meanwhile can provide more help...
          Hide
          Michael McCandless added a comment -

          NestedDocumentQuery already implements rewrite() by returning this, just as in 3171.

          I put another patch up on LUCENE-3171 earlier, fixing this...

          Maybe we should impl MTQ.createWeight, to throw a "you forgot to rewrite me first" exception?

          Show
          Michael McCandless added a comment - NestedDocumentQuery already implements rewrite() by returning this, just as in 3171. I put another patch up on LUCENE-3171 earlier, fixing this... Maybe we should impl MTQ.createWeight, to throw a "you forgot to rewrite me first" exception?
          Hide
          Paul Elschot added a comment - - edited

          NestedDocumentQuery already implements rewrite() by returning this, just as in 3171.

          This is a more complete traceback of the exception:

              [junit] java.lang.UnsupportedOperationException: org.apache.lucene.search.NumericRangeQuery
              [junit] 	at org.apache.lucene.search.Query.createWeight(Query.java:91)
              [junit] 	at org.apache.lucene.search.BooleanQuery$BooleanWeight.<init>(BooleanQuery.java:177)
              [junit] 	at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:358)
              [junit] 	at org.apache.lucene.search.nested.NestedDocumentQuery.createWeight(NestedDocumentQuery.java:65)
              [junit] 	at org.apache.lucene.search.BooleanQuery$BooleanWeight.<init>(BooleanQuery.java:177)
              [junit] 	at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:358)
              [junit] 	at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:676)
              [junit] 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:292)
              [junit] 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:281)
              [junit] 	at org.apache.lucene.search.TestNestedDocumentQuery.testSimple(TestNestedDocumentQuery.java:92)
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1414)
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1332)
          

          Could BooleanWeight be the offendor?

          Or should NestedDocumentQuery rewrite by rewriting its child, as you suggested?
          But rewrite to what then?

          Show
          Paul Elschot added a comment - - edited NestedDocumentQuery already implements rewrite() by returning this , just as in 3171. This is a more complete traceback of the exception: [junit] java.lang.UnsupportedOperationException: org.apache.lucene.search.NumericRangeQuery [junit] at org.apache.lucene.search.Query.createWeight(Query.java:91) [junit] at org.apache.lucene.search.BooleanQuery$BooleanWeight.<init>(BooleanQuery.java:177) [junit] at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:358) [junit] at org.apache.lucene.search.nested.NestedDocumentQuery.createWeight(NestedDocumentQuery.java:65) [junit] at org.apache.lucene.search.BooleanQuery$BooleanWeight.<init>(BooleanQuery.java:177) [junit] at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:358) [junit] at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:676) [junit] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:292) [junit] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:281) [junit] at org.apache.lucene.search.TestNestedDocumentQuery.testSimple(TestNestedDocumentQuery.java:92) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1414) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1332) Could BooleanWeight be the offendor? Or should NestedDocumentQuery rewrite by rewriting its child, as you suggested? But rewrite to what then?
          Hide
          Michael McCandless added a comment -

          I suspect the NestedDocumentQuery must impl rewrite, and rewrite the childQuery. I hit this on LUCENE-3171, too.

          Show
          Michael McCandless added a comment - I suspect the NestedDocumentQuery must impl rewrite, and rewrite the childQuery. I hit this on LUCENE-3171 , too.
          Hide
          Paul Elschot added a comment -

          Tried the current patch here to make use prevSetBit, but ran into a problem with the query weight that could be related to LUCENE-3208.

          When fixing the patch here so that NestedDocumentQuery.java looks like this:

            public Weight createWeight(IndexSearcher searcher) throws IOException {
              return new NestedDocumentQueryWeight(childQuery.createWeight(searcher));
            }
          

          the TestNestedDocumentQuery from the patch here fails with an UnsupportedOperationException.

          After adding the class name to Query.java constructing this exception the test fails by:

          UnsupportedOperationException: org.apache.lucene.search.NumericRangeQuery

          That means that probably the above fix to the patch is wrong.
          Any comments on how to continue this?

          Show
          Paul Elschot added a comment - Tried the current patch here to make use prevSetBit, but ran into a problem with the query weight that could be related to LUCENE-3208 . When fixing the patch here so that NestedDocumentQuery.java looks like this: public Weight createWeight(IndexSearcher searcher) throws IOException { return new NestedDocumentQueryWeight(childQuery.createWeight(searcher)); } the TestNestedDocumentQuery from the patch here fails with an UnsupportedOperationException. After adding the class name to Query.java constructing this exception the test fails by: UnsupportedOperationException: org.apache.lucene.search.NumericRangeQuery That means that probably the above fix to the patch is wrong. Any comments on how to continue this?
          Hide
          Paul Elschot added a comment - - edited

          So one concern that is left is performance for parent testing.
          I'll open an issue for OpenBitSet.prevSetBit(), LUCENE-3179

          Show
          Paul Elschot added a comment - - edited So one concern that is left is performance for parent testing. I'll open an issue for OpenBitSet.prevSetBit(), LUCENE-3179
          Hide
          Mark Harwood added a comment -

          Looking at the structure of the BooleanQuery, I would expect this to work correctly.

          I've found it to be robust so far - you just need to be clear about directing criteria at only one child or potentially different children.
          The main challenge in using this functionality is allowing users to articulate the nuances of such queries and Lucene-3133 is a holding place for this.

          Under the covers using the same cached filter for parent filters certainly helps with performance and I typically wrap the ParentFilter tag in the XML queries with a <CachedFilter> tag to achieve this

          Show
          Mark Harwood added a comment - Looking at the structure of the BooleanQuery, I would expect this to work correctly. I've found it to be robust so far - you just need to be clear about directing criteria at only one child or potentially different children. The main challenge in using this functionality is allowing users to articulate the nuances of such queries and Lucene-3133 is a holding place for this. Under the covers using the same cached filter for parent filters certainly helps with performance and I typically wrap the ParentFilter tag in the XML queries with a <CachedFilter> tag to achieve this
          Hide
          Paul Elschot added a comment -

          Looking at the structure of the BooleanQuery, I would expect this to work correctly. The ParentsFilter on the unfiltered scorer of required term ("mahout") should return the docId of the parent ("resume") when the unfiltered scorer is at the document containing the required term.

          Show
          Paul Elschot added a comment - Looking at the structure of the BooleanQuery, I would expect this to work correctly. The ParentsFilter on the unfiltered scorer of required term ("mahout") should return the docId of the parent ("resume") when the unfiltered scorer is at the document containing the required term.
          Hide
          Paul Elschot added a comment -

          That is very nicely readable XML.

          The problem might occur when a document with an optional term occurs before a document in the same group with a required term.
          So the second question is the one for which the problem might occur.
          The score value Grant's resume should then be higher than the score value for Sean's.
          Testing only for the set of expected results is not enough for this particular query.

          The problem might occur in another disguise when requiring both terms and then the set of expected results is enough to test,
          but this is not as easily tested because one does not know beforehand the order in which the terms are going to be advance()d.
          The case with an optional term is simpler to test because the optional term is certain to be advance()d to compute the score value after the required term determines that there is a match (see ReqOptSumScorer.score()), and then to be certain of the correct advance() on the optional term one needs to test the score value.

          Show
          Paul Elschot added a comment - That is very nicely readable XML. The problem might occur when a document with an optional term occurs before a document in the same group with a required term. So the second question is the one for which the problem might occur. The score value Grant's resume should then be higher than the score value for Sean's. Testing only for the set of expected results is not enough for this particular query. The problem might occur in another disguise when requiring both terms and then the set of expected results is enough to test, but this is not as easily tested because one does not know beforehand the order in which the terms are going to be advance()d. The case with an optional term is simpler to test because the optional term is certain to be advance()d to compute the score value after the required term determines that there is a match (see ReqOptSumScorer.score()), and then to be certain of the correct advance() on the optional term one needs to test the score value.
          Hide
          Mark Harwood added a comment -

          Below are 2 example tests searching employment resumes - both using the same optional and mandatory clauses but in subtly different ways.
          Question 1 is "who has Mahout skills and preferably used them at Lucid?" while the other question is "who has Mahout skills and preferably has been employed by Lucid?". The questions and the answers are different. Below is the XML test script I used to illustrate the data/queries used, define expected results and run as an executable test.
          Hopefully you can make sense of this:

          <?xml version="1.0" encoding="UTF-8"?>
          <?xml-stylesheet type="text/xsl" href="test.xsl"?>
          <Test description="NestedQuery tests">
          	<Data>
          		<Index name="ResumeIndex">
          			<Analyzers class="org.apache.lucene.analysis.WhitespaceAnalyzer">
          			</Analyzers>
          			<Shard name="shard1">
          				<!--  =============================================================== -->
          				<Document pk="1">
          					<Field name="name">grant</Field>
          					<Field name="docType">resume</Field>
          				</Document>
          				<!--  =============================================================== -->
          						<Document pk="2">
          							<Field name="employer">lucid</Field>
          							<Field name="docType">employment</Field>
          							<Field name="skills">java lucene</Field>
          						</Document>
          				<!--  =============================================================== -->
          						<Document pk="3">
          							<Field name="employer">somewhere else</Field>
          							<Field name="docType">employment</Field>
          							<Field name="skills">mahout and more mahout</Field>
          						</Document>
          				<!--  =============================================================== -->
          				<Document pk="4">
          					<Field name="name">sean</Field>
          					<Field name="docType">resume</Field>
          				</Document>
          				<!--  =============================================================== -->
          						<Document pk="5">
          							<Field name="employer">foo bar</Field>
          							<Field name="docType">employment</Field>
          							<Field name="skills">java</Field>
          						</Document>
          				<!--  =============================================================== -->
          						<Document pk="6">
          							<Field name="employer">some co</Field>
          							<Field name="docType">employment</Field>
          							<Field name="skills">mahout mahout and more mahout</Field>
          						</Document>
          			</Shard>
          		</Index>
          	</Data>
          	<Tests>
          		<Test description="Who knows Mahout and preferably used it *while employed at Lucid*?">
          			<Query>
          	            <NestedQuery> 
          	            	<!-- testing properties of individual child employment docs -->
          	               <Query>
          	                  <BooleanQuery>
          	                  		<Clause occurs="must">
          	                  			<TermsQuery fieldName="skills">mahout</TermsQuery>
          	                  		</Clause>
          	                  		<Clause occurs="should">
          	                  			<TermsQuery fieldName="employer">lucid</TermsQuery>
          	                  		</Clause>
          	                  </BooleanQuery>
          	               </Query>
          	               <ParentsFilter>	
          	                    <TermsFilter fieldName="docType">resume</TermsFilter>                  		 
          	               </ParentsFilter>	
          	            </NestedQuery>
          			</Query>
          			<ExpectedResults why="Grant's tenure at Lucid is overlooked for scoring purposes 
          			                       because it did not involve the required Mahout. Sean has more Mahout experience">
          							<Result fieldName="pk">4</Result>
          							<Result fieldName="pk">1</Result>
          			</ExpectedResults>
          		</Test>
          
          		<!-- ==================================================================================== -->
          		
          		<Test description="Different question - who knows Mahout and preferably has been employed by Lucid?">
          			<Query>
                          <BooleanQuery>
                            		<Clause occurs="must">
          				            <NestedQuery> 
          				            	<!-- testing properties of one child employment docs -->
          				               <Query>
          				                  	<TermsQuery fieldName="skills">mahout</TermsQuery>
          				               </Query>
          				               <ParentsFilter>	
          				                    <TermsFilter fieldName="docType">resume</TermsFilter>                  		 
          				               </ParentsFilter>	
          				            </NestedQuery>
                            		</Clause>
                            		<Clause occurs="should">
          				            	<!-- Another NestedQuery testing properties of *potentially different* child employment docs -->
          				            <NestedQuery> 
          				               <Query>
          		                  			<TermsQuery fieldName="employer">lucid</TermsQuery>
          				               </Query>
          				               <ParentsFilter>	
          				                    <TermsFilter fieldName="docType">resume</TermsFilter>                  		 
          				               </ParentsFilter>	
          				            </NestedQuery>
                            		</Clause>
                            	</BooleanQuery>
          			</Query>
          			<ExpectedResults why="Grant has the required Mahout skills plus the optional Lucid engagement">
          							<Result fieldName="pk">1</Result>
          							<Result fieldName="pk">4</Result>
          			</ExpectedResults>
          		</Test>
          		<!-- ==================================================================================== -->
          	</Tests>
          </Test>
          
          Show
          Mark Harwood added a comment - Below are 2 example tests searching employment resumes - both using the same optional and mandatory clauses but in subtly different ways. Question 1 is "who has Mahout skills and preferably used them at Lucid?" while the other question is "who has Mahout skills and preferably has been employed by Lucid?". The questions and the answers are different. Below is the XML test script I used to illustrate the data/queries used, define expected results and run as an executable test. Hopefully you can make sense of this: <?xml version= "1.0" encoding= "UTF-8" ?> <?xml-stylesheet type= "text/xsl" href= "test.xsl" ?> <Test description= "NestedQuery tests" > <Data> <Index name= "ResumeIndex" > <Analyzers class= "org.apache.lucene.analysis.WhitespaceAnalyzer" > </Analyzers> <Shard name= "shard1" > <!-- =============================================================== --> <Document pk= "1" > <Field name= "name" > grant </Field> <Field name= "docType" > resume </Field> </Document> <!-- =============================================================== --> <Document pk= "2" > <Field name= "employer" > lucid </Field> <Field name= "docType" > employment </Field> <Field name= "skills" > java lucene </Field> </Document> <!-- =============================================================== --> <Document pk= "3" > <Field name= "employer" > somewhere else </Field> <Field name= "docType" > employment </Field> <Field name= "skills" > mahout and more mahout </Field> </Document> <!-- =============================================================== --> <Document pk= "4" > <Field name= "name" > sean </Field> <Field name= "docType" > resume </Field> </Document> <!-- =============================================================== --> <Document pk= "5" > <Field name= "employer" > foo bar </Field> <Field name= "docType" > employment </Field> <Field name= "skills" > java </Field> </Document> <!-- =============================================================== --> <Document pk= "6" > <Field name= "employer" > some co </Field> <Field name= "docType" > employment </Field> <Field name= "skills" > mahout mahout and more mahout </Field> </Document> </Shard> </Index> </Data> <Tests> <Test description= "Who knows Mahout and preferably used it *while employed at Lucid*?" > <Query> <NestedQuery> <!-- testing properties of individual child employment docs --> <Query> <BooleanQuery> <Clause occurs= "must" > <TermsQuery fieldName= "skills" > mahout </TermsQuery> </Clause> <Clause occurs= "should" > <TermsQuery fieldName= "employer" > lucid </TermsQuery> </Clause> </BooleanQuery> </Query> <ParentsFilter> <TermsFilter fieldName= "docType" > resume </TermsFilter> </ParentsFilter> </NestedQuery> </Query> <ExpectedResults why="Grant's tenure at Lucid is overlooked for scoring purposes because it did not involve the required Mahout. Sean has more Mahout experience"> <Result fieldName= "pk" > 4 </Result> <Result fieldName= "pk" > 1 </Result> </ExpectedResults> </Test> <!-- ==================================================================================== --> <Test description= "Different question - who knows Mahout and preferably has been employed by Lucid?" > <Query> <BooleanQuery> <Clause occurs= "must" > <NestedQuery> <!-- testing properties of one child employment docs --> <Query> <TermsQuery fieldName= "skills" > mahout </TermsQuery> </Query> <ParentsFilter> <TermsFilter fieldName= "docType" > resume </TermsFilter> </ParentsFilter> </NestedQuery> </Clause> <Clause occurs= "should" > <!-- Another NestedQuery testing properties of *potentially different* child employment docs --> <NestedQuery> <Query> <TermsQuery fieldName= "employer" > lucid </TermsQuery> </Query> <ParentsFilter> <TermsFilter fieldName= "docType" > resume </TermsFilter> </ParentsFilter> </NestedQuery> </Clause> </BooleanQuery> </Query> <ExpectedResults why= "Grant has the required Mahout skills plus the optional Lucid engagement" > <Result fieldName= "pk" > 1 </Result> <Result fieldName= "pk" > 4 </Result> </ExpectedResults> </Test> <!-- ==================================================================================== --> </Tests> </Test>
          Hide
          Paul Elschot added a comment - - edited

          I finally had some time to start taking a look at the grouping module and again at the patch here.
          There is too much code there for me to come up with a test case soon.
          So please don't wait for me to commit this.

          An easy way to test this would be to have a boolean query with required term and an optional term,
          with the optional term occurring in a document group in a document before (i.e. with a lower docId than)
          a document in the same group with a required term.

          In case I run into this I'll open a separate issue.

          Show
          Paul Elschot added a comment - - edited I finally had some time to start taking a look at the grouping module and again at the patch here. There is too much code there for me to come up with a test case soon. So please don't wait for me to commit this. An easy way to test this would be to have a boolean query with required term and an optional term, with the optional term occurring in a document group in a document before (i.e. with a lower docId than) a document in the same group with a required term. In case I run into this I'll open a separate issue.
          Hide
          Michael McCandless added a comment -

          OK I opened LUCENE-3171 to explore the single-pass approach.

          Show
          Michael McCandless added a comment - OK I opened LUCENE-3171 to explore the single-pass approach.
          Hide
          Paul Elschot added a comment -

          Mark, this may occur when one child contains A and another contains B, and at least one scorer (the one for A and/or the one for B) uses advance() to get to the nested doc.
          I'll add a test case here, but that could take a few days.

          Show
          Paul Elschot added a comment - Mark, this may occur when one child contains A and another contains B, and at least one scorer (the one for A and/or the one for B) uses advance() to get to the nested doc. I'll add a test case here, but that could take a few days.
          Hide
          Mark Harwood added a comment -

          Mike - having thought about it moving the per-parent grouping logic to a collector removes some of the functionality I am used to.
          I often use multiple PerParentLimitedQuery clauses in the same query to limit different aspects of the structures I query e.g. for each best parent document I may want the best 5 children of type A and the best 10 children of type B. Type "A" docs are of a different size and shape to type "B" - hence the need for different limits. I realise this can be done with multiple searches using different collectors each time but I use distributed servers that service all XML-based queries the same way. In these sorts of environments it is desirable to maintain a common query API (e.g. the XML syntax) and not introduce the need for special collectors to service different types of queries.

          Paul, not sure on the exact details of your query (a parent with a child containing A and B or a parent with a child containing A and a potentially different child containing B?) I suspect a test case would help.

          Show
          Mark Harwood added a comment - Mike - having thought about it moving the per-parent grouping logic to a collector removes some of the functionality I am used to. I often use multiple PerParentLimitedQuery clauses in the same query to limit different aspects of the structures I query e.g. for each best parent document I may want the best 5 children of type A and the best 10 children of type B. Type "A" docs are of a different size and shape to type "B" - hence the need for different limits. I realise this can be done with multiple searches using different collectors each time but I use distributed servers that service all XML-based queries the same way. In these sorts of environments it is desirable to maintain a common query API (e.g. the XML syntax) and not introduce the need for special collectors to service different types of queries. Paul, not sure on the exact details of your query (a parent with a child containing A and B or a parent with a child containing A and a potentially different child containing B?) I suspect a test case would help.
          Hide
          Paul Elschot added a comment -

          I see no test cases for required terms in a nested document.
          This may be non trivial in that advance() should advance into the first doc of the nested doc.
          For example, assume the parents p1 and p2 are the first docs in the nested docs, and that the query
          requires a and b to be present:

          docId
          0   p1
          1   a
          2   b
          3   p2
          4   b
          5   a
          

          In this situation, p2 may be missed when advance() on a required scorer for "b" is given docId 5 (containing "a")
          as a target. It should be given target docId 3 to advance into the nested doc p2 containing "a".

          I quickly read the code here, but I could not easily determine whether this is done correctly or not.
          Shall I add a test case here, or would it be better to open another issue after this one is closed, or can someone reassure me that this is not in an issue?

          Show
          Paul Elschot added a comment - I see no test cases for required terms in a nested document. This may be non trivial in that advance() should advance into the first doc of the nested doc. For example, assume the parents p1 and p2 are the first docs in the nested docs, and that the query requires a and b to be present: docId 0 p1 1 a 2 b 3 p2 4 b 5 a In this situation, p2 may be missed when advance() on a required scorer for "b" is given docId 5 (containing "a") as a target. It should be given target docId 3 to advance into the nested doc p2 containing "a". I quickly read the code here, but I could not easily determine whether this is done correctly or not. Shall I add a test case here, or would it be better to open another issue after this one is closed, or can someone reassure me that this is not in an issue?
          Hide
          Michael McCandless added a comment -

          I'll need to check LUCENE-3129 for equivalence with PerParentLimitQuery. It's certainly a central part of what I typically deploy for nested queries - pass 1 is usually a NestedDocumentQuery to get the best parents and pass 2 uses PerParentLimitQuery to get the best children for these best parents.

          Hmm, so I wonder if we could do this in one pass? Ie, like grouping,
          if you indexed your docs as blocks, you can use the faster single-pass
          collector; but if you didn't, you can use the more general but slower
          and more-RAM-consuming two pass collector.

          It seems like we should be able to do something similar with joins,
          somehow... ie Solr's join impl is a start at the "fully general"
          two-pass solution.

          But I agree the "join child to parent" and then "grouping of child
          docs" go hand in hand for searching...

          What do you do for facet counting in these apps...? Post-grouping
          faceting also ties in here.

          Of course some apps can simply fetch ALL children for the top parents but in some cases summarising children is required

          Right...

          (note: this is potentially a great solution for performance issues on highlighting big docs e.g. entire books).

          I think it'd be compelling to index book/articles with each
          page/section/chapter being a new doc, and then group them under their
          book/article.

          I haven't benchmarked nextSetBit vs the existing "rewind" implementation but I imagine it may be quicker.

          I think it should be much faster – obs.nextSetBit looks heavily
          optimized, since it can operate a word at a time. Though, if the
          groups are smallish, so that nextSetBit is often maybe 2 or 3 bits
          away, I'm not sure it'd be faster...

          Parent- followed-by-children seems more natural from a user's point of view however.

          But is it really so bad to ask the app to put parent doc last?

          I mean, the docs have to be indexed w/ the new doc block APIs in IW
          anyway, which will often be eg a List<Document>, at which point
          putting parent last seems a minor imposition?

          Since this is an expert API I think it's OK to put [minor] impositions
          on its usage if this can simplify the impl / make it faster / less
          risky. That said, I'm not yet sure on the impl (single pass query +
          collector vs generic two-pass join that solr now has), so it's
          probably premature to worry about this...

          I guess you could always keep the parent-then-child insertion order but flip the bitset (then cache) for query execution if that was faster.

          True but this adds some hair into the impl (we must also "flip" coming
          back from nextSetBit)...

          Benchmarking rewind vs nextSetbit vs flip then nextSetBit would reveal all.

          True, though it'd be best to do this in the context of the actual join impl...

          Show
          Michael McCandless added a comment - I'll need to check LUCENE-3129 for equivalence with PerParentLimitQuery. It's certainly a central part of what I typically deploy for nested queries - pass 1 is usually a NestedDocumentQuery to get the best parents and pass 2 uses PerParentLimitQuery to get the best children for these best parents. Hmm, so I wonder if we could do this in one pass? Ie, like grouping, if you indexed your docs as blocks, you can use the faster single-pass collector; but if you didn't, you can use the more general but slower and more-RAM-consuming two pass collector. It seems like we should be able to do something similar with joins, somehow... ie Solr's join impl is a start at the "fully general" two-pass solution. But I agree the "join child to parent" and then "grouping of child docs" go hand in hand for searching... What do you do for facet counting in these apps...? Post-grouping faceting also ties in here. Of course some apps can simply fetch ALL children for the top parents but in some cases summarising children is required Right... (note: this is potentially a great solution for performance issues on highlighting big docs e.g. entire books). I think it'd be compelling to index book/articles with each page/section/chapter being a new doc, and then group them under their book/article. I haven't benchmarked nextSetBit vs the existing "rewind" implementation but I imagine it may be quicker. I think it should be much faster – obs.nextSetBit looks heavily optimized, since it can operate a word at a time. Though, if the groups are smallish, so that nextSetBit is often maybe 2 or 3 bits away, I'm not sure it'd be faster... Parent- followed-by-children seems more natural from a user's point of view however. But is it really so bad to ask the app to put parent doc last? I mean, the docs have to be indexed w/ the new doc block APIs in IW anyway, which will often be eg a List<Document>, at which point putting parent last seems a minor imposition? Since this is an expert API I think it's OK to put [minor] impositions on its usage if this can simplify the impl / make it faster / less risky. That said, I'm not yet sure on the impl (single pass query + collector vs generic two-pass join that solr now has), so it's probably premature to worry about this... I guess you could always keep the parent-then-child insertion order but flip the bitset (then cache) for query execution if that was faster. True but this adds some hair into the impl (we must also "flip" coming back from nextSetBit)... Benchmarking rewind vs nextSetbit vs flip then nextSetBit would reveal all. True, though it'd be best to do this in the context of the actual join impl...
          Hide
          Mark Harwood added a comment -

          Thanks for the patch work, Mike. I'll need to check LUCENE-3129 for equivalence with PerParentLimitQuery. It's certainly a central part of what I typically deploy for nested queries - pass 1 is usually a NestedDocumentQuery to get the best parents and pass 2 uses PerParentLimitQuery to get the best children for these best parents. Of course some apps can simply fetch ALL children for the top parents but in some cases summarising children is required (note: this is potentially a great solution for performance issues on highlighting big docs e.g. entire books).

          I haven't benchmarked nextSetBit vs the existing "rewind" implementation but I imagine it may be quicker. Parent- followed-by-children seems more natural from a user's point of view however. I guess you could always keep the parent-then-child insertion order but flip the bitset (then cache) for query execution if that was faster. Benchmarking rewind vs nextSetbit vs flip then nextSetBit would reveal all.

          Thomas - maintaining a strict order of parent/child docs is important and the recently-committed LUCENE-3112 should help with this.

          Show
          Mark Harwood added a comment - Thanks for the patch work, Mike. I'll need to check LUCENE-3129 for equivalence with PerParentLimitQuery. It's certainly a central part of what I typically deploy for nested queries - pass 1 is usually a NestedDocumentQuery to get the best parents and pass 2 uses PerParentLimitQuery to get the best children for these best parents. Of course some apps can simply fetch ALL children for the top parents but in some cases summarising children is required (note: this is potentially a great solution for performance issues on highlighting big docs e.g. entire books). I haven't benchmarked nextSetBit vs the existing "rewind" implementation but I imagine it may be quicker. Parent- followed-by-children seems more natural from a user's point of view however. I guess you could always keep the parent-then-child insertion order but flip the bitset (then cache) for query execution if that was faster. Benchmarking rewind vs nextSetbit vs flip then nextSetBit would reveal all. Thomas - maintaining a strict order of parent/child docs is important and the recently-committed LUCENE-3112 should help with this.
          Hide
          Thomas Guttesen added a comment -

          Hi.

          Great feature...
          I have some difficulties understanding the semantics/flow of document creation.
          Do you have to add the parent and child levels in any correct sequence? Or can you insert all parents and then insert all child levels later.
          The reason I as is that in my case I look for a one->many relation style insertion. I had hoped that I could add more child levels later.

          Cheers

          Show
          Thomas Guttesen added a comment - Hi. Great feature... I have some difficulties understanding the semantics/flow of document creation. Do you have to add the parent and child levels in any correct sequence? Or can you insert all parents and then insert all child levels later. The reason I as is that in my case I look for a one->many relation style insertion. I had hoped that I could add more child levels later. Cheers
          Hide
          Michael McCandless added a comment -

          One idea: maybe the parent doc should be indexed last, in the block,
          instead of first? Ie, so that NestedDocumentQuery.getParentDoc
          could use obs.nextSetBit()?

          Show
          Michael McCandless added a comment - One idea: maybe the parent doc should be indexed last, in the block, instead of first? Ie, so that NestedDocumentQuery.getParentDoc could use obs.nextSetBit()?
          Hide
          Michael McCandless added a comment -

          Attached patch, pulling out just NestedDocumentQuery from the zip
          file:

          • I put NestedDocumentQuery and its test case in contrib/queries,
            under oal.search.nested
          • Changed scoreMode to final int that's passed to ctor
          • Brought everything up to current trunk (scoring is per-segment,
            etc.).this also moves parentBits (previously mutable on the
            original query) into the scorer
          • Simplified the test case – instead of parsing XHTML resumes it
            just directly builds up a toy example of parent+child docs.
          • Removed @author tags, fixed code style (indent to 2 spaces, {}s,
            newlines, etc.), etc.

          I put a few nocommits in that still need resolving...

          My patch doesn't include XMLQueryParser integration; I think we should
          do this as separate issue?

          I think PerParentLimitedQuery is the same functionality as grouping,
          right? Except, it only supports relevance withinGroup sort, but it
          operates only w/ a parent filter (whereas our grouping impls so far
          require a more-RAM-costly DocTermsIndex FC entry of the field you are
          grouping on). I think we should somehow merge this w/ the single pass
          grouping collector (LUCENE-3129)?

          Show
          Michael McCandless added a comment - Attached patch, pulling out just NestedDocumentQuery from the zip file: I put NestedDocumentQuery and its test case in contrib/queries, under oal.search.nested Changed scoreMode to final int that's passed to ctor Brought everything up to current trunk (scoring is per-segment, etc.).this also moves parentBits (previously mutable on the original query) into the scorer Simplified the test case – instead of parsing XHTML resumes it just directly builds up a toy example of parent+child docs. Removed @author tags, fixed code style (indent to 2 spaces, {}s, newlines, etc.), etc. I put a few nocommits in that still need resolving... My patch doesn't include XMLQueryParser integration; I think we should do this as separate issue? I think PerParentLimitedQuery is the same functionality as grouping, right? Except, it only supports relevance withinGroup sort, but it operates only w/ a parent filter (whereas our grouping impls so far require a more-RAM-costly DocTermsIndex FC entry of the field you are grouping on). I think we should somehow merge this w/ the single pass grouping collector ( LUCENE-3129 )?
          Hide
          Mark Harwood added a comment -

          I think this is the only core change necessary for this feature?

          Yup. A same-segment indexing guarantee is all that is required.

          Show
          Mark Harwood added a comment - I think this is the only core change necessary for this feature? Yup. A same-segment indexing guarantee is all that is required.
          Hide
          Michael McCandless added a comment -

          I think this is a very important addition to Lucene, so let's get this
          done!

          I just opened LUCENE-3112, to add IW.add/updateDocuments, which would
          atomically add Document produced by an iterator, and ensure they all
          wind up in the same segment. I think this is the only core change
          necessary for this feature? Ie, all else can be built on top of Lucene
          once LUCENE-3112 is committed?

          Show
          Michael McCandless added a comment - I think this is a very important addition to Lucene, so let's get this done! I just opened LUCENE-3112 , to add IW.add/updateDocuments, which would atomically add Document produced by an iterator, and ensure they all wind up in the same segment. I think this is the only core change necessary for this feature? Ie, all else can be built on top of Lucene once LUCENE-3112 is committed?
          Hide
          RynekMedyczny.pl added a comment -

          Code like this ends up in trunk when there is concensus so your support is welcome.

          Of course! How can we help you?

          While core Lucene adoption is a relatively simple technical task

          We are eagerly waiting for incorporating your work into Lucene Core!

          Show
          RynekMedyczny.pl added a comment - Code like this ends up in trunk when there is concensus so your support is welcome. Of course! How can we help you? While core Lucene adoption is a relatively simple technical task We are eagerly waiting for incorporating your work into Lucene Core!
          Hide
          Mark Harwood added a comment -

          I have not looked this patch so this comment may be off base.

          The slideshare deck gives a good overview: http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

          As a simple Lucene-focused addition I'd prefer not to explore all the possible implications for Solr adoption here. The affected areas in Solr are extensive and would include schema definitions, query syntax, facets/filter caching, result-fetching, DIH etc etc. Probably best discussed elsewhere.

          Show
          Mark Harwood added a comment - I have not looked this patch so this comment may be off base. The slideshare deck gives a good overview: http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene As a simple Lucene-focused addition I'd prefer not to explore all the possible implications for Solr adoption here. The affected areas in Solr are extensive and would include schema definitions, query syntax, facets/filter caching, result-fetching, DIH etc etc. Probably best discussed elsewhere.
          Hide
          Ryan McKinley added a comment -

          Solr, however does introduce a schema and much more that assumes a "flat" model.

          In SOLR-1566 we could add a DocList as a field within a SolrDocument – this would at least allow the output format to return a nested structure.

          I have not looked this patch so this comment may be off base.

          Show
          Ryan McKinley added a comment - Solr, however does introduce a schema and much more that assumes a "flat" model. In SOLR-1566 we could add a DocList as a field within a SolrDocument – this would at least allow the output format to return a nested structure. I have not looked this patch so this comment may be off base.
          Hide
          Mark Harwood added a comment -

          Lucene does not dictate a schema and so using this approach to index design/querying is not a problem with base Lucene.

          Solr, however does introduce a schema and much more that assumes a "flat" model. In the opening chapters of the "Solr 1.4 Enterprise Search Server" book the authors take the time to discuss the modelling limitations of this flat model and acknowledge this as an issue. The impact of adopting "nested" documents in Solr at this stage would be very large.
          There may be ways you can overcome some of your issues without requiring nested documents (using phrase/span queries or combining tokens from multiple fields in Solr) but in my experience these are often poor alternatives if richer structures are important.

          Cheers
          Mark

          Show
          Mark Harwood added a comment - Lucene does not dictate a schema and so using this approach to index design/querying is not a problem with base Lucene. Solr, however does introduce a schema and much more that assumes a "flat" model. In the opening chapters of the "Solr 1.4 Enterprise Search Server" book the authors take the time to discuss the modelling limitations of this flat model and acknowledge this as an issue. The impact of adopting "nested" documents in Solr at this stage would be very large. There may be ways you can overcome some of your issues without requiring nested documents (using phrase/span queries or combining tokens from multiple fields in Solr) but in my experience these are often poor alternatives if richer structures are important. Cheers Mark
          Hide
          Jamal Natour added a comment -

          Mark,

          For my project this is a must have feature that could decide the adoption of SOLR. What do think is the best way to help ensure this gets incorporated into SOLR?

          Thank you,
          Jamal

          Show
          Jamal Natour added a comment - Mark, For my project this is a must have feature that could decide the adoption of SOLR. What do think is the best way to help ensure this gets incorporated into SOLR? Thank you, Jamal
          Hide
          Mark Harwood added a comment -

          Mark, do you have any plans for including this feature into the Lucene trunk?

          That is my intention in providing it here. I had to work hard to convince my employer to let me release this as open source in the interests of seeing it updated/tested as core Lucene APIs change - and hopefully receive some improved support in IndexWriter "flush" control.
          Unfortunately it seems not everyone shares the pain when it comes to modelling richer data structures and seem content with the "flat" model we have in Lucene today. Code like this ends up in trunk when there is concensus so your support is welcome.

          While core Lucene adoption is a relatively simple technical task, Solr adoption feels like a much more disruptive change.

          Show
          Mark Harwood added a comment - Mark, do you have any plans for including this feature into the Lucene trunk? That is my intention in providing it here. I had to work hard to convince my employer to let me release this as open source in the interests of seeing it updated/tested as core Lucene APIs change - and hopefully receive some improved support in IndexWriter "flush" control. Unfortunately it seems not everyone shares the pain when it comes to modelling richer data structures and seem content with the "flat" model we have in Lucene today. Code like this ends up in trunk when there is concensus so your support is welcome. While core Lucene adoption is a relatively simple technical task, Solr adoption feels like a much more disruptive change.
          Hide
          RynekMedyczny.pl added a comment -

          Mark, do you have any plans for including this feature into the Lucene trunk?
          I think that this is a "must have" feature since tree structures are so common!
          Thank you in advance.

          Show
          RynekMedyczny.pl added a comment - Mark, do you have any plans for including this feature into the Lucene trunk? I think that this is a "must have" feature since tree structures are so common! Thank you in advance.
          Hide
          Mark Harwood added a comment -

          Updated version with fix for Scorer.advance issue and added XMLQueryBuilder support

          Show
          Mark Harwood added a comment - Updated version with fix for Scorer.advance issue and added XMLQueryBuilder support
          Hide
          Mark Harwood added a comment -

          I'm not sure the auto-repair is that trivial.
          Let's say the parent/child docs are resumes and nested docs for employment positions (as in the attached example).
          An update may not just be adding a new "employment position" doc but editing an existing one, deleting an old one etc.
          Your auto-updater is going to need to do a lot of figuring out to work out which existing docs need copying over from earlier segments and patching in to the new segment with the updated parts of the resume. This gets worse if we start to consider multiple levels to the hierarchy.
          It all feels like a lot of work for the IndexWriter to take on?

          Show
          Mark Harwood added a comment - I'm not sure the auto-repair is that trivial. Let's say the parent/child docs are resumes and nested docs for employment positions (as in the attached example). An update may not just be adding a new "employment position" doc but editing an existing one, deleting an old one etc. Your auto-updater is going to need to do a lot of figuring out to work out which existing docs need copying over from earlier segments and patching in to the new segment with the updated parts of the resume. This gets worse if we start to consider multiple levels to the hierarchy. It all feels like a lot of work for the IndexWriter to take on?
          Hide
          Paul Elschot added a comment -

          So the missing basic operation is a copy/append of a range of existing index docs.
          After that operation, the original docs can be deleted, but that is trivial.

          I'll have a look at IndexWriter for this over the coming days. Any quick hints?

          Show
          Paul Elschot added a comment - So the missing basic operation is a copy/append of a range of existing index docs. After that operation, the original docs can be deleted, but that is trivial. I'll have a look at IndexWriter for this over the coming days. Any quick hints?
          Hide
          Mark Harwood added a comment -

          The intention is quite simple: allow a set of documents to be used to provide a single score value during query searching

          That's what the existing NestedDocumentQuery code attached to this issue already provides. As far as I am concerned the search side works fine and I have it installed in several live installations (along with a bug fix for "skip" that I must remember to upload here). "Parent" filters as you suggest benefit from caching and I typically use the XMLQueryParser with a <CachedFilter> tag to take care of that (I need to upload the XMLQueryParser extensions for this Nested stuff too).

          The new "intention" that I think you added in your last post was more complex and is related to indexing, not searching and introduced the idea that adding a new child doc on its own should somehow trigger some automated repair of the index contents. This repair would involve ensuring that related documents from previous adds would be reorganised such that all related documents still remained physically next to each other in the same segment.
          I don't think a custom choice of "MergePolicy" is the class to perform this operation - they are simply consulted as an advisor to pick which segments are ripe for a background merge operation conducted elsewhere. The more complex merge task you need to be performed here requires selective deletes of related docs from existing segments and addition of the same documents back into a new segment. This is a task I have always considered something the application code should do rather than relying on Lucene to second-guess what index reorganisation may be required. We could try make core Lucene understand and support parent/child relationships more fully but I'd settle for this existing approach with some added app-control over flushing as a first step.

          Show
          Mark Harwood added a comment - The intention is quite simple: allow a set of documents to be used to provide a single score value during query searching That's what the existing NestedDocumentQuery code attached to this issue already provides. As far as I am concerned the search side works fine and I have it installed in several live installations (along with a bug fix for "skip" that I must remember to upload here). "Parent" filters as you suggest benefit from caching and I typically use the XMLQueryParser with a <CachedFilter> tag to take care of that (I need to upload the XMLQueryParser extensions for this Nested stuff too). The new "intention" that I think you added in your last post was more complex and is related to indexing, not searching and introduced the idea that adding a new child doc on its own should somehow trigger some automated repair of the index contents. This repair would involve ensuring that related documents from previous adds would be reorganised such that all related documents still remained physically next to each other in the same segment. I don't think a custom choice of "MergePolicy" is the class to perform this operation - they are simply consulted as an advisor to pick which segments are ripe for a background merge operation conducted elsewhere. The more complex merge task you need to be performed here requires selective deletes of related docs from existing segments and addition of the same documents back into a new segment. This is a task I have always considered something the application code should do rather than relying on Lucene to second-guess what index reorganisation may be required. We could try make core Lucene understand and support parent/child relationships more fully but I'd settle for this existing approach with some added app-control over flushing as a first step.
          Hide
          Paul Elschot added a comment -

          I think your proposal here is related to a new (to me) use case where clients can add a single new "child" document and the index automagically reorganises to assemble all prior related documents back into a structure where they are grouped as contiguous documents held in the same segment?

          Indeed.

          The first two fields match the ones I intended.
          The third field for the document type would be quite useful for searching, but it may not be necessary to maintain the document order.

          The intention is quite simple: allow a set of documents to be used to provide a single score value during query searching. AFAICT that fits most of the use cases described here.

          To allow conjunctions inside such a set, it is necessary to advance() a scorer into a set, and for that it might be better to put the set representative before the children. The document order would then be pre-order instead of post-order, which would not really make any difference in difficulty to keep the docs in order.
          With the representative before the children, an extra operation (sth like previousDocId()) would be needed on the iterator of the filter.

          I don't know about flushes during merging. One operation that would recur during index maintenance is appending a sequence of documents from one segment to another segment, see docs 1, 2 and 3 above.
          This is indeed what needs to be done when a new child is added, or when an existing one is changed, i.e. deleted and added.
          I'm not familiar with the merging code, but I would suppose something very close to appending a sequence of documents from an existing segment is already available. Anyway this is costly, but that is the price to pay.

          During searching, the term filters used for the node representatives might use some optimizations. Since one term filter is needed for every document scorer involved in searching the query and these term filters are all based on the same term, they could share index information, for example in a filter cache.
          A bit set is not always optimal for such filters, perhaps a more tree like structure could be more compact and faster. But bit sets could be used to get this going.

          The good news so far for me is that this seems to be feasible, thanks.

          Show
          Paul Elschot added a comment - I think your proposal here is related to a new (to me) use case where clients can add a single new "child" document and the index automagically reorganises to assemble all prior related documents back into a structure where they are grouped as contiguous documents held in the same segment? Indeed. The first two fields match the ones I intended. The third field for the document type would be quite useful for searching, but it may not be necessary to maintain the document order. The intention is quite simple: allow a set of documents to be used to provide a single score value during query searching. AFAICT that fits most of the use cases described here. To allow conjunctions inside such a set, it is necessary to advance() a scorer into a set, and for that it might be better to put the set representative before the children. The document order would then be pre-order instead of post-order, which would not really make any difference in difficulty to keep the docs in order. With the representative before the children, an extra operation (sth like previousDocId()) would be needed on the iterator of the filter. I don't know about flushes during merging. One operation that would recur during index maintenance is appending a sequence of documents from one segment to another segment, see docs 1, 2 and 3 above. This is indeed what needs to be done when a new child is added, or when an existing one is changed, i.e. deleted and added. I'm not familiar with the merging code, but I would suppose something very close to appending a sequence of documents from an existing segment is already available. Anyway this is costly, but that is the price to pay. During searching, the term filters used for the node representatives might use some optimizations. Since one term filter is needed for every document scorer involved in searching the query and these term filters are all based on the same term, they could share index information, for example in a filter cache. A bit set is not always optimal for such filters, perhaps a more tree like structure could be more compact and faster. But bit sets could be used to get this going. The good news so far for me is that this seems to be feasible, thanks.
          Hide
          Mark Harwood added a comment -

          Hi Paul,
          I'm not sure I currently have an issue with merges as they just concatenate established segments without interleaving their documents. This operation should retain the order that is crucial to maintaining the parent/child/grandchild relationships (unless something has changed in merge logic which would certainly be an issue!). My main cause for concern is robust control over flushes so parent/child docs don't end up being separated into different segments at the point of arbitrary flushes.

          I think your proposal here is related to a new (to me) use case where clients can add a single new "child" document and the index automagically reorganises to assemble all prior related documents back into a structure where they are grouped as contiguous documents held in the same segment? Please correct me if I am wrong.
          Previously I have always seen this need for reorganisation as an application's responsibility and a single child document addition required the app to delete the associated parent and all old child docs, then add a new batch of documents representing the parent, old children plus the new child addition. Given the implied deletes and inserts required to maintain relationship integrity that seems like an operation that needs to be done under the control of Lucene's transaction management APIs rather than some form of special MergePolicy which are really intended for background efficiency tidy-ups not integrity maintenance.

          As for the fields you outline for merging , generally speaking in applications using NestedDocumentQuery and PerParentLimitedQuery I have found that for searching purposes I already need to store:
          1) A globally unique ID as an indexed "primary key" field on the top-level container document
          2) An indexed field with the same unique ID held in a different "foreign key" field on child documents
          3) An indexed field indicating the document type e.g "root" or "resume" and "level1Child" or "employmentRecord"

          I could be a little confused about your intentions - maybe should we start with what problem we are trying to solve before addressing how we achieve it?

          Cheers
          Mark

          Show
          Mark Harwood added a comment - Hi Paul, I'm not sure I currently have an issue with merges as they just concatenate established segments without interleaving their documents. This operation should retain the order that is crucial to maintaining the parent/child/grandchild relationships (unless something has changed in merge logic which would certainly be an issue!). My main cause for concern is robust control over flushes so parent/child docs don't end up being separated into different segments at the point of arbitrary flushes. I think your proposal here is related to a new (to me) use case where clients can add a single new "child" document and the index automagically reorganises to assemble all prior related documents back into a structure where they are grouped as contiguous documents held in the same segment? Please correct me if I am wrong. Previously I have always seen this need for reorganisation as an application's responsibility and a single child document addition required the app to delete the associated parent and all old child docs, then add a new batch of documents representing the parent, old children plus the new child addition. Given the implied deletes and inserts required to maintain relationship integrity that seems like an operation that needs to be done under the control of Lucene's transaction management APIs rather than some form of special MergePolicy which are really intended for background efficiency tidy-ups not integrity maintenance. As for the fields you outline for merging , generally speaking in applications using NestedDocumentQuery and PerParentLimitedQuery I have found that for searching purposes I already need to store: 1) A globally unique ID as an indexed "primary key" field on the top-level container document 2) An indexed field with the same unique ID held in a different "foreign key" field on child documents 3) An indexed field indicating the document type e.g "root" or "resume" and "level1Child" or "employmentRecord" I could be a little confused about your intentions - maybe should we start with what problem we are trying to solve before addressing how we achieve it? Cheers Mark
          Hide
          Paul Elschot added a comment -

          How about an implementation for strict hierarchies that uses two fields per document, in the following way:

          The two fields each contain a single (indexed) token that indicates the node in the nesting hierarchy, one field meaning that the document is a child of that node, and the other that the document is the representative of that node. Any number of levels could be allowed, but no cycles of course.
          These fields are then used by a merge policy to keep the documents ordered postorder, that is the children immediately followed by the representative for each node.
          Collecting scores at any node in the hierarchy could then be done by using term filters, one for each involved scorer, to provide the representative for the current doc by advancing.

          For example, in index order:

          userDocId nodeMemberField nodeReprField

          doc1 nodeA1 .
          doc2 nodeA1 .
          doc3 nodeA nodeA1
          doc4 nodeA2 .
          doc5 nodeA2 .
          doc6 nodeA nodeA2

          The node representatives for scoring could then be obtained by a term filter for nodeA.

          I think this could work for the scoring part, basically along the lines of the code already posted here.

          Could someone with more experience in segment merge policies comment on this? This is quite restrictive for merging as the only freedom that is left in the document order is the order of the children for each node.

          For example, adding a leaf document doc7 for nodeA1 could result in the following index order:

          doc4 nodeA2 .
          doc5 nodeA2 .
          doc6 nodeA nodeA2
          doc7 nodeA1 .
          doc1 nodeA1 .
          doc2 nodeA1 .
          doc3 nodeA nodeA1

          Show
          Paul Elschot added a comment - How about an implementation for strict hierarchies that uses two fields per document, in the following way: The two fields each contain a single (indexed) token that indicates the node in the nesting hierarchy, one field meaning that the document is a child of that node, and the other that the document is the representative of that node. Any number of levels could be allowed, but no cycles of course. These fields are then used by a merge policy to keep the documents ordered postorder, that is the children immediately followed by the representative for each node. Collecting scores at any node in the hierarchy could then be done by using term filters, one for each involved scorer, to provide the representative for the current doc by advancing. For example, in index order: userDocId nodeMemberField nodeReprField doc1 nodeA1 . doc2 nodeA1 . doc3 nodeA nodeA1 doc4 nodeA2 . doc5 nodeA2 . doc6 nodeA nodeA2 The node representatives for scoring could then be obtained by a term filter for nodeA. I think this could work for the scoring part, basically along the lines of the code already posted here. Could someone with more experience in segment merge policies comment on this? This is quite restrictive for merging as the only freedom that is left in the document order is the order of the children for each node. For example, adding a leaf document doc7 for nodeA1 could result in the following index order: doc4 nodeA2 . doc5 nodeA2 . doc6 nodeA nodeA2 doc7 nodeA1 . doc1 nodeA1 . doc2 nodeA1 . doc3 nodeA nodeA1
          Hide
          Mark Harwood added a comment -

          I made a minor modification your approch by making it do a "Forward-scan" instead of reverse scan

          Interesting, but I'm not sure what guarantees Lucene will make about:

          • Sequencing of calls on scorer.nextDoc (i.e. are calls to all scorers involved guaranteed to be in doc-insert order ?)
          • Index-time merging of segments (i.e. are all segments merged together in an order that keeps the parent doc in one segment next to the child doc from the next segment?)

          That seems like a fragile set of dependencies.
          Also, don't things get tricky when reporting matches from NestedDocumentQuery and PerParentLimitedQuery back to the collector? During the query process the IndexSearcher resets the docId context (Collector.setNextReader) as it moves from one Scorer/segment to another. If we are delaying the assessment/reporting of matches until we've crossed a segment boundary it is too late to report a match on a child doc id from a previous segment as the collector has already changed context. Unfortunately "PerParentLimitedQuery" needs to do this when selecting the "best" children for a single parent.

          Show
          Mark Harwood added a comment - I made a minor modification your approch by making it do a "Forward-scan" instead of reverse scan Interesting, but I'm not sure what guarantees Lucene will make about: Sequencing of calls on scorer.nextDoc (i.e. are calls to all scorers involved guaranteed to be in doc-insert order ?) Index-time merging of segments (i.e. are all segments merged together in an order that keeps the parent doc in one segment next to the child doc from the next segment?) That seems like a fragile set of dependencies. Also, don't things get tricky when reporting matches from NestedDocumentQuery and PerParentLimitedQuery back to the collector? During the query process the IndexSearcher resets the docId context (Collector.setNextReader) as it moves from one Scorer/segment to another. If we are delaying the assessment/reporting of matches until we've crossed a segment boundary it is too late to report a match on a child doc id from a previous segment as the collector has already changed context. Unfortunately "PerParentLimitedQuery" needs to do this when selecting the "best" children for a single parent.
          Hide
          Buddika Gajapala added a comment -

          Mark, that was fast

          BTW another scenario, when there are lot of data, there is a posibility of having parent docuemnt and matching child document in two different segments causing to miss some matches. I made a minor modification your approch by making it do a "Forward-scan" instead of reverse scan for parent docs and have the parent document inserted AFTER the child docs are inserted and in case of parent doc is located outside the scop of current doc, it's docid is preserved at the "Weight Object" level and nextDoc() modified to check fo that for the very 1st nextDoc call to new segment.

          Show
          Buddika Gajapala added a comment - Mark, that was fast BTW another scenario, when there are lot of data, there is a posibility of having parent docuemnt and matching child document in two different segments causing to miss some matches. I made a minor modification your approch by making it do a "Forward-scan" instead of reverse scan for parent docs and have the parent document inserted AFTER the child docs are inserted and in case of parent doc is located outside the scop of current doc, it's docid is preserved at the "Weight Object" level and nextDoc() modified to check fo that for the very 1st nextDoc call to new segment.
          Hide
          Mark Harwood added a comment -

          Maybe we should add an addDocuments call to IW? To add more than one document, "atomically", so that any flush must happen before or after them?

          That would be nice.
          Another way of modelling this would be to introduce Document.add(Document childDoc) but I think that is a more fundamental and wide-reaching change.

          Show
          Mark Harwood added a comment - Maybe we should add an addDocuments call to IW? To add more than one document, "atomically", so that any flush must happen before or after them? That would be nice. Another way of modelling this would be to introduce Document.add(Document childDoc) but I think that is a more fundamental and wide-reaching change.
          Hide
          Mark Harwood added a comment -

          Updated package with fix for multi-segment issue and improved Junit test

          Show
          Mark Harwood added a comment - Updated package with fix for multi-segment issue and improved Junit test
          Hide
          Michael McCandless added a comment -

          Maybe we should add an addDocuments call to IW? To add more than one document, "atomically", so that any flush must happen before or after them?

          Show
          Michael McCandless added a comment - Maybe we should add an addDocuments call to IW? To add more than one document, "atomically", so that any flush must happen before or after them?
          Hide
          Mark Harwood added a comment -

          Attached Junit confirms issue with multiple segments (thanks, Buddika).
          Previous tests masked the error.
          I'm looking into a fix now.

          Cheers,
          Mark

          Show
          Mark Harwood added a comment - Attached Junit confirms issue with multiple segments (thanks, Buddika). Previous tests masked the error. I'm looking into a fix now. Cheers, Mark
          Hide
          Mark Harwood added a comment -

          The 2nd comment above talks about this and the need for Lucene to offer more control over flush policy.

          it only matches the the parent document acurately for the 1st segment. I think this is due to the way the parent docs are marked using a bit array for the ENTIRE index

          But aren't filters held and evaluated the within the context of each sub reader? Are you sure the issue isn't limited to a parent/child combo that is split across segments?

          Show
          Mark Harwood added a comment - The 2nd comment above talks about this and the need for Lucene to offer more control over flush policy. it only matches the the parent document acurately for the 1st segment. I think this is due to the way the parent docs are marked using a bit array for the ENTIRE index But aren't filters held and evaluated the within the context of each sub reader? Are you sure the issue isn't limited to a parent/child combo that is split across segments?
          Hide
          Buddika Gajapala added a comment -

          I tried this solution and works perfectly for smaller indexes with (either less number of Documents or Document size is small) However for larger indexes that span across multiple segments it only matches the the parent document acurately for the 1st segment. I think this is due to the way the parent docs are marked using a bit array for the ENTIRE index but actual traversing for matching criteria done by the Scorer is segment-by-segment (i.e. in nextDoc() and advance() methods) . Have you considered this situation?

          Show
          Buddika Gajapala added a comment - I tried this solution and works perfectly for smaller indexes with (either less number of Documents or Document size is small) However for larger indexes that span across multiple segments it only matches the the parent document acurately for the 1st segment. I think this is due to the way the parent docs are marked using a bit array for the ENTIRE index but actual traversing for matching criteria done by the Scorer is segment-by-segment (i.e. in nextDoc() and advance() methods) . Have you considered this situation?
          Hide
          Mark Harwood added a comment -

          Can this help in searching over multiple child/nested documents?

          Yes, a typical use case is to use "NestedDocumentQuery" to fetch the top 10 parents then do a second query to fetch the children using a mandatory clause which lists the primary keys of the selected parents (assuming the children have an indexed field with the parent primary key).
          The "PerParentLimitedQuery" can be used to limit the number of child docs returned per parent if there are many e.g. pages in a book. Both these classes are in the zipped attachment to this issue.

          Show
          Mark Harwood added a comment - Can this help in searching over multiple child/nested documents? Yes, a typical use case is to use "NestedDocumentQuery" to fetch the top 10 parents then do a second query to fetch the children using a mandatory clause which lists the primary keys of the selected parents (assuming the children have an indexed field with the parent primary key). The "PerParentLimitedQuery" can be used to limit the number of child docs returned per parent if there are many e.g. pages in a book. Both these classes are in the zipped attachment to this issue.
          Hide
          Amit Kulkarni added a comment -

          This is amazing feature!

          Can this help in searching over multiple child/nested documents? Is there a sample code avaialble that demonstrates how to achieve this?

          We have requirement wherein search result need to carry fields from child documents. Can this be achieved?

          Show
          Amit Kulkarni added a comment - This is amazing feature! Can this help in searching over multiple child/nested documents? Is there a sample code avaialble that demonstrates how to achieve this? We have requirement wherein search result need to carry fields from child documents. Can this be achieved?
          Hide
          Mark Harwood added a comment -

          I wonder if potentially many places in Solr would have to be become aware of this scheme

          Supporting multiple doc types in this way can be a big modelling change. Not everyone needs it so I suspect that may be a hard sell to the Solr crowd.

          Another nice thing about the parallel index is that the idf relevancy factor stays clean

          I have an "IDF compensating" wrapper query that takes care of that. It wraps a child query to then wrap the Similarity class in use and adjusts IDF calculations to be based on the number of documents of the required type. I'll attach it here when I get a chance.

          Before I saw this issue, I was going to try something with SpanNearQuery and the masking-field variant.

          I went through a similar thought process around using position info to make this stuff work. This child-doc approach used here seems the cleanest by far.

          Show
          Mark Harwood added a comment - I wonder if potentially many places in Solr would have to be become aware of this scheme Supporting multiple doc types in this way can be a big modelling change. Not everyone needs it so I suspect that may be a hard sell to the Solr crowd. Another nice thing about the parallel index is that the idf relevancy factor stays clean I have an "IDF compensating" wrapper query that takes care of that. It wraps a child query to then wrap the Similarity class in use and adjusts IDF calculations to be based on the number of documents of the required type. I'll attach it here when I get a chance. Before I saw this issue, I was going to try something with SpanNearQuery and the masking-field variant. I went through a similar thought process around using position info to make this stuff work. This child-doc approach used here seems the cleanest by far.
          Hide
          David Smiley added a comment -

          That's an interesting strategy. The size of these arrays is no big deal to me since there's only a couple of them. My concern with this strategy is that I wonder if potentially many places in Solr would have to be become aware of this scheme which might make this strategy untenable to implement even though its theoretically sound.

          Another nice thing about the parallel index is that the idf relevancy factor stays clean since it will only consider "real" documents.

          I want to investigate these options closer ASAP since this feature you've implemented is something I need. Before I saw this issue, I was going to try something with SpanNearQuery and the masking-field variant.

          Show
          David Smiley added a comment - That's an interesting strategy. The size of these arrays is no big deal to me since there's only a couple of them. My concern with this strategy is that I wonder if potentially many places in Solr would have to be become aware of this scheme which might make this strategy untenable to implement even though its theoretically sound. Another nice thing about the parallel index is that the idf relevancy factor stays clean since it will only consider "real" documents. I want to investigate these options closer ASAP since this feature you've implemented is something I need. Before I saw this issue, I was going to try something with SpanNearQuery and the masking-field variant.
          Hide
          Mark Harwood added a comment -

          Yep, I can see an app with a thousand cached filters would have a problem with this impl as it stands.

          Maintaining parallel indexes always feels a little flaky to me, not least because of the loss of transactional integrity you can get from using a single index.

          Is another approach to make your cached filters document-type-specific? I.e. they only hold numbers in the range of zero to number-of-docs-of-this-type.
          To use a cached doc ID in such a filter you would need to make use of mapping arrays to project the type-specific doc id numbers into global doc-id references and back.
          Lets imagine an index with a mix of "A", "B" and "C" doc types organised as follows:
          docId docType
          ===== =======
          1 A
          2 B
          3 C
          4 A
          5 C
          6 C

          The mapping arrays for docType "C" would look as follows

          Bar.java
          int [ ] globalDocIdToTypeCLookUp = {-1,-1,0,-1,1,2}        // sparse, sized 0-> num docs in overall index
          int [ ] typeCToGlobalDocIdLookUp = {0,1,2}          // dense, sized 0-> num type C docs in overall index
          

          Your cached filters would be created as follows:

          Bar.java
          myTypeCBitset=new OpenBitSet(numberOfTypeCDocs);  //this line is hopefully where you save RAM!
          //for all matching type C docs...
          myTypeCBitSet.setBit(globalDocIdToTypeCLookUp[realDocId];
          

          Your filters can then be used by dereferencing the child doc IDs as follows:

          Bar.java
          int nextRealDocId=typeCToGlobalDocIdLookUp [myTypeCBitSet.getNextSetBit()];
          

          Clearly the mapping arrays come at a cost of 4bytes*num docs which is non trivial. The sparse globalDocIdToTypeCLookUp array shown here could be avoided by reading TermDocs and counting at cached-Filter-create time .

          Show
          Mark Harwood added a comment - Yep, I can see an app with a thousand cached filters would have a problem with this impl as it stands. Maintaining parallel indexes always feels a little flaky to me, not least because of the loss of transactional integrity you can get from using a single index. Is another approach to make your cached filters document-type-specific? I.e. they only hold numbers in the range of zero to number-of-docs-of-this-type. To use a cached doc ID in such a filter you would need to make use of mapping arrays to project the type-specific doc id numbers into global doc-id references and back. Lets imagine an index with a mix of "A", "B" and "C" doc types organised as follows: docId docType ===== ======= 1 A 2 B 3 C 4 A 5 C 6 C The mapping arrays for docType "C" would look as follows Bar.java int [ ] globalDocIdToTypeCLookUp = {-1,-1,0,-1,1,2} // sparse, sized 0-> num docs in overall index int [ ] typeCToGlobalDocIdLookUp = {0,1,2} // dense, sized 0-> num type C docs in overall index Your cached filters would be created as follows: Bar.java myTypeCBitset= new OpenBitSet(numberOfTypeCDocs); // this line is hopefully where you save RAM! // for all matching type C docs... myTypeCBitSet.setBit(globalDocIdToTypeCLookUp[realDocId]; Your filters can then be used by dereferencing the child doc IDs as follows: Bar.java int nextRealDocId=typeCToGlobalDocIdLookUp [myTypeCBitSet.getNextSetBit()]; Clearly the mapping arrays come at a cost of 4bytes*num docs which is non trivial. The sparse globalDocIdToTypeCLookUp array shown here could be avoided by reading TermDocs and counting at cached-Filter-create time .
          Hide
          David Smiley added a comment -

          35.7MB of RAM for every filter is a LOT compared to the 357KB I need now (100x). Presumably the facet intersections now take 100x as long too. I cache nearly a thousand of these per index (lots of faceting!) which is by the way just one Solr shard of many. No can do.

          I wonder if its plausible to consider a different implementation strategy employing a parallel index with the child documents storing the document IDs to the parent index. I might even assume I need no more than 1000 child documents and thus index blank documents as filler so that if I am looking at a child document with id 32005 then it is the 6th sub-entity belonging to parent document id 32. I know that document IDs are a bit transient so I know that some care would be needed to maintain this strategy. Thoughts?

          Show
          David Smiley added a comment - 35.7MB of RAM for every filter is a LOT compared to the 357KB I need now (100x). Presumably the facet intersections now take 100x as long too. I cache nearly a thousand of these per index (lots of faceting!) which is by the way just one Solr shard of many. No can do. I wonder if its plausible to consider a different implementation strategy employing a parallel index with the child documents storing the document IDs to the parent index. I might even assume I need no more than 1000 child documents and thus index blank documents as filler so that if I am looking at a child document with id 32005 then it is the 6th sub-entity belonging to parent document id 32. I know that document IDs are a bit transient so I know that some care would be needed to maintain this strategy. Thoughts?
          Hide
          Mark Harwood added a comment -

          Wow, this is absolutely awesome!

          Thanks. I've found that this certainly solves problems I previously couldn't address at all in standard Lucene.

          The leading concern I have with this implementation is the size of the number of documents in the index as it affects the size of filters

          These filters can obviously be cached but you'll need one filter per level you "roll up" to. Assuming a 300m doc index and only rolling up matches to the root that should only cost 300m /8 bits per byte = 37.5 meg of RAM. Index reloads should avoid the cost of completely rebuilding this filter nowadays because filters are cached at segment level and unchanged segments will retain their cached filters.
          Perhaps a bigger concern is any norms arrays which are allocated one BYTE (as opposed to one bit) per document in the index.

          and they don't share any fields with the parent.

          For parents with only 1 child document instance of a given type, these could be safely "rolled up" into the parent and stored in the same document.

          Show
          Mark Harwood added a comment - Wow, this is absolutely awesome! Thanks. I've found that this certainly solves problems I previously couldn't address at all in standard Lucene. The leading concern I have with this implementation is the size of the number of documents in the index as it affects the size of filters These filters can obviously be cached but you'll need one filter per level you "roll up" to. Assuming a 300m doc index and only rolling up matches to the root that should only cost 300m /8 bits per byte = 37.5 meg of RAM. Index reloads should avoid the cost of completely rebuilding this filter nowadays because filters are cached at segment level and unchanged segments will retain their cached filters. Perhaps a bigger concern is any norms arrays which are allocated one BYTE (as opposed to one bit) per document in the index. and they don't share any fields with the parent. For parents with only 1 child document instance of a given type, these could be safely "rolled up" into the parent and stored in the same document.
          Hide
          David Smiley added a comment -

          Wow, this is absolutely awesome! This is one of the best enhancement requests to Lucene/Solr that I've seen as it brings a real enhancement this is difficult / impossible to do without.

          The leading concern I have with this implementation is the size of the number of documents in the index as it affects the size of filters and perhaps other areas involving creating BitSet's. I have a scenario in which the sub-documents number on average over 100 to each primary document. These sub-documents are at least very small, and they don't share any fields with the parent. For a large scale search situation, an index containing 3M lucene documents now needs to store over 300M, and thus require 100x the amount of RAM for filter caches as I require now. Thoughts?

          Show
          David Smiley added a comment - Wow, this is absolutely awesome! This is one of the best enhancement requests to Lucene/Solr that I've seen as it brings a real enhancement this is difficult / impossible to do without. The leading concern I have with this implementation is the size of the number of documents in the index as it affects the size of filters and perhaps other areas involving creating BitSet's. I have a scenario in which the sub-documents number on average over 100 to each primary document. These sub-documents are at least very small, and they don't share any fields with the parent. For a large scale search situation, an index containing 3M lucene documents now needs to store over 300M, and thus require 100x the amount of RAM for filter caches as I require now. Thoughts?
          Hide
          Earwin Burrfoot added a comment -

          I thiiiiink, here - LUCENE-2309

          Show
          Earwin Burrfoot added a comment - I thiiiiink, here - LUCENE-2309
          Hide
          Mark Harwood added a comment -

          - there was a discussion on narrowing indexing API to something stream-like

          Any idea where there that discussion was taking place? Happy to move flush-control discussions elsewhere if that is more appropriate.

          Show
          Mark Harwood added a comment - - there was a discussion on narrowing indexing API to something stream-like Any idea where there that discussion was taking place? Happy to move flush-control discussions elsewhere if that is more appropriate.
          Hide
          Earwin Burrfoot added a comment -

          Both things can be combined for sure. New stream-like indexing API stuffs docs into IW and controls when flushes /can/ happen, while FlushPolicy decides if they actually /do/ happen, when they /can/.

          Show
          Earwin Burrfoot added a comment - Both things can be combined for sure. New stream-like indexing API stuffs docs into IW and controls when flushes /can/ happen, while FlushPolicy decides if they actually /do/ happen, when they /can/.
          Hide
          Earwin Burrfoot added a comment -

          An alternate approach - there was a discussion on narrowing indexing API to something stream-like, whereas Document becomes its default implementation. We can add some flush-boundary signalling methods, or a notion of composite documents to that new API.

          I like this approach more, as control does not spread out across different APIs/instances. You don't have to hold reference to your policy in the indexing code, you don't have to raise/lower flags in some remote code to signal things that are internal to yours.

          Show
          Earwin Burrfoot added a comment - An alternate approach - there was a discussion on narrowing indexing API to something stream-like, whereas Document becomes its default implementation. We can add some flush-boundary signalling methods, or a notion of composite documents to that new API. I like this approach more, as control does not spread out across different APIs/instances. You don't have to hold reference to your policy in the indexing code, you don't have to raise/lower flags in some remote code to signal things that are internal to yours.
          Hide
          Mark Harwood added a comment -

          Robust use of this feature is dependent on careful management of segments i.e. that all compound documents are held in the same segment.

          Michael Busch suggested the introduction of a new "FlushPolicy" on IndexWriter to offer the required control. (see http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3C4BE5A14C.6040108@gmail.com%3E )
          Sounds sensible to me given that IndexWriter currently manages to muddle 2 alternative policies in the one implementation and it looks like we now need a third.

          Is this the place to start the debate on "FlushPolicy" ?
          My guess is this change would involve :

          • Deprecating/removing IndexWriter's setMaxBufferedDocs and setRAMBufferSizeMB.
          • Providing a new "FlushPolicy" abstract class that is called with a "BufferContext " class to hold number buffered docs + ram usage. FlushPolicy is asked if flushing of various structures should be triggered given the context
          • Provide default implementations of FlushPolicy that are number-of-documents-based and RAM-based.
          • Provide a special "NestedDocumentFlushPolicy" that can wrap any other policy (ram/num docs) but only triggers flushes when application code has primed it to say a batch of related documents is completed.

          Let me know where it's best to continue the thinking on these IndexWriter changes.

          Show
          Mark Harwood added a comment - Robust use of this feature is dependent on careful management of segments i.e. that all compound documents are held in the same segment. Michael Busch suggested the introduction of a new "FlushPolicy" on IndexWriter to offer the required control. (see http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3C4BE5A14C.6040108@gmail.com%3E ) Sounds sensible to me given that IndexWriter currently manages to muddle 2 alternative policies in the one implementation and it looks like we now need a third. Is this the place to start the debate on "FlushPolicy" ? My guess is this change would involve : Deprecating/removing IndexWriter's setMaxBufferedDocs and setRAMBufferSizeMB. Providing a new "FlushPolicy" abstract class that is called with a "BufferContext " class to hold number buffered docs + ram usage. FlushPolicy is asked if flushing of various structures should be triggered given the context Provide default implementations of FlushPolicy that are number-of-documents-based and RAM-based. Provide a special "NestedDocumentFlushPolicy" that can wrap any other policy (ram/num docs) but only triggers flushes when application code has primed it to say a batch of related documents is completed. Let me know where it's best to continue the thinking on these IndexWriter changes.
          Hide
          Mark Harwood added a comment -

          Initial attachment is code plus illustrative data/tests.
          Fuller unit tests/build scripts etc to follow

          Show
          Mark Harwood added a comment - Initial attachment is code plus illustrative data/tests. Fuller unit tests/build scripts etc to follow

            People

            • Assignee:
              Mark Harwood
              Reporter:
              Mark Harwood
            • Votes:
              10 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development