Solr
  1. Solr
  2. SOLR-3076

Solr(Cloud) should support block joins

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5, 5.0
    • Component/s: None
    • Labels:
      None

      Description

      Lucene has the ability to do block joins, we should add it to Solr.

      NOTE: this issue mistakenly got listed in the 4.4 section of CHANGES.txt, but is first available in Solr 4.5.

      1. bjq-vs-filters-illegal-state.patch
        5 kB
        Mikhail Khludnev
      2. SOLR-3076.patch
        19 kB
        Michael McCandless
      3. bjq-vs-filters-backward-disi.patch
        11 kB
        Mikhail Khludnev
      4. parent-bjq-qparser.patch
        33 kB
        Mikhail Khludnev
      5. parent-bjq-qparser.patch
        33 kB
        Mikhail Khludnev
      6. solrconf-bjq-erschema-snippet.xml
        2 kB
        Mikhail Khludnev
      7. SOLR-3076.patch
        21 kB
        Michael McCandless
      8. SOLR-3076.patch
        20 kB
        Michael McCandless
      9. child-bjqparser.patch
        28 kB
        Mikhail Khludnev
      10. tochild-bjq-filtered-search-fix.patch
        10 kB
        Mikhail Khludnev
      11. SOLR-3076.patch
        10 kB
        Michael McCandless
      12. SOLR-3076.patch
        26 kB
        Mikhail Khludnev
      13. SOLR-3076.patch
        66 kB
        Mikhail Khludnev
      14. solrconfig.xml.patch
        4 kB
        Mikhail Khludnev
      15. Screen Shot 2012-07-17 at 1.12.11 AM.png
        99 kB
        Mikhail Khludnev
      16. 27M-singlesegment-histogram.png
        89 kB
        Mikhail Khludnev
      17. 27M-singlesegment.png
        57 kB
        Mikhail Khludnev
      18. SOLR-3076-childDocs.patch
        70 kB
        Vadim Kirilchuk
      19. dih-config.xml
        2 kB
        Vadim Kirilchuk
      20. dih-3076.patch
        24 kB
        Vadim Kirilchuk
      21. SOLR-7036-childDocs-solr-fork-trunk-patched
        5 kB
        Tom Burton-West
      22. SOLR-3076.patch
        75 kB
        Alan Woodward
      23. SOLR-3076.patch
        77 kB
        Tom Burton-West
      24. SOLR-3076.patch
        79 kB
        Yonik Seeley
      25. SOLR-3076.patch
        79 kB
        Yonik Seeley
      26. SOLR-3076.patch
        41 kB
        Yonik Seeley
      27. SOLR-3076.patch
        72 kB
        Yonik Seeley
      28. SOLR-3076.patch
        111 kB
        Yonik Seeley

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          Mikhail Khludnev added a comment -

          Hello,

          The parser itself is not a very big deal. I'm stuck with filtered search for ToParentBlockJoinQuery. I'm attaching neat test for it.

          incepted at http://lucene.472066.n3.nabble.com/ToParentBlockJoinQuery-vs-filtered-search-td3717911.html

          Giving Michael's resolution I amended o.a.l.s.join.ToParentBlockJoinQuery.BlockJoinWeight.scorer(AtomicReaderContext, boolean, boolean, Bits):

          • children query scorer obtains readerContext.reader.getLiveDocs() - all documents
          • parent filter obtains acceptDocs to instantiate parent DocIdSet.

          now it fails TestBlockJoin.testSimpleFilter() line 196 I use parent filter as a filter for search (it shouldn't influence th result). I've got an exception

          java.lang.IllegalStateException: parentFilter must return FixedBitSet; got org.apache.lucene.search.BitsFilteredDocIdSet@19ccba
          at org.apache.lucene.search.join.ToParentBlockJoinQuery$BlockJoinWeight.scorer(ToParentBlockJoinQuery.java:197)

          it's caused by CachingWrapperFilter.getDocIdSet(AtomicReaderContext, Bits) and BitsFilteredDocIdSet.wrap(DocIdSet, Bits).

          (if you don't apply changes to ToParentBlockJoinQuery, assert fails with zero doc found - parent filter is applied to children query)

          it seems to me:

          • I did something absolutely wrong or
          • I need to manage ToParentBlockJoinQuery work with BitsFilteredDocIdSet - I guess it might be possible.
          Show
          Mikhail Khludnev added a comment - Hello, The parser itself is not a very big deal. I'm stuck with filtered search for ToParentBlockJoinQuery. I'm attaching neat test for it. incepted at http://lucene.472066.n3.nabble.com/ToParentBlockJoinQuery-vs-filtered-search-td3717911.html Giving Michael's resolution I amended o.a.l.s.join.ToParentBlockJoinQuery.BlockJoinWeight.scorer(AtomicReaderContext, boolean, boolean, Bits): children query scorer obtains readerContext.reader.getLiveDocs() - all documents parent filter obtains acceptDocs to instantiate parent DocIdSet. now it fails TestBlockJoin.testSimpleFilter() line 196 I use parent filter as a filter for search (it shouldn't influence th result). I've got an exception java.lang.IllegalStateException: parentFilter must return FixedBitSet; got org.apache.lucene.search.BitsFilteredDocIdSet@19ccba at org.apache.lucene.search.join.ToParentBlockJoinQuery$BlockJoinWeight.scorer(ToParentBlockJoinQuery.java:197) it's caused by CachingWrapperFilter.getDocIdSet(AtomicReaderContext, Bits) and BitsFilteredDocIdSet.wrap(DocIdSet, Bits). (if you don't apply changes to ToParentBlockJoinQuery, assert fails with zero doc found - parent filter is applied to children query) it seems to me: I did something absolutely wrong or I need to manage ToParentBlockJoinQuery work with BitsFilteredDocIdSet - I guess it might be possible.
          Hide
          Mikhail Khludnev added a comment -

          just tried to pass readerContext.reader.getLiveDocs() to parent filter, it leads to filtering assert failure TestBlockJoin.testSimpleFilter() line 205. So, I considering " ToParentBlockJoinQuery work with BitsFilteredDocIdSet"

          What do you think? Is it possible?

          Show
          Mikhail Khludnev added a comment - just tried to pass readerContext.reader.getLiveDocs() to parent filter, it leads to filtering assert failure TestBlockJoin.testSimpleFilter() line 205. So, I considering " ToParentBlockJoinQuery work with BitsFilteredDocIdSet" What do you think? Is it possible?
          Hide
          Yonik Seeley added a comment -

          I think perhaps the biggest obstacle to overcome is the indexing side. We need to modify the transfer syntaxes to allow documents within documents, represent that in AddUpdateCommand/SolrInputDocument, and flow it through to finally call IndexWriter.addDocuments() to add them as a single block.

          Show
          Yonik Seeley added a comment - I think perhaps the biggest obstacle to overcome is the indexing side. We need to modify the transfer syntaxes to allow documents within documents, represent that in AddUpdateCommand/SolrInputDocument, and flow it through to finally call IndexWriter.addDocuments() to add them as a single block.
          Hide
          Mikhail Khludnev added a comment -

          @Yonik,
          I agree that introducing IW.addDocuments() in DirectUpdHandler can be kind of design challenge, for experiment purpose I set big flush limits and add docs consequently in single thread with overwrite=false. But now I'm stuck with fairly basic Lucene functionality - filtered search.

          Show
          Mikhail Khludnev added a comment - @Yonik, I agree that introducing IW.addDocuments() in DirectUpdHandler can be kind of design challenge, for experiment purpose I set big flush limits and add docs consequently in single thread with overwrite=false. But now I'm stuck with fairly basic Lucene functionality - filtered search.
          Hide
          Michael McCandless added a comment -

          Patch, fixing ToParent/ChildBJQ to properly handle the incoming
          acceptDocs. I added Mikhail's test, and added randomly deleted docs
          to the random test.

          Thanks for catching & raising this Mikhail!

          Show
          Michael McCandless added a comment - Patch, fixing ToParent/ChildBJQ to properly handle the incoming acceptDocs. I added Mikhail's test, and added randomly deleted docs to the random test. Thanks for catching & raising this Mikhail!
          Hide
          Mikhail Khludnev added a comment -

          Michael,

          First of all, it's not, you know, fair - I just attached my own approach bjq-vs-filters-backward-disi.patch !
          Great idea to include delete case.
          The only concern I have is that your patch makes BlockJoinQuery sources a little bit puzzling from my POV. I tried to reuse more Lucene code and introduce rewindable DocIdSetIterator just to provide an access to bitset.prevBitSet(). It can also be done by providing extended from of Bits interface (I don't know which of these is more perspective).
          Also, I don't know how much the problem we are talking about is relevant to Solr - it looks like pretty Lucene issue. What should we do? Should we care?

          Show
          Mikhail Khludnev added a comment - Michael, First of all, it's not, you know, fair - I just attached my own approach bjq-vs-filters-backward-disi.patch ! Great idea to include delete case. The only concern I have is that your patch makes BlockJoinQuery sources a little bit puzzling from my POV. I tried to reuse more Lucene code and introduce rewindable DocIdSetIterator just to provide an access to bitset.prevBitSet(). It can also be done by providing extended from of Bits interface (I don't know which of these is more perspective). Also, I don't know how much the problem we are talking about is relevant to Solr - it looks like pretty Lucene issue. What should we do? Should we care?
          Hide
          Michael McCandless added a comment -

          The only concern I have is that your patch makes BlockJoinQuery sources a little bit puzzling from my POV

          Well, that was the only way to "respect" the incoming acceptDocs, I think? Ie, BJQ must test itself against the acceptDocs.

          Your patch is interesting (creating a DocIdSetBackwardIterator) but that looks like a more invasive/larger change? So far it's only BJQ and block grouping that need this "rewindability", and requiring that the app provides a Filter producing a FixedBitSet for each segment (as CachingWrapperFilter will do) is not too imposing... maybe if more queries/filters require this we can rewindability to the interface... but I think it's too big a change at this point? Maybe open a new issue to explore it?

          Also, I don't know how much the problem we are talking about is relevant to Solr - it looks like pretty Lucene issue. What should we do? Should we care?

          You're right this is mostly a Lucene issue so far, but I think we can just use this same issue number (I'll reference it in Lucene's changes), and then we can keep this issue open for fixing the solr side (to be able to index doc blocks, and then add BJQ query search-time support).

          Show
          Michael McCandless added a comment - The only concern I have is that your patch makes BlockJoinQuery sources a little bit puzzling from my POV Well, that was the only way to "respect" the incoming acceptDocs, I think? Ie, BJQ must test itself against the acceptDocs. Your patch is interesting (creating a DocIdSetBackwardIterator) but that looks like a more invasive/larger change? So far it's only BJQ and block grouping that need this "rewindability", and requiring that the app provides a Filter producing a FixedBitSet for each segment (as CachingWrapperFilter will do) is not too imposing... maybe if more queries/filters require this we can rewindability to the interface... but I think it's too big a change at this point? Maybe open a new issue to explore it? Also, I don't know how much the problem we are talking about is relevant to Solr - it looks like pretty Lucene issue. What should we do? Should we care? You're right this is mostly a Lucene issue so far, but I think we can just use this same issue number (I'll reference it in Lucene's changes), and then we can keep this issue open for fixing the solr side (to be able to index doc blocks, and then add BJQ query search-time support).
          Hide
          Mikhail Khludnev added a comment -

          Michael,

          I agree with your points. Anyway, I suppose that filter search fix is useful for Lucene BJQ users.

          Okay. Here is my first BlockJoinQParser parent-bjq-qparser.patch.

          My problem is providing modules/join dependency for Solr core. I tried to amend some ant xml. But it isn't compiled. Please help me with ant, I can't. AFAIK the problem is that modules/join/buld/blabla-join.jar isn't build.

          /BlockJoinParentQParserPlugin.java:12: package org.apache.lucene.search.join does not exist
          

          the syntax is

          {!parent filter="parent:true"}

          child_name:b

          BJQ needs CachingWrapperFilter which is not compatible/interchangeable with Solr's fq (top level vs leaf readers, you know). I had to put is into user configured cache, and also provide the parent filter only mode - when you omit child query, it returns just parent filter to be used in fq.

          Do you like the syntax? approach? What's worth to include in scope?
          Please help me fix the build!!

          Show
          Mikhail Khludnev added a comment - Michael, I agree with your points. Anyway, I suppose that filter search fix is useful for Lucene BJQ users. Okay. Here is my first BlockJoinQParser parent-bjq-qparser.patch. My problem is providing modules/join dependency for Solr core. I tried to amend some ant xml. But it isn't compiled. Please help me with ant, I can't. AFAIK the problem is that modules/join/buld/blabla-join.jar isn't build. /BlockJoinParentQParserPlugin.java:12: package org.apache.lucene.search.join does not exist the syntax is {!parent filter="parent:true"} child_name:b BJQ needs CachingWrapperFilter which is not compatible/interchangeable with Solr's fq (top level vs leaf readers, you know). I had to put is into user configured cache, and also provide the parent filter only mode - when you omit child query, it returns just parent filter to be used in fq. Do you like the syntax? approach? What's worth to include in scope? Please help me fix the build!!
          Hide
          Robert Muir added a comment -

          Hi Mikhail, take a look at solr/common-build.xml and search for 'grouping',
          you will see how the 'grouping' module is tied in.

          you can add a similar thing for 'join'.

          Then you also edit lucene/contrib-build.xml (again just search for 'grouping')
          and make a similar entry for 'join'.

          I think this is right...

          Show
          Robert Muir added a comment - Hi Mikhail, take a look at solr/common-build.xml and search for 'grouping', you will see how the 'grouping' module is tied in. you can add a similar thing for 'join'. Then you also edit lucene/contrib-build.xml (again just search for 'grouping') and make a similar entry for 'join'. I think this is right...
          Hide
          Michael McCandless added a comment -

          Maybe instead of opening up end-user QP syntax to control block joins, there should be configuration that tells the query parser how to create the parent bits filter, which fields are in "child scope" vs "parent scope", etc.? This way if a user does a field'd query, where some fields are parent and some are child, the QP would know to create BJQ?

          Show
          Michael McCandless added a comment - Maybe instead of opening up end-user QP syntax to control block joins, there should be configuration that tells the query parser how to create the parent bits filter, which fields are in "child scope" vs "parent scope", etc.? This way if a user does a field'd query, where some fields are parent and some are child, the QP would know to create BJQ?
          Hide
          Mikhail Khludnev added a comment -

          Robert,

          Thanks. I did mostly everything beside of the last one:
          <target name="prep-lucene-jars"
          depends="...

          Reattached parent-bjq-qparser.patch builds fine.

          Show
          Mikhail Khludnev added a comment - Robert, Thanks. I did mostly everything beside of the last one: <target name="prep-lucene-jars" depends="... Reattached parent-bjq-qparser.patch builds fine.
          Hide
          Mikhail Khludnev added a comment - - edited

          Michael,

          Do you mean something like attached solrconf-bjq-erschema-snippet.xml ?

          Show
          Mikhail Khludnev added a comment - - edited Michael, Do you mean something like attached solrconf-bjq-erschema-snippet.xml ?
          Hide
          Hoss Man added a comment -

          Maybe instead of opening up end-user QP syntax to control block joins, there should be configuration that tells the query parser how to create the parent bits filter, which fields are in "child scope" vs "parent scope", etc.?

          wouldn't that still need to be a query time option to be useful to the full capability of block join??

          from what i remember about block join:

          • you can have arbitrary depth of parent>child->grandchild->etc...
          • there's nothing prevent parents and children having the same fields

          ...correct?

          so wouldn't it be kind of limiting if those types of options were configuration that couldn't be done per query/filter? (ie: in this fq i want only docs whose parents are size:[0 TO 1000] but in this other fq i want docs who are themselves size:[10 TO 40] ... if perhaps "parents" are books and the children being queried are "chapters" and "size" is number of pages)

          Show
          Hoss Man added a comment - Maybe instead of opening up end-user QP syntax to control block joins, there should be configuration that tells the query parser how to create the parent bits filter, which fields are in "child scope" vs "parent scope", etc.? wouldn't that still need to be a query time option to be useful to the full capability of block join?? from what i remember about block join: you can have arbitrary depth of parent>child->grandchild->etc... there's nothing prevent parents and children having the same fields ...correct? so wouldn't it be kind of limiting if those types of options were configuration that couldn't be done per query/filter? (ie: in this fq i want only docs whose parents are size: [0 TO 1000] but in this other fq i want docs who are themselves size: [10 TO 40] ... if perhaps "parents" are books and the children being queried are "chapters" and "size" is number of pages)
          Hide
          Michael McCandless added a comment -

          Mikhail, I think something like that looks right? It expresses the join structure to the QP.

          Hoss, you're right, you can have the same field name on parent/child/grandchild/etc., so I agree, we will need something in the query syntax to disambiguate which is which for such cases. Though I would think this would be the exception not the rule (ie, I'd expect apps to name the fields "uniquely" across parent and child docs).

          Maybe there can be field aliases? Eg, book_page_count:[0 to 1000] and chapter_page_count[10:40], and the QP is told to map book_page_count -> parent:size and chapter_page_count -> child:size? Or maybe we let the user explicitly scope the field, eg chapter:size, book:size, book:title, etc. Not sure...

          Show
          Michael McCandless added a comment - Mikhail, I think something like that looks right? It expresses the join structure to the QP. Hoss, you're right, you can have the same field name on parent/child/grandchild/etc., so I agree, we will need something in the query syntax to disambiguate which is which for such cases. Though I would think this would be the exception not the rule (ie, I'd expect apps to name the fields "uniquely" across parent and child docs). Maybe there can be field aliases? Eg, book_page_count: [0 to 1000] and chapter_page_count [10:40] , and the QP is told to map book_page_count -> parent:size and chapter_page_count -> child:size? Or maybe we let the user explicitly scope the field, eg chapter:size, book:size, book:title, etc. Not sure...
          Hide
          Hoss Man added a comment -

          Maybe there can be field aliases? Eg, book_page_count:[0 to 1000] and chapter_page_count[10:40], and the QP is told to map book_page_count -> parent:size and chapter_page_count -> child:size? Or maybe we let the user explicitly scope the field, eg chapter:size, book:size, book:title, etc. Not sure...

          Hmmm... i kind of understand what you're saying; but the part i'm not understanding is even if you had field aliasing like that, given some query string like...

            book_page_count:[0 TO 1000] and chapter_page_count[10 TO 40]
          

          ..how would the parser know whether the user was asking for the results to be "book documents" matching that criteria (1-1000 pages and containing at least one chapter child containing 10-40 pages), or "chapter documents" matching that criteria (10-40 pages contained in a book of 1-1000 pages) or "page documents" (all pages in containing in a chapter of 10-40 total pages, contained in a book of 1-1000 total pages) ?

          I mean: it seems possible, and a QParser like that could totally support configuring those types of file mappings / hierarchy definitions in init params, but perhaps we should focus on the more user explicit, direct mapping type QParser type approach Mikhail has already started on for now, and consider that as an enhancement later? (especially since it's not clear how the indexing side will be managed/enforced – depending on how that shapes up, it might be possible for a QParser like you're describing, or perhaps all QParsers to infer the field rules from the schema or some other configuration)

          I think the syntax in Mikhail's BlockJoinParentQParserPlugin looks great as a straight forward baseline implementation. The one straw man suggestion i might toss out there for consideration would be to invert the use of the "filter" and "v" local params, so instead of...

          {!parent filter="parent:true"}child_name:b
          {!parent filter="parent:true"}
          

          ...it might be...

          {!parent of="child_names:b"}parent:true
          {!parent}parent:true
          

          ...people may find that easier to read as a way to understand that the final query will return "parent documents" constraint such that those parent documents have children matching the "of" query. The one thing i don't like this "of" idea is that (compared to the "filter" param Mikhail uses) it might be more tempting for people to use something like...

          // WRONG! (i think)
          q={!parent of="child_names:b"}some_parent_field:foo
          

          ...when they mean to write something like this...

          q={!parent of="child_names:b"}some_query_that_identifies_the_set_of_all_parents
          fq=some_parent_field:foo
          

          ...because as i understand it, it's important for the "parentFilter" to identify all of the parent documents, even ones you may not want returned, so that the ToParentBlockJoinQuery knows how to identify the parent of each document (correct?)

          This type of user confusion is still possible with the syntax Mikhail's got, but i suspect it will be less likely — In any case, i wanted to put the idea out there.

          Given McCandless supposition that the parent/child relationships are likely to be very consistent, not very deep, and not vary from query to query, one thing we could do to to help mitigate this possible confusion would be:

          • make the "filter" param name much longer and verbose, ie: setOfAllParentsQuery
          • make the param optional, and have it default to something specified as an init param, ie: defaultSetOfAllParentsQuery
          • make the init param mandatory

          That way, in the common case people will configure things like...

          <queryParser name="parent" class="solr.BlockJoinParentQParserPlugin">
            <str name="defaultSetOfAllParentsQuery">type:parent</str>
          </queryParser>
          

          ..and their queries will be simple...

          q={!parent}              (all parent docs)
          q={!parent}foo:bar       (all parent docss that contain kid docs matching foo:bar)
          

          ...but it will still be possible for people with more complex usecases with do more complex things.

          Mikhail: some other minor feedback on the parts i understood of your patch that i understood (note: my lack of understanding is not a fault of your patch, it's just that most of the block join stuff is very foreign to me)...

          • please prune down "solrconfig-bjqparser.xml" so it contains only the absolute minimum things you need for the test case, it makes it a lot easier for people to review the patch, and for users to understand what is necessary to utilize features demoed in the test (we have a lot of old bloaded solrconfig files i nthe test dir, but we're trying to stop doing that)
          • the test would be a bit easier to follow if you used different letters for the parent fields vs the child fields (abcdef, vs xyz for example)
          • it would be good to have tests verifying that nested parent queries work as expected, ie: that something like this works...
            q={!parent filter="type:book" v=$chapters}
            chapters=+chapter_title:Solr +_query_:{!parent filter="type:chapter" v=$pages}
            pages=page_body:BlockJoin
            
          • it would be good to have your tests introspect the cache after doing the query to make sure the number of inserts, lookups, and hits match what you expect.

          ...but like i said: all in all i think it's really good.

          Show
          Hoss Man added a comment - Maybe there can be field aliases? Eg, book_page_count: [0 to 1000] and chapter_page_count [10:40] , and the QP is told to map book_page_count -> parent:size and chapter_page_count -> child:size? Or maybe we let the user explicitly scope the field, eg chapter:size, book:size, book:title, etc. Not sure... Hmmm... i kind of understand what you're saying; but the part i'm not understanding is even if you had field aliasing like that, given some query string like... book_page_count:[0 TO 1000] and chapter_page_count[10 TO 40] ..how would the parser know whether the user was asking for the results to be "book documents" matching that criteria (1-1000 pages and containing at least one chapter child containing 10-40 pages), or "chapter documents" matching that criteria (10-40 pages contained in a book of 1-1000 pages) or "page documents" (all pages in containing in a chapter of 10-40 total pages, contained in a book of 1-1000 total pages) ? I mean: it seems possible, and a QParser like that could totally support configuring those types of file mappings / hierarchy definitions in init params, but perhaps we should focus on the more user explicit, direct mapping type QParser type approach Mikhail has already started on for now, and consider that as an enhancement later? (especially since it's not clear how the indexing side will be managed/enforced – depending on how that shapes up, it might be possible for a QParser like you're describing, or perhaps all QParsers to infer the field rules from the schema or some other configuration) I think the syntax in Mikhail's BlockJoinParentQParserPlugin looks great as a straight forward baseline implementation. The one straw man suggestion i might toss out there for consideration would be to invert the use of the "filter" and "v" local params, so instead of... {!parent filter= "parent: true " }child_name:b {!parent filter= "parent: true " } ...it might be... {!parent of= "child_names:b" }parent: true {!parent}parent: true ...people may find that easier to read as a way to understand that the final query will return "parent documents" constraint such that those parent documents have children matching the "of" query. The one thing i don't like this "of" idea is that (compared to the "filter" param Mikhail uses) it might be more tempting for people to use something like... // WRONG! (i think) q={!parent of= "child_names:b" }some_parent_field:foo ...when they mean to write something like this... q={!parent of= "child_names:b" }some_query_that_identifies_the_set_of_all_parents fq=some_parent_field:foo ...because as i understand it, it's important for the "parentFilter" to identify all of the parent documents, even ones you may not want returned, so that the ToParentBlockJoinQuery knows how to identify the parent of each document (correct?) This type of user confusion is still possible with the syntax Mikhail's got, but i suspect it will be less likely — In any case, i wanted to put the idea out there. Given McCandless supposition that the parent/child relationships are likely to be very consistent, not very deep, and not vary from query to query, one thing we could do to to help mitigate this possible confusion would be: make the "filter" param name much longer and verbose, ie: setOfAllParentsQuery make the param optional, and have it default to something specified as an init param, ie: defaultSetOfAllParentsQuery make the init param mandatory That way, in the common case people will configure things like... <queryParser name= "parent" class= "solr.BlockJoinParentQParserPlugin" > <str name= "defaultSetOfAllParentsQuery" >type:parent</str> </queryParser> ..and their queries will be simple... q={!parent} (all parent docs) q={!parent}foo:bar (all parent docss that contain kid docs matching foo:bar) ...but it will still be possible for people with more complex usecases with do more complex things. Mikhail: some other minor feedback on the parts i understood of your patch that i understood (note: my lack of understanding is not a fault of your patch, it's just that most of the block join stuff is very foreign to me)... please prune down "solrconfig-bjqparser.xml" so it contains only the absolute minimum things you need for the test case, it makes it a lot easier for people to review the patch, and for users to understand what is necessary to utilize features demoed in the test (we have a lot of old bloaded solrconfig files i nthe test dir, but we're trying to stop doing that) the test would be a bit easier to follow if you used different letters for the parent fields vs the child fields (abcdef, vs xyz for example) it would be good to have tests verifying that nested parent queries work as expected, ie: that something like this works... q={!parent filter= "type:book" v=$chapters} chapters=+chapter_title:Solr +_query_:{!parent filter= "type:chapter" v=$pages} pages=page_body:BlockJoin it would be good to have your tests introspect the cache after doing the query to make sure the number of inserts, lookups, and hits match what you expect. ...but like i said: all in all i think it's really good.
          Hide
          Michael McCandless added a comment -

          Patch, for the 3.x backport. I backported only the test case at
          first, but it failed!

          Digging in, it failed because BJQ requires that the parent filter
          always include deleted docs. This is unfortunately not easy on 3.x,
          so I added new (experimental) API SR.termDocsWithDeletedDocs(Term),
          and then create TermFilterNoDeletes (experimental), and fixed up BJQ
          javadocs to note exactly how the parents filter must be created. Once
          I added that and fixed the test to use it, it passes again w/ no
          modifications to BJQ sources... phew!

          Show
          Michael McCandless added a comment - Patch, for the 3.x backport. I backported only the test case at first, but it failed! Digging in, it failed because BJQ requires that the parent filter always include deleted docs. This is unfortunately not easy on 3.x, so I added new (experimental) API SR.termDocsWithDeletedDocs(Term), and then create TermFilterNoDeletes (experimental), and fixed up BJQ javadocs to note exactly how the parents filter must be created. Once I added that and fixed the test to use it, it passes again w/ no modifications to BJQ sources... phew!
          Hide
          Michael McCandless added a comment -

          New patch: removed leftover from FIR, moved the new terms filter to join module (and named it RawTermsFilter), renamed SR's new API to rawTermDocs.

          Show
          Michael McCandless added a comment - New patch: removed leftover from FIR, moved the new terms filter to join module (and named it RawTermsFilter), renamed SR's new API to rawTermDocs.
          Hide
          Michael McCandless added a comment -

          OK I've committed fixes to 3.6 and 4.0 for the BJQ bugs... let's leave this open to add BJQ support to Solr.

          Show
          Michael McCandless added a comment - OK I've committed fixes to 3.6 and 4.0 for the BJQ bugs... let's leave this open to add BJQ support to Solr.
          Hide
          Mikhail Khludnev added a comment -

          Michael,

          Thank you a lot for the fix! Pls, find replies for the points above:

          • I think it's worth to start from the essential ability - "opening up end-user QP syntax", and deliver fullfledged entity-relations schema later. I'm sure even this basic will be challenging enough. After we will have a feedback for essential approach, it will be clear what to include into the advanced one.
            • Despite I propose to push it back, I have some thoughts about field names clash in ER-schema. If we have one field interpreted diferently between searches: first time SIZE is book entity attr, and then SIZE is chapter's attr, it can be handled via having several schemas (top level entities). We can specify which fields layout to use per every request.
            • There are cases when fields are clashed between BJQ tree levels in the single request. Vast majority of them can be resolved by distingluishing fields between entities, i.e. there are no sigle SIZE field for book and chapter entities, but there are BOOK_SIZE and CHAPTER_SIZE, which are different due to their nature.
            • Even if it's not enough and you have an attribute which belongs two both entities at the same time e.g. matte/glossy paper in our books sample, it seems like UI app concern. Imagine you have a Paper Type field in search form, what do you expect from it when search by these two joined entities? I guess the expectation is disjunction - return books which have matte cover OR matte sheets in any chapter. So, in this case the app should buid the following query: BJQ(books:true, PAPER:matte ) OR (books:true AND PAPER:matte). But where the disjunction hypothesis comes from, another app requires that books and their chapters both were glossy: BJQ(books:true, PAPER:matte ) AND (books:true AND PAPER:matte). Also there is a room for index time processing - take parent attr term, push down to chid. It's actually one of nuances which I propose to defer for a while.
          • I'm covering improvements for test one-by-one. Stay tuned.
          • At the end is the sweety point - syntax.
            • I believe that having body of parameter (.V) fixed, and mutate values of local params {!parent of="child_names:b"}

              parent:true is the way to confuse user. I checked currently present Join queries - they don't do so. Local params are fixed by the app, and param body is mutated in according to user request (I know what is the &localparamtrick);

            • then instead of considering how the "parent-filter-fallback" will looks like I propose to think how the oposite form of jon will looks like (ToChildBlockJoinQuery - when you retrieve children whose parent are passed by the filter). In this case you still need to specify parent filter:
          {!parent filter="parent:true"}child_name:b
          {!child filter="parent:true"}parent_name:b
          and filter itself
          {!parent filter="parent:true"}
          
          this syntax is open for misuse:
          {!child filter="NOT parent:true"}parent_name:b 
          
          

          The only thing I can suggest is make syntax more verbose:

          
          {!parent which="parent:true"}child_name:b
          {!child of="parent:true"}parent_name:b
          {!parent which="parent:true"}
          
          

          Btw, don't you think that master/detail is more appropriate terms that parent/child?

          Show
          Mikhail Khludnev added a comment - Michael, Thank you a lot for the fix! Pls, find replies for the points above: I think it's worth to start from the essential ability - "opening up end-user QP syntax", and deliver fullfledged entity-relations schema later. I'm sure even this basic will be challenging enough. After we will have a feedback for essential approach, it will be clear what to include into the advanced one. Despite I propose to push it back, I have some thoughts about field names clash in ER-schema. If we have one field interpreted diferently between searches: first time SIZE is book entity attr, and then SIZE is chapter's attr, it can be handled via having several schemas (top level entities). We can specify which fields layout to use per every request. There are cases when fields are clashed between BJQ tree levels in the single request. Vast majority of them can be resolved by distingluishing fields between entities, i.e. there are no sigle SIZE field for book and chapter entities, but there are BOOK_SIZE and CHAPTER_SIZE, which are different due to their nature. Even if it's not enough and you have an attribute which belongs two both entities at the same time e.g. matte/glossy paper in our books sample, it seems like UI app concern. Imagine you have a Paper Type field in search form, what do you expect from it when search by these two joined entities? I guess the expectation is disjunction - return books which have matte cover OR matte sheets in any chapter. So, in this case the app should buid the following query: BJQ(books:true, PAPER:matte ) OR (books:true AND PAPER:matte). But where the disjunction hypothesis comes from, another app requires that books and their chapters both were glossy: BJQ(books:true, PAPER:matte ) AND (books:true AND PAPER:matte). Also there is a room for index time processing - take parent attr term, push down to chid. It's actually one of nuances which I propose to defer for a while. I'm covering improvements for test one-by-one. Stay tuned. At the end is the sweety point - syntax. I believe that having body of parameter (.V) fixed, and mutate values of local params {!parent of="child_names:b"} parent:true is the way to confuse user. I checked currently present Join queries - they don't do so. Local params are fixed by the app, and param body is mutated in according to user request (I know what is the &localparamtrick); then instead of considering how the "parent-filter-fallback" will looks like I propose to think how the oposite form of jon will looks like (ToChildBlockJoinQuery - when you retrieve children whose parent are passed by the filter). In this case you still need to specify parent filter: {!parent filter= "parent: true " }child_name:b {!child filter= "parent: true " }parent_name:b and filter itself {!parent filter= "parent: true " } this syntax is open for misuse: {!child filter= "NOT parent: true " }parent_name:b The only thing I can suggest is make syntax more verbose: {!parent which= "parent: true " }child_name:b {!child of= "parent: true " }parent_name:b {!parent which= "parent: true " } Btw, don't you think that master/detail is more appropriate terms that parent/child?
          Hide
          Mikhail Khludnev added a comment -

          Some news in attach child-bjqparser.patch.

          Good stuff - I added ToChild parser, bad thing is that ToChildBJQ seems to have a filtered search bug. See below.

          Hoss,
          I covered all your points about tests. Also I want to add deleted blocks coverage.

          Syntax:

          • {!parent which="PARENT:true"}CHILD_PRICE:10
            {!child of="PARENT:true"}PARENT_PRICE:10
            
          • cool EntityRelationships schema is descoped
          • I don't think that swapping local and param ( {!parent of="child_names:b"}

            parent:true) is user friendly

          • still asking for using master-detail in favour of parent child. WDYT?

          ToChildBJQ vs filtered search

          Michael,

          • I briefly looked into the TestBlockJoin.testSimple(). It asserts filtered search for children by "skill":"foosball" but I can't find such term in data.
          • I added two plain asserts one of them is failed on erroneous 0 docnum in result. (testSimple fails after applying child-bjqparser.patch).
          • TestBJQParser.testChildrenParser() is ignored in the patch. Looks like it fails due to another reason. Pls find my concerns in @Ignore comment.
          Show
          Mikhail Khludnev added a comment - Some news in attach child-bjqparser.patch. Good stuff - I added ToChild parser, bad thing is that ToChildBJQ seems to have a filtered search bug. See below. Hoss, I covered all your points about tests. Also I want to add deleted blocks coverage. Syntax: {!parent which= "PARENT: true " }CHILD_PRICE:10 {!child of= "PARENT: true " }PARENT_PRICE:10 cool EntityRelationships schema is descoped I don't think that swapping local and param ( {!parent of="child_names:b"} parent:true) is user friendly still asking for using master-detail in favour of parent child. WDYT? ToChildBJQ vs filtered search Michael, I briefly looked into the TestBlockJoin.testSimple(). It asserts filtered search for children by "skill":"foosball" but I can't find such term in data. I added two plain asserts one of them is failed on erroneous 0 docnum in result. (testSimple fails after applying child-bjqparser.patch). TestBJQParser.testChildrenParser() is ignored in the patch. Looks like it fails due to another reason. Pls find my concerns in @Ignore comment.
          Hide
          Mikhail Khludnev added a comment -

          tochild-bjq-filtered-search-fix.patch covers filtered search by ToChildBJQ and has two fixes for it. Lucene codebase is impacted only.

          Michael,
          Please consider it for commit.

          Thanks

          Show
          Mikhail Khludnev added a comment - tochild-bjq-filtered-search-fix.patch covers filtered search by ToChildBJQ and has two fixes for it. Lucene codebase is impacted only. Michael, Please consider it for commit. Thanks
          Hide
          Michael McCandless added a comment -

          Thanks Mikhail! These are real issues.

          I tweaked the patch a bit: I don't think we need to use .nextSetBit, because, IndexSearcher/FilteredQuery will already use .advance if the filter looks sparse.

          Show
          Michael McCandless added a comment - Thanks Mikhail! These are real issues. I tweaked the patch a bit: I don't think we need to use .nextSetBit, because, IndexSearcher/FilteredQuery will already use .advance if the filter looks sparse.
          Hide
          Mikhail Khludnev added a comment -

          Committers,
          Is there plan to commit filtered ToChildBJQ fix from "13/Mar/12 01:04"? btw I guess 3.x is also impacted.

          And about parser itself? Do you like the syntax? What's the plan?

          Show
          Mikhail Khludnev added a comment - Committers, Is there plan to commit filtered ToChildBJQ fix from "13/Mar/12 01:04"? btw I guess 3.x is also impacted. And about parser itself? Do you like the syntax? What's the plan?
          Hide
          Michael McCandless added a comment -

          Hi Mikhail, I've committed fixes for the filtering issues you found in ToChildBJQ, I think...? Are you still seeing issues?

          I'm unsure of the QP syntax for BJQ...

          Show
          Michael McCandless added a comment - Hi Mikhail, I've committed fixes for the filtering issues you found in ToChildBJQ, I think...? Are you still seeing issues? I'm unsure of the QP syntax for BJQ...
          Hide
          Mikhail Khludnev added a comment -

          Michael,

          1. I've just seen your commits. Thanks.

          2. Do you agree with overall approach to deliver straightforward QP with explicit joining syntax? Or you object and insist on entity-relationship-schema approach?

          3. What's is the level of uncertainty you have about the current QP syntax? What's your main concern and what's the way to improve it?

          Show
          Mikhail Khludnev added a comment - Michael, 1. I've just seen your commits. Thanks. 2. Do you agree with overall approach to deliver straightforward QP with explicit joining syntax? Or you object and insist on entity-relationship-schema approach? 3. What's is the level of uncertainty you have about the current QP syntax? What's your main concern and what's the way to improve it?
          Hide
          Mikhail Khludnev added a comment -

          Yonik,

          What's your vision of indexing blocks in Solr?

          1. I guess we need to introduce new method UpdateHandler.addBlock() and AddBlockCommand as well.
          2. I suppose we have two different approaches for supporting blocks in ContentStreamLoader descendants.
            • we can introduce new tag in input xml format <docs>...</docs> or <block>...<block>
            • we can transfer List<SolrInputDoc>[] <in javabin format
            • the different strategy should be used for plain csv format: we need to add block separator init parameter in solrconfig.xml eg <str name="block.separator">type:resume</str>
          • as an alternative for all these above we can have just an UpdateRequestProcessor descendant which separates blocks by employing the same block.separator approach, in this case update handler support is required anyway, but it's really little efforts and less invasive.

          WDYT?

          Show
          Mikhail Khludnev added a comment - Yonik, What's your vision of indexing blocks in Solr? I guess we need to introduce new method UpdateHandler.addBlock() and AddBlockCommand as well. I suppose we have two different approaches for supporting blocks in ContentStreamLoader descendants. we can introduce new tag in input xml format <docs>...</docs> or <block>...<block> we can transfer List<SolrInputDoc>[] <in javabin format the different strategy should be used for plain csv format: we need to add block separator init parameter in solrconfig.xml eg <str name="block.separator">type:resume</str> as an alternative for all these above we can have just an UpdateRequestProcessor descendant which separates blocks by employing the same block.separator approach, in this case update handler support is required anyway, but it's really little efforts and less invasive. WDYT?
          Hide
          Mikhail Khludnev added a comment -

          Anybody, please! I need go/not-go feedback about QP-syntax and considerations about indexing part.

          Show
          Mikhail Khludnev added a comment - Anybody, please! I need go/not-go feedback about QP-syntax and considerations about indexing part.
          Hide
          Michael McCandless added a comment -

          2. Do you agree with overall approach to deliver straightforward QP with explicit joining syntax? Or you object and insist on entity-relationship-schema approach?

          3. What's is the level of uncertainty you have about the current QP syntax? What's your main concern and what's the way to improve it?

          Well, stepping back, my concern is still that I don't think there
          should be any QP syntax to express block joins. These are joins
          "determined" at indexing time, and compiled into the index, and so the
          only remaining query-time freedom is which fields you want to search
          against (something QP can already understand, ie field:text syntax).
          From that fields list the required joins are implied.

          I can't imagine users learning/typing the sort of syntax we are
          discussing here.

          It's true there are exceptional cases (Hoss's "size" field that's on
          both parent and child docs), but, that's the exception not the rule; I
          don't think we should design things (APIs, QP syntax) around exceptional
          cases. And, I think such an exception should be
          handled by some sort of field aliasing ("book_page_count" vs
          "chapter_page_count").

          For query-time join, which is fully flexible, I agree the QP must (and
          already does) include join syntax, ie be more like SQL, where you can
          express arbitrary on-the-fly joins.

          But, at the same time, the 'users' of Solr's QP syntax may not be the
          "end" user, ie, the app's front end may very well construct these
          complex join expressions.... and so it's really the developers of that
          search app writing these join queries. So perhaps it's fine to add
          crazy-expert syntax that end users would rarely use but search app
          developers might...?

          All this being said, I defer to Hoss (and other committers more
          experienced w/ Solr QP issues) here... if they all feel this added QP
          syntax makes sense then let's do it!

          Show
          Michael McCandless added a comment - 2. Do you agree with overall approach to deliver straightforward QP with explicit joining syntax? Or you object and insist on entity-relationship-schema approach? 3. What's is the level of uncertainty you have about the current QP syntax? What's your main concern and what's the way to improve it? Well, stepping back, my concern is still that I don't think there should be any QP syntax to express block joins. These are joins "determined" at indexing time, and compiled into the index, and so the only remaining query-time freedom is which fields you want to search against (something QP can already understand, ie field:text syntax). From that fields list the required joins are implied. I can't imagine users learning/typing the sort of syntax we are discussing here. It's true there are exceptional cases (Hoss's "size" field that's on both parent and child docs), but, that's the exception not the rule; I don't think we should design things (APIs, QP syntax) around exceptional cases. And, I think such an exception should be handled by some sort of field aliasing ("book_page_count" vs "chapter_page_count"). For query-time join, which is fully flexible, I agree the QP must (and already does) include join syntax, ie be more like SQL, where you can express arbitrary on-the-fly joins. But, at the same time, the 'users' of Solr's QP syntax may not be the "end" user, ie, the app's front end may very well construct these complex join expressions.... and so it's really the developers of that search app writing these join queries. So perhaps it's fine to add crazy-expert syntax that end users would rarely use but search app developers might...? All this being said, I defer to Hoss (and other committers more experienced w/ Solr QP issues) here... if they all feel this added QP syntax makes sense then let's do it!
          Hide
          Mikhail Khludnev added a comment -

          Michael,

          It's a great point about a contrast between QPs for query time and index time joins. I haven't realized it before. Completely agree.

          So perhaps it's fine to add crazy-expert syntax that end users would rarely use but search app developers might...?

          That's actually what I'm talking about - start from it.

          Also, I don't assume that this "crazy expert syntax" is a termination stage for the this issue. We can have a kind of modular approach here: ER-schema module will translate plain set of filter queries into the BJQ tree expressed in "crazy expert syntax".

          Show
          Mikhail Khludnev added a comment - Michael, It's a great point about a contrast between QPs for query time and index time joins. I haven't realized it before. Completely agree. So perhaps it's fine to add crazy-expert syntax that end users would rarely use but search app developers might...? That's actually what I'm talking about - start from it. Also, I don't assume that this "crazy expert syntax" is a termination stage for the this issue. We can have a kind of modular approach here: ER-schema module will translate plain set of filter queries into the BJQ tree expressed in "crazy expert syntax".
          Hide
          Hoss Man added a comment -

          As i said before...

          ...perhaps we should focus on the more user explicit, direct mapping type QParser type approach Mikhail has already started on for now, and consider that (schema driven implicit block joining) as an enhancement later? (especially since it's not clear how the indexing side will be managed/enforced...)

          what Mikhail's fleshed out here seems like a good starting point for users who are willing to deal with this at the "low" level (similar in expertness to the "raw" QParser) , and would be usable today for people who take responsibility of indexing the blocks themselves.

          if/when/how we decide to drive the indexing side, we can think about how if/where/how to automagically hook blockjoin queries into "higher" level parsers like LuceneQParser, DismaxQueryParser

          Show
          Hoss Man added a comment - As i said before... ...perhaps we should focus on the more user explicit, direct mapping type QParser type approach Mikhail has already started on for now, and consider that ( schema driven implicit block joining ) as an enhancement later? (especially since it's not clear how the indexing side will be managed/enforced...) what Mikhail's fleshed out here seems like a good starting point for users who are willing to deal with this at the "low" level (similar in expertness to the "raw" QParser) , and would be usable today for people who take responsibility of indexing the blocks themselves. if/when/how we decide to drive the indexing side, we can think about how if/where/how to automagically hook blockjoin queries into "higher" level parsers like LuceneQParser, DismaxQueryParser
          Hide
          Mikhail Khludnev added a comment -

          Hello,

          I'm attaching the patch includes index support SOLR-3535. significant improvement is SolrJ support.

          solrconfig.xml.patch shows how to enable this magic for "examples" instance.

          I did the same benchmark as in http://www.lucidimagination.com/blog/2012/06/20/solr-and-joins/ with queries like:

          q=text_all:(autumn OR lie)&fq=

          {!parent which=kind:body}

          acl:[5326 TO 5326]
          q=text_all:(presumably OR isolate OR resistance)&fq=

          {!parent which=kind:body}

          acl:[1937 TO 1940]
          q=text_all:(nice OR junior OR great)&fq=

          {!parent which=kind:body}

          acl:[33 TO 36]

          it gives me 0.3 sec avg latency on MacBookPro. see attached solrmeter screenshot.

          Show
          Mikhail Khludnev added a comment - Hello, I'm attaching the patch includes index support SOLR-3535 . significant improvement is SolrJ support. solrconfig.xml.patch shows how to enable this magic for "examples" instance. I did the same benchmark as in http://www.lucidimagination.com/blog/2012/06/20/solr-and-joins/ with queries like: q=text_all:(autumn OR lie)&fq= {!parent which=kind:body} acl: [5326 TO 5326] q=text_all:(presumably OR isolate OR resistance)&fq= {!parent which=kind:body} acl: [1937 TO 1940] q=text_all:(nice OR junior OR great)&fq= {!parent which=kind:body} acl: [33 TO 36] it gives me 0.3 sec avg latency on MacBookPro. see attached solrmeter screenshot.
          Hide
          Mikhail Khludnev added a comment -

          I repeat the join test with 27M docs (4.5M parent docs with 5 children per each) optimized up to single segment. You can see that it provides fairly subseconds search (with rare spikes up to 9 sec, of course). Check 27M-singlesegment.. pngs.

          Show
          Mikhail Khludnev added a comment - I repeat the join test with 27M docs (4.5M parent docs with 5 children per each) optimized up to single segment. You can see that it provides fairly subseconds search (with rare spikes up to 9 sec, of course). Check 27M-singlesegment.. pngs.
          Hide
          Pavel Goncharik added a comment -

          Hi,

          I'm wondering if patches in this issue implement "block joins" for a single Solr shard only, or implementation works correctly out-of-the-box in distributed mode (SolrCloud)?

          Show
          Pavel Goncharik added a comment - Hi, I'm wondering if patches in this issue implement "block joins" for a single Solr shard only, or implementation works correctly out-of-the-box in distributed mode (SolrCloud)?
          Hide
          Mikhail Khludnev added a comment - - edited

          Hi

          I can't check it right now, can't remember how deep I covered it, but I remember that there was a test or even level of certainty that nested docs are passed through DisttributedUpdateProcessor by using a top level doc primary key. Therefore whole block should land into particular shard. If this necessary condition is met query side should work out of the box.
          Also you can find about performance benefits

          http://blog.griddynamics.com/2012/08/block-join-query-performs.html

          Show
          Mikhail Khludnev added a comment - - edited Hi I can't check it right now, can't remember how deep I covered it, but I remember that there was a test or even level of certainty that nested docs are passed through DisttributedUpdateProcessor by using a top level doc primary key. Therefore whole block should land into particular shard. If this necessary condition is met query side should work out of the box. Also you can find about performance benefits http://blog.griddynamics.com/2012/08/block-join-query-performs.html
          Hide
          Vadim Kirilchuk added a comment - - edited

          Hi everyone, i revisited the patch and attaching another version of it (childDocs), which includes:

          1) SolrInputDocument#getChildrenDocs modification (suggested by Hoss)
          2) Distributed test
          FullSolrCloudDistribCmdsTest#testIndexQueryDeleteHierarchical
          3) Fixed NPE analyzer exception

          Feel free to check it and give a feedback!

          Thx!

          Show
          Vadim Kirilchuk added a comment - - edited Hi everyone, i revisited the patch and attaching another version of it (childDocs), which includes: 1) SolrInputDocument#getChildrenDocs modification (suggested by Hoss) 2) Distributed test FullSolrCloudDistribCmdsTest#testIndexQueryDeleteHierarchical 3) Fixed NPE analyzer exception Feel free to check it and give a feedback! Thx!
          Hide
          Otis Gospodnetic added a comment -

          Just skimmed the comments - looks like there may still be some things to work out. But doesn't this feel like too important not to try and include in 4.2 or 5.0? If so, set Fix Version so it doesn't get lost?

          Show
          Otis Gospodnetic added a comment - Just skimmed the comments - looks like there may still be some things to work out. But doesn't this feel like too important not to try and include in 4.2 or 5.0? If so, set Fix Version so it doesn't get lost?
          Hide
          Mikhail Khludnev added a comment -

          Otis Gospodnetic for me it's 100% ready, otherwise let me know what's need to be improved.
          I can't set Fix Version beacuse it's assigned on Grant Ingersoll

          Show
          Mikhail Khludnev added a comment - Otis Gospodnetic for me it's 100% ready, otherwise let me know what's need to be improved. I can't set Fix Version beacuse it's assigned on Grant Ingersoll
          Hide
          Grant Ingersoll added a comment -

          I'll try to look soon. Others should be able to set the fix version, if they have the appropriate rights.

          Show
          Grant Ingersoll added a comment - I'll try to look soon. Others should be able to set the fix version, if they have the appropriate rights.
          Hide
          Erik Hatcher added a comment -

          well just set it then, Otis!

          I put only 5.0 since this is going to be a fairly big change. Feel free to set it to 4.2 or greater if you'd like (and of course volunteer to do the work)

          Show
          Erik Hatcher added a comment - well just set it then, Otis! I put only 5.0 since this is going to be a fairly big change. Feel free to set it to 4.2 or greater if you'd like (and of course volunteer to do the work)
          Hide
          Yonik Seeley added a comment -

          Honestly I normally pretty much ignore "fix version" for features... it's not how things get done (or scheduled) really.
          This feature is very important - there's no way it's getting lost.

          Show
          Yonik Seeley added a comment - Honestly I normally pretty much ignore "fix version" for features... it's not how things get done (or scheduled) really. This feature is very important - there's no way it's getting lost.
          Hide
          Vadim Kirilchuk added a comment -

          Hi everyone!

          I am attaching draft dih patch (and sample dih-config) which adds support of hierarchical docs to dih.

          It is far from being complete but works for me and can be useful for others.

          Any Feedback Is Appreciated. Thanks.

          Show
          Vadim Kirilchuk added a comment - Hi everyone! I am attaching draft dih patch (and sample dih-config) which adds support of hierarchical docs to dih. It is far from being complete but works for me and can be useful for others. Any Feedback Is Appreciated. Thanks.
          Hide
          Jan Høydahl added a comment -

          Setting 4.4 as fix version since Mikhail wanted that, and I guess to encourage this happening before 5.0. Any released Lucene feature should be allowed to bubble up in Solr within a few point releases.

          What's lacking for this one now?

          Show
          Jan Høydahl added a comment - Setting 4.4 as fix version since Mikhail wanted that, and I guess to encourage this happening before 5.0. Any released Lucene feature should be allowed to bubble up in Solr within a few point releases. What's lacking for this one now?
          Hide
          Tom Burton-West added a comment -

          I'd like to test this out with some real data and would like to use the XmlUpdateRequestHandler. Since SOLR-3535 was folded in here, I looked here to try to find the XML syntax to use. I couldn't tell from a quick read of the code what the XML syntax would be to actually use to add a parent and children. Would it be possible to add a test similar to solr/core/src/test/org.apache.solr.handler/XmlUpdateRequestHandlerTest?

          I would assume all that the xml in XmlUpdateRequestHandlerTest could be replaced with the proper xml to index a block consisting of a parent and its children

          i.e in the test replace:
          String xml =
          "<doc boost=\"5.5\">" +
          " <field name=\"id\" boost=\"2.2\">12345</field>" +
          " <field name=\"name\">kitten</field>" +
          " <field name=\"cat\" boost=\"3\">aaa</field>" +
          " <field name=\"cat\" boost=\"4\">bbb</field>" +
          " <field name=\"cat\" boost=\"5\">bbb</field>" +
          " <field name=\"ab\">a&b</field>" +
          "</doc>";

          with whatever xml is needed to index a block (parent and children).

          Show
          Tom Burton-West added a comment - I'd like to test this out with some real data and would like to use the XmlUpdateRequestHandler. Since SOLR-3535 was folded in here, I looked here to try to find the XML syntax to use. I couldn't tell from a quick read of the code what the XML syntax would be to actually use to add a parent and children. Would it be possible to add a test similar to solr/core/src/test/org.apache.solr.handler/XmlUpdateRequestHandlerTest? I would assume all that the xml in XmlUpdateRequestHandlerTest could be replaced with the proper xml to index a block consisting of a parent and its children i.e in the test replace: String xml = "<doc boost=\"5.5\">" + " <field name=\"id\" boost=\"2.2\">12345</field>" + " <field name=\"name\">kitten</field>" + " <field name=\"cat\" boost=\"3\">aaa</field>" + " <field name=\"cat\" boost=\"4\">bbb</field>" + " <field name=\"cat\" boost=\"5\">bbb</field>" + " <field name=\"ab\">a&b</field>" + "</doc>"; with whatever xml is needed to index a block (parent and children).
          Hide
          Vadim Kirilchuk added a comment - - edited

          Tom, you can take a look at AddBlockUpdateTest, it contains several examples.
          Here SOLR-3076.patch and SOLR-3076-childDocs.patch have a great difference:

          here is SOLR-3076.patch style (Note: child documents inside fields):

          <add>
            <doc>
              <field name='id'>1</field>
              <field name='type_s'>parent</field>
              <field name='make_name_s'>Ford</field>
              <field name="1th-subdocs">
                <doc> 
                    <field name='id'>2</field>
                    <field name='type_s'>child</field>
                    <field name='modelyear_name_s'>1992 Ford E150 Van 2WD</field>
                </doc>
                <doc> 
                    <field name='id'>3</field>
                    <field name='type_s'>child</field>
                    <field name='modelyear_name_s'>1993 Ford E150 Van 2WD</field>
                </doc>
              </field>
              
              <!-- 
                  This was done for case when you have more then 2 types in relation:for example, Make->Model and Make->Factory 
                  However, it is ok to just add all subdocs to 1th-subdocs fields.
              -->
              <field name="2th-subdocs"> 
                <doc> 
                    ...
                </doc>
              </field>
            </doc>
            
            <doc> 
                 ...
            </doc>
          </add>
          

          here is SOLR-3076-childDocs.patch style (Note: child documents are another doc inside doc):

          <add>
            <doc>
              <field name='id'>1</field>
              <field name='type_s'>parent</field>
              <field name='make_name_s'>Ford</field>
              <doc>
                <field name='id'>2</field>
                <field name='type_s'>child</field>
                <field name='modelyear_name_s'>1992 Ford E150 Van 2WD</field>
              </doc>
              <doc>
                <field name='id'>3</field>
                <field name='type_s'>child</field>
                <field name='modelyear_name_s'>1993 Ford E150 Van 2WD</field>
              </doc>
              
             <!-- 
                  2th-subdocs from previous example can easily go here.
              -->
              <doc>
                ...
              </doc>
            </doc>
          </add>
          

          Also i suggest you to use SolrJ instead of plain xml, examples at https://github.com/griddynamics/solr-fork/blob/trunk-patched/solr/core/src/test/org/apache/solr/update/AddBlockUpdateTest.java look at testSolrJXML()

          Show
          Vadim Kirilchuk added a comment - - edited Tom, you can take a look at AddBlockUpdateTest, it contains several examples. Here SOLR-3076 .patch and SOLR-3076 -childDocs.patch have a great difference: here is SOLR-3076 .patch style (Note: child documents inside fields): <add> <doc> <field name='id'> 1 </field> <field name='type_s'> parent </field> <field name='make_name_s'> Ford </field> <field name= "1th-subdocs" > <doc> <field name='id'> 2 </field> <field name='type_s'> child </field> <field name='modelyear_name_s'> 1992 Ford E150 Van 2WD </field> </doc> <doc> <field name='id'> 3 </field> <field name='type_s'> child </field> <field name='modelyear_name_s'> 1993 Ford E150 Van 2WD </field> </doc> </field> <!-- This was done for case when you have more then 2 types in relation:for example, Make->Model and Make->Factory However, it is ok to just add all subdocs to 1th-subdocs fields. --> <field name= "2th-subdocs" > <doc> ... </doc> </field> </doc> <doc> ... </doc> </add> here is SOLR-3076 -childDocs.patch style (Note: child documents are another doc inside doc): <add> <doc> <field name='id'> 1 </field> <field name='type_s'> parent </field> <field name='make_name_s'> Ford </field> <doc> <field name='id'> 2 </field> <field name='type_s'> child </field> <field name='modelyear_name_s'> 1992 Ford E150 Van 2WD </field> </doc> <doc> <field name='id'> 3 </field> <field name='type_s'> child </field> <field name='modelyear_name_s'> 1993 Ford E150 Van 2WD </field> </doc> <!-- 2th-subdocs from previous example can easily go here. --> <doc> ... </doc> </doc> </add> Also i suggest you to use SolrJ instead of plain xml, examples at https://github.com/griddynamics/solr-fork/blob/trunk-patched/solr/core/src/test/org/apache/solr/update/AddBlockUpdateTest.java look at testSolrJXML()
          Hide
          Tom Burton-West added a comment -

          Thanks Vadim,

          I haven't used SolrJ, so I needed to translate to the XMLLoader/XML update handler.

          I pulled your trunk-patched version, added an XMLLoader test to AddBlockUpdateTest. It's a brain-dead copy of the testSolrJXML test. I don't know if it is testing much, but at least for Solr users like me who are unfamiliar with SolrJ, it provides an executable example of the XML syntax currently being used.

          p.s.
          The attached patch is a git diff to your version. ( I don't quite know how to make a correct patch against the right version of Solr trunk)

          Show
          Tom Burton-West added a comment - Thanks Vadim, I haven't used SolrJ, so I needed to translate to the XMLLoader/XML update handler. I pulled your trunk-patched version, added an XMLLoader test to AddBlockUpdateTest. It's a brain-dead copy of the testSolrJXML test. I don't know if it is testing much, but at least for Solr users like me who are unfamiliar with SolrJ, it provides an executable example of the XML syntax currently being used. p.s. The attached patch is a git diff to your version. ( I don't quite know how to make a correct patch against the right version of Solr trunk)
          Hide
          Alan Woodward added a comment -

          This patch updates the 12/10/12 patch to trunk. There are a bunch of test failures in the analyzing suggester suite, which is a bit odd. There's also a single test failure in AddBlockUpdateTest.testExceptionThrown, which I think is actually an error in the test (it seems to expect that a document with fieldtype errors in a subdoc would be added, instead of the entire block being rejected).

          I have a client who's keen to get this into trunk/4x soon. Would be good to get some momentum behind it.

          Show
          Alan Woodward added a comment - This patch updates the 12/10/12 patch to trunk. There are a bunch of test failures in the analyzing suggester suite, which is a bit odd. There's also a single test failure in AddBlockUpdateTest.testExceptionThrown, which I think is actually an error in the test (it seems to expect that a document with fieldtype errors in a subdoc would be added, instead of the entire block being rejected). I have a client who's keen to get this into trunk/4x soon. Would be good to get some momentum behind it.
          Hide
          Vadim Kirilchuk added a comment -

          Thanks, Alan!

          There are a bunch of test failures in the analyzing suggester suite, which is a bit odd.

          I will try to take a look at the weekend.

          There's also a single test failure.

          Right, inconsistent behavior was fixed at some point (if you look at the test it has comment about this), so the proper way is to change numFound=1 to numFound=0.

          Show
          Vadim Kirilchuk added a comment - Thanks, Alan! There are a bunch of test failures in the analyzing suggester suite, which is a bit odd. I will try to take a look at the weekend. There's also a single test failure. Right, inconsistent behavior was fixed at some point (if you look at the test it has comment about this), so the proper way is to change numFound=1 to numFound=0.
          Hide
          Alan Woodward added a comment - - edited

          change numFound=1 to numFound=0

          The failure is elsewhere, in the *:* query - it's expecting to find 9 docs, but actually finds 8. But I guess this is the same change.

          Show
          Alan Woodward added a comment - - edited change numFound=1 to numFound=0 The failure is elsewhere, in the *:* query - it's expecting to find 9 docs, but actually finds 8. But I guess this is the same change.
          Hide
          Vadim Kirilchuk added a comment -

          There are a bunch of test failures in the analyzing suggester suite, which is a bit odd.

          Not reproduced for me.

          The failure is elsewhere, in the : query - it's expecting to find 9 docs, but actually finds 8. But I guess this is the same change.

          Right, this is about changing
          assertQ(req(":"), "//*[@numFound='" + "abcDefgHx".length() + "']");
          to
          assertQ(req(":"), "//*[@numFound='" + "abcDefgH".length() + "']");

          Show
          Vadim Kirilchuk added a comment - There are a bunch of test failures in the analyzing suggester suite, which is a bit odd. Not reproduced for me. The failure is elsewhere, in the : query - it's expecting to find 9 docs, but actually finds 8. But I guess this is the same change. Right, this is about changing assertQ(req(" : "), "//* [@numFound='" + "abcDefgHx".length() + "'] "); to assertQ(req(" : "), "//* [@numFound='" + "abcDefgH".length() + "'] ");
          Hide
          Daniel Soriano added a comment -

          Hi everyone!
          I'm interested in trying/testing this new feature, but I cannot figure out what's the right order to apply patches, either if all files are required. Could anyone suggest me how to do that?

          Thanks!

          Show
          Daniel Soriano added a comment - Hi everyone! I'm interested in trying/testing this new feature, but I cannot figure out what's the right order to apply patches, either if all files are required. Could anyone suggest me how to do that? Thanks!
          Hide
          Vadim Kirilchuk added a comment - - edited

          Hi Daniel,

          It's nice to hear that more and more people want this feature.

          You need only latest SOLR-3076 patch (Thanks to Alan!) https://issues.apache.org/jira/secure/attachment/12585393/SOLR-3076.patch. You should apply it to current trunk. There will be one test failure which is mentioned here (in AddBlockUpdateTest) but you can ignore it for a while.

          Show
          Vadim Kirilchuk added a comment - - edited Hi Daniel, It's nice to hear that more and more people want this feature. You need only latest SOLR-3076 patch (Thanks to Alan!) https://issues.apache.org/jira/secure/attachment/12585393/SOLR-3076.patch . You should apply it to current trunk. There will be one test failure which is mentioned here (in AddBlockUpdateTest) but you can ignore it for a while.
          Hide
          Daniel Soriano added a comment -

          But paths in patches start with a/ and b/, and patches are applied into both subdirs. I don't understand how apply this patch file to trunk. I have done two trunk copies starting with a and b, but --dry-run notices that changes are applied into both. What's the trick?

          Show
          Daniel Soriano added a comment - But paths in patches start with a/ and b/, and patches are applied into both subdirs. I don't understand how apply this patch file to trunk. I have done two trunk copies starting with a and b, but --dry-run notices that changes are applied into both. What's the trick?
          Hide
          Jan Høydahl added a comment -

          It's probably a git patch. Try patch -p1 as mentioned here http://wiki.apache.org/solr/HowToContribute#Working_With_Patches

          Show
          Jan Høydahl added a comment - It's probably a git patch. Try patch -p1 as mentioned here http://wiki.apache.org/solr/HowToContribute#Working_With_Patches
          Hide
          Daniel Soriano added a comment -

          That has been perfect!! Thanks everyone. Now, block joins are running.

          Show
          Daniel Soriano added a comment - That has been perfect!! Thanks everyone. Now, block joins are running.
          Hide
          Tom Burton-West added a comment -

          Patch against trunk (SVN style, patch -p0) that adds testXML(), which illustrates XML block indexing syntax and exercises the XMLLoader

          Show
          Tom Burton-West added a comment - Patch against trunk (SVN style, patch -p0) that adds testXML(), which illustrates XML block indexing syntax and exercises the XMLLoader
          Hide
          Otis Gospodnetic added a comment -

          This is issue #2-3 in terms of popularity. Does it work in SolrCloud-type setups?

          Show
          Otis Gospodnetic added a comment - This is issue #2-3 in terms of popularity. Does it work in SolrCloud-type setups?
          Hide
          Vadim Kirilchuk added a comment - - edited

          Otis, patch have a test class solr/core/src/test/org/apache/solr/cloud/FullSolrCloudDistribCmdsTest.java and a method #testIndexQueryDeleteHierarchical which index, query and then delete hierarchical documents. However, it asserts only sizes of parents, children and grandchildren with simple term queries (not bjq), so someone need to check it manually or update a test.

          By the way, what is the #1 issue? =)

          Show
          Vadim Kirilchuk added a comment - - edited Otis, patch have a test class solr/core/src/test/org/apache/solr/cloud/FullSolrCloudDistribCmdsTest.java and a method #testIndexQueryDeleteHierarchical which index, query and then delete hierarchical documents. However, it asserts only sizes of parents, children and grandchildren with simple term queries (not bjq), so someone need to check it manually or update a test. By the way, what is the #1 issue? =)
          Hide
          Michael McCandless added a comment -

          #1 issue is SOLR-64: http://jirasearch.mikemccandless.com/search.py?index=jira&chg=dds&text=&a1=status&a2=Open&page=0&searcher=2004&sort=voteCount&format=list&id=tixjnvz4fv8r&newText= (that's just sorting all open issues by vote count).

          But that issue was open for a very long time so it's not really fair

          Show
          Michael McCandless added a comment - #1 issue is SOLR-64 : http://jirasearch.mikemccandless.com/search.py?index=jira&chg=dds&text=&a1=status&a2=Open&page=0&searcher=2004&sort=voteCount&format=list&id=tixjnvz4fv8r&newText= (that's just sorting all open issues by vote count). But that issue was open for a very long time so it's not really fair
          Hide
          Yonik Seeley added a comment -

          Starting to look at this now... hopefully we should be able to get something committed soonish.

          Show
          Yonik Seeley added a comment - Starting to look at this now... hopefully we should be able to get something committed soonish.
          Hide
          Vadim Kirilchuk added a comment -

          Yonik, it's great!

          Just keep in mind several improvements:

          • make _childDocuments inside SolrInputDocument lazy instead of new ArrayList() in constructor.
          • at JavaBinCodec there is no need in SOLRINPUTDOC_CHILDS tag, it is easier to write "SOLRINPITDOC docFieldsSize childrenNum" instead of "SOLRINPUTDOC docFieldsSize SOLRINPUTDOC_CHILDS childrenNum"

          There is also a blueprint of dih support for this: https://issues.apache.org/jira/secure/attachment/12576960/dih-3076.patch
          Maybe it will be better to move it to it's own jira ticket.

          There are no support for:

          • delete block
          • overwrite/update block
          • JSON

          I hope it helps.

          Show
          Vadim Kirilchuk added a comment - Yonik, it's great! Just keep in mind several improvements: make _childDocuments inside SolrInputDocument lazy instead of new ArrayList() in constructor. at JavaBinCodec there is no need in SOLRINPUTDOC_CHILDS tag, it is easier to write "SOLRINPITDOC docFieldsSize childrenNum" instead of "SOLRINPUTDOC docFieldsSize SOLRINPUTDOC_CHILDS childrenNum" There is also a blueprint of dih support for this: https://issues.apache.org/jira/secure/attachment/12576960/dih-3076.patch Maybe it will be better to move it to it's own jira ticket. There are no support for: delete block overwrite/update block JSON I hope it helps.
          Hide
          Yonik Seeley added a comment -

          OK, now that compound-file test related failures have been nailed, I'm back to looking at this.

          Just keep in mind several improvements:

          Thanks, I'll check those suggestions out.
          As for the stuff that there is no support for, the list seems to be getting big as I look at it... I think I'll try for a more minimalistic approach and leave the other stuff for follow-on work if possible.

          Show
          Yonik Seeley added a comment - OK, now that compound-file test related failures have been nailed, I'm back to looking at this. Just keep in mind several improvements: Thanks, I'll check those suggestions out. As for the stuff that there is no support for, the list seems to be getting big as I look at it... I think I'll try for a more minimalistic approach and leave the other stuff for follow-on work if possible.
          Hide
          Yonik Seeley added a comment -

          Here's the most recent patch updated to trunk. For some reason, the first test I tried (AddBlockUpdateTest) fails. I haven't looked into why just yet...

          Show
          Yonik Seeley added a comment - Here's the most recent patch updated to trunk. For some reason, the first test I tried (AddBlockUpdateTest) fails. I haven't looked into why just yet...
          Hide
          Vadim Kirilchuk added a comment -

          Right, one test fail because Alan didn't change it. You can find our discussion in comments, but let me help you:

          There's also a single test failure in AddBlockUpdateTest.testExceptionThrown

          Right, inconsistent behavior was fixed at some point (if you look at the test it has comment about this), so the proper way is to change numFound=1 to numFound=0.

          The failure is elsewhere, in the : query - it's expecting to find 9 docs, but actually finds 8. But I guess this is the same change.

          Right, this is about changing
          assertQ(req(":"), "//*[@numFound='" + "abcDefgHx".length() + "']");
          to
          assertQ(req(":"), "//*[@numFound='" + "abcDefgH".length() + "']");

          Show
          Vadim Kirilchuk added a comment - Right, one test fail because Alan didn't change it. You can find our discussion in comments, but let me help you: There's also a single test failure in AddBlockUpdateTest.testExceptionThrown Right, inconsistent behavior was fixed at some point (if you look at the test it has comment about this), so the proper way is to change numFound=1 to numFound=0. The failure is elsewhere, in the : query - it's expecting to find 9 docs, but actually finds 8. But I guess this is the same change. Right, this is about changing assertQ(req(":"), "//* [@numFound='" + "abcDefgHx".length() + "'] "); to assertQ(req(":"), "//* [@numFound='" + "abcDefgH".length() + "'] ");
          Hide
          Yonik Seeley added a comment -

          Updated patch to fix test.

          Show
          Yonik Seeley added a comment - Updated patch to fix test.
          Hide
          Jack Krupansky added a comment -

          Some questions:

          1. Is there a reasonably high probability that this feature will make it into Solr 4.5?

          2. How would Solr block join compare with Elasticsearch "nested types"? Anything that ES has on that front that this issue would not address?

          See:
          http://www.elasticsearch.org/guide/reference/mapping/nested-type/

          3. Can individual child documents be updated, replaced, deleted, appended?

          Show
          Jack Krupansky added a comment - Some questions: 1. Is there a reasonably high probability that this feature will make it into Solr 4.5? 2. How would Solr block join compare with Elasticsearch "nested types"? Anything that ES has on that front that this issue would not address? See: http://www.elasticsearch.org/guide/reference/mapping/nested-type/ 3. Can individual child documents be updated, replaced, deleted, appended?
          Hide
          Mikhail Khludnev added a comment -

          2. How would Solr block join compare with Elasticsearch "nested types"? ...

          ElasticSearch also provides some sort of "nested facets". However, I found them useless for real life eCommerce faceting. Though, internally ES has enough infrastructure to provide correct item level faceting. I feel it can be delivered there easily. It's worth to clarify what I mean in "real life eCommerce faceting", just check any site: if we have colored items which are nested into top-level products, and when we count item level facets eg color, we need to count number of products which has any items of the particular color and passed item level filter eg size.

          also desired direction is relations schema, see discussion in the beginning, I'm not sure how ElasticSearch address it.

          3. Can individual child documents be updated,...

          you can only delete. I mean, you should be able to do so, but I don't remember whether it's covered by test or not. Update,Replace,Append might be possible only after "stacked updates" are there (frankly speaking, never).

          Show
          Mikhail Khludnev added a comment - 2. How would Solr block join compare with Elasticsearch "nested types"? ... ElasticSearch also provides some sort of "nested facets". However, I found them useless for real life eCommerce faceting. Though, internally ES has enough infrastructure to provide correct item level faceting. I feel it can be delivered there easily. It's worth to clarify what I mean in "real life eCommerce faceting", just check any site: if we have colored items which are nested into top-level products, and when we count item level facets eg color, we need to count number of products which has any items of the particular color and passed item level filter eg size. also desired direction is relations schema, see discussion in the beginning, I'm not sure how ElasticSearch address it. 3. Can individual child documents be updated,... you can only delete. I mean, you should be able to do so, but I don't remember whether it's covered by test or not. Update,Replace,Append might be possible only after "stacked updates" are there (frankly speaking, never).
          Hide
          Yonik Seeley added a comment -

          It seems like the implementation of AddUpdateCommand and AddBlockCommand have almost everything in common (or should... such as handing reordered delete-by-queries, etc). For the most part, the only difference will be what IndexWriter method is finally called. I'm considering just modifying AddUpdateCommand instead of having a separate AddBlockCommand, but I was wondering about the reasoning behind a separate command.

          Show
          Yonik Seeley added a comment - It seems like the implementation of AddUpdateCommand and AddBlockCommand have almost everything in common (or should... such as handing reordered delete-by-queries, etc). For the most part, the only difference will be what IndexWriter method is finally called. I'm considering just modifying AddUpdateCommand instead of having a separate AddBlockCommand, but I was wondering about the reasoning behind a separate command.
          Hide
          Mikhail Khludnev added a comment - - edited
          Show
          Mikhail Khludnev added a comment - - edited Yonik Seeley it's a ginger cake !
          Hide
          Yonik Seeley added a comment -

          So it seems like for updates, we need a common term to delete everything by. Unless someone thinks of a better name, I'll use _root_ (and the root document and all child documents will have this set to the id of the root document).

          Show
          Yonik Seeley added a comment - So it seems like for updates, we need a common term to delete everything by. Unless someone thinks of a better name, I'll use _root_ (and the root document and all child documents will have this set to the id of the root document).
          Hide
          Vadim Kirilchuk added a comment -

          Wdyt about "block_id" for name?

          Also, i have a weird idea: wdyt about figuring out block based on parent bit set, i.e, we have the following parent bitSet 000100010001, and we somehow figure out that we want to update the parent in the middle, it means that we actually want to delete everything between two parents in the bitset, so we don't need anything except parent bitSet and bitSet id for parent. I don't believe it is easy, but it is just a weird idea =)

          Show
          Vadim Kirilchuk added a comment - Wdyt about "block_id" for name? Also, i have a weird idea: wdyt about figuring out block based on parent bit set, i.e, we have the following parent bitSet 000100010001, and we somehow figure out that we want to update the parent in the middle, it means that we actually want to delete everything between two parents in the bitset, so we don't need anything except parent bitSet and bitSet id for parent. I don't believe it is easy, but it is just a weird idea =)
          Hide
          Mikhail Khludnev added a comment - - edited

          Yonik Seeley I thought address this question at SOLR-4836, but it would be great if you cover it altogether.
          Pls confirm my understanding, if <uniqKey> is specified, for every block children obtain value for _root_ field, from parent's <uniqKey> field. Hence _root_ field is always used for IW.updateDocs call.

          btw, uber compact codec can be implemented for that _root_ field that somehow interfere with Vadim's idea.

          Vadim Kirilchuk introducing this separate case, blows IndexWriter code. Presumably you can try to introduce a facility to update not by term as now, but for some kind of generic query. fwiw, parent bitset is not really exists in IndexWriter see below

          Show
          Mikhail Khludnev added a comment - - edited Yonik Seeley I thought address this question at SOLR-4836 , but it would be great if you cover it altogether. Pls confirm my understanding, if <uniqKey> is specified, for every block children obtain value for _root_ field, from parent's <uniqKey> field. Hence _root_ field is always used for IW.updateDocs call. btw, uber compact codec can be implemented for that _root_ field that somehow interfere with Vadim's idea. Vadim Kirilchuk introducing this separate case, blows IndexWriter code. Presumably you can try to introduce a facility to update not by term as now, but for some kind of generic query. fwiw, parent bitset is not really exists in IndexWriter see below
          Hide
          Yonik Seeley added a comment - - edited

          Pls confirm my understanding, if <uniqKey> is specified, for every block children obtain value for root field, from parent's <uniqKey> field. Hence root field is always used for IW.updateDocs call.

          Right, _root_ would always be used when dealing with a block of documents (and always be set to the parent ID by solr... this wouldn't be set by the user). The only thing not covered by this is a transition from block to non-block or vice versa (i.e. adding a block document and then overwriting it with a non-block one). I'm punting on that for now. Also for deletes... unless we somehow keep track or can look it up, we'll need syntax to say that we're deleting a block.

          block_id is a good descriptive name, but I've been going toward surrounding special/internal/solr-added fields with underscores. Like _version_ and _shard_ (which may change to _route_) and _docid_. Using that scheme, _block_id_ is a little less appealing. Perhaps _block_ though.

          On the query side, this field can be used to get from a child of any depth to the root document, and from that perspective _root_ makes more sense. On the update side, _block_ makes a little more sense because all you care about is deleting the block as a whole with it. Hmmm...

          Show
          Yonik Seeley added a comment - - edited Pls confirm my understanding, if <uniqKey> is specified, for every block children obtain value for root field, from parent's <uniqKey> field. Hence root field is always used for IW.updateDocs call. Right, _root_ would always be used when dealing with a block of documents (and always be set to the parent ID by solr... this wouldn't be set by the user). The only thing not covered by this is a transition from block to non-block or vice versa (i.e. adding a block document and then overwriting it with a non-block one). I'm punting on that for now. Also for deletes... unless we somehow keep track or can look it up, we'll need syntax to say that we're deleting a block. block_id is a good descriptive name, but I've been going toward surrounding special/internal/solr-added fields with underscores. Like _version_ and _shard_ (which may change to _route_) and _docid_. Using that scheme, _block_id_ is a little less appealing. Perhaps _block_ though. On the query side, this field can be used to get from a child of any depth to the root document, and from that perspective _root_ makes more sense. On the update side, _block_ makes a little more sense because all you care about is deleting the block as a whole with it. Hmmm...
          Hide
          Yonik Seeley added a comment -

          Updated patch merging block functionality into AddUpdateCommand and implementing _root_. This is just a dev snapshot - things haven't been cleaned up and there are no new tests.

          Show
          Yonik Seeley added a comment - Updated patch merging block functionality into AddUpdateCommand and implementing _root_. This is just a dev snapshot - things haven't been cleaned up and there are no new tests.
          Hide
          Mikhail Khludnev added a comment - - edited

          delete everything between two parents in the bitset

          Indeed! Yonik Seeley we don't need _root_ if we can submit two queries for deletion: ToChild(parentid:foo) and TQ(parentid:foo)!
          Second great idea from Vadim Kirilchuk and Aleksey Aleev is that specific FullBlockQuery which matches parent and all its' children by the given parentid.

          Show
          Mikhail Khludnev added a comment - - edited delete everything between two parents in the bitset Indeed! Yonik Seeley we don't need _root_ if we can submit two queries for deletion: ToChild(parentid:foo) and TQ(parentid:foo)! Second great idea from Vadim Kirilchuk and Aleksey Aleev is that specific FullBlockQuery which matches parent and all its' children by the given parentid.
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Yonik Seeley added a comment -

          Indeed! Yonik Seeley we don't need root if we can submit two queries for deletion: ToChild(parentid:foo) and TQ(parentid:foo)!

          Since solr wouldn't know how to create those queries, it seems like the user would need to provide them (which doesn't seem very friendly).
          Also, IndexWriter currently only allows atomically specifying a term with the document block... deleteByQuery wouldn't be atomic.

          Show
          Yonik Seeley added a comment - Indeed! Yonik Seeley we don't need root if we can submit two queries for deletion: ToChild(parentid:foo) and TQ(parentid:foo)! Since solr wouldn't know how to create those queries, it seems like the user would need to provide them (which doesn't seem very friendly). Also, IndexWriter currently only allows atomically specifying a term with the document block... deleteByQuery wouldn't be atomic.
          Hide
          Yonik Seeley added a comment -

          FYI, I haven't forgotten about this issue, but I'm traveling at the beginning of this next week. I'll pick it up first thing when I get back.

          Show
          Yonik Seeley added a comment - FYI, I haven't forgotten about this issue, but I'm traveling at the beginning of this next week. I'll pick it up first thing when I get back.
          Hide
          Mikhail Khludnev added a comment -

          Yonik Seeley one of inconveniences is the necessity to provide user cache for BJQParser to store parent bitset.
          Do you think it's ok to bother user by this, and provide suboptimal performance y default?
          I have few solutions in mind, will we discuss it now or later?

          Show
          Mikhail Khludnev added a comment - Yonik Seeley one of inconveniences is the necessity to provide user cache for BJQParser to store parent bitset. Do you think it's ok to bother user by this, and provide suboptimal performance y default? I have few solutions in mind, will we discuss it now or later?
          Hide
          Yonik Seeley added a comment -

          Making progeess... currently working on randomized testing (using our current join implementation to cross-check this implementation). I've hit some snags and am working through them...

          one of inconveniences is the necessity to provide user cache for BJQParser

          Yeah, I had some things in mind to handle that as well.

          Show
          Yonik Seeley added a comment - Making progeess... currently working on randomized testing (using our current join implementation to cross-check this implementation). I've hit some snags and am working through them... one of inconveniences is the necessity to provide user cache for BJQParser Yeah, I had some things in mind to handle that as well.
          Hide
          Yonik Seeley added a comment -

          OK, here's a patch that fixes some bugs and removes the need for the user to configure an extra cache and register the parsers.

          Show
          Yonik Seeley added a comment - OK, here's a patch that fixes some bugs and removes the need for the user to configure an extra cache and register the parsers.
          Hide
          Mikhail Khludnev added a comment -

          Yonik Seeley could you please comment:

          • why PARENT:true should ignore deletions?
          • don't you think that "diverging" BJQ code makes harder the further maintenance?
          • I propose to revise the idea of rewindable docIDset iterator (it's discussed in the beginning), manage BJQ to work with it and avoid the cast to the concrete bitset class, in this case Lucene BJQ code could work with BitsDocsSet.getTopFilter(), which is some sort of BitSetSlice already.
          • as an alternative hack we can provide getDocset(wq, LUCENE_CACHING_WRAPPING_FILTER), where wq protects other queries from cache hit; or we can add to ExtendedQuery an ability to calculate bitset on it's own, i.e. searcher will delegate to ExtendedQuery.getDocSet(searcher) that allows it to yield Lucene's CahingWrapperFilter
          Show
          Mikhail Khludnev added a comment - Yonik Seeley could you please comment: why PARENT:true should ignore deletions? don't you think that "diverging" BJQ code makes harder the further maintenance? I propose to revise the idea of rewindable docIDset iterator (it's discussed in the beginning), manage BJQ to work with it and avoid the cast to the concrete bitset class, in this case Lucene BJQ code could work with BitsDocsSet.getTopFilter(), which is some sort of BitSetSlice already. as an alternative hack we can provide getDocset(wq, LUCENE_CACHING_WRAPPING_FILTER), where wq protects other queries from cache hit; or we can add to ExtendedQuery an ability to calculate bitset on it's own, i.e. searcher will delegate to ExtendedQuery.getDocSet(searcher) that allows it to yield Lucene's CahingWrapperFilter
          Hide
          Yonik Seeley added a comment -

          why PARENT:true should ignore deletions?

          In an earlier iteration, it needed to... but now I think it's just desirable (as opposed to required) because it's more efficient (less backtracking over deleted docs), and more resilient to accidental error conditions (like when someone deletes a parent doc but not it's children).

          I propose to revise the idea of rewindable docIDset iterator

          See LUCENE-5092, it looks like something like that has been rejected.

          As far as maintenance, the current stuff makes some things easier to tweak. I already did so for the child parser to make it fit better with how we put together full queries. Anyway, the important part are the public interfaces (the XML doc, and {Unable to render embedded object: File (parent} {!child} parsers and semantics). If we're happy with those, I think we should commit at this point - this issue has been open for far too long) not found.

          Show
          Yonik Seeley added a comment - why PARENT:true should ignore deletions? In an earlier iteration, it needed to... but now I think it's just desirable (as opposed to required) because it's more efficient (less backtracking over deleted docs), and more resilient to accidental error conditions (like when someone deletes a parent doc but not it's children). I propose to revise the idea of rewindable docIDset iterator See LUCENE-5092 , it looks like something like that has been rejected. As far as maintenance, the current stuff makes some things easier to tweak. I already did so for the child parser to make it fit better with how we put together full queries. Anyway, the important part are the public interfaces (the XML doc, and { Unable to render embedded object: File (parent} {!child} parsers and semantics). If we're happy with those, I think we should commit at this point - this issue has been open for far too long) not found.
          Hide
          Mikhail Khludnev added a comment - - edited

          like when someone deletes a parent doc but not it's children

          I've thought it so. However, there is an argument provided by one of my colleagues and the brightest engineer ever (Nina G) - such courtesy works until merge happens, and after merge/expunge deletes it's a pain. So, beside it's inconsistent, I even thought it wont be passed by random tests.

          See LUCENE-5092, it looks like something like that has been rejected.

          that approach has performance implication, but I propose nothing more just API massaging without any real implementation changes/extending: let BJQ work with something, which is either CachingWrapperFilter or BitDocSet.getTopFilter().

          If we're happy with those, I think we should commit at this point -

          I got your point. It makes sense. We just need to raise followup issue - unify BJQs across Lucene and Solr, and ideally address it before the next release. Otherwise, it's just a way to upset a user - if someone happy with BJQ in Lucene, it should be clear that with this parser he goes to another BJQs. As an alternative intermediate measure, don't you think it's more honest to store CachingWrapperFilter in Solr's filtercache via ugly hack, for sure. Then, follow up and address it soon.

          this issue has been open for far too long)

          but who really cares?

          Thanks

          Show
          Mikhail Khludnev added a comment - - edited like when someone deletes a parent doc but not it's children I've thought it so. However, there is an argument provided by one of my colleagues and the brightest engineer ever (Nina G) - such courtesy works until merge happens, and after merge/expunge deletes it's a pain. So, beside it's inconsistent, I even thought it wont be passed by random tests. See LUCENE-5092 , it looks like something like that has been rejected. that approach has performance implication, but I propose nothing more just API massaging without any real implementation changes/extending: let BJQ work with something, which is either CachingWrapperFilter or BitDocSet.getTopFilter(). If we're happy with those, I think we should commit at this point - I got your point. It makes sense. We just need to raise followup issue - unify BJQs across Lucene and Solr, and ideally address it before the next release. Otherwise, it's just a way to upset a user - if someone happy with BJQ in Lucene, it should be clear that with this parser he goes to another BJQs. As an alternative intermediate measure, don't you think it's more honest to store CachingWrapperFilter in Solr's filtercache via ugly hack, for sure. Then, follow up and address it soon. this issue has been open for far too long) but who really cares? Thanks
          Hide
          Robert Muir added a comment -

          why take working per-segment code and make it slower/top-level?

          Show
          Robert Muir added a comment - why take working per-segment code and make it slower/top-level?
          Hide
          Yonik Seeley added a comment -

          per-segment caches isn't the focus of this issue (although we should add a generic per-segment cache that can be sized/managed in a diff issue).

          Show
          Yonik Seeley added a comment - per-segment caches isn't the focus of this issue (although we should add a generic per-segment cache that can be sized/managed in a diff issue).
          Hide
          Robert Muir added a comment -

          The previous patches were per-segment. There is no reason for it to be top-level!

          Show
          Robert Muir added a comment - The previous patches were per-segment. There is no reason for it to be top-level!
          Hide
          Michael McCandless added a comment -

          I don't understand all the design constraints here, but I really don't like that the internal fork (full copy) of the ToParent/ChildBlockJoinQuery sources.

          Why is this necessary? Is it to cutover to the top-level filter cache?

          We should not fork our sources if we can help it ...

          Show
          Michael McCandless added a comment - I don't understand all the design constraints here, but I really don't like that the internal fork (full copy) of the ToParent/ChildBlockJoinQuery sources. Why is this necessary? Is it to cutover to the top-level filter cache? We should not fork our sources if we can help it ...
          Hide
          Yonik Seeley added a comment -

          The important parts of this issue are:

          • Serialization formats (XML, javabin, etc)
          • join semantics
          • join syntax... i.e. {!child ...} {!parent ...}
          • common public Solr Java APIs SolrInputDocument, UpdateHandler/UpdateProcessor
          • correctness

          Other things are implementation details that can be improved over time.
          We should be aware of things we don't want to support long term... this
          is why I removed the external/custom cache dependency (in addition to the
          usability implications).

          As far as "per-segment" goes, some of the previous patches had issues
          (such as caching SolrCache in QParser instances), double-caching
          (the filter used by the join would be cached separately from the
          same filter used in all other contexts), the custom caches defined in
          solrconfig.xml, not to mention my general dislike for weak references.
          Since per-segment filter caching is an orthogonal issue (and it would be
          best to be able to specify this on a per-filter basis), I decided it was
          best to leave per-segment filters for a different issue and create queries
          that would work well with the way Solr currently does it's filter caching
          and request construction.

          Additionally, how to deal with the "going backwards" problem / expecting
          all filters to be FixedBitSet (which Solr doesn't use) is still up in the
          air: LUCENE-5092. There's no reason to wait for that to get hashed out
          before giving Solr users block child/parent join functionallity. Those details
          of the Java APIs just don't matter to Solr users.

          These query classes in question are package-private classes that Solr
          users do not see - truly an implementation detail. Changing them in
          the future (as long as the behavior is equivalent) would not even
          warrant mention in release notes (unless performance had been improved).

          Can there still be implementation improvements? Absolutely! But I'm
          personally currently out of time on this issue, and I feel comfortable
          with supporting the public APIs we've come up with for some time to come.
          Since no one seems to have issues with any of the important parts like
          the public APIs, I plan on committing this shortly. Additional
          improvements/optimizations can come from follow-on patches.

          Show
          Yonik Seeley added a comment - The important parts of this issue are: Serialization formats (XML, javabin, etc) join semantics join syntax... i.e. {!child ...} {!parent ...} common public Solr Java APIs SolrInputDocument, UpdateHandler/UpdateProcessor correctness Other things are implementation details that can be improved over time. We should be aware of things we don't want to support long term... this is why I removed the external/custom cache dependency (in addition to the usability implications). As far as "per-segment" goes, some of the previous patches had issues (such as caching SolrCache in QParser instances), double-caching (the filter used by the join would be cached separately from the same filter used in all other contexts), the custom caches defined in solrconfig.xml, not to mention my general dislike for weak references. Since per-segment filter caching is an orthogonal issue (and it would be best to be able to specify this on a per-filter basis), I decided it was best to leave per-segment filters for a different issue and create queries that would work well with the way Solr currently does it's filter caching and request construction. Additionally, how to deal with the "going backwards" problem / expecting all filters to be FixedBitSet (which Solr doesn't use) is still up in the air: LUCENE-5092 . There's no reason to wait for that to get hashed out before giving Solr users block child/parent join functionallity. Those details of the Java APIs just don't matter to Solr users. These query classes in question are package-private classes that Solr users do not see - truly an implementation detail. Changing them in the future (as long as the behavior is equivalent) would not even warrant mention in release notes (unless performance had been improved). Can there still be implementation improvements? Absolutely! But I'm personally currently out of time on this issue, and I feel comfortable with supporting the public APIs we've come up with for some time to come. Since no one seems to have issues with any of the important parts like the public APIs, I plan on committing this shortly. Additional improvements/optimizations can come from follow-on patches.
          Hide
          Yonik Seeley added a comment -

          However, there is an argument provided by one of my colleagues and the brightest engineer ever (Nina G) - such courtesy works until merge happens, and after merge/expunge deletes it's a pain.

          Ah, right you (and Nina G) are! The inconsistency here (working until a merge) is worse than any performance difference. I'll change it.

          Show
          Yonik Seeley added a comment - However, there is an argument provided by one of my colleagues and the brightest engineer ever (Nina G) - such courtesy works until merge happens, and after merge/expunge deletes it's a pain. Ah, right you (and Nina G) are! The inconsistency here (working until a merge) is worse than any performance difference. I'll change it.
          Hide
          Mikhail Khludnev added a comment -

          Yonik Seeley give me last chance to rescue the world Solr.

          1. introduce SolrIndexSearcher.addCache(name), QParser can add usercache itself.
          2. leave it as it was before, BJQParser allows to specify user cache for storing CachingWrapperFilter, otherwise it performs bad. It seems forgivable for the 'early-adopter' feature.
          Show
          Mikhail Khludnev added a comment - Yonik Seeley give me last chance to rescue the world Solr. introduce SolrIndexSearcher.addCache(name), QParser can add usercache itself. leave it as it was before, BJQParser allows to specify user cache for storing CachingWrapperFilter, otherwise it performs bad. It seems forgivable for the 'early-adopter' feature.
          Hide
          Robert Muir added a comment -

          Since no one seems to have issues with any of the important parts like
          the public APIs, I plan on committing this shortly.

          Uhhh, the code is important to some of us too. There are several objections listed on this issue.

          The patches contributed here took lucene's join module and integrated it into solr, actually thats the description of the issue "Lucene has the ability to do block joins, we should add it to Solr.".

          But then suddenly the code becomes garbage: your patch FORKS CODE OF ENTIRE LUCENE JOIN MODULE to make it slow and top-level. Why even bring in lucene-join.jar here?

          So yeah, i dont care if you think code is important or not, -1 to this.

          Show
          Robert Muir added a comment - Since no one seems to have issues with any of the important parts like the public APIs, I plan on committing this shortly. Uhhh, the code is important to some of us too. There are several objections listed on this issue. The patches contributed here took lucene's join module and integrated it into solr, actually thats the description of the issue "Lucene has the ability to do block joins, we should add it to Solr.". But then suddenly the code becomes garbage: your patch FORKS CODE OF ENTIRE LUCENE JOIN MODULE to make it slow and top-level. Why even bring in lucene-join.jar here? So yeah, i dont care if you think code is important or not, -1 to this.
          Hide
          Michael McCandless added a comment -

          I plan on committing this shortly.

          Please don't. Three people have objections to the approach here ... let's iterate to a solution others are happy with.

          Show
          Michael McCandless added a comment - I plan on committing this shortly. Please don't. Three people have objections to the approach here ... let's iterate to a solution others are happy with.
          Hide
          Yonik Seeley added a comment -

          Ah, the joy of lucene vs solr politics. Anyway, I'm out of time on this issue that I worked hard to get committable (and I was about to commit... it's too bad we're letting perfect be the enemy of good here, but I feel it's a hell of a lot more political than technical).

          Show
          Yonik Seeley added a comment - Ah, the joy of lucene vs solr politics. Anyway, I'm out of time on this issue that I worked hard to get committable (and I was about to commit... it's too bad we're letting perfect be the enemy of good here, but I feel it's a hell of a lot more political than technical).
          Hide
          Yonik Seeley added a comment -

          BJQParser allows to specify user cache for storing CachingWrapperFilter

          I don't think that should be part of the parser API (or did you just mean just as config in solrconfig.xml?)

          Show
          Yonik Seeley added a comment - BJQParser allows to specify user cache for storing CachingWrapperFilter I don't think that should be part of the parser API (or did you just mean just as config in solrconfig.xml?)
          Hide
          Mikhail Khludnev added a comment -

          just mean just as config in solrconfig.xml

          yep.

          Show
          Mikhail Khludnev added a comment - just mean just as config in solrconfig.xml yep.
          Hide
          Michael McCandless added a comment -

          I decided it was
          best to leave per-segment filters for a different issue and create queries
          that would work well with the way Solr currently does it's filter caching
          and request construction.

          I think that makes sense (decouple the two), but can we do this without forking lucene/join's *BlockJoinQuery?

          E.g. maybe instead of requiring a FixedBitSet, Parent/ChildBlockJoinQuery could accept an interface/abstract class, that a slice of OpenBitSet (and FixedBitSet) could implement?

          Or ... messy, but maybe it could work maybe: the *BlockJoinQuery could do an instanceof check + cast for this OpenBitSet slice ...

          Show
          Michael McCandless added a comment - I decided it was best to leave per-segment filters for a different issue and create queries that would work well with the way Solr currently does it's filter caching and request construction. I think that makes sense (decouple the two), but can we do this without forking lucene/join's *BlockJoinQuery? E.g. maybe instead of requiring a FixedBitSet, Parent/ChildBlockJoinQuery could accept an interface/abstract class, that a slice of OpenBitSet (and FixedBitSet) could implement? Or ... messy, but maybe it could work maybe: the *BlockJoinQuery could do an instanceof check + cast for this OpenBitSet slice ...
          Hide
          Yonik Seeley added a comment -

          FORKS CODE OF ENTIRE LUCENE JOIN MODULE

          I really wanted to stop participating in discussions like this, but then I thought that it was important to clear up this mischaracterization for casual viewers. The patch introduced custom versions of 2 top level query classes (ToChildBlockJoinQuery and ToParentBlockJoinQuery). Further, they were package private implementation details. Focusing on whether the classes are officially part of the "lucene join module" are silly - it should be irrelevant to Solr end users... you could change those class names every day and it just wouldn't matter.

          Show
          Yonik Seeley added a comment - FORKS CODE OF ENTIRE LUCENE JOIN MODULE I really wanted to stop participating in discussions like this, but then I thought that it was important to clear up this mischaracterization for casual viewers. The patch introduced custom versions of 2 top level query classes (ToChildBlockJoinQuery and ToParentBlockJoinQuery). Further, they were package private implementation details. Focusing on whether the classes are officially part of the "lucene join module" are silly - it should be irrelevant to Solr end users... you could change those class names every day and it just wouldn't matter.
          Hide
          Uwe Schindler added a comment -

          Hi,
          I don't think we should again start with

          Ah, the joy of lucene vs solr politics

          again! As far as I understand, the current issue is: Solr has no way to cache filters per segment, so Yonik forked the whole (or major parts) of the Lucene join module into the source code of Solr. This also includes using OpenBitSet instead of FixedBitSet (why?).

          I would suggest to proceed like this:

          • Delay committing for now
          • Open new issue to allow per-segment caches in Solr
          • Open new issue to move away from OpenBitSet as a requirement for caching filters in Solr. For filter caching, FixedBitSet is the better alternative, as filters always have a fixed number of documents. Also reuse CachingWrapperFilter in Solr and don't have a separate filter caching. This would also allow to use the new Bitset implementations in Solr that were recently added to Lucene. Solr should simply have a non-implementation-agnostic caching for DocIdSets.
          • Once this is committed, adapt the current patch to use the new filter caching.

          From what I have learned in the past: Once we have the forked code in Solr there is no way to move away from it, mainly because:

          • Backwards compatibility complaints of plugin authors
          • "It's already implemented and working - why change?"
          • Some people claim that "Top-level caches are not slow".

          So my clear -1 from committing the forked code and delay it a few more weeks and instead fix the dependent issues as noted above before this one.

          Show
          Uwe Schindler added a comment - Hi, I don't think we should again start with Ah, the joy of lucene vs solr politics again! As far as I understand, the current issue is: Solr has no way to cache filters per segment, so Yonik forked the whole (or major parts) of the Lucene join module into the source code of Solr. This also includes using OpenBitSet instead of FixedBitSet (why?). I would suggest to proceed like this: Delay committing for now Open new issue to allow per-segment caches in Solr Open new issue to move away from OpenBitSet as a requirement for caching filters in Solr. For filter caching, FixedBitSet is the better alternative, as filters always have a fixed number of documents. Also reuse CachingWrapperFilter in Solr and don't have a separate filter caching. This would also allow to use the new Bitset implementations in Solr that were recently added to Lucene. Solr should simply have a non-implementation-agnostic caching for DocIdSets. Once this is committed, adapt the current patch to use the new filter caching. From what I have learned in the past: Once we have the forked code in Solr there is no way to move away from it, mainly because: Backwards compatibility complaints of plugin authors "It's already implemented and working - why change?" Some people claim that "Top-level caches are not slow". So my clear -1 from committing the forked code and delay it a few more weeks and instead fix the dependent issues as noted above before this one.
          Hide
          Grant Ingersoll added a comment -

          Uwe, sounds good. Can you link the new issues here so that we can track them as part of this?

          Show
          Grant Ingersoll added a comment - Uwe, sounds good. Can you link the new issues here so that we can track them as part of this?
          Hide
          Mikhail Khludnev added a comment -

          Uwe Schindler what is the reason of delaying this functionality, if we can deliver it now with the following configuration "footprint"?

                  <query>
                         <cache name="cachingWrapperFilterCache" class="solr.search.LRUCache">
                  </query>
          	<queryParser name="parent" class="join.BlockJoinParentQParserPlugin">
          
                     <!-- if you omit this BJQ will perform "suboptimal", 
                           thought functionally complete, 
                          also it annoyingly warns -->
          
          		<str name="parent-filter-cache">cachingWrapperFilterCache</str>
          	</queryParser>
          

          Broken Solr on segments seems crucial to me too.
          I suppose that refactoring will take even much time than this modest piece of functionality. I feel it's really challenging from design perspective e.g. can CachingWrapperFilter flip between dense and sparse sets already? shouldn't it use off-heap memory like in LUCENE-5052 rather than pollute the heap? whether filters behavior is defined in index time or can be requested ad-hoc? etc

          Once again, what about going ahead with small configuration footprint, without diverging join queries.

          thanks

          Show
          Mikhail Khludnev added a comment - Uwe Schindler what is the reason of delaying this functionality, if we can deliver it now with the following configuration "footprint"? <query> <cache name= "cachingWrapperFilterCache" class= "solr.search.LRUCache" > </query> <queryParser name= "parent" class= "join.BlockJoinParentQParserPlugin" > <!-- if you omit this BJQ will perform "suboptimal" , thought functionally complete, also it annoyingly warns --> <str name= "parent-filter-cache" > cachingWrapperFilterCache </str> </queryParser> Broken Solr on segments seems crucial to me too. I suppose that refactoring will take even much time than this modest piece of functionality. I feel it's really challenging from design perspective e.g. can CachingWrapperFilter flip between dense and sparse sets already? shouldn't it use off-heap memory like in LUCENE-5052 rather than pollute the heap? whether filters behavior is defined in index time or can be requested ad-hoc? etc Once again, what about going ahead with small configuration footprint, without diverging join queries. thanks
          Hide
          Robert Muir added a comment -

          Once again, what about going ahead with small configuration footprint, without diverging join queries.

          I am +1 to that setup, like the original patches on this issue (especially for block joins, its typically only ONE filter anyway: and one that must support some "special" stuff like prevSetBit or whatever at the moment compared to general filters). So even long term, maybe a separate cache makes sense because e.g. you want to use compressed implementations for your 'ordinary' filters.

          Broken Solr on segments seems crucial to me too.
          I suppose that refactoring will take even much time than this modest piece of functionality.

          I'm willing to volunteer a significant amount of my time (e.g. like, starting right now) to make it happen. As it stands now its very difficult for users to use solr's features if their index is changing rapidly: lots of garbage and slow reopens. But right or wrong, i always felt restrained from fixing this, for some of the same reasons Uwe mentions...

          Show
          Robert Muir added a comment - Once again, what about going ahead with small configuration footprint, without diverging join queries. I am +1 to that setup, like the original patches on this issue (especially for block joins, its typically only ONE filter anyway: and one that must support some "special" stuff like prevSetBit or whatever at the moment compared to general filters). So even long term, maybe a separate cache makes sense because e.g. you want to use compressed implementations for your 'ordinary' filters. Broken Solr on segments seems crucial to me too. I suppose that refactoring will take even much time than this modest piece of functionality. I'm willing to volunteer a significant amount of my time (e.g. like, starting right now) to make it happen. As it stands now its very difficult for users to use solr's features if their index is changing rapidly: lots of garbage and slow reopens. But right or wrong, i always felt restrained from fixing this, for some of the same reasons Uwe mentions...
          Hide
          Yonik Seeley added a comment -

          I don't think we should again start with

          > Ah, the joy of lucene vs solr politics

          Heh. And then you proceed to spout a bunch of political b.s.

          Show
          Yonik Seeley added a comment - I don't think we should again start with > Ah, the joy of lucene vs solr politics Heh. And then you proceed to spout a bunch of political b.s.
          Hide
          ASF subversion and git services added a comment -

          Commit 1513290 from Yonik Seeley in branch 'dev/trunk'
          [ https://svn.apache.org/r1513290 ]

          SOLR-3076: block join parent and child queries

          Show
          ASF subversion and git services added a comment - Commit 1513290 from Yonik Seeley in branch 'dev/trunk' [ https://svn.apache.org/r1513290 ] SOLR-3076 : block join parent and child queries
          Hide
          Yonik Seeley added a comment -

          Committed w/ Mikhail's user cache approach. To everyone who contributed to this, thank you for your patience.

          Show
          Yonik Seeley added a comment - Committed w/ Mikhail's user cache approach. To everyone who contributed to this, thank you for your patience.
          Hide
          Vadim Kirilchuk added a comment - - edited

          Thank you Yonik Seeley! We all waited this for a long time!

          Btw, as there are still many things we need to address (for example dih support): should we create subtasks for this jira or create another jira like "Improving block joins support" with new subtasks? wdyt?

          Show
          Vadim Kirilchuk added a comment - - edited Thank you Yonik Seeley ! We all waited this for a long time! Btw, as there are still many things we need to address (for example dih support): should we create subtasks for this jira or create another jira like "Improving block joins support" with new subtasks? wdyt?
          Hide
          ASF subversion and git services added a comment -

          Commit 1513577 from Yonik Seeley in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1513577 ]

          SOLR-3076: block join parent and child queries

          Show
          ASF subversion and git services added a comment - Commit 1513577 from Yonik Seeley in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1513577 ] SOLR-3076 : block join parent and child queries
          Hide
          Uwe Schindler added a comment -

          Hi Yonik, hi Mikhail,

          the committed version seems to be much better than the top-level cache one! Many, many thanks for committing that one! This was my only problem with it. But as you say, we should really work on getting Solr to no longer use top-level caches for filters and facets. Also Filters should not use OpenBitSet anymore, instead FixedBitSet or one of the new compressed bitsets (maybe off-heap). FixedBitSet is also better supported by internal APIs, as some algorithms can directly use it (e.g. in BooleanFilter), not sure if this is relevant for Solr.

          Show
          Uwe Schindler added a comment - Hi Yonik, hi Mikhail, the committed version seems to be much better than the top-level cache one! Many, many thanks for committing that one! This was my only problem with it. But as you say, we should really work on getting Solr to no longer use top-level caches for filters and facets. Also Filters should not use OpenBitSet anymore, instead FixedBitSet or one of the new compressed bitsets (maybe off-heap). FixedBitSet is also better supported by internal APIs, as some algorithms can directly use it (e.g. in BooleanFilter), not sure if this is relevant for Solr.
          Hide
          Yonik Seeley added a comment -

          Closing...
          I opened SOLR-5142 for additional work.

          Show
          Yonik Seeley added a comment - Closing... I opened SOLR-5142 for additional work.
          Hide
          Yonik Seeley added a comment -

          Filters should not use OpenBitSet anymore, instead FixedBitSet

          Hey, it isn't my fault that Lucene chose to fork OpenBitSet

          Show
          Yonik Seeley added a comment - Filters should not use OpenBitSet anymore, instead FixedBitSet Hey, it isn't my fault that Lucene chose to fork OpenBitSet
          Hide
          Mikhail Khludnev added a comment -

          Yonik,
          Thanks and Congratulations!

          Show
          Mikhail Khludnev added a comment - Yonik, Thanks and Congratulations!
          Hide
          David Smiley added a comment -

          FYI there is an existing issue with patch that Greg Bowyer is/was working on for per-segment caching in Solr: SOLR-3763

          Show
          David Smiley added a comment - FYI there is an existing issue with patch that Greg Bowyer is/was working on for per-segment caching in Solr: SOLR-3763
          Hide
          Anton added a comment - - edited

          Hi!

          I trying use this feature with solr nightly build: i add

          _root_

          field and index data with inner <doc> blocks.

          ?q=*:*

          gives me data:

          <result name="response" numFound="2" start="0">
              <doc>
                  <int name="id">4</int>
                  <bool name="is_parent">false</bool>
                  <int name="_root_">3</int>
              </doc>
              <doc>
                  <int name="id">3</int>
                  <bool name="is_parent">true</bool>
                  <long name="_version_">1447068567577034752</long>
                  <int name="_root_">3</int>
              </doc>
          </result>
          

          But

          ?q={!parent which=is_parent:true}*:*

          gives me error:

          child query must only match non-parent docs, but parent docID=1 matched childScorer

          however

           ?q={!child of=is_parent:true}*:*

          gives me expected result (only one child doc):

          <result name="response" numFound="1" start="0">
              <doc>
                  <int name="id">4</int>
                  <bool name="is_parent">false</bool>
                  <int name="_root_">3</int>
              </doc>
          </result>
          

          Please, can you explain me what i doing wrong?

          Show
          Anton added a comment - - edited Hi! I trying use this feature with solr nightly build: i add _root_ field and index data with inner <doc> blocks. ?q=*:* gives me data: <result name= "response" numFound= "2" start= "0" > <doc> <int name= "id" > 4 </int> <bool name= "is_parent" > false </bool> <int name= "_root_" > 3 </int> </doc> <doc> <int name= "id" > 3 </int> <bool name= "is_parent" > true </bool> <long name= "_version_" > 1447068567577034752 </long> <int name= "_root_" > 3 </int> </doc> </result> But ?q={!parent which=is_parent:true}*:* gives me error: child query must only match non-parent docs, but parent docID=1 matched childScorer however ?q={!child of=is_parent:true}*:* gives me expected result (only one child doc): <result name= "response" numFound= "1" start= "0" > <doc> <int name= "id" > 4 </int> <bool name= "is_parent" > false </bool> <int name= "_root_" > 3 </int> </doc> </result> Please, can you explain me what i doing wrong?
          Hide
          Vadim Kirilchuk added a comment -

          Hi Anton,

          It's better to ask such questions in mail-lists, but anyway let me answer:
          1) First of all you don't need to explicitly specify root field. It's done internally by solr.
          2)

          ?q={!parent which=is_parent:true}*:*

          gives you an exception because child query must never match parent docs!

          But

          *:*

          breaks the constraint because it matches both child and parent docs. You can try:

          ?q={!parent which=is_parent:true}is_parent:false

          I hope it helps.

          Show
          Vadim Kirilchuk added a comment - Hi Anton, It's better to ask such questions in mail-lists, but anyway let me answer: 1) First of all you don't need to explicitly specify root field. It's done internally by solr. 2) ?q={!parent which=is_parent: true }*:* gives you an exception because child query must never match parent docs! But *:* breaks the constraint because it matches both child and parent docs. You can try: ?q={!parent which=is_parent: true }is_parent: false I hope it helps.
          Hide
          Hoss Man added a comment -


          NOTE: this issue mistakenly got listed in the 4.4 section of CHANGES.txt, but is first available in Solr 4.5.

          Show
          Hoss Man added a comment - NOTE: this issue mistakenly got listed in the 4.4 section of CHANGES.txt, but is first available in Solr 4.5.
          Hide
          Adrien Grand added a comment -

          4.5 release -> bulk close

          Show
          Adrien Grand added a comment - 4.5 release -> bulk close

            People

            • Assignee:
              Unassigned
              Reporter:
              Grant Ingersoll
            • Votes:
              34 Vote for this issue
              Watchers:
              47 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development