Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7304

Doc values based block join implementation

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      At query time the block join relies on a bitset for finding the previous parent doc during advancing the doc id iterator. On large indices these bitsets can consume large amounts of jvm heap space. Also typically due the nature how these bitsets are set, the 'FixedBitSet' implementation is used.

      The idea I had was to replace the bitset usage by a numeric doc values field that stores offsets. Each child doc stores how many docids it is from its parent doc and each parent stores how many docids it is apart from its first child. At query time this information can be used to perform the block join.

      I think another benefit of this approach is that external tools can now easily determine if a doc is part of a block of documents and perhaps this also helps index time sorting?

      1. LUCENE-7304.patch
        25 kB
        Martijn van Groningen
      2. LUCENE-7304.patch
        25 kB
        Martijn van Groningen
      3. LUCENE_7304.patch
        22 kB
        Martijn van Groningen
      4. LUCENE-7304-20160606.patch
        110 kB
        Paul Elschot
      5. LUCENE-7304-20160531.patch
        10 kB
        Paul Elschot
      6. LUCENE-5092-20140313.patch
        25 kB
        Paul Elschot
      7. LUCENE_7304.patch
        17 kB
        Martijn van Groningen

        Activity

        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        Updated the patch. Added a more tests and cleaned up a bit.

        To re-iterate what this patch does, this query uses both an indexed field and a doc values field. The doc values field is used when DocIdSetIterator#advance(...) is invoked to figure out what the first child is of a parent and then instruct the child iterator to advance to that first child. The doc values field has kind of the same purpose what the BitSet does for ToParentBlockJoinQuery query. The indexed field is used for normal forward advancing (DocIdSetIterator#nextDoc()).

        I'm still unsure if this query should also use a doc values field for forward advancing. Each child would then store the offset to the next child. The last child's offset would be zero, meaning the parent is the next document. I think the upside with only using doc values fields is that validating that the docid block structure is correct is easier.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - Updated the patch. Added a more tests and cleaned up a bit. To re-iterate what this patch does, this query uses both an indexed field and a doc values field. The doc values field is used when DocIdSetIterator#advance(...) is invoked to figure out what the first child is of a parent and then instruct the child iterator to advance to that first child. The doc values field has kind of the same purpose what the BitSet does for ToParentBlockJoinQuery query. The indexed field is used for normal forward advancing ( DocIdSetIterator#nextDoc() ). I'm still unsure if this query should also use a doc values field for forward advancing. Each child would then store the offset to the next child. The last child's offset would be zero, meaning the parent is the next document. I think the upside with only using doc values fields is that validating that the docid block structure is correct is easier.
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        It has been a while, but I had some time to get back to this. Updated patch to all changes that have happened so far in master (iterator based doc values api, two phase query execution and score supplier).

        I ran the same performance test as before and due to doc values compression, the offset field now takes 337387 bytes instead of 839592 bytes before, which is good!

        I'm still thinking about other ways of encoding the block of documents. Right now the parent document have a doc values field with the offset to the first child docid. Instead child documents can have a doc values field with the offset to its parent docid. That way parent doc can be indexed first before the child docs.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - It has been a while, but I had some time to get back to this. Updated patch to all changes that have happened so far in master (iterator based doc values api, two phase query execution and score supplier). I ran the same performance test as before and due to doc values compression, the offset field now takes 337387 bytes instead of 839592 bytes before, which is good! I'm still thinking about other ways of encoding the block of documents. Right now the parent document have a doc values field with the offset to the first child docid. Instead child documents can have a doc values field with the offset to its parent docid. That way parent doc can be indexed first before the child docs.
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        Changed the block join query to only require that parent docs store how far away there first child doc is (in docids).

        The reduces the amount of information required to be stored in the doc values offset field and these offsets for the parents compress better the offset values before (which was composed out of more information).

        I tested this patch out on a test data set (https://archive.org/download/stackexchange/english.stackexchange.com.7z). I extracted the questions, answers and comment and indexed each question with its answers and related comments as a hierarchical block of documents. In total 745252 docs were indexed. The size of the doc values offset field was 839592 bytes.

        After that I ran a query that selects all questions that have answers with comments (questions -> answers -> comments) for both the current block join and doc value block join. The the block join used 186768 bytes of jvm heap for bitsets and the doc values block join used 1132 bytes of jvm heap for references to the offset doc values field.

        So with the doc values approach, in total used roughly 4.5 times more RAM (assuming OS caches offset field), and the jvm memory footprint was roughly 165 times smaller.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - Changed the block join query to only require that parent docs store how far away there first child doc is (in docids). The reduces the amount of information required to be stored in the doc values offset field and these offsets for the parents compress better the offset values before (which was composed out of more information). I tested this patch out on a test data set ( https://archive.org/download/stackexchange/english.stackexchange.com.7z ). I extracted the questions, answers and comment and indexed each question with its answers and related comments as a hierarchical block of documents. In total 745252 docs were indexed. The size of the doc values offset field was 839592 bytes. After that I ran a query that selects all questions that have answers with comments (questions -> answers -> comments) for both the current block join and doc value block join. The the block join used 186768 bytes of jvm heap for bitsets and the doc values block join used 1132 bytes of jvm heap for references to the offset doc values field. So with the doc values approach, in total used roughly 4.5 times more RAM (assuming OS caches offset field), and the jvm memory footprint was roughly 165 times smaller.
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        The last time I tried doc values, I could not use advance(target) on them. Is that still the case?

        That is still the case. But the way the doc value block join work is by storing offsets (how far away is the first child doc in docids and how far away is the closest parent) and at query time that is being used to advance the child scorer. However when doc values become iterator based these offsets can be encoded much more efficiently then is now the case.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - The last time I tried doc values, I could not use advance(target) on them. Is that still the case? That is still the case. But the way the doc value block join work is by storing offsets (how far away is the first child doc in docids and how far away is the closest parent) and at query time that is being used to advance the child scorer. However when doc values become iterator based these offsets can be encoded much more efficiently then is now the case.
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        The last time I tried doc values, I could not use advance(target) on them. Is that still the case?
        When so, that will be a hurdle to take for a doc values based block join implementation.

        The other BitSets could also be used for several layers of blocks using the index at the lower level.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - The last time I tried doc values, I could not use advance(target) on them. Is that still the case? When so, that will be a hurdle to take for a doc values based block join implementation. The other BitSets could also be used for several layers of blocks using the index at the lower level.
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        Paul Elschot This is a lot of code I really think this should be moved to a new issue, not just because of this size of the patch, but also because the implementation is different compared to what was initially proposed here. Also I think that EliasFanoDocIdSet and friends shouldn't be added to core, but should be added the join module instead. EliasFano was superseded from core as general purposes docidset by other implementations a while ago and since now it will be used in context of block join, it makes sense to just add it to the join module.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - Paul Elschot This is a lot of code I really think this should be moved to a new issue, not just because of this size of the patch, but also because the implementation is different compared to what was initially proposed here. Also I think that EliasFanoDocIdSet and friends shouldn't be added to core, but should be added the join module instead. EliasFano was superseded from core as general purposes docidset by other implementations a while ago and since now it will be used in context of block join, it makes sense to just add it to the join module.
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        To save some space for multilevel blocks, at a higher level one could use an EliasFanoSequence of the indexes of the lower level.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - To save some space for multilevel blocks, at a higher level one could use an EliasFanoSequence of the indexes of the lower level.
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        Patch of 6 June 2016.
        This is the EliasFano code from LUCENE-5627 put into core.

        This has EliasFanoSequence implemented as EliasFanoBytes and as EliasFanoLongs, and an encoder and a decoder for these.

        The EliasFanoDocIdSet uses an EliasFanoLongs except when it is dense, in that case it uses a FixedBitSet.

        I added a getBitSet() method in this EliasFanoDocIdSet.

        I also added the test cases from LUCENE-5627, but I did not add a test for the getBitSet() method yet. It works as a DocIdSet, so as a BitSet should be no problem.

        EliasFanoDocIdSet could also be implemented on EliasFanoBytes, and it should be doable to put that in an index. At LUCENE-5627 EliasFanoBytes is used as a Payload already.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - Patch of 6 June 2016. This is the EliasFano code from LUCENE-5627 put into core. This has EliasFanoSequence implemented as EliasFanoBytes and as EliasFanoLongs, and an encoder and a decoder for these. The EliasFanoDocIdSet uses an EliasFanoLongs except when it is dense, in that case it uses a FixedBitSet. I added a getBitSet() method in this EliasFanoDocIdSet. I also added the test cases from LUCENE-5627 , but I did not add a test for the getBitSet() method yet. It works as a DocIdSet, so as a BitSet should be no problem. EliasFanoDocIdSet could also be implemented on EliasFanoBytes, and it should be doable to put that in an index. At LUCENE-5627 EliasFanoBytes is used as a Payload already.
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        I intend to make an EliasFanoDocIdSet that implements BitSet. I think it makes sense to try and use this as a starting point for a sparse doc values implementation, so for now I'm not opening a new issue.

        I don't follow. I thought that this new BitSet would be used for the current block join queries? The idea I had is that the doc values block join wouldn't rely on BitSet and would be using a numeric doc values field instead. I'm not sure if the doc values block join will be a better trade off over the current block join in certain scenarios, but this issue is here to explore this.

        Is there is a typical document block size these days?

        Most blocks are larger than 7 docs, but usually inside these blocks there are several layers of child documents (range from 1 to many). Each additional child level in a block requires a BitSet instance too. It really depends and there is no typical block size.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - I intend to make an EliasFanoDocIdSet that implements BitSet. I think it makes sense to try and use this as a starting point for a sparse doc values implementation, so for now I'm not opening a new issue. I don't follow. I thought that this new BitSet would be used for the current block join queries? The idea I had is that the doc values block join wouldn't rely on BitSet and would be using a numeric doc values field instead. I'm not sure if the doc values block join will be a better trade off over the current block join in certain scenarios, but this issue is here to explore this. Is there is a typical document block size these days? Most blocks are larger than 7 docs, but usually inside these blocks there are several layers of child documents (range from 1 to many). Each additional child level in a block requires a BitSet instance too. It really depends and there is no typical block size.
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        This will take a some time, I'll give a try, slowly.

        I intend to make an EliasFanoDocIdSet that implements BitSet.
        I think it makes sense to try and use this as a starting point for a sparse doc values implementation, so for now I'm not opening a new issue.
        Unlike normal doc values, this would allow an advance(target) implementation.

        Meanwhile I realized that a doc values implementation will also have to deal with MutableBits.

        (... Typically it tends to be on the dense side).

        Is there is a typical document block size these days?
        For less than 7, an EliasFano based implementation does not really make sense, a FixedBitSet is better there.
        The bigger the block size gets than that, the more EliasFano makes sense.

        For nested blocks, EliasFano can be used hierarchically, at a higher level the value in the dictionary can be the index of the dictionary at the lower level.
        Anyway, at any level there is always the possibility to use FixedBitSet or another BitSet implementation.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - This will take a some time, I'll give a try, slowly. I intend to make an EliasFanoDocIdSet that implements BitSet. I think it makes sense to try and use this as a starting point for a sparse doc values implementation, so for now I'm not opening a new issue. Unlike normal doc values, this would allow an advance(target) implementation. Meanwhile I realized that a doc values implementation will also have to deal with MutableBits. (... Typically it tends to be on the dense side). Is there is a typical document block size these days? For less than 7, an EliasFano based implementation does not really make sense, a FixedBitSet is better there. The bigger the block size gets than that, the more EliasFano makes sense. For nested blocks, EliasFano can be used hierarchically, at a higher level the value in the dictionary can be the index of the dictionary at the lower level. Anyway, at any level there is always the possibility to use FixedBitSet or another BitSet implementation.
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        There is a dilemma here: either introduce DocBlocksIterator, or not implement MutableBits.

        The block join queries are not using any of the methods that modify the bitset, so I think it is fine to not implement clear() and set() methods. Also it will not be a general purpose bitset, but specialized for the block join.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - There is a dilemma here: either introduce DocBlocksIterator, or not implement MutableBits. The block join queries are not using any of the methods that modify the bitset, so I think it is fine to not implement clear() and set() methods. Also it will not be a general purpose bitset, but specialized for the block join.
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment - - edited

        Instead of the patch it might be simpler to try and let EliasFanoDocIdSet extend from BitSet, even though it cannot implement MutableBits.
        There is a dilemma here: either introduce DocBlocksIterator, or not implement MutableBits.

        The question is which one would be preferable in the long term for the block join queries: DocBlocksIterator or BitSet?
        DocBlocksIterator is read only and might involve a little overhead.
        BitSet implements mutability but that is not needed for the block join queries.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - - edited Instead of the patch it might be simpler to try and let EliasFanoDocIdSet extend from BitSet, even though it cannot implement MutableBits. There is a dilemma here: either introduce DocBlocksIterator, or not implement MutableBits. The question is which one would be preferable in the long term for the block join queries: DocBlocksIterator or BitSet? DocBlocksIterator is read only and might involve a little overhead. BitSet implements mutability but that is not needed for the block join queries.
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        This is only to show a possible direction, BitSetProducer in the join queries may also need to be replaced by a DocBlocksIteratorProducer.

        Cool. Lets iterate on this approach in a new issue? So that this issue can focus on the doc values based approach.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - This is only to show a possible direction, BitSetProducer in the join queries may also need to be replaced by a DocBlocksIteratorProducer. Cool. Lets iterate on this approach in a new issue? So that this issue can focus on the doc values based approach.
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        Patch of 31 May 2016.
        Adds DocBlocksIterator and uses it in ToChildBlockJoinQuery only.

        This is mostly an update of LUCENE-5092 to today, except that it does not include the ToParentBlockJoinQuery yet.

        To my surprise this compiles, but I did not run the tests in the join module.

        This is only to show a possible direction, BitSetProducer in the join queries may also need to be replaced by a DocBlocksIteratorProducer.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - Patch of 31 May 2016. Adds DocBlocksIterator and uses it in ToChildBlockJoinQuery only. This is mostly an update of LUCENE-5092 to today, except that it does not include the ToParentBlockJoinQuery yet. To my surprise this compiles, but I did not run the tests in the join module. This is only to show a possible direction, BitSetProducer in the join queries may also need to be replaced by a DocBlocksIteratorProducer.
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        Having different block join implementations with different trade offs around is good. If EliasFanoDocIdSet can extend from `BitSet` then I think it would be a nice addition to the jojn module, so that `ToParentBlockJoinQuery` and friends can use it as `parentsFilter`. This way the block join that exists today can be improved in certain scenarios (I think that largely depends on how dense this parentsFilter is. Typically it tends to be on the dense side).

        Show
        martijn.v.groningen Martijn van Groningen added a comment - Having different block join implementations with different trade offs around is good. If EliasFanoDocIdSet can extend from `BitSet` then I think it would be a nice addition to the jojn module, so that `ToParentBlockJoinQuery` and friends can use it as `parentsFilter`. This way the block join that exists today can be improved in certain scenarios (I think that largely depends on how dense this parentsFilter is. Typically it tends to be on the dense side).
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        When heap space is the only problem, one could also leave the index unchanged and create an EliasFanoSequence based on long[]'s because that is a little faster than the one based on a BytesRef.
        One sequence per block level would be needed.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - When heap space is the only problem, one could also leave the index unchanged and create an EliasFanoSequence based on long[]'s because that is a little faster than the one based on a BytesRef. One sequence per block level would be needed.
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        To use an EliasFano dictionary in an index, it would be better to start from the EliasFano code from LUCENE-5627 because that one also has an implementation on a BytesRef that is used as a payload there. From the BytesRef it would probably be easier to put it directly in an index.
        The same advanceToJustBefore() method (from DocBlockIterator) would still need to be added.

        The above patch for LUCENE-5092 also moves block joins from FixedBitSet to DocBlockIterator.
        For use here, that would allow two different implementations of DocBlockIterator, the current FixedBitSet and an implementation based on doc values.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - To use an EliasFano dictionary in an index, it would be better to start from the EliasFano code from LUCENE-5627 because that one also has an implementation on a BytesRef that is used as a payload there. From the BytesRef it would probably be easier to put it directly in an index. The same advanceToJustBefore() method (from DocBlockIterator) would still need to be added. The above patch for LUCENE-5092 also moves block joins from FixedBitSet to DocBlockIterator. For use here, that would allow two different implementations of DocBlockIterator, the current FixedBitSet and an implementation based on doc values.
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        Patch for LUCENE-5092 against trunk of 13 March 2014.
        A.o. this adds method advanceToJustBefore() in EliasFanoDocIdSet.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - Patch for LUCENE-5092 against trunk of 13 March 2014. A.o. this adds method advanceToJustBefore() in EliasFanoDocIdSet.
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        I still have an EliasFanoDocIdSet that could be used for block joins, see LUCENE-5092.

        I'm not familiar with EliasFanoDocIdSet, but can that implementation go iterate backwards? The link to the pull request mentioned in that issue gives a 404 and from the patch in LUCENE-6484 it doesn't seem this is supported.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - I still have an EliasFanoDocIdSet that could be used for block joins, see LUCENE-5092 . I'm not familiar with EliasFanoDocIdSet, but can that implementation go iterate backwards? The link to the pull request mentioned in that issue gives a 404 and from the patch in LUCENE-6484 it doesn't seem this is supported.
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        ... go backwards ... less than one bit per doc ...

        Maybe it is time to have another look at EliasFanoDocIdSet, see LUCENE-6484.
        It won't really fit doc values I think, for block joins this needs one set per segment.

        I still have an EliasFanoDocIdSet that could be used for block joins, see LUCENE-5092.
        In case there is interest in that please let me know, the github pull requests from that time did not survive the move to git.

        See also these graphs on performance http://people.apache.org/~jpountz/doc_id_sets.html
        Unfortunately RoaringDocIdSet is not shown in there, I'd expect that to be (easily made) bidirectional, too.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - ... go backwards ... less than one bit per doc ... Maybe it is time to have another look at EliasFanoDocIdSet, see LUCENE-6484 . It won't really fit doc values I think, for block joins this needs one set per segment. I still have an EliasFanoDocIdSet that could be used for block joins, see LUCENE-5092 . In case there is interest in that please let me know, the github pull requests from that time did not survive the move to git. See also these graphs on performance http://people.apache.org/~jpountz/doc_id_sets.html Unfortunately RoaringDocIdSet is not shown in there, I'd expect that to be (easily made) bidirectional, too.
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        Does this approach work out to less than one bit per doc?

        Unfortunately it is more than that. But with current block join implementation the memory cost does increase (requires extra bit sets) when there are multiple levels of parent-child relations, while with this approach the memory costs remains the same (it just needs one numeric doc values field to encode the multiple layers of document blocks).

        our doc values compression isn't THAT good yet

        Maybe if doc values becomes an iterator based, then I guess with delta encoding, we could get closer to 1 bit per doc?

        Show
        martijn.v.groningen Martijn van Groningen added a comment - Does this approach work out to less than one bit per doc? Unfortunately it is more than that. But with current block join implementation the memory cost does increase (requires extra bit sets) when there are multiple levels of parent-child relations, while with this approach the memory costs remains the same (it just needs one numeric doc values field to encode the multiple layers of document blocks). our doc values compression isn't THAT good yet Maybe if doc values becomes an iterator based, then I guess with delta encoding, we could get closer to 1 bit per doc?
        Hide
        mikemccand Michael McCandless added a comment -

        This is a neat idea!

        Does this approach work out to less than one bit per doc? I guess it must be more than that (our doc values compression isn't THAT good yet), but by switching to doc values, even though we need more RAM, it moves off-heap right, so the OS is managing keeping those bytes hot instead.

        Show
        mikemccand Michael McCandless added a comment - This is a neat idea! Does this approach work out to less than one bit per doc? I guess it must be more than that (our doc values compression isn't THAT good yet), but by switching to doc values, even though we need more RAM, it moves off-heap right, so the OS is managing keeping those bytes hot instead.
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        If we switched block joins to use numeric doc values, I am wondering if we would ever need to read doc values in reverse order?

        Yes, in this patch, but I think the logic can be changed, so that at least doc values don't need to be read in reverse. Currently there is one offset field holding both the offset the parent for child docs and offset to the first child for parents. This can be split up in two fields, so that doc values never has to be read in reverse.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - If we switched block joins to use numeric doc values, I am wondering if we would ever need to read doc values in reverse order? Yes, in this patch, but I think the logic can be changed, so that at least doc values don't need to be read in reverse. Currently there is one offset field holding both the offset the parent for child docs and offset to the first child for parents. This can be split up in two fields, so that doc values never has to be read in reverse.
        Hide
        jpountz Adrien Grand added a comment -

        If we switched block joins to use numeric doc values, I am wondering if we would ever need to read doc values in reverse order? The reason I am asking is that there have been some tensions to cut over doc values to an iterator API in order to improve compression and better deal with sparse doc values, see eg. LUCENE-7253:

        Show
        jpountz Adrien Grand added a comment - If we switched block joins to use numeric doc values, I am wondering if we would ever need to read doc values in reverse order? The reason I am asking is that there have been some tensions to cut over doc values to an iterator API in order to improve compression and better deal with sparse doc values, see eg. LUCENE-7253 :
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        I wonder... instead couldn't we get a DocIdSetIterator of parent docs and kind of intersect it with the child DISI?

        I wondered that a while ago too, but we can't go backwards with `DocIdSetIterator` and this what the advance method ('parentBits.prevSetBit(parentTarget-1)') requires of the block join query to figure out where the first child starts for 'parentTarget'.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - I wonder... instead couldn't we get a DocIdSetIterator of parent docs and kind of intersect it with the child DISI? I wondered that a while ago too, but we can't go backwards with `DocIdSetIterator` and this what the advance method ('parentBits.prevSetBit(parentTarget-1)') requires of the block join query to figure out where the first child starts for 'parentTarget'.
        Hide
        dsmiley David Smiley added a comment -

        This is interesting. I wonder... instead couldn't we get a DocIdSetIterator of parent docs and kind of intersect it with the child DISI? (no bitset, no potentially fragile encoding of relative doc ID offsets). This is a half-baked idea and I'm not sure if it even makes any sense :-P so take it with a grain of salt!

        Show
        dsmiley David Smiley added a comment - This is interesting. I wonder... instead couldn't we get a DocIdSetIterator of parent docs and kind of intersect it with the child DISI? (no bitset, no potentially fragile encoding of relative doc ID offsets). This is a half-baked idea and I'm not sure if it even makes any sense :-P so take it with a grain of salt!
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        Attached a working version of a doc values based block join query.
        The app storing docs is responsible for adding the numeric doc values field with the right offsets.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - Attached a working version of a doc values based block join query. The app storing docs is responsible for adding the numeric doc values field with the right offsets.

          People

          • Assignee:
            Unassigned
            Reporter:
            martijn.v.groningen Martijn van Groningen
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:

              Development