Lucene - Core
  1. Lucene - Core
  2. LUCENE-1476

BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Won't Fix
    • Affects Version/s: 2.4
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Update BitVector to implement DocIdSet. Expose deleted docs DocIdSet from IndexReader.

      1. hacked-deliterator.patch
        5 kB
        Michael McCandless
      2. LUCENE-1476.patch
        17 kB
        Michael McCandless
      3. LUCENE-1476.patch
        17 kB
        Jason Rutherglen
      4. LUCENE-1476.patch
        17 kB
        Jason Rutherglen
      5. LUCENE-1476.patch
        14 kB
        Jason Rutherglen
      6. LUCENE-1476.patch
        3 kB
        Jason Rutherglen
      7. quasi_iterator_deletions_r2.diff
        5 kB
        Marvin Humphrey
      8. quasi_iterator_deletions_r3.diff
        6 kB
        Marvin Humphrey
      9. quasi_iterator_deletions.diff
        5 kB
        Marvin Humphrey
      10. searchdeletes.alg
        2 kB
        Jason Rutherglen
      11. sortBench2.py
        4 kB
        Michael McCandless
      12. sortCollate2.py
        1 kB
        Michael McCandless
      13. TestDeletesDocIdSet.java
        3 kB
        Jason Rutherglen

        Activity

        Hide
        Jason Rutherglen added a comment -

        Won't be working on these and they're old

        Show
        Jason Rutherglen added a comment - Won't be working on these and they're old
        Hide
        Michael McCandless added a comment -

        We need more performance data before going ahead with tombstone deletions. Sounds like the next step.

        Or, I think morphing the tombstones idea into the "incremental copy on write block BitVector" idea you mentioned, is a good next step to explore (for realtime search).

        Show
        Michael McCandless added a comment - We need more performance data before going ahead with tombstone deletions. Sounds like the next step. Or, I think morphing the tombstones idea into the "incremental copy on write block BitVector" idea you mentioned, is a good next step to explore (for realtime search).
        Hide
        Michael McCandless added a comment -

        This seems like something we can port to Java and get into
        contrib/benchmark. Particularly automatically creating the indexes.

        That would be great! Maybe it's not so much work... contrib/benchmark can already do the "rounds" to enumerate through the different combinations of query, index, etc.

        Automatically creating the indices is trickier; you'd want to 1) create the first index, 2) copy it and do X%tg deletions, 3) repeat 2 for different values of X.

        Also tricky is gathering a series of results and saving it somehow so that you can then make a baseline for other runs.

        Or we could keep using Python (pickling is so easy ). It's nice to have a "scripting" layer on top of contrib/benchmark.

        Show
        Michael McCandless added a comment - This seems like something we can port to Java and get into contrib/benchmark. Particularly automatically creating the indexes. That would be great! Maybe it's not so much work... contrib/benchmark can already do the "rounds" to enumerate through the different combinations of query, index, etc. Automatically creating the indices is trickier; you'd want to 1) create the first index, 2) copy it and do X%tg deletions, 3) repeat 2 for different values of X. Also tricky is gathering a series of results and saving it somehow so that you can then make a baseline for other runs. Or we could keep using Python (pickling is so easy ). It's nice to have a "scripting" layer on top of contrib/benchmark.
        Hide
        Michael McCandless added a comment -

        Thanks for running all these benchmarks. (I'm going to have to port the
        benchmark suite sooner rather than later so that Lucy doesn't have to rely on
        having algos worked out under Lucene.)

        No problem And I don't mind working these things out / discussing
        them in Lucene – many of these ideas turn out very well! EG
        LUCENE-1483.

        Though... iteration may work better in C than it does in Java, so it's
        hard to know what to conclude here for Lucy.

        I think it's important to note that the degradation at low deletion
        percentages is not as bad as the chart seems to imply. The worst performance
        was on the cheapest query, and the best performance was on the most expensive
        query.

        True.

        Furthermore, a 50% deletion percentage is very high. Regardless of how
        deletions are implemented, the default merge policy ought to trigger
        absorbtion of segments with deletion percentages over a certain threshold.
        The actual threshold percentage is an arbitrary number because the ideal
        tradeoff between index-time sacrifices and search-time sacrifices varies, but
        my gut pick would be somewhere between 10% and 30% for bit vector deletions
        and between 5% and 20% for iterated deletions.

        Right, we should have the default merge policy try to coalsce segments
        with too many deletions (Lucene does not do this today – you have to
        call expungeDeletes yourself, and even that's not perfect since it
        eliminates every single delete).

        Ha, you're giving up a little early, amigo.

        I do think it's clear that maintaining an individual deletions iterator for
        each posting list isn't going to work well, particularly when it's implemented
        using an opaque DocIdSetIterator and virtual methods. That can't compete with
        what is essentially an array lookup (since BitVector.get() can be inlined) on
        a shared bit vector, even at low deletion rates.

        Right.

        I also think we can conclude that high deletion rates cause an accelerating
        degradation with iterated deletions, and that if they are implemented, that
        problem probably needs to be addressed with more aggressive segment merging.

        Agreed.

        However, I don't think the benchmark data we've seen so far demonstrates that
        the filtered deletions model won't work. Heck, with that model,
        deletions get pulled out out TermDocs so we lose the per-iter null-check,
        possibly yielding a small performance increase in the common case of zero
        deletions.

        By "filtered deletions model", you mean treating deletions as a
        "normal" filter and moving "up" when they are applied? Right, we have
        not explored that yet, here.

        Except, filters are accessed by iteration in Lucene now (they used to
        be random access before LUCENE-584).

        So... now I'm wondering if we should go back and scrutinize the
        performance cost of switching from random access to iteration for
        filters, especially in cases where under-the-hood the filter needed to
        create a non-sparse repr anyway. It looks like there was a fair
        amount of discussion about performance, but no hard conclusions, in
        LUCENE-584, on first read. Hmmm.

        I think to do a fair test, we need to move deleted docs "up higher",
        but access it w/ random access API?

        Except... the boolean query QPS are not that different with the
        no-deletions case vs the 1% deletions. And it's only the non-simple
        queries (boolean, span, phrase, etc.) that would see any difference
        in moving deletions "up" as a filter? Ie their QPS is already so low
        that the added cost of doing the deletes "at the bottom" appears to be
        quite minor.

        (Conceptually, I completely agree that deletions are really the same
        thing as filters).

        Maybe we should close this issue with a won't-fix and start a new one for
        filtered deletions?

        Maybe discuss a bit more here or on java-dev first?

        Show
        Michael McCandless added a comment - Thanks for running all these benchmarks. (I'm going to have to port the benchmark suite sooner rather than later so that Lucy doesn't have to rely on having algos worked out under Lucene.) No problem And I don't mind working these things out / discussing them in Lucene – many of these ideas turn out very well! EG LUCENE-1483 . Though... iteration may work better in C than it does in Java, so it's hard to know what to conclude here for Lucy. I think it's important to note that the degradation at low deletion percentages is not as bad as the chart seems to imply. The worst performance was on the cheapest query, and the best performance was on the most expensive query. True. Furthermore, a 50% deletion percentage is very high. Regardless of how deletions are implemented, the default merge policy ought to trigger absorbtion of segments with deletion percentages over a certain threshold. The actual threshold percentage is an arbitrary number because the ideal tradeoff between index-time sacrifices and search-time sacrifices varies, but my gut pick would be somewhere between 10% and 30% for bit vector deletions and between 5% and 20% for iterated deletions. Right, we should have the default merge policy try to coalsce segments with too many deletions (Lucene does not do this today – you have to call expungeDeletes yourself, and even that's not perfect since it eliminates every single delete). Ha, you're giving up a little early, amigo. I do think it's clear that maintaining an individual deletions iterator for each posting list isn't going to work well, particularly when it's implemented using an opaque DocIdSetIterator and virtual methods. That can't compete with what is essentially an array lookup (since BitVector.get() can be inlined) on a shared bit vector, even at low deletion rates. Right. I also think we can conclude that high deletion rates cause an accelerating degradation with iterated deletions, and that if they are implemented, that problem probably needs to be addressed with more aggressive segment merging. Agreed. However, I don't think the benchmark data we've seen so far demonstrates that the filtered deletions model won't work. Heck, with that model, deletions get pulled out out TermDocs so we lose the per-iter null-check, possibly yielding a small performance increase in the common case of zero deletions. By "filtered deletions model", you mean treating deletions as a "normal" filter and moving "up" when they are applied? Right, we have not explored that yet, here. Except, filters are accessed by iteration in Lucene now (they used to be random access before LUCENE-584 ). So... now I'm wondering if we should go back and scrutinize the performance cost of switching from random access to iteration for filters, especially in cases where under-the-hood the filter needed to create a non-sparse repr anyway. It looks like there was a fair amount of discussion about performance, but no hard conclusions, in LUCENE-584 , on first read. Hmmm. I think to do a fair test, we need to move deleted docs "up higher", but access it w/ random access API? Except... the boolean query QPS are not that different with the no-deletions case vs the 1% deletions. And it's only the non-simple queries (boolean, span, phrase, etc.) that would see any difference in moving deletions "up" as a filter? Ie their QPS is already so low that the added cost of doing the deletes "at the bottom" appears to be quite minor. (Conceptually, I completely agree that deletions are really the same thing as filters). Maybe we should close this issue with a won't-fix and start a new one for filtered deletions? Maybe discuss a bit more here or on java-dev first?
        Hide
        Jason Rutherglen added a comment -

        Just run sortBench2.py in contrib/benchmark of trunk & patch
        areas. Then run sortCollate2.py to make the Jira table (-jira) or
        print a human readable output (default). You'll have to make your own
        Wikipedia indices with the pctg deletes, then edit sortBench2.py &
        sortCollate2.py to fix the paths.

        All they do is write an alg file, run the test, and parse the output
        file to gather best of 5.

        This seems like something we can port to Java and get into
        contrib/benchmark. Particularly automatically creating the indexes.

        Show
        Jason Rutherglen added a comment - Just run sortBench2.py in contrib/benchmark of trunk & patch areas. Then run sortCollate2.py to make the Jira table (-jira) or print a human readable output (default). You'll have to make your own Wikipedia indices with the pctg deletes, then edit sortBench2.py & sortCollate2.py to fix the paths. All they do is write an alg file, run the test, and parse the output file to gather best of 5. This seems like something we can port to Java and get into contrib/benchmark. Particularly automatically creating the indexes.
        Hide
        Jason Rutherglen added a comment -

        > Maybe we should close this issue with a won't-fix and start a new one for filtered deletions?

        We need more performance data before going ahead with tombstone deletions. Sounds like the next step.

        Show
        Jason Rutherglen added a comment - > Maybe we should close this issue with a won't-fix and start a new one for filtered deletions? We need more performance data before going ahead with tombstone deletions. Sounds like the next step.
        Hide
        Marvin Humphrey added a comment -

        > Actually I used your entire patch on its own.

        Ah. Thanks for running all these benchmarks. (I'm going to have to port the
        benchmark suite sooner rather than later so that Lucy doesn't have to rely on
        having algos worked out under Lucene.)

        I think it's important to note that the degradation at low deletion
        percentages is not as bad as the chart seems to imply. The worst performance
        was on the cheapest query, and the best performance was on the most expensive
        query.

        Furthermore, a 50% deletion percentage is very high. Regardless of how
        deletions are implemented, the default merge policy ought to trigger
        absorbtion of segments with deletion percentages over a certain threshold.
        The actual threshold percentage is an arbitrary number because the ideal
        tradeoff between index-time sacrifices and search-time sacrifices varies, but
        my gut pick would be somewhere between 10% and 30% for bit vector deletions
        and between 5% and 20% for iterated deletions.

        > I'm now feeling like we're gonna have to keep random-access to deleted
        > docs....

        Ha, you're giving up a little early, amigo.

        I do think it's clear that maintaining an individual deletions iterator for
        each posting list isn't going to work well, particularly when it's implemented
        using an opaque DocIdSetIterator and virtual methods. That can't compete with
        what is essentially an array lookup (since BitVector.get() can be inlined) on
        a shared bit vector, even at low deletion rates.

        I also think we can conclude that high deletion rates cause an accelerating
        degradation with iterated deletions, and that if they are implemented, that
        problem probably needs to be addressed with more aggressive segment merging.

        However, I don't think the benchmark data we've seen so far demonstrates that
        the filtered deletions model won't work. Heck, with that model,
        deletions get pulled out out TermDocs so we lose the per-iter null-check,
        possibly yielding a small performance increase in the common case of zero
        deletions.

        Maybe we should close this issue with a won't-fix and start a new one for
        filtered deletions?

        Show
        Marvin Humphrey added a comment - > Actually I used your entire patch on its own. Ah. Thanks for running all these benchmarks. (I'm going to have to port the benchmark suite sooner rather than later so that Lucy doesn't have to rely on having algos worked out under Lucene.) I think it's important to note that the degradation at low deletion percentages is not as bad as the chart seems to imply. The worst performance was on the cheapest query, and the best performance was on the most expensive query. Furthermore, a 50% deletion percentage is very high. Regardless of how deletions are implemented, the default merge policy ought to trigger absorbtion of segments with deletion percentages over a certain threshold. The actual threshold percentage is an arbitrary number because the ideal tradeoff between index-time sacrifices and search-time sacrifices varies, but my gut pick would be somewhere between 10% and 30% for bit vector deletions and between 5% and 20% for iterated deletions. > I'm now feeling like we're gonna have to keep random-access to deleted > docs.... Ha, you're giving up a little early, amigo. I do think it's clear that maintaining an individual deletions iterator for each posting list isn't going to work well, particularly when it's implemented using an opaque DocIdSetIterator and virtual methods. That can't compete with what is essentially an array lookup (since BitVector.get() can be inlined) on a shared bit vector, even at low deletion rates. I also think we can conclude that high deletion rates cause an accelerating degradation with iterated deletions, and that if they are implemented, that problem probably needs to be addressed with more aggressive segment merging. However, I don't think the benchmark data we've seen so far demonstrates that the filtered deletions model won't work. Heck, with that model, deletions get pulled out out TermDocs so we lose the per-iter null-check, possibly yielding a small performance increase in the common case of zero deletions. Maybe we should close this issue with a won't-fix and start a new one for filtered deletions?
        Hide
        Michael McCandless added a comment -

        Alas.... I had a bug in my original test (my SegmentTermDocs was
        incorrectly returning some deleted docs). But even with that bug I
        can't repro my original "it's faster at < 10% deletions". Here are my
        results using a pre-computed array of deleted docIDs:

        %tg deletes query hits qps qpsnew pctg
        0% 147 4984 5560.1 5392.5 -3.0%
        0% text 97191 347.3 334.1 -3.8%
        0% 1 AND 2 234634 22.9 22.8 -0.4%
        0% 1 386435 88.4 86.0 -2.7%
        0% 1 OR 2 535584 20.9 20.8 -0.5%
        1% 147 4933 5082.0 3643.5 -28.3%
        1% text 96143 313.9 304.9 -2.9%
        1% 1 AND 2 232250 22.1 22.3 0.9%
        1% 1 382498 81.0 82.3 1.6%
        1% 1 OR 2 530212 20.2 20.2 0.0%
        2% 147 4883 5133.6 3299.6 -35.7%
        2% text 95190 315.8 289.7 -8.3%
        2% 1 AND 2 229870 22.2 22.1 -0.5%
        2% 1 378641 81.2 80.9 -0.4%
        2% 1 OR 2 524873 20.3 20.2 -0.5%
        5% 147 4729 5073.6 2405.2 -52.6%
        5% text 92293 315.2 259.0 -17.8%
        5% 1 AND 2 222859 22.5 22.0 -2.2%
        5% 1 367000 81.0 77.6 -4.2%
        5% 1 OR 2 508632 20.4 19.7 -3.4%
        10% 147 4475 5049.6 1738.8 -65.6%
        10% text 87504 314.8 232.6 -26.1%
        10% 1 AND 2 210982 22.9 21.7 -5.2%
        10% 1 347664 81.5 74.0 -9.2%
        10% 1 OR 2 481792 21.2 20.2 -4.7%
        20% 147 4012 5045.0 1117.6 -77.8%
        20% text 77980 317.2 208.9 -34.1%
        20% 1 AND 2 187605 23.9 21.4 -10.5%
        20% 1 309040 82.0 68.2 -16.8%
        20% 1 OR 2 428232 22.3 20.2 -9.4%
        50% 147 2463 5283.2 522.3 -90.1%
        50% text 48331 336.9 176.4 -47.6%
        50% 1 AND 2 116887 28.4 23.0 -19.0%
        50% 1 193154 86.4 63.5 -26.5%
        50% 1 OR 2 267525 27.6 22.4 -18.8%

        I've attached my patch, but note that some tests fail because I don't update the list of deleted docs when deleteDocument is called.

        I'm now feeling like we're gonna have to keep random-access to deleted docs....

        Show
        Michael McCandless added a comment - Alas.... I had a bug in my original test (my SegmentTermDocs was incorrectly returning some deleted docs). But even with that bug I can't repro my original "it's faster at < 10% deletions". Here are my results using a pre-computed array of deleted docIDs: %tg deletes query hits qps qpsnew pctg 0% 147 4984 5560.1 5392.5 -3.0% 0% text 97191 347.3 334.1 -3.8% 0% 1 AND 2 234634 22.9 22.8 -0.4% 0% 1 386435 88.4 86.0 -2.7% 0% 1 OR 2 535584 20.9 20.8 -0.5% 1% 147 4933 5082.0 3643.5 -28.3% 1% text 96143 313.9 304.9 -2.9% 1% 1 AND 2 232250 22.1 22.3 0.9% 1% 1 382498 81.0 82.3 1.6% 1% 1 OR 2 530212 20.2 20.2 0.0% 2% 147 4883 5133.6 3299.6 -35.7% 2% text 95190 315.8 289.7 -8.3% 2% 1 AND 2 229870 22.2 22.1 -0.5% 2% 1 378641 81.2 80.9 -0.4% 2% 1 OR 2 524873 20.3 20.2 -0.5% 5% 147 4729 5073.6 2405.2 -52.6% 5% text 92293 315.2 259.0 -17.8% 5% 1 AND 2 222859 22.5 22.0 -2.2% 5% 1 367000 81.0 77.6 -4.2% 5% 1 OR 2 508632 20.4 19.7 -3.4% 10% 147 4475 5049.6 1738.8 -65.6% 10% text 87504 314.8 232.6 -26.1% 10% 1 AND 2 210982 22.9 21.7 -5.2% 10% 1 347664 81.5 74.0 -9.2% 10% 1 OR 2 481792 21.2 20.2 -4.7% 20% 147 4012 5045.0 1117.6 -77.8% 20% text 77980 317.2 208.9 -34.1% 20% 1 AND 2 187605 23.9 21.4 -10.5% 20% 1 309040 82.0 68.2 -16.8% 20% 1 OR 2 428232 22.3 20.2 -9.4% 50% 147 2463 5283.2 522.3 -90.1% 50% text 48331 336.9 176.4 -47.6% 50% 1 AND 2 116887 28.4 23.0 -19.0% 50% 1 193154 86.4 63.5 -26.5% 50% 1 OR 2 267525 27.6 22.4 -18.8% I've attached my patch, but note that some tests fail because I don't update the list of deleted docs when deleteDocument is called. I'm now feeling like we're gonna have to keep random-access to deleted docs....
        Hide
        Michael McCandless added a comment -

        Presumably you spliced the improved nextSetBit into Jason's patch, correct?

        Actually I used your entire patch on its own.

        Didn't you have a close-to-ideal patch using sorted ints that performed well up to 10% deletions? What did that look like?

        I thought I did – but it was rather hacked up (I "fixed" SegmentReader to do always do an up-front conversion into int[] deletedDocs). I'll re-test it to try to repro my initial rough results.

        I also suspect that when there are many deletions, the sheer number of method calls to perform the deletions iteration is a burden. The iterator has to compete with an inline-able method from a final class (BitVector).

        Right, for a highish %tg deletion it seems likely that random-access will win.

        Show
        Michael McCandless added a comment - Presumably you spliced the improved nextSetBit into Jason's patch, correct? Actually I used your entire patch on its own. Didn't you have a close-to-ideal patch using sorted ints that performed well up to 10% deletions? What did that look like? I thought I did – but it was rather hacked up (I "fixed" SegmentReader to do always do an up-front conversion into int[] deletedDocs). I'll re-test it to try to repro my initial rough results. I also suspect that when there are many deletions, the sheer number of method calls to perform the deletions iteration is a burden. The iterator has to compete with an inline-able method from a final class (BitVector). Right, for a highish %tg deletion it seems likely that random-access will win.
        Hide
        Marvin Humphrey added a comment -

        > Numbers with Marvin's latest patch:

        Presumably you spliced the improved nextSetBit into Jason's patch, correct? I wonder how my patch on its own would do, since there's less abstraction. Didn't you have a close-to-ideal patch using sorted ints that performed well up to 10% deletions? What did that look like?

        > I think we should try list-of-sorted-ints?

        That should help with the situation where deletes are sparse, particularly when the term is rare (such as those searches for "147"), since it will remove the cost of scanning through a bunch of empty bits.

        I'm also curious what happens if we do without the null-check here:

        +      if (deletedDocsIt != null) {
        +        if (doc > nextDeletion) {
        +          if (deletedDocsIt.skipTo(doc)) 
        +            nextDeletion = deletedDocsIt.doc();
        +        } 
        +        if (doc == nextDeletion)
        +          continue;
               }
        

        When there are no deletions, nextDeletion is set and left at Integer.MAX_VALUE, so we'd get a comparison that's always false for the life of the TermDocs instead of an always-null null check. Possibly we'd slow down the no-deletions case while speeding up all others, but maybe the processor does a good job at predicting the result of the comparison.

        I also suspect that when there are many deletions, the sheer number of method calls to perform the deletions iteration is a burden. The iterator has to compete with an inline-able method from a final class (BitVector).

        Show
        Marvin Humphrey added a comment - > Numbers with Marvin's latest patch: Presumably you spliced the improved nextSetBit into Jason's patch, correct? I wonder how my patch on its own would do, since there's less abstraction. Didn't you have a close-to-ideal patch using sorted ints that performed well up to 10% deletions? What did that look like? > I think we should try list-of-sorted-ints? That should help with the situation where deletes are sparse, particularly when the term is rare (such as those searches for "147"), since it will remove the cost of scanning through a bunch of empty bits. I'm also curious what happens if we do without the null-check here: + if (deletedDocsIt != null ) { + if (doc > nextDeletion) { + if (deletedDocsIt.skipTo(doc)) + nextDeletion = deletedDocsIt.doc(); + } + if (doc == nextDeletion) + continue ; } When there are no deletions, nextDeletion is set and left at Integer.MAX_VALUE, so we'd get a comparison that's always false for the life of the TermDocs instead of an always-null null check. Possibly we'd slow down the no-deletions case while speeding up all others, but maybe the processor does a good job at predicting the result of the comparison. I also suspect that when there are many deletions, the sheer number of method calls to perform the deletions iteration is a burden. The iterator has to compete with an inline-able method from a final class (BitVector).
        Hide
        Michael McCandless added a comment -

        The main problem we're trying to solve is potential allocation of a
        large del docs BV byte array for the copy on write of a cloned
        reader.

        Right, as long as normal search performance does not get worse.
        Actually, I was hoping "deletes as iterator" and "deletes higher up as
        filter" might give us some gains in search performance.

        An option we haven't looked at is a MultiByteArray where
        multiple byte arrays make up a virtual byte array checked by BV.get.
        On deleteDocument, only the byte array chunks that are changed are
        replaced in the new version, while the previously copied chunks are
        kept. The overhead of the BV.get can be minimal, though in our tests
        with an int array version the performance can either be equal to or
        double based on factors we are not aware of.

        I think that'd be a good approach (it amortizes the copy on write
        cost), though it'd be a double deref per lookup with the
        straightforward impl so I think it'll hurt normal search perf too.

        And I don't think we should give up on iterator access just yet... I
        think we should try list-of-sorted-ints?

        Show
        Michael McCandless added a comment - The main problem we're trying to solve is potential allocation of a large del docs BV byte array for the copy on write of a cloned reader. Right, as long as normal search performance does not get worse. Actually, I was hoping "deletes as iterator" and "deletes higher up as filter" might give us some gains in search performance. An option we haven't looked at is a MultiByteArray where multiple byte arrays make up a virtual byte array checked by BV.get. On deleteDocument, only the byte array chunks that are changed are replaced in the new version, while the previously copied chunks are kept. The overhead of the BV.get can be minimal, though in our tests with an int array version the performance can either be equal to or double based on factors we are not aware of. I think that'd be a good approach (it amortizes the copy on write cost), though it'd be a double deref per lookup with the straightforward impl so I think it'll hurt normal search perf too. And I don't think we should give up on iterator access just yet... I think we should try list-of-sorted-ints?
        Hide
        Michael McCandless added a comment -

        Numbers with Marvin's latest patch:

        %tg deletes query hits qps qpsnew pctg
        0% 147 4984 5560.1 5507.8 -0.9%
        0% text 97191 347.3 338.8 -2.4%
        0% 1 AND 2 234634 22.9 22.8 -0.4%
        0% 1 386435 88.4 86.9 -1.7%
        0% 1 OR 2 535584 20.9 20.7 -1.0%
        1% 147 4933 5082.0 3292.2 -35.2%
        1% text 96143 313.9 260.8 -16.9%
        1% 1 AND 2 232250 22.1 21.9 -0.9%
        1% 1 382498 81.0 79.6 -1.7%
        1% 1 OR 2 530212 20.2 20.3 0.5%
        2% 147 4883 5133.6 3092.0 -39.8%
        2% text 95190 315.8 232.8 -26.3%
        2% 1 AND 2 229870 22.2 21.2 -4.5%
        2% 1 378641 81.2 76.4 -5.9%
        2% 1 OR 2 524873 20.3 19.8 -2.5%
        5% 147 4729 5073.6 3478.5 -31.4%
        5% text 92293 315.2 219.1 -30.5%
        5% 1 AND 2 222859 22.5 20.5 -8.9%
        5% 1 367000 81.0 68.5 -15.4%
        5% 1 OR 2 508632 20.4 18.9 -7.4%
        10% 147 4475 5049.6 3547.8 -29.7%
        10% text 87504 314.8 222.2 -29.4%
        10% 1 AND 2 210982 22.9 19.4 -15.3%
        10% 1 347664 81.5 61.6 -24.4%
        10% 1 OR 2 481792 21.2 18.4 -13.2%
        20% 147 4012 5045.0 3741.8 -25.8%
        20% text 77980 317.2 232.9 -26.6%
        20% 1 AND 2 187605 23.9 20.5 -14.2%
        20% 1 309040 82.0 59.0 -28.0%
        20% 1 OR 2 428232 22.3 18.8 -15.7%
        50% 147 2463 5283.2 4712.1 -10.8%
        50% text 48331 336.9 296.2 -12.1%
        50% 1 AND 2 116887 28.4 26.6 -6.3%
        50% 1 193154 86.4 77.3 -10.5%
        50% 1 OR 2 267525 27.6 25.7 -6.9%

        It's definitely better than before but still slower than trunk.

        Show
        Michael McCandless added a comment - Numbers with Marvin's latest patch: %tg deletes query hits qps qpsnew pctg 0% 147 4984 5560.1 5507.8 -0.9% 0% text 97191 347.3 338.8 -2.4% 0% 1 AND 2 234634 22.9 22.8 -0.4% 0% 1 386435 88.4 86.9 -1.7% 0% 1 OR 2 535584 20.9 20.7 -1.0% 1% 147 4933 5082.0 3292.2 -35.2% 1% text 96143 313.9 260.8 -16.9% 1% 1 AND 2 232250 22.1 21.9 -0.9% 1% 1 382498 81.0 79.6 -1.7% 1% 1 OR 2 530212 20.2 20.3 0.5% 2% 147 4883 5133.6 3092.0 -39.8% 2% text 95190 315.8 232.8 -26.3% 2% 1 AND 2 229870 22.2 21.2 -4.5% 2% 1 378641 81.2 76.4 -5.9% 2% 1 OR 2 524873 20.3 19.8 -2.5% 5% 147 4729 5073.6 3478.5 -31.4% 5% text 92293 315.2 219.1 -30.5% 5% 1 AND 2 222859 22.5 20.5 -8.9% 5% 1 367000 81.0 68.5 -15.4% 5% 1 OR 2 508632 20.4 18.9 -7.4% 10% 147 4475 5049.6 3547.8 -29.7% 10% text 87504 314.8 222.2 -29.4% 10% 1 AND 2 210982 22.9 19.4 -15.3% 10% 1 347664 81.5 61.6 -24.4% 10% 1 OR 2 481792 21.2 18.4 -13.2% 20% 147 4012 5045.0 3741.8 -25.8% 20% text 77980 317.2 232.9 -26.6% 20% 1 AND 2 187605 23.9 20.5 -14.2% 20% 1 309040 82.0 59.0 -28.0% 20% 1 OR 2 428232 22.3 18.8 -15.7% 50% 147 2463 5283.2 4712.1 -10.8% 50% text 48331 336.9 296.2 -12.1% 50% 1 AND 2 116887 28.4 26.6 -6.3% 50% 1 193154 86.4 77.3 -10.5% 50% 1 OR 2 267525 27.6 25.7 -6.9% It's definitely better than before but still slower than trunk.
        Hide
        Marvin Humphrey added a comment -

        > the nextSetBit impl in BitVector (in the patch) is extremely inefficient.

        Here's an improved version that doesn't call get().
        (quasi_iterator_deletions_r3.diff)

        The private method firstBitInNonZeroByte() in this patch could potentially
        be replaced with a 256-byte lookup table.

        Show
        Marvin Humphrey added a comment - > the nextSetBit impl in BitVector (in the patch) is extremely inefficient. Here's an improved version that doesn't call get(). (quasi_iterator_deletions_r3.diff) The private method firstBitInNonZeroByte() in this patch could potentially be replaced with a 256-byte lookup table.
        Hide
        Jason Rutherglen added a comment -

        bq: shut down all extraneous processes

        It's a desktop machine though so it's going to have some stuff
        running the background, most of which I'm not aware of being a Mac
        newbie.

        bq: Actually I meant a simple sorted list of ints, but even for that
        I'm worried about the skipTo cost (if we use a normal binary search)

        Skipping is slower because it unnecessarily checks bits that are not
        useful to the query. A higher level deletions Filter implemented
        perhaps in IndexSearcher requires docs that are deleted, pass through
        the SegmentTermDocs doc[] cache which could add unnecessary overhead
        from the vint decoding.

        The main problem we're trying to solve is potential allocation of a
        large del docs BV byte array for the copy on write of a cloned
        reader. An option we haven't looked at is a MultiByteArray where
        multiple byte arrays make up a virtual byte array checked by BV.get.
        On deleteDocument, only the byte array chunks that are changed are
        replaced in the new version, while the previously copied chunks are
        kept. The overhead of the BV.get can be minimal, though in our tests
        with an int array version the performance can either be equal to or
        double based on factors we are not aware of.

        Show
        Jason Rutherglen added a comment - bq: shut down all extraneous processes It's a desktop machine though so it's going to have some stuff running the background, most of which I'm not aware of being a Mac newbie. bq: Actually I meant a simple sorted list of ints, but even for that I'm worried about the skipTo cost (if we use a normal binary search) Skipping is slower because it unnecessarily checks bits that are not useful to the query. A higher level deletions Filter implemented perhaps in IndexSearcher requires docs that are deleted, pass through the SegmentTermDocs doc[] cache which could add unnecessary overhead from the vint decoding. The main problem we're trying to solve is potential allocation of a large del docs BV byte array for the copy on write of a cloned reader. An option we haven't looked at is a MultiByteArray where multiple byte arrays make up a virtual byte array checked by BV.get. On deleteDocument, only the byte array chunks that are changed are replaced in the new version, while the previously copied chunks are kept. The overhead of the BV.get can be minimal, though in our tests with an int array version the performance can either be equal to or double based on factors we are not aware of.
        Hide
        Michael McCandless added a comment -

        Actually I made one mistake running your standalone test – I had
        allowed the "createIndex" to run more than once, and so I think I had
        tested 30K docs with 1875 deletes (6.25%).

        I just removed the index and recreated it, so I have 15K docs and 1875
        deletes (12.5%). On the mac pro I now see the patch at 4.0% slower
        (4672 ms to 4859 ms), and on a Debian Linux box (kernel 2.6.22.1, java
        1.5.0_08-b03) I see it 0.8% slower (7298 ms to 7357 ms).

        The Mac can be somewhat unreliable for performance results

        I've actually found it to be quite reliable. What I love most about
        it is, as long as you shut down all extraneous processes, it gives
        very repeatable results. I haven't found the same true (or, less so)
        of various Linux's & Windows.

        OpenBitSet didn't seem to make much of a difference

        This is very hard to believe – the nextSetBit impl in BitVector (in
        the patch) is extremely inefficient. OpenBitSet's impl ought to be
        much faster.

        The other option is something like P4Delta which stores the doc
        ids in a compressed form solely for iterating.

        I think that will be too costly here (but is a good fit for
        postings).

        Is this what you mean by sparse representation?

        Actually I meant a simple sorted list of ints, but even for that I'm
        worried about the skipTo cost (if we use a normal binary search). I'm
        not sure it can be made fast enough (ie faster than random access
        we have today).

        Show
        Michael McCandless added a comment - Actually I made one mistake running your standalone test – I had allowed the "createIndex" to run more than once, and so I think I had tested 30K docs with 1875 deletes (6.25%). I just removed the index and recreated it, so I have 15K docs and 1875 deletes (12.5%). On the mac pro I now see the patch at 4.0% slower (4672 ms to 4859 ms), and on a Debian Linux box (kernel 2.6.22.1, java 1.5.0_08-b03) I see it 0.8% slower (7298 ms to 7357 ms). The Mac can be somewhat unreliable for performance results I've actually found it to be quite reliable. What I love most about it is, as long as you shut down all extraneous processes, it gives very repeatable results. I haven't found the same true (or, less so) of various Linux's & Windows. OpenBitSet didn't seem to make much of a difference This is very hard to believe – the nextSetBit impl in BitVector (in the patch) is extremely inefficient. OpenBitSet's impl ought to be much faster. The other option is something like P4Delta which stores the doc ids in a compressed form solely for iterating. I think that will be too costly here (but is a good fit for postings). Is this what you mean by sparse representation? Actually I meant a simple sorted list of ints, but even for that I'm worried about the skipTo cost (if we use a normal binary search). I'm not sure it can be made fast enough (ie faster than random access we have today).
        Hide
        Michael McCandless added a comment -

        I'm attaching the Python code I use to run (it's adapted from lucene-1483). You also need the following nocommit diff applied:

        Index: contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java
        ===================================================================
        --- contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java	(revision 738896)
        +++ contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java	(working copy)
        @@ -62,6 +62,9 @@
             super(runData);
           }
         
        +  // nocommit
        +  static boolean first = true;
        +
           public int doLogic() throws Exception {
             int res = 0;
             boolean closeReader = false;
        @@ -102,6 +105,12 @@
                 }
                 //System.out.println("q=" + q + ":" + hits.totalHits + " total hits"); 
         
        +        // nocommit
        +        if (first) {
        +          System.out.println("NUMHITS=" + hits.totalHits);
        +          first = false;
        +        }
        +
                 if (withTraverse()) {
                   final ScoreDoc[] scoreDocs = hits.scoreDocs;
                   int traversalSize = Math.min(scoreDocs.length, traversalSize());
        

        Just run sortBench2.py in contrib/benchmark of trunk & patch areas. Then run sortCollate2.py to make the Jira table (-jira) or print a human readable output (default). You'll have to make your own Wikipedia indices with the pctg deletes, then edit sortBench2.py & sortCollate2.py to fix the paths.

        All they do is write an alg file, run the test, and parse the output file to gather best of 5.

        Show
        Michael McCandless added a comment - I'm attaching the Python code I use to run (it's adapted from lucene-1483). You also need the following nocommit diff applied: Index: contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java =================================================================== --- contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java (revision 738896) +++ contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java (working copy) @@ -62,6 +62,9 @@ super (runData); } + // nocommit + static boolean first = true ; + public int doLogic() throws Exception { int res = 0; boolean closeReader = false ; @@ -102,6 +105,12 @@ } // System .out.println( "q=" + q + ":" + hits.totalHits + " total hits" ); + // nocommit + if (first) { + System .out.println( "NUMHITS=" + hits.totalHits); + first = false ; + } + if (withTraverse()) { final ScoreDoc[] scoreDocs = hits.scoreDocs; int traversalSize = Math .min(scoreDocs.length, traversalSize()); Just run sortBench2.py in contrib/benchmark of trunk & patch areas. Then run sortCollate2.py to make the Jira table (-jira) or print a human readable output (default). You'll have to make your own Wikipedia indices with the pctg deletes, then edit sortBench2.py & sortCollate2.py to fix the paths. All they do is write an alg file, run the test, and parse the output file to gather best of 5.
        Hide
        Jason Rutherglen added a comment -

        Interesting results. Mike, can you post your test code? The Mac can
        be somewhat unreliable for performance results as today the
        TestDeletesDocIdSet is about the same speed as trunk.

        OpenBitSet didn't seem to make much of a difference however when
        running your tests I can check it out. The other option is something
        like P4Delta which stores the doc ids in a compressed form solely for
        iterating. Is this what you mean by sparse representation?

        Show
        Jason Rutherglen added a comment - Interesting results. Mike, can you post your test code? The Mac can be somewhat unreliable for performance results as today the TestDeletesDocIdSet is about the same speed as trunk. OpenBitSet didn't seem to make much of a difference however when running your tests I can check it out. The other option is something like P4Delta which stores the doc ids in a compressed form solely for iterating. Is this what you mean by sparse representation?
        Hide
        Michael McCandless added a comment -

        Hmm... when I run the TestDeletesDocIdSet, I don't see as much
        improvement: trunk gets 10.507 seconds, patch gets 10.158 (~3.3%
        faster). I'm running on OS X 10.5.6, quad core Intel; java version is
        "1.6.0_07-b06-153" and I run "java -server -Xbatch -Xmx1024M
        -Xms1024M".

        But that test is rather synthetic: you create 15,000 docs, then delete
        1 in 8, then do a search (for "text") that matches all of the docs.

        So I went back to contrib/benchmark... I created a single-segment
        index, with 0%, 1%, 2%, 5%, 10%, 20% and 50% deletions, using first 2M
        docs from Wikipedia, then ran 5 different queries and compared qps w/
        patch vs trunk. Results:

        %tg deletes query hits qps qpsnew %tg change
        0% 147 4984 5560.1 5486.2 -1.3%
        0% text 97191 347.3 339.4 -2.3%
        0% 1 AND 2 234634 22.9 22.8 -0.4%
        0% 1 386435 88.4 87.2 -1.4%
        0% 1 OR 2 535584 20.9 20.9 0.0%
        1% 147 4933 5082.0 1419.2 -72.1%
        1% text 96143 313.9 142.0 -54.8%
        1% 1 AND 2 232250 22.1 18.6 -15.8%
        1% 1 382498 81.0 62.2 -23.2%
        1% 1 OR 2 530212 20.2 17.5 -13.4%
        2% 147 4883 5133.6 1959.0 -61.8%
        2% text 95190 315.8 149.2 -52.8%
        2% 1 AND 2 229870 22.2 18.4 -17.1%
        2% 1 378641 81.2 58.9 -27.5%
        2% 1 OR 2 524873 20.3 17.1 -15.8%
        5% 147 4729 5073.6 2600.8 -48.7%
        5% text 92293 315.2 166.9 -47.0%
        5% 1 AND 2 222859 22.5 17.8 -20.9%
        5% 1 367000 81.0 56.2 -30.6%
        5% 1 OR 2 508632 20.4 16.3 -20.1%
        10% 147 4475 5049.6 2953.7 -41.5%
        10% text 87504 314.8 180.9 -42.5%
        10% 1 AND 2 210982 22.9 17.8 -22.3%
        10% 1 347664 81.5 53.8 -34.0%
        10% 1 OR 2 481792 21.2 16.5 -22.2%
        20% 147 4012 5045.0 3455.5 -31.5%
        20% text 77980 317.2 204.7 -35.5%
        20% 1 AND 2 187605 23.9 19.2 -19.7%
        20% 1 309040 82.0 54.7 -33.3%
        20% 1 OR 2 428232 22.3 17.5 -21.5%
        50% 147 2463 5283.2 4731.4 -10.4%
        50% text 48331 336.9 290.2 -13.9%
        50% 1 AND 2 116887 28.4 25.9 -8.8%
        50% 1 193154 86.4 74.9 -13.3%
        50% 1 OR 2 267525 27.6 24.9 -9.8%

        I think one major source of slowness is the BitVector.nextSetBit()
        impl: it now checks one bit at a time, but it'd be much better to use
        OpenBitSet's approach.

        I also think, realistically, this approach won't perform very well
        until we switch to a sparse representation for the bit set, so that
        next() and skipTo() perform well.

        Show
        Michael McCandless added a comment - Hmm... when I run the TestDeletesDocIdSet, I don't see as much improvement: trunk gets 10.507 seconds, patch gets 10.158 (~3.3% faster). I'm running on OS X 10.5.6, quad core Intel; java version is "1.6.0_07-b06-153" and I run "java -server -Xbatch -Xmx1024M -Xms1024M". But that test is rather synthetic: you create 15,000 docs, then delete 1 in 8, then do a search (for "text") that matches all of the docs. So I went back to contrib/benchmark... I created a single-segment index, with 0%, 1%, 2%, 5%, 10%, 20% and 50% deletions, using first 2M docs from Wikipedia, then ran 5 different queries and compared qps w/ patch vs trunk. Results: %tg deletes query hits qps qpsnew %tg change 0% 147 4984 5560.1 5486.2 -1.3% 0% text 97191 347.3 339.4 -2.3% 0% 1 AND 2 234634 22.9 22.8 -0.4% 0% 1 386435 88.4 87.2 -1.4% 0% 1 OR 2 535584 20.9 20.9 0.0% 1% 147 4933 5082.0 1419.2 -72.1% 1% text 96143 313.9 142.0 -54.8% 1% 1 AND 2 232250 22.1 18.6 -15.8% 1% 1 382498 81.0 62.2 -23.2% 1% 1 OR 2 530212 20.2 17.5 -13.4% 2% 147 4883 5133.6 1959.0 -61.8% 2% text 95190 315.8 149.2 -52.8% 2% 1 AND 2 229870 22.2 18.4 -17.1% 2% 1 378641 81.2 58.9 -27.5% 2% 1 OR 2 524873 20.3 17.1 -15.8% 5% 147 4729 5073.6 2600.8 -48.7% 5% text 92293 315.2 166.9 -47.0% 5% 1 AND 2 222859 22.5 17.8 -20.9% 5% 1 367000 81.0 56.2 -30.6% 5% 1 OR 2 508632 20.4 16.3 -20.1% 10% 147 4475 5049.6 2953.7 -41.5% 10% text 87504 314.8 180.9 -42.5% 10% 1 AND 2 210982 22.9 17.8 -22.3% 10% 1 347664 81.5 53.8 -34.0% 10% 1 OR 2 481792 21.2 16.5 -22.2% 20% 147 4012 5045.0 3455.5 -31.5% 20% text 77980 317.2 204.7 -35.5% 20% 1 AND 2 187605 23.9 19.2 -19.7% 20% 1 309040 82.0 54.7 -33.3% 20% 1 OR 2 428232 22.3 17.5 -21.5% 50% 147 2463 5283.2 4731.4 -10.4% 50% text 48331 336.9 290.2 -13.9% 50% 1 AND 2 116887 28.4 25.9 -8.8% 50% 1 193154 86.4 74.9 -13.3% 50% 1 OR 2 267525 27.6 24.9 -9.8% I think one major source of slowness is the BitVector.nextSetBit() impl: it now checks one bit at a time, but it'd be much better to use OpenBitSet's approach. I also think, realistically, this approach won't perform very well until we switch to a sparse representation for the bit set, so that next() and skipTo() perform well.
        Hide
        Michael McCandless added a comment -

        I had some trouble w/ the latest patch (I'm trying to reproduce your
        strange problem w/ contrib/benchmark & run perf tests):

        • MultiDocIdSet was missing (I just pulled from earlier patch)
        • Compilation errors because {Filter,Parallel}

          IndexReader failed to
          mark getDeletedDocs public (I made them public).

        • contrib/instantiated failed to compile, because its IndexReader
          impl failed to implement getDocIdSet... but we can't add abstract
          IndexReader.getDocIdSet (this breaks back compat). I switched to
          non-abstract default method that throws
          UnsupportedOperationException instead, and added nocommit comment
          to remind us to fix all IndexReader subclasses in contrib.
        • I don't think you should change MemoryIndexReader from private to
          public? Why was that done? (I reverted this).
        • Replaced MemoryIndex.NullDocIdSet with the new
          DocIdSet.EMPTY_DOCIDSET
        • Somehow you lost recent fixes to MemoryIndex.java (committed as
          part of LUCENE-1316). This was causing NPE test failure (I
          reverted this, test passes now).

        New patch attached. All tests pass... next I'll try to repro the
        contrib/benchmark oddness.

        Show
        Michael McCandless added a comment - I had some trouble w/ the latest patch (I'm trying to reproduce your strange problem w/ contrib/benchmark & run perf tests): MultiDocIdSet was missing (I just pulled from earlier patch) Compilation errors because {Filter,Parallel} IndexReader failed to mark getDeletedDocs public (I made them public). contrib/instantiated failed to compile, because its IndexReader impl failed to implement getDocIdSet... but we can't add abstract IndexReader.getDocIdSet (this breaks back compat). I switched to non-abstract default method that throws UnsupportedOperationException instead, and added nocommit comment to remind us to fix all IndexReader subclasses in contrib. I don't think you should change MemoryIndexReader from private to public? Why was that done? (I reverted this). Replaced MemoryIndex.NullDocIdSet with the new DocIdSet.EMPTY_DOCIDSET Somehow you lost recent fixes to MemoryIndex.java (committed as part of LUCENE-1316 ). This was causing NPE test failure (I reverted this, test passes now). New patch attached. All tests pass... next I'll try to repro the contrib/benchmark oddness.
        Hide
        Jason Rutherglen added a comment -

        I implemented code that uses skipto deleted docs where BitVector is
        replaced with OpenBitSet, and BitVector is faster. I'll save
        implementing the int array docidset for the tombstone patch
        LUCENE-1526.

        Show
        Jason Rutherglen added a comment - I implemented code that uses skipto deleted docs where BitVector is replaced with OpenBitSet, and BitVector is faster. I'll save implementing the int array docidset for the tombstone patch LUCENE-1526 .
        Hide
        Jason Rutherglen added a comment -

        Patch the previously mentioned performance test uses.

        Show
        Jason Rutherglen added a comment - Patch the previously mentioned performance test uses.
        Hide
        Jason Rutherglen added a comment -

        I created basic test code TestDeletesDocIdSet.java because I thought
        the previous results may have something to do with my
        misunderstanding of the contrib/benchmark code.

        The results show a 17% increase in performance for this patch's
        deleted docs skipto code. I think now that it's probably the fact
        that ANT_OPTS isn't using the -server option properly as
        contrib/benchmark build.xml forks the process. -server on the MacOSX
        seems to optimize method calls a far better than -client.

        Trunk:
        warmup search duration: 6208
        final search duration: 6371

        LUCENE-1476, skipto:
        warmup search duration: 5926
        final search duration: 5450

        Show
        Jason Rutherglen added a comment - I created basic test code TestDeletesDocIdSet.java because I thought the previous results may have something to do with my misunderstanding of the contrib/benchmark code. The results show a 17% increase in performance for this patch's deleted docs skipto code. I think now that it's probably the fact that ANT_OPTS isn't using the -server option properly as contrib/benchmark build.xml forks the process. -server on the MacOSX seems to optimize method calls a far better than -client. Trunk: warmup search duration: 6208 final search duration: 6371 LUCENE-1476 , skipto: warmup search duration: 5926 final search duration: 5450
        Hide
        Michael McCandless added a comment -

        Perhaps this should be an option in the benchmark output?

        That's a great idea!

        Something silly must be going on... 99% performance drop can't be right.

        Show
        Michael McCandless added a comment - Perhaps this should be an option in the benchmark output? That's a great idea! Something silly must be going on... 99% performance drop can't be right.
        Hide
        Marvin Humphrey added a comment -

        > The percentage performance decrease in the previous
        > results is 99%.

        That's pretty strange. I look forward to seeing profiling data.

        Show
        Marvin Humphrey added a comment - > The percentage performance decrease in the previous > results is 99%. That's pretty strange. I look forward to seeing profiling data.
        Hide
        Jason Rutherglen added a comment -

        The percentage performance decrease in the previous
        results is 99%.

        Jason can you format those results using a Jira table?

        Perhaps this should be an option in the benchmark output?

        M.M. LUCENE-1516 comment: "I think the larger number of
        [harder-for-cpu-to-predict] if statements may be the cause of the
        slowdown once %tg deletes gets high enough?"

        I have been looking at the performance with YourKit and don't have
        any conclusions yet. The main difference between using skipto and
        BV.get is the if statements and some added method calls, which even
        if they are inlined I suspect will not make up the difference.

        Next steps:
        1. Deletes as a NOT boolean query which probably should
        be it's own patch
        2. Pluggable alternative representations such as
        OpenBitSet and int array, part of this patch?

        Show
        Jason Rutherglen added a comment - The percentage performance decrease in the previous results is 99%. Jason can you format those results using a Jira table? Perhaps this should be an option in the benchmark output? M.M. LUCENE-1516 comment: "I think the larger number of [harder-for-cpu-to-predict] if statements may be the cause of the slowdown once %tg deletes gets high enough?" I have been looking at the performance with YourKit and don't have any conclusions yet. The main difference between using skipto and BV.get is the if statements and some added method calls, which even if they are inlined I suspect will not make up the difference. Next steps: 1. Deletes as a NOT boolean query which probably should be it's own patch 2. Pluggable alternative representations such as OpenBitSet and int array, part of this patch?
        Hide
        Michael McCandless added a comment -

        Jason can you format those results using a Jira table? It's real hard to read as is (and something tells me we will be looking at alot of these tables shortly.... ).

        Also, showing %tg change vs baseline helps.

        Show
        Michael McCandless added a comment - Jason can you format those results using a Jira table? It's real hard to read as is (and something tells me we will be looking at alot of these tables shortly.... ). Also, showing %tg change vs baseline helps.
        Hide
        Jason Rutherglen added a comment -

        searchdeletes.alg uses reuters, deletes many docs, then performs
        searches. If it's working properly, iteration rather than calling
        BitVector.get has some serious performance drawbacks.

        Compare:
        DocIdSet SrchNewRdr_8: 32.0 rec/s
        DelDoc.get SrchNewRdr_8: 2,959.5 rec/s

        Next step is running JProfiler. Perhaps BitVector needs to be
        replaced by OpenBitSet for iterating, or there's some other issue.

        BitVector.get:
        [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
        [java] CreateIndex 0 10 100 1 1 17.2 0.06 3,953,984 9,072,640
        [java] CloseIndex - - - - 0 10 100 - - 1 - - - - 1 - - 1,000.0 - - 0.00 - 3,953,984 - - 9,072,640
        [java] Populate 0 10 100 1 200003 6,539.7 30.58 8,665,528 10,420,224
        [java] Deletions - - - - 0 10 100 - - 1 - - - 8002 - 533,466.7 - - 0.01 - 8,665,528 - 10,420,224
        [java] OpenReader(false) 0 10 100 1 1 1,000.0 0.00 8,691,040 10,420,224
        [java] Seq_8000 - - - - 0 10 100 - - 1 - - - 8000 - 800,000.0 - - 0.01 - 8,833,912 - 10,420,224
        [java] CloseReader 0 10 100 9 1 2,250.0 0.00 7,672,217 10,420,224
        [java] SrchNewRdr_8 - - - 0 10 100 - - 1 - - - 4016 - - 2,959.5 - - 1.36 - 8,232,384 - 10,420,224
        [java] OpenReader 0 10 100 8 1 1,333.3 0.01 7,579,584 10,420,224
        [java] Seq_500 - - - - - 0 10 100 - - 8 - - - 500 - - 2,963.0 - - 1.35 - 8,591,199 - 10,420,224

        DocIdSet:
        [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
        [java] CreateIndex 0 10 100 1 1 17.5 0.06 3,954,376 9,076,736
        [java] CloseIndex - - - - 0 10 100 - - 1 - - - - 1 - - 1,000.0 - - 0.00 - 3,954,376 - - 9,076,736
        [java] Populate 0 10 100 1 200003 6,503.1 30.75 5,951,816 10,321,920
        [java] Deletions - - - - 0 10 100 - - 1 - - - 8002 - 500,125.0 - - 0.02 - 6,190,816 - 10,321,920
        [java] OpenReader(false) 0 10 100 1 1 1,000.0 0.00 5,976,960 10,321,920
        [java] Seq_8000 - - - - 0 10 100 - - 1 - - - 8000 - 727,272.8 - - 0.01 - 6,122,904 - 10,321,920
        [java] CloseReader 0 10 100 9 1 3,000.0 0.00 7,727,980 10,321,920
        [java] SrchNewRdr_8 - - - 0 10 100 - - 1 - - - 4016 - - - 32.0 - - 125.67 - 7,960,824 - 10,321,920
        [java] OpenReader 0 10 100 8 1 1,333.3 0.01 7,742,855 10,321,920
        [java] Seq_500 - - - - - 0 10 100 - - 8 - - - 500 - - - 31.8 - - 125.66 - 8,744,057 - 10,321,920

        Show
        Jason Rutherglen added a comment - searchdeletes.alg uses reuters, deletes many docs, then performs searches. If it's working properly, iteration rather than calling BitVector.get has some serious performance drawbacks. Compare: DocIdSet SrchNewRdr_8: 32.0 rec/s DelDoc.get SrchNewRdr_8: 2,959.5 rec/s Next step is running JProfiler. Perhaps BitVector needs to be replaced by OpenBitSet for iterating, or there's some other issue. BitVector.get: [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] CreateIndex 0 10 100 1 1 17.2 0.06 3,953,984 9,072,640 [java] CloseIndex - - - - 0 10 100 - - 1 - - - - 1 - - 1,000.0 - - 0.00 - 3,953,984 - - 9,072,640 [java] Populate 0 10 100 1 200003 6,539.7 30.58 8,665,528 10,420,224 [java] Deletions - - - - 0 10 100 - - 1 - - - 8002 - 533,466.7 - - 0.01 - 8,665,528 - 10,420,224 [java] OpenReader(false) 0 10 100 1 1 1,000.0 0.00 8,691,040 10,420,224 [java] Seq_8000 - - - - 0 10 100 - - 1 - - - 8000 - 800,000.0 - - 0.01 - 8,833,912 - 10,420,224 [java] CloseReader 0 10 100 9 1 2,250.0 0.00 7,672,217 10,420,224 [java] SrchNewRdr_8 - - - 0 10 100 - - 1 - - - 4016 - - 2,959.5 - - 1.36 - 8,232,384 - 10,420,224 [java] OpenReader 0 10 100 8 1 1,333.3 0.01 7,579,584 10,420,224 [java] Seq_500 - - - - - 0 10 100 - - 8 - - - 500 - - 2,963.0 - - 1.35 - 8,591,199 - 10,420,224 DocIdSet: [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] CreateIndex 0 10 100 1 1 17.5 0.06 3,954,376 9,076,736 [java] CloseIndex - - - - 0 10 100 - - 1 - - - - 1 - - 1,000.0 - - 0.00 - 3,954,376 - - 9,076,736 [java] Populate 0 10 100 1 200003 6,503.1 30.75 5,951,816 10,321,920 [java] Deletions - - - - 0 10 100 - - 1 - - - 8002 - 500,125.0 - - 0.02 - 6,190,816 - 10,321,920 [java] OpenReader(false) 0 10 100 1 1 1,000.0 0.00 5,976,960 10,321,920 [java] Seq_8000 - - - - 0 10 100 - - 1 - - - 8000 - 727,272.8 - - 0.01 - 6,122,904 - 10,321,920 [java] CloseReader 0 10 100 9 1 3,000.0 0.00 7,727,980 10,321,920 [java] SrchNewRdr_8 - - - 0 10 100 - - 1 - - - 4016 - - - 32.0 - - 125.67 - 7,960,824 - 10,321,920 [java] OpenReader 0 10 100 8 1 1,333.3 0.01 7,742,855 10,321,920 [java] Seq_500 - - - - - 0 10 100 - - 8 - - - 500 - - - 31.8 - - 125.66 - 8,744,057 - 10,321,920
        Hide
        Michael McCandless added a comment -

        For the perf tests, I would use an optimized index with maybe 2M
        wikipedia docs in it.

        Then test with maybe 0, 1, 5, 10, 25, 50, 75 percent deletions, across
        various kinds of queries (single term, OR, AND, phrase/span).
        Baseline w/ trunk, and then test w/ this patch (keeps deletion access
        low (@ SegmentTermDocs) but switches to iterator API). I'd love to
        also see numbers for deletion-applied-as-filter (high) eventually.

        [Actually if ever deletion %tg is > 50%, we should presumably invert
        the bit set and work with that instead. And same with filters.]

        You might want to start with the Python scripts attached to
        LUCENE-1483; with some small mods you could easily fix them to run
        these tests.

        Show
        Michael McCandless added a comment - For the perf tests, I would use an optimized index with maybe 2M wikipedia docs in it. Then test with maybe 0, 1, 5, 10, 25, 50, 75 percent deletions, across various kinds of queries (single term, OR, AND, phrase/span). Baseline w/ trunk, and then test w/ this patch (keeps deletion access low (@ SegmentTermDocs) but switches to iterator API). I'd love to also see numbers for deletion-applied-as-filter (high) eventually. [Actually if ever deletion %tg is > 50%, we should presumably invert the bit set and work with that instead. And same with filters.] You might want to start with the Python scripts attached to LUCENE-1483 ; with some small mods you could easily fix them to run these tests.
        Hide
        Michael McCandless added a comment -

        We probably need to support both as implementing top level deleted
        docs filtering may have unknown side effects. The user may decide
        based on their queries and other variables such as the number of
        deleted docs.

        I agree... and then, if the performance difference is large enough, it
        seems like we'll need some simple "search policy" for the interesting
        (Boolean) query scorers to pick the best way to execute a query.

        This could include which order to visit the segments in (we broached
        this in LUCENE-1483, since depending on the query different orders may
        perform better). And when (high vs low) & how (iterator vs random
        access) to apply a filter would also be decided by the search policy.

        Deprecating isDeleted might be good.

        I wonder how this method is used [externally] by applications,
        today... I'll go ask on java-user. And, whether all such uses could
        migrate to an iterator API instead without much cost.

        Would we need the read only readers?

        Good question... I'm guessing there would still be a performance
        benefit if the underlying data structures for deletions &
        column-stride fields know they cannot change?

        Show
        Michael McCandless added a comment - We probably need to support both as implementing top level deleted docs filtering may have unknown side effects. The user may decide based on their queries and other variables such as the number of deleted docs. I agree... and then, if the performance difference is large enough, it seems like we'll need some simple "search policy" for the interesting (Boolean) query scorers to pick the best way to execute a query. This could include which order to visit the segments in (we broached this in LUCENE-1483 , since depending on the query different orders may perform better). And when (high vs low) & how (iterator vs random access) to apply a filter would also be decided by the search policy. Deprecating isDeleted might be good. I wonder how this method is used [externally] by applications, today... I'll go ask on java-user. And, whether all such uses could migrate to an iterator API instead without much cost. Would we need the read only readers? Good question... I'm guessing there would still be a performance benefit if the underlying data structures for deletions & column-stride fields know they cannot change?
        Hide
        Jason Rutherglen added a comment -

        Should we deprecate isDeleted? And also deprecate document()
        checking whether a document is deleted (ie switch to a new API that
        returns document w/o checking deletions).

        Deprecating isDeleted might be good. Would we need the read only
        readers? We can still offer get method access to the deleted docs if
        it's a bit set by creating an abstract class DocIdBitSet that
        BitVector, java.util.BitSet, and OpenBItSet implement. This can
        happen by casting IR.getDeletedDocs to DocIdBItSet.

        Or perhaps move isDeleted to a new API that makes it clear
        that there is a performance cost to random-access of deletions? (And
        the first time it's called, it materializes a full bit vector).

        Seems best to keep deleted docs access to the DocIdSet and DocIdBitSet
        and not have IR.isDeleted.

        Show
        Jason Rutherglen added a comment - Should we deprecate isDeleted? And also deprecate document() checking whether a document is deleted (ie switch to a new API that returns document w/o checking deletions). Deprecating isDeleted might be good. Would we need the read only readers? We can still offer get method access to the deleted docs if it's a bit set by creating an abstract class DocIdBitSet that BitVector, java.util.BitSet, and OpenBItSet implement. This can happen by casting IR.getDeletedDocs to DocIdBItSet. Or perhaps move isDeleted to a new API that makes it clear that there is a performance cost to random-access of deletions? (And the first time it's called, it materializes a full bit vector). Seems best to keep deleted docs access to the DocIdSet and DocIdBitSet and not have IR.isDeleted.
        Hide
        Jason Rutherglen added a comment -

        Is there info on the wiki about how to use the contrib/benchmarks? (didn't find any)

        M.M.: "1) Should we switch to iterator only API for accessing deleted docs
        (vs random-access API we use today)?"

        I think we can by default support both. We may want to test (later
        on) the difference in speed of iterating over an OpenBitSet vs
        BitVector for the deleted docs.

        M.M.: "2) Should we take deletes into account at a higher level
        (BooleanScorer/2, as a top-level Filter) vs the lowest level (each
        posting, in SegmentTermDocs that we do today)?"

        We probably need to support both as implementing top level deleted
        docs filtering may have unknown side effects. The user may decide
        based on their queries and other variables such as the number of
        deleted docs.

        Show
        Jason Rutherglen added a comment - Is there info on the wiki about how to use the contrib/benchmarks? (didn't find any) M.M.: "1) Should we switch to iterator only API for accessing deleted docs (vs random-access API we use today)?" I think we can by default support both. We may want to test (later on) the difference in speed of iterating over an OpenBitSet vs BitVector for the deleted docs. M.M.: "2) Should we take deletes into account at a higher level (BooleanScorer/2, as a top-level Filter) vs the lowest level (each posting, in SegmentTermDocs that we do today)?" We probably need to support both as implementing top level deleted docs filtering may have unknown side effects. The user may decide based on their queries and other variables such as the number of deleted docs.
        Hide
        Michael McCandless added a comment -

        I am going to run the contrib benchmark tests on this patch vs trunk to see if there is a difference in performance

        I think this is the most important next step before we go too much
        further.

        There are two separate questions we are exploring here:

        1) Should we switch to iterator only API for accessing deleted docs
        (vs random-access API we use today)?

        2) Should we take deletes into account at a higher level
        (BooleanScorer/2, as a top-level Filter) vs the lowest level
        (each posting, in SegmentTermDocs that we do today)?

        We need to understand the performance impact with each combination of
        1) and 2), across different queries. Likely the answer will be "it
        depends on the query", but we should understand just how much. I
        think performance cost/benefit is the driver here,

        Show
        Michael McCandless added a comment - I am going to run the contrib benchmark tests on this patch vs trunk to see if there is a difference in performance I think this is the most important next step before we go too much further. There are two separate questions we are exploring here: 1) Should we switch to iterator only API for accessing deleted docs (vs random-access API we use today)? 2) Should we take deletes into account at a higher level (BooleanScorer/2, as a top-level Filter) vs the lowest level (each posting, in SegmentTermDocs that we do today)? We need to understand the performance impact with each combination of 1) and 2), across different queries. Likely the answer will be "it depends on the query", but we should understand just how much. I think performance cost/benefit is the driver here,
        Hide
        Michael McCandless added a comment -

        Indeed, how should we solve isDeleted for the tombstones
        implementation?

        Should we deprecate isDeleted? And also deprecate document() checking whether a document is deleted (ie switch to a new API that returns document w/o checking deletions).

        Or perhaps move isDeleted to a new API that makes it clear that there is a performance cost to random-access of deletions? (And the first time it's called, it materializes a full bit vector).

        Show
        Michael McCandless added a comment - Indeed, how should we solve isDeleted for the tombstones implementation? Should we deprecate isDeleted? And also deprecate document() checking whether a document is deleted (ie switch to a new API that returns document w/o checking deletions). Or perhaps move isDeleted to a new API that makes it clear that there is a performance cost to random-access of deletions? (And the first time it's called, it materializes a full bit vector).
        Hide
        Jason Rutherglen added a comment -

        I am going to run the contrib benchmark tests on this patch vs trunk to see if there is a difference in performance. Do the benchmarks have deleted docs?

        Show
        Jason Rutherglen added a comment - I am going to run the contrib benchmark tests on this patch vs trunk to see if there is a difference in performance. Do the benchmarks have deleted docs?
        Hide
        Jason Rutherglen added a comment -

        M.M.: "At a higher level, I'd imagine we'd want to store the deleted doc IDs as an
        integer array rather than a BitVector if there aren't very many of them. But
        I think that will mess with the random access required by
        IndexReader.isDeleted()."

        Indeed, how should we solve isDeleted for the tombstones
        implementation? Or do we simply assume it will be slow (requiring a
        linear scan?) Or perhaps borrow from Solr's HashDocSet (a minimal
        primitive int based hash) to implement?

        Show
        Jason Rutherglen added a comment - M.M.: "At a higher level, I'd imagine we'd want to store the deleted doc IDs as an integer array rather than a BitVector if there aren't very many of them. But I think that will mess with the random access required by IndexReader.isDeleted()." Indeed, how should we solve isDeleted for the tombstones implementation? Or do we simply assume it will be slow (requiring a linear scan?) Or perhaps borrow from Solr's HashDocSet (a minimal primitive int based hash) to implement?
        Hide
        Jason Rutherglen added a comment -

        Previous patch did not have MultiDocIdSet, this one does.

        Show
        Jason Rutherglen added a comment - Previous patch did not have MultiDocIdSet, this one does.
        Hide
        Marvin Humphrey added a comment -

        > I implemented the SegmentTermsDocs using DocIdSetIterator.skipTo rather than
        > nextSetBit.

        Ah, I see. And you handled the -1 value properly. Well done.

        > Do you think the skipTo can be optimized?

        Provided that the compiler is clever enough to inline the various methods
        within BitVectorDocIdSetIterator (it should be), I don't see a way to improve
        on that part.

        It's probably possible to optimize this loop from my patch in
        BitVector.nextSetBit(), by retrieving the byte value and operating directly on
        it rather than calling get(). (The compiler would have to be quite clever to
        figure out that optimization on its own.)

        +        // There's a non-zero bit in this byte. 
        +        final int base = pos * 8;
        +        for (int i = 0; i < 8; i++) {
        +          final int candidate = base + i;
        +          if (candidate < size && candidate >= bit) {
        +            if (get(candidate)) {
        +              return candidate;
        +            }
        +          }
        +        }
        

        At a higher level, I'd imagine we'd want to store the deleted doc IDs as an
        integer array rather than a BitVector if there aren't very many of them. But
        I think that will mess with the random access required by
        IndexReader.isDeleted().

        Show
        Marvin Humphrey added a comment - > I implemented the SegmentTermsDocs using DocIdSetIterator.skipTo rather than > nextSetBit. Ah, I see. And you handled the -1 value properly. Well done. > Do you think the skipTo can be optimized? Provided that the compiler is clever enough to inline the various methods within BitVectorDocIdSetIterator (it should be), I don't see a way to improve on that part. It's probably possible to optimize this loop from my patch in BitVector.nextSetBit(), by retrieving the byte value and operating directly on it rather than calling get(). (The compiler would have to be quite clever to figure out that optimization on its own.) + // There's a non-zero bit in this byte . + final int base = pos * 8; + for ( int i = 0; i < 8; i++) { + final int candidate = base + i; + if (candidate < size && candidate >= bit) { + if (get(candidate)) { + return candidate; + } + } + } At a higher level, I'd imagine we'd want to store the deleted doc IDs as an integer array rather than a BitVector if there aren't very many of them. But I think that will mess with the random access required by IndexReader.isDeleted().
        Hide
        Jason Rutherglen added a comment -

        M.M.: I believe there's an inefficiency in my original patch.

        Marvin, I implemented the SegmentTermsDocs using DocIdSetIterator.skipTo rather than nextSetBit. Do you think the skipTo can be optimized?

        Show
        Jason Rutherglen added a comment - M.M.: I believe there's an inefficiency in my original patch. Marvin, I implemented the SegmentTermsDocs using DocIdSetIterator.skipTo rather than nextSetBit. Do you think the skipTo can be optimized?
        Hide
        Marvin Humphrey added a comment -

        > For OR like queries, that use Scorer.next(), deletions might be treated as
        > early as possible, since each hit will cause a matching doc and a
        > corresponding score calculation.

        It depends whether the OR-like query spawns a Scorer that pre-fetches.

        At present, deletions are handled in Lucene at the SegmentTermDocs level.
        Filtering deletions that early minimizes overhead from all Scorer.next()
        implementations, including those that pre-fetch like the original
        BooleanScorer. The cost is that we must check every single posting in every
        single SegmentTermDocs to see if the doc has been deleted.

        The Scorer_Collect() routine above, which skips past deletions, consolidates
        the cost of checking every single posting into one check. However, the
        original BooleanScorer can't skip, so it will incur overhead from scoring
        documents which have been deleted. Do the savings and the additional costs
        balance each other out? Its hard to say.

        However, with OR-like Scorers which implement skipping and don't pre-fetch –
        Lucene's DisjunctionSumScorer and its KS analogue ORScorer – you get the best
        of both worlds. Not only do you avoid the per-posting-per-SegTermDocs
        deletions check, you get to skip past deletions and avoid the extra overhead
        in Scorer.next().

        Of course the problem is that because ScorerDocQueue isn't very efficient,
        DisjunctionSumScorer and ORScorer are often slower than the distributed sort
        in the original BooleanScorer.

        > For AND like queries, that use Scorer.skipTo(), deletions can be treated
        > later, for example in skipTo() of the conjunction/and/scorer.

        Yes, and it could be implemented by adding a Filter sub-clause to the
        conjunction/and/scorer.

        Show
        Marvin Humphrey added a comment - > For OR like queries, that use Scorer.next(), deletions might be treated as > early as possible, since each hit will cause a matching doc and a > corresponding score calculation. It depends whether the OR-like query spawns a Scorer that pre-fetches. At present, deletions are handled in Lucene at the SegmentTermDocs level. Filtering deletions that early minimizes overhead from all Scorer.next() implementations, including those that pre-fetch like the original BooleanScorer. The cost is that we must check every single posting in every single SegmentTermDocs to see if the doc has been deleted. The Scorer_Collect() routine above, which skips past deletions, consolidates the cost of checking every single posting into one check. However, the original BooleanScorer can't skip, so it will incur overhead from scoring documents which have been deleted. Do the savings and the additional costs balance each other out? Its hard to say. However, with OR-like Scorers which implement skipping and don't pre-fetch – Lucene's DisjunctionSumScorer and its KS analogue ORScorer – you get the best of both worlds. Not only do you avoid the per-posting-per-SegTermDocs deletions check, you get to skip past deletions and avoid the extra overhead in Scorer.next(). Of course the problem is that because ScorerDocQueue isn't very efficient, DisjunctionSumScorer and ORScorer are often slower than the distributed sort in the original BooleanScorer. > For AND like queries, that use Scorer.skipTo(), deletions can be treated > later, for example in skipTo() of the conjunction/and/scorer. Yes, and it could be implemented by adding a Filter sub-clause to the conjunction/and/scorer.
        Hide
        Paul Elschot added a comment -

        >>> If we move the deletions filtering up, then we'd increase traffic through
        >>> that cache
        >>
        >> OK, right. So we may have some added cost because of this. I think it's only
        >> TermScorer that uses the bulk API though.

        >The original BooleanScorer also pre-fetches. (That doesn't affect KS because
        > ORScorer, ANDScorer, NOTScorer and RequiredOptionScorer (which have
        > collectively replaced BooleanScorer) all proceed doc-at-a-time and implement
        > skipping.)

        For OR like queries, that use Scorer.next(), deletions might be treated as early
        as possible, since each hit will cause a matching doc and a corresponding score calculation.

        For AND like queries, that use Scorer.skipTo(), deletions can be treated later,
        for example in skipTo() of the conjunction/and/scorer.

        In the same way, prefetching into a larger term documents buffer helps for OR queries,
        but gets in the way for AND queries.

        Show
        Paul Elschot added a comment - >>> If we move the deletions filtering up, then we'd increase traffic through >>> that cache >> >> OK, right. So we may have some added cost because of this. I think it's only >> TermScorer that uses the bulk API though. >The original BooleanScorer also pre-fetches. (That doesn't affect KS because > ORScorer, ANDScorer, NOTScorer and RequiredOptionScorer (which have > collectively replaced BooleanScorer) all proceed doc-at-a-time and implement > skipping.) For OR like queries, that use Scorer.next(), deletions might be treated as early as possible, since each hit will cause a matching doc and a corresponding score calculation. For AND like queries, that use Scorer.skipTo(), deletions can be treated later, for example in skipTo() of the conjunction/and/scorer. In the same way, prefetching into a larger term documents buffer helps for OR queries, but gets in the way for AND queries.
        Hide
        Marvin Humphrey added a comment -

        Mike McCandless:

        >> If we move the deletions filtering up, then we'd increase traffic through
        >> that cache
        >
        > OK, right. So we may have some added cost because of this. I think it's only
        > TermScorer that uses the bulk API though.

        The original BooleanScorer also pre-fetches. (That doesn't affect KS because
        ORScorer, ANDScorer, NOTScorer and RequiredOptionScorer (which have
        collectively replaced BooleanScorer) all proceed doc-at-a-time and implement
        skipping.)

        And I'm still not certain it's a good idea from an API standpoint: it's
        strange to have the PostingList and Scorer iterators included deleted docs.

        Nevertheless, I've now changed over KS to use the deletions-as-filter approach
        in svn trunk. The tie-breaker was related to the ongoing modularization of
        IndexReader: if PostingList doesn't have to handle deletions, then
        PostingsReader doesn't have to know about DeletionsReader, and if
        PostingsReader doesn't have to know about DeletionsReader, then all
        IndexReader sub-components can be implemented independently.

        The code to implement the deletion skipping turned out to be more verbose and
        fiddly than anticipated. It's easy to make fencepost errors when dealing with
        advancing two iterators in sync, especially when you can only invoke the
        skipping iterator method once for a given doc num.

        void
        Scorer_collect(Scorer *self, HitCollector *collector, DelEnum *deletions,
                       i32_t doc_base)
        {
            i32_t   doc_num        = 0;
            i32_t   next_deletion  = deletions ? 0 : I32_MAX;
        
            /* Execute scoring loop. */
            while (1) {
                if (doc_num > next_deletion) {
                    next_deletion = DelEnum_Advance(deletions, doc_num);
                    if (next_deletion == 0) { next_deletion = I32_MAX; }
                    continue;
                }
                else if (doc_num == next_deletion) {
                    /* Skip past deletions. */
                    while (doc_num == next_deletion) {
                        /* Artifically advance scorer. */
                        while (doc_num == next_deletion) {
                            doc_num++;
                            next_deletion = DelEnum_Advance(deletions, doc_num);
                            if (next_deletion == 0) { next_deletion = I32_MAX; }
                        }
                        /* Verify that the artificial advance actually worked. */
                        doc_num = Scorer_Advance(self, doc_num);
                        if (doc_num > next_deletion) {
                            next_deletion = DelEnum_Advance(deletions, doc_num);
                        }
                    }
                }
                else {
                    doc_num = Scorer_Advance(self, doc_num + 1);
                    if (doc_num >= next_deletion) { 
                        next_deletion = DelEnum_Advance(deletions, doc_num);
                        if (doc_num == next_deletion) { continue; }
                    }
                }
        
                if (doc_num) {
                    HC_Collect(collector, doc_num + doc_base, Scorer_Tally(self));
                }
                else { 
                    break; 
                }
            }
        }
        
        Show
        Marvin Humphrey added a comment - Mike McCandless: >> If we move the deletions filtering up, then we'd increase traffic through >> that cache > > OK, right. So we may have some added cost because of this. I think it's only > TermScorer that uses the bulk API though. The original BooleanScorer also pre-fetches. (That doesn't affect KS because ORScorer, ANDScorer, NOTScorer and RequiredOptionScorer (which have collectively replaced BooleanScorer) all proceed doc-at-a-time and implement skipping.) And I'm still not certain it's a good idea from an API standpoint: it's strange to have the PostingList and Scorer iterators included deleted docs. Nevertheless, I've now changed over KS to use the deletions-as-filter approach in svn trunk. The tie-breaker was related to the ongoing modularization of IndexReader: if PostingList doesn't have to handle deletions, then PostingsReader doesn't have to know about DeletionsReader, and if PostingsReader doesn't have to know about DeletionsReader, then all IndexReader sub-components can be implemented independently. The code to implement the deletion skipping turned out to be more verbose and fiddly than anticipated. It's easy to make fencepost errors when dealing with advancing two iterators in sync, especially when you can only invoke the skipping iterator method once for a given doc num. void Scorer_collect(Scorer *self, HitCollector *collector, DelEnum *deletions, i32_t doc_base) { i32_t doc_num = 0; i32_t next_deletion = deletions ? 0 : I32_MAX; /* Execute scoring loop. */ while (1) { if (doc_num > next_deletion) { next_deletion = DelEnum_Advance(deletions, doc_num); if (next_deletion == 0) { next_deletion = I32_MAX; } continue ; } else if (doc_num == next_deletion) { /* Skip past deletions. */ while (doc_num == next_deletion) { /* Artifically advance scorer. */ while (doc_num == next_deletion) { doc_num++; next_deletion = DelEnum_Advance(deletions, doc_num); if (next_deletion == 0) { next_deletion = I32_MAX; } } /* Verify that the artificial advance actually worked. */ doc_num = Scorer_Advance(self, doc_num); if (doc_num > next_deletion) { next_deletion = DelEnum_Advance(deletions, doc_num); } } } else { doc_num = Scorer_Advance(self, doc_num + 1); if (doc_num >= next_deletion) { next_deletion = DelEnum_Advance(deletions, doc_num); if (doc_num == next_deletion) { continue ; } } } if (doc_num) { HC_Collect(collector, doc_num + doc_base, Scorer_Tally(self)); } else { break ; } } }
        Hide
        Marvin Humphrey added a comment -

        Jason,

        > Incorporated Marvin's patch into SegmentTermDocs and BitVector.

        I believe there's an inefficiency in my original patch. Code like this occurs in three places:

        +      if (doc >= nextDeletion) {
        +        if (doc > nextDeletion) 
        +          nextDeletion = deletedDocs.nextSetBit(doc);
        +        if (doc == nextDeletion)
        +          continue;
               }
        

        While technically correct, extra work is being done. When nextSetBit() can't
        find any more bits, it returns -1 (just like java.util.BitSet.nextSetBit() does).
        This causes the deletion loop to be checked on each iteration.

        The solution is to test for -1, as in the new version of the patch (also
        tested but not benchmarked):

        +      if (doc >= nextDeletion) {
        +        if (doc > nextDeletion) {
        +          nextDeletion = deletedDocs.nextSetBit(doc);
        +          if (nextDeletion == -1) {
        +            nextDeletion = Integer.MAX_VALUE;
        +          }
        +        }
        +        if (doc == nextDeletion) {
        +          continue;
        +        }
               }
        

        Theoretically, we could also change the behavior of nextSetBit() so that it
        returns Integer.MAX_VALUE when it can't find any more set bits. However,
        that's a little misleading (since it's a positive number and could thus
        represent a true set bit), and also would break the intended API mimicry by
        BitVector.nextSetBit() of java.util.BitSet.nextSetBit().

        Show
        Marvin Humphrey added a comment - Jason, > Incorporated Marvin's patch into SegmentTermDocs and BitVector. I believe there's an inefficiency in my original patch. Code like this occurs in three places: + if (doc >= nextDeletion) { + if (doc > nextDeletion) + nextDeletion = deletedDocs.nextSetBit(doc); + if (doc == nextDeletion) + continue ; } While technically correct, extra work is being done. When nextSetBit() can't find any more bits, it returns -1 (just like java.util.BitSet.nextSetBit() does). This causes the deletion loop to be checked on each iteration. The solution is to test for -1, as in the new version of the patch (also tested but not benchmarked): + if (doc >= nextDeletion) { + if (doc > nextDeletion) { + nextDeletion = deletedDocs.nextSetBit(doc); + if (nextDeletion == -1) { + nextDeletion = Integer .MAX_VALUE; + } + } + if (doc == nextDeletion) { + continue ; + } } Theoretically, we could also change the behavior of nextSetBit() so that it returns Integer.MAX_VALUE when it can't find any more set bits. However, that's a little misleading (since it's a positive number and could thus represent a true set bit), and also would break the intended API mimicry by BitVector.nextSetBit() of java.util.BitSet.nextSetBit().
        Hide
        Jason Rutherglen added a comment -

        First cut. All tests pass.

        • Implemented IndexReader.getDeletedDocs in all readers
        • Created MultiDocIdSet that iterates over multiple DocIdSets which is
          used by MultiSegmentReader and MultiReader
        • Incorporated Marvin's patch into SegmentTermDocs and BitVector.
          SegmentTermDocs iterates using deleted docs DocIdSet instead of
          calling BitVector.get.
        Show
        Jason Rutherglen added a comment - First cut. All tests pass. Implemented IndexReader.getDeletedDocs in all readers Created MultiDocIdSet that iterates over multiple DocIdSets which is used by MultiSegmentReader and MultiReader Incorporated Marvin's patch into SegmentTermDocs and BitVector. SegmentTermDocs iterates using deleted docs DocIdSet instead of calling BitVector.get.
        Hide
        Jason Rutherglen added a comment -

        Do we want to implement SegmentTermDocs using DocIdSet in another patch?

        Show
        Jason Rutherglen added a comment - Do we want to implement SegmentTermDocs using DocIdSet in another patch?
        Hide
        Jason Rutherglen added a comment -

        If we moved to using only iterator API for accessing deleted docs within Lucene then we could explore fixes for the copy-on-write cost w/o changing on-disk representation of deletes. IE tombstones are perhaps overkill for Lucene, since we're not using the filesystem as the intermediary for communicating deletes to a reopened reader. We only need an in-RAM incremental solution.

        +1 Agreed. Good point about not needing to change the on disk representation as that would make implementation a bit more complicated. Sounds like we need a tombstones patch as well that plays well with IndexReader.clone.

        Exposing deleted docs as a DocIdSet allows possible future implementations that DO return deleted docs as discussed (via a flag to IndexReader) from TermDocs. Deleted docs DocIdSet can then be used on a higher level as a filter/query.

        Show
        Jason Rutherglen added a comment - If we moved to using only iterator API for accessing deleted docs within Lucene then we could explore fixes for the copy-on-write cost w/o changing on-disk representation of deletes. IE tombstones are perhaps overkill for Lucene, since we're not using the filesystem as the intermediary for communicating deletes to a reopened reader. We only need an in-RAM incremental solution. +1 Agreed. Good point about not needing to change the on disk representation as that would make implementation a bit more complicated. Sounds like we need a tombstones patch as well that plays well with IndexReader.clone. Exposing deleted docs as a DocIdSet allows possible future implementations that DO return deleted docs as discussed (via a flag to IndexReader) from TermDocs. Deleted docs DocIdSet can then be used on a higher level as a filter/query.
        Hide
        Michael McCandless added a comment -

        Why? The returned iterator can traverse the multiple bitvectors.

        Woops, sorry: I missed that it would return a DocIdSet (iterator only) vs underlying (current) BitVector. So then MultiReader could return a DocIdSet.

        If the segment is large, tombstones can solve this.

        Right; I was saying, as things are today (single BitVector holds all deleted docs), one limitation of the realtime approach we are moving towards is the copy-on-write cost of the first delete on a freshly cloned reader for a large segment.

        If we moved to using only iterator API for accessing deleted docs within Lucene then we could explore fixes for the copy-on-write cost w/o changing on-disk representation of deletes. IE tombstones are perhaps overkill for Lucene, since we're not using the filesystem as the intermediary for communicating deletes to a reopened reader. We only need an in-RAM incremental solution.

        Show
        Michael McCandless added a comment - Why? The returned iterator can traverse the multiple bitvectors. Woops, sorry: I missed that it would return a DocIdSet (iterator only) vs underlying (current) BitVector. So then MultiReader could return a DocIdSet. If the segment is large, tombstones can solve this. Right; I was saying, as things are today (single BitVector holds all deleted docs), one limitation of the realtime approach we are moving towards is the copy-on-write cost of the first delete on a freshly cloned reader for a large segment. If we moved to using only iterator API for accessing deleted docs within Lucene then we could explore fixes for the copy-on-write cost w/o changing on-disk representation of deletes. IE tombstones are perhaps overkill for Lucene, since we're not using the filesystem as the intermediary for communicating deletes to a reopened reader. We only need an in-RAM incremental solution.
        Hide
        Jason Rutherglen added a comment -

        it's fine to expose this API, mark it expert & subject to
        change, and state that it simply returns the current deleted docs
        BitVector and any synchronization issues must be handled by the app.
        Probably it should be package private or protected.

        +1 Sounds good

        Also, I think Multi*Reader would not implement this API? Meaning
        you must get the individual SegmentReader and ask it.

        Why? The returned iterator can traverse the multiple bitvectors.

        I think one weak part of
        the design is the blocking copy-on-write required for the first
        deletion on a newly cloned reader. The time taken is in proportion to
        the size of the segment.

        If the segment is large, tombstones can solve this.

        Show
        Jason Rutherglen added a comment - it's fine to expose this API, mark it expert & subject to change, and state that it simply returns the current deleted docs BitVector and any synchronization issues must be handled by the app. Probably it should be package private or protected. +1 Sounds good Also, I think Multi*Reader would not implement this API? Meaning you must get the individual SegmentReader and ask it. Why? The returned iterator can traverse the multiple bitvectors. I think one weak part of the design is the blocking copy-on-write required for the first deletion on a newly cloned reader. The time taken is in proportion to the size of the segment. If the segment is large, tombstones can solve this.
        Hide
        Michael McCandless added a comment -

        I think it's fine to expose this API, mark it expert & subject to
        change, and state that it simply returns the current deleted docs
        BitVector and any synchronization issues must be handled by the app.
        Probably it should be package private or protected.

        Also, I think Multi*Reader would not implement this API? Meaning
        you must get the individual SegmentReader and ask it.

        Since we are discussing possible changes to deleted docs, eg switching
        to iterator-only accesss, maybe applying deletions as a filter, maybe
        using "tombstones" under-the-hood, etc., this API could very well
        change.

        For our current thinking on realtime search, I think one weak part of
        the design is the blocking copy-on-write required for the first
        deletion on a newly cloned reader. The time taken is in proportion to
        the size of the segment. Switching to a design that can background
        this copy-on-write will necessarily impact how we represent
        and access deletions.

        Show
        Michael McCandless added a comment - I think it's fine to expose this API, mark it expert & subject to change, and state that it simply returns the current deleted docs BitVector and any synchronization issues must be handled by the app. Probably it should be package private or protected. Also, I think Multi*Reader would not implement this API? Meaning you must get the individual SegmentReader and ask it. Since we are discussing possible changes to deleted docs, eg switching to iterator-only accesss, maybe applying deletions as a filter, maybe using "tombstones" under-the-hood, etc., this API could very well change. For our current thinking on realtime search, I think one weak part of the design is the blocking copy-on-write required for the first deletion on a newly cloned reader. The time taken is in proportion to the size of the segment. Switching to a design that can background this copy-on-write will necessarily impact how we represent and access deletions.
        Hide
        Jason Rutherglen added a comment -

        I think it should be up to the user. If the user concurrently
        modifies then they're responsible for the possibly spurious effects.
        However if we want to be protective, a philosophy I don't think works
        well in LUCEN-1516 either, we can offer IR.getDeletedDocs only from a
        read only index reader. This solves the issues brought up such as
        "undesirable synchronization".

        Show
        Jason Rutherglen added a comment - I think it should be up to the user. If the user concurrently modifies then they're responsible for the possibly spurious effects. However if we want to be protective, a philosophy I don't think works well in LUCEN-1516 either, we can offer IR.getDeletedDocs only from a read only index reader. This solves the issues brought up such as "undesirable synchronization".
        Hide
        Yonik Seeley added a comment -

        IndexReader.getDeletedDocs that returns a DocIdSet

        Seems like a good idea, but the difficulty lies in defining the semantics of such a call.
        Are subsequent deletes reflected in the returned DocIdSet? Perhaps make this undefined - may or may not be.
        Can the DocIdSet be used concurrently with IndexReader.deleteDocument()? This could work (w/o extra undesirable synchronization) if we're careful.

        Show
        Yonik Seeley added a comment - IndexReader.getDeletedDocs that returns a DocIdSet Seems like a good idea, but the difficulty lies in defining the semantics of such a call. Are subsequent deletes reflected in the returned DocIdSet? Perhaps make this undefined - may or may not be. Can the DocIdSet be used concurrently with IndexReader.deleteDocument()? This could work (w/o extra undesirable synchronization) if we're careful.
        Hide
        Jason Rutherglen added a comment -

        To simplify the patch, can we agree to add a method,
        IndexReader.getDeletedDocs that returns a DocIdSet? This way
        applications may have access to deleteDocs and not be encumbered by
        the IR.isDeleted method.

        Show
        Jason Rutherglen added a comment - To simplify the patch, can we agree to add a method, IndexReader.getDeletedDocs that returns a DocIdSet? This way applications may have access to deleteDocs and not be encumbered by the IR.isDeleted method.
        Hide
        Michael McCandless added a comment -

        > One way to think of the realtime problem is in terms of segments rather than FSDirs and RAMDirs. Some segments are on disk, some in RAM.

        I think that's a good approach. IndexWriter could care less which dir
        each segment resides in.

        For starters let's flush all new segments into a single Directory (and
        swap in RAMDir, single-file-Dir, etc., if/when necessary for better
        performance).

        Thinking out loud.... what if we added IndexWriter.getReader(), which
        returns an IndexReader for the current index?

        It would read segments that are not "visible" to a newly opened
        IndexReader on that directory (ie, the SegmentInfos inside IndexWriter
        was used to do the open, not the latest segments_N in the Directory).

        That IndexReader is free to do deletions / set norms (shares write
        lock under the hood w/ IndexWriter), and when reopen is called it
        grabs the current SegmentInfos from IndexWriter and re-opens that.

        We would un-deprecate flush(). The, the transaction layer would make
        changes, call flush(), call reopen(), and return the reopened reader.

        Show
        Michael McCandless added a comment - > One way to think of the realtime problem is in terms of segments rather than FSDirs and RAMDirs. Some segments are on disk, some in RAM. I think that's a good approach. IndexWriter could care less which dir each segment resides in. For starters let's flush all new segments into a single Directory (and swap in RAMDir, single-file-Dir, etc., if/when necessary for better performance). Thinking out loud.... what if we added IndexWriter.getReader(), which returns an IndexReader for the current index? It would read segments that are not "visible" to a newly opened IndexReader on that directory (ie, the SegmentInfos inside IndexWriter was used to do the open, not the latest segments_N in the Directory). That IndexReader is free to do deletions / set norms (shares write lock under the hood w/ IndexWriter), and when reopen is called it grabs the current SegmentInfos from IndexWriter and re-opens that. We would un-deprecate flush(). The, the transaction layer would make changes, call flush(), call reopen(), and return the reopened reader.
        Hide
        Michael McCandless added a comment -

        > Under the current system, I'm not certain that the deletions checks are that
        > excessive.

        I made a simple test that up-front converted the deleted docs
        BitVector into an int[] containing the deleted docsIDs, and then did
        the same nextDeletedDoc change to SegmentTermDocs.

        This was only faster if < 10% of docs were deleted, though I didn't do
        very thorough testing.

        I think the problem is with this change the CPU must predict a new
        branch (docNum >= nextDeletion) vs doing a no-branching lookup of the
        bit.

        Show
        Michael McCandless added a comment - > Under the current system, I'm not certain that the deletions checks are that > excessive. I made a simple test that up-front converted the deleted docs BitVector into an int[] containing the deleted docsIDs, and then did the same nextDeletedDoc change to SegmentTermDocs. This was only faster if < 10% of docs were deleted, though I didn't do very thorough testing. I think the problem is with this change the CPU must predict a new branch (docNum >= nextDeletion) vs doing a no-branching lookup of the bit.
        Hide
        Michael McCandless added a comment -

        > We could potentially make that choice with Lucy/KS, thus allowing us to remove
        > the deletions check in the PostingList iterator (as above) and getting a
        > potential speedup. But even then I hesitate to push the deletions API upwards
        > into a space where users of raw Scorer and TermDocs classes have to deal with
        > it - especially since iterator-style deletions aren't very user-friendly.

        Can't you just put sugar on top? Ie, add an API that matches today's
        API and efficiently applies the "deleted docs filter" for you? Then
        you have 100% back compat?

        Show
        Michael McCandless added a comment - > We could potentially make that choice with Lucy/KS, thus allowing us to remove > the deletions check in the PostingList iterator (as above) and getting a > potential speedup. But even then I hesitate to push the deletions API upwards > into a space where users of raw Scorer and TermDocs classes have to deal with > it - especially since iterator-style deletions aren't very user-friendly. Can't you just put sugar on top? Ie, add an API that matches today's API and efficiently applies the "deleted docs filter" for you? Then you have 100% back compat?
        Hide
        Michael McCandless added a comment -

        > We can hide the sparse representation and the internal state, having the
        > object lazily build the a non-sparse representation. That's what I had in
        > mind with the code for TombstoneDelEnum.nextDeletion().
        > TombstoneDelEnum.nextInternal() would be a private method used for building up
        > the internal BitVector.

        Got it, though for a low deletion rate presumably you'd want to store
        the int docIDs directly so iterating through them doesn't require O(N)
        scan for the next set bit.

        I think what you'd want to lazily do is merge the N tombstone streams
        for this one segment into a single data structure; whether that data
        structure is sparse or unsparse is a separate decision.

        Show
        Michael McCandless added a comment - > We can hide the sparse representation and the internal state, having the > object lazily build the a non-sparse representation. That's what I had in > mind with the code for TombstoneDelEnum.nextDeletion(). > TombstoneDelEnum.nextInternal() would be a private method used for building up > the internal BitVector. Got it, though for a low deletion rate presumably you'd want to store the int docIDs directly so iterating through them doesn't require O(N) scan for the next set bit. I think what you'd want to lazily do is merge the N tombstone streams for this one segment into a single data structure; whether that data structure is sparse or unsparse is a separate decision.
        Hide
        Michael McCandless added a comment -

        > Mmm. I think I might have given IndexWriter.commit() slightly different
        > semantics. Specifically, I might have given it a boolean "sync" argument
        > which defaults to false.

        It seems like there are various levels of increasing "durability" here:

        • Make available to a reader in the same JVM (eg flush new segment
          to a RAMDir) – not exposed today.
        • Make available to a reader sharing the filesystem right now (flush
          to Directory in "real" filesystem, but don't sync) – exposed
          today (but deprecated as a public API) as flush.
        • Make available to readers even if OS/machine crashes (flush to
          Directory in "real" filesystem, and sync) – exposed today as commit.

        > Two comments. First, if you don't sync, but rather leave it up to the OS when
        > it wants to actually perform the actual disk i/o, how expensive is flushing? Can
        > we make it cheap enough to meet Jason's absolute change rate requirements?

        Right I've been wondering the same thing. I think this should be our
        first approach to realtime indexing, and then we swap in RAMDir if
        performance is not good enough.

        > Second, the multi-index model is very tricky when dealing with "updates". How
        > do you guarantee that you always see the "current" version of a given
        > document, and only that version? When do you expose new deletes in the
        > RAMDirectory, when do you expose new deletes in the FSDirectory, how do you
        > manage slow merges from the RAMDirectory to the FSDirectory, how do you manage
        > new adds to the RAMDirectory that take place during slow merges...
        >
        > Building a single-index, two-writer model that could handle fast updates while
        > performing background merging was one of the main drivers behind the tombstone
        > design.

        I'm not proposing multi-index model (at least I think I'm not!). A
        single IW could flush new tiny segments into a RAMDir and later merge
        them into a real Dir. But I agree: let's start w/ a single Dir and
        move to RAMDir only if necessary.

        > Building a single-index, two-writer model that could handle fast updates while
        > performing background merging was one of the main drivers behind the tombstone
        > design.

        I think carrying the deletions in RAM (reopening the reader) is
        probably fastest for Lucene. Lucene with the "reopened stream of
        readers" approach can do this, but Lucy/KS (with mmap) must use
        filesystem as the intermediary.

        Show
        Michael McCandless added a comment - > Mmm. I think I might have given IndexWriter.commit() slightly different > semantics. Specifically, I might have given it a boolean "sync" argument > which defaults to false. It seems like there are various levels of increasing "durability" here: Make available to a reader in the same JVM (eg flush new segment to a RAMDir) – not exposed today. Make available to a reader sharing the filesystem right now (flush to Directory in "real" filesystem, but don't sync) – exposed today (but deprecated as a public API) as flush. Make available to readers even if OS/machine crashes (flush to Directory in "real" filesystem, and sync) – exposed today as commit. > Two comments. First, if you don't sync, but rather leave it up to the OS when > it wants to actually perform the actual disk i/o, how expensive is flushing? Can > we make it cheap enough to meet Jason's absolute change rate requirements? Right I've been wondering the same thing. I think this should be our first approach to realtime indexing, and then we swap in RAMDir if performance is not good enough. > Second, the multi-index model is very tricky when dealing with "updates". How > do you guarantee that you always see the "current" version of a given > document, and only that version? When do you expose new deletes in the > RAMDirectory, when do you expose new deletes in the FSDirectory, how do you > manage slow merges from the RAMDirectory to the FSDirectory, how do you manage > new adds to the RAMDirectory that take place during slow merges... > > Building a single-index, two-writer model that could handle fast updates while > performing background merging was one of the main drivers behind the tombstone > design. I'm not proposing multi-index model (at least I think I'm not!). A single IW could flush new tiny segments into a RAMDir and later merge them into a real Dir. But I agree: let's start w/ a single Dir and move to RAMDir only if necessary. > Building a single-index, two-writer model that could handle fast updates while > performing background merging was one of the main drivers behind the tombstone > design. I think carrying the deletions in RAM (reopening the reader) is probably fastest for Lucene. Lucene with the "reopened stream of readers" approach can do this, but Lucy/KS (with mmap) must use filesystem as the intermediary.
        Hide
        Doug Cutting added a comment -

        To really tighten this loop, you have to [ ... ] remove all function/method call overhead [and] operate directly on the memory mapped postings file.

        That sounds familiar...

        http://svn.apache.org/viewvc/lucene/java/trunk/src/gcj/org/apache/lucene/index/GCJTermDocs.cc?view=markup

        Show
        Doug Cutting added a comment - To really tighten this loop, you have to [ ... ] remove all function/method call overhead [and] operate directly on the memory mapped postings file. That sounds familiar... http://svn.apache.org/viewvc/lucene/java/trunk/src/gcj/org/apache/lucene/index/GCJTermDocs.cc?view=markup
        Hide
        Marvin Humphrey added a comment -

        Here's a patch implementing BitVector.nextSetBit() and converting
        SegmentTermDocs over to use the quasi-iterator style. Tested but
        not benchmarked.

        Show
        Marvin Humphrey added a comment - Here's a patch implementing BitVector.nextSetBit() and converting SegmentTermDocs over to use the quasi-iterator style. Tested but not benchmarked.
        Hide
        Jason Rutherglen added a comment -

        M.M.:" I think the transactions layer would also sit on top of this
        "realtime" layer? EG this "realtime" layer would expose a commit()
        method, and the transaction layer above it would maintain the
        transaction log, periodically calling commit() and truncating the
        transaction log?"

        One approach that may be optimal is to expose from IndexWriter a createTransaction method that accepts new documents and deletes. All documents have an associated UID. The new documents could feasibly be encoded into a single segment that represents the added documents for that transaction. The deletes would be represented as document long UIDs rather than int doc ids. Then the commit method would be called on the transaction object who returns a reader representing the latest version of the index, plus the changes created by the transaction. This system would be a part of IndexWriter and would not rely on a transaction log. IndexWriter.commit would flush the in ram realtime indexes to disk. The realtime merge policy would flush based on the RAM usage or number of docs.

        IndexWriter iw = new IndexWriter();
        Transaction tr = iw.createTransaction();
        tr.addDocument(new Document());
        tr.addDocument(new Document());
        tr.deleteDocument(1200l);
        IndexReader ir = tr.flush(); // flushes transaction to the index (probably to a ram index)
        IndexReader latestReader = iw.getReader(); // same as ir
        iw.commit(boolean doWait); // commits the in ram realtime index to disk
        

        When commit is called, the disk segment reader flush their deletes to disk which is fast. The in ram realtime index is merged to disk. The process is described in more detail further down.

        M.H.: "how about writing a single-file Directory implementation?"

        I'm not sure we need this because and appending rolling transaction log should work. Segments don't change, only things like norms and deletes which can be appended to a rolling transaction log file system. If we had a generic transaction logging system, the future column stride fields, deletes, norms, and future realtime features could use it and be realtime.

        M.H.: "How do you guarantee that you always see the "current" version of a given document, and only that version?

        Each transaction returns an IndexReader. Each "row" or "object" could use a unique id in the transaction log model which would allow documents that were merged into other segments to be deleted during a transaction log replay.

        M.H.: "When do you expose new deletes in the RAMDirectory, when do you expose new deletes in the FSDirectory"

        When do you expose new deletes in the RAMDir, when do you expose new deletes in the FSDirectory, how do you manage slow merges from the RAMDir to the FSDirectory, how do you manage new adds to the RAMDir that take place during slow merges..."

        Queue deletes to the RAMDir, while copying the RAMDir to the FSDir in the background, perform the deletes after the copy is completed, then instantiate a new reader with the newly merged FSDirectory and a new RAMDirs. Writes that were occurring during this process would be happening to another new RAMDir.

        One way to think of the realtime problem is in terms of segments rather than FSDirs and RAMDirs. Some segments are on disk, some in RAM. Each transaction is an instance of some segments and their deletes (and we're not worried about the deletes being flushed or not so assume they exist as BitVectors). The system should expose an API to checkpoint/flush at a given transaction level (usually the current) and should not stop new updates from happening.

        When I wrote this type of system, I managed individual segments outside of IndexWriter's merge policy and performed the merging manually by placing each segment in it's own FSDirectory (the segment size was 64MB) which minimized the number of directories. I do not know the best approach for this when performed within IndexWriter.

        M.H.: "Two comments. First, if you don't sync, but rather leave it up to the OS when
        it wants to actually perform the actual disk i/o, how expensive is flushing? Can
        we make it cheap enough to meet Jason's absolute change rate requirements?"

        When I tried out the transaction log a write usually mapped pretty quickly to a hard disk write. I don't think it's safe to leave writes up to the OS.

        M.M.: "maintain & updated deleted docs even though IndexWriter has the write lock"

        In my previous realtime search implementation I got around this by having each segment in it's own directory. Assuming this is non-optimal, we will need to expose an IndexReader that has the writelock of the IndexWriter.

        Show
        Jason Rutherglen added a comment - M.M.:" I think the transactions layer would also sit on top of this "realtime" layer? EG this "realtime" layer would expose a commit() method, and the transaction layer above it would maintain the transaction log, periodically calling commit() and truncating the transaction log?" One approach that may be optimal is to expose from IndexWriter a createTransaction method that accepts new documents and deletes. All documents have an associated UID. The new documents could feasibly be encoded into a single segment that represents the added documents for that transaction. The deletes would be represented as document long UIDs rather than int doc ids. Then the commit method would be called on the transaction object who returns a reader representing the latest version of the index, plus the changes created by the transaction. This system would be a part of IndexWriter and would not rely on a transaction log. IndexWriter.commit would flush the in ram realtime indexes to disk. The realtime merge policy would flush based on the RAM usage or number of docs. IndexWriter iw = new IndexWriter(); Transaction tr = iw.createTransaction(); tr.addDocument( new Document()); tr.addDocument( new Document()); tr.deleteDocument(1200l); IndexReader ir = tr.flush(); // flushes transaction to the index (probably to a ram index) IndexReader latestReader = iw.getReader(); // same as ir iw.commit( boolean doWait); // commits the in ram realtime index to disk When commit is called, the disk segment reader flush their deletes to disk which is fast. The in ram realtime index is merged to disk. The process is described in more detail further down. M.H.: "how about writing a single-file Directory implementation?" I'm not sure we need this because and appending rolling transaction log should work. Segments don't change, only things like norms and deletes which can be appended to a rolling transaction log file system. If we had a generic transaction logging system, the future column stride fields, deletes, norms, and future realtime features could use it and be realtime. M.H.: "How do you guarantee that you always see the "current" version of a given document, and only that version? Each transaction returns an IndexReader. Each "row" or "object" could use a unique id in the transaction log model which would allow documents that were merged into other segments to be deleted during a transaction log replay. M.H.: "When do you expose new deletes in the RAMDirectory, when do you expose new deletes in the FSDirectory" When do you expose new deletes in the RAMDir, when do you expose new deletes in the FSDirectory, how do you manage slow merges from the RAMDir to the FSDirectory, how do you manage new adds to the RAMDir that take place during slow merges..." Queue deletes to the RAMDir, while copying the RAMDir to the FSDir in the background, perform the deletes after the copy is completed, then instantiate a new reader with the newly merged FSDirectory and a new RAMDirs. Writes that were occurring during this process would be happening to another new RAMDir. One way to think of the realtime problem is in terms of segments rather than FSDirs and RAMDirs. Some segments are on disk, some in RAM. Each transaction is an instance of some segments and their deletes (and we're not worried about the deletes being flushed or not so assume they exist as BitVectors). The system should expose an API to checkpoint/flush at a given transaction level (usually the current) and should not stop new updates from happening. When I wrote this type of system, I managed individual segments outside of IndexWriter's merge policy and performed the merging manually by placing each segment in it's own FSDirectory (the segment size was 64MB) which minimized the number of directories. I do not know the best approach for this when performed within IndexWriter. M.H.: "Two comments. First, if you don't sync, but rather leave it up to the OS when it wants to actually perform the actual disk i/o, how expensive is flushing? Can we make it cheap enough to meet Jason's absolute change rate requirements?" When I tried out the transaction log a write usually mapped pretty quickly to a hard disk write. I don't think it's safe to leave writes up to the OS. M.M.: "maintain & updated deleted docs even though IndexWriter has the write lock" In my previous realtime search implementation I got around this by having each segment in it's own directory. Assuming this is non-optimal, we will need to expose an IndexReader that has the writelock of the IndexWriter.
        Hide
        Marvin Humphrey added a comment -

        Mike McCandless:

        > So, net/net it seems like "deletes-as-a-filter" approach is compelling?

        In terms of CPU-cycles, maybe.

        My gut tells me that it's all but mandatory if we use merged-on-the-fly
        tombstone streams, but if Lucene goes that route it should cache a BitVector
        and use a shared pseudo-iterator – in which case the costs will no longer be
        significantly more than the current system.

        Under the current system, I'm not certain that the deletions checks are that
        excessive. Consider this loop from TermDocs.read():

            while (i < length && count < df) {
              // manually inlined call to next() for speed
              final int docCode = freqStream.readVInt();
              doc += docCode >>> 1;       // shift off low bit
              if ((docCode & 1) != 0)       // if low bit is set
                freq = 1;         // freq is one
              else
                freq = freqStream.readVInt();     // else read freq
              count++;
        
              if (deletedDocs == null || !deletedDocs.get(doc)) {
                docs[i] = doc;
                freqs[i] = freq;
                ++i;
              }
            }
        

        The CPU probably does a good job of predicting the result of the null check on
        deletedDocs. The readVInt() method call is already a pipeline killer.

        Here's how that loop looks after I patch the deletions check for pseudo-iteration.

              while (i < length && count < df) {
                // manually inlined call to next() for speed
                final int docCode = freqStream.readVInt();
                doc += docCode >>> 1;       // shift off low bit
                if ((docCode & 1) != 0)       // if low bit is set
                  freq = 1;         // freq is one
                else
                  freq = freqStream.readVInt();     // else read freq
                count++;
        
                if (docNum >= nextDeletion) {
                    if (docNum > nextDeletion) {
                      nextDeletion = deletedDocs.nextSetBit(docNum);
                    }
                    if (docNum == nextDeletion) {
                      continue;
                    }
                }
        
                docs[i] = doc;
                freqs[i] = freq;
                ++i;
              }
              return i;
            }
        

        Again, the CPU is probably going to do a pretty good job of predicting the
        results of the deletion check. And even then, we're accessing the same shared
        BitVector across all TermDocs, and its bits are hopefully a cache hit.

        To really tighten this loop, you have to do what Nate and I want with Lucy/KS:

        • Remove all function/method call overhead.
        • Operate directly on the memory mapped postings file.
        SegPList_bulk_read(SegPostingList *self, i32_t *doc_nums, i32_t *freqs,
                           u32_t request)
        {
            i32_t       doc_num   = self->doc_num;
            const u32_t remaining = self->doc_freq - self->count;
            const u32_t num_got   = request < remaining ? request : remaining;
            char       *buf       = InStream_Buf(instream, C32_MAX_BYTES * num_got);
            u32_t       i;
        
            for (i = 0; i < num_got; i++) {
                u32_t doc_code = Math_decode_c32(&buf); /* static inline function */
                u32_t freq     = (doc_code & 1) ? 1 : Math_decode_c32(&buf);
                doc_num        += doc_code >> 1; 
                doc_nums[i]    = doc_num;
                freqs[i]       = freq;
                ++i;
            }
        
            InStream_Advance_Buf(instream, buf);
            self->doc_num = doc_num;
            self->count += num_got;
        
            return num_got;
        }
        

        (That loop would be even better using PFOR instead of vbyte.)

        In terms of public API, I don't think it's reasonable to change Lucene's
        Scorer and TermDocs classes so that their iterators start returning deleted
        docs.

        We could potentially make that choice with Lucy/KS, thus allowing us to remove
        the deletions check in the PostingList iterator (as above) and getting a
        potential speedup. But even then I hesitate to push the deletions API upwards
        into a space where users of raw Scorer and TermDocs classes have to deal with
        it – especially since iterator-style deletions aren't very user-friendly.

        Show
        Marvin Humphrey added a comment - Mike McCandless: > So, net/net it seems like "deletes-as-a-filter" approach is compelling? In terms of CPU-cycles, maybe. My gut tells me that it's all but mandatory if we use merged-on-the-fly tombstone streams, but if Lucene goes that route it should cache a BitVector and use a shared pseudo-iterator – in which case the costs will no longer be significantly more than the current system. Under the current system, I'm not certain that the deletions checks are that excessive. Consider this loop from TermDocs.read(): while (i < length && count < df) { // manually inlined call to next() for speed final int docCode = freqStream.readVInt(); doc += docCode >>> 1; // shift off low bit if ((docCode & 1) != 0) // if low bit is set freq = 1; // freq is one else freq = freqStream.readVInt(); // else read freq count++; if (deletedDocs == null || !deletedDocs.get(doc)) { docs[i] = doc; freqs[i] = freq; ++i; } } The CPU probably does a good job of predicting the result of the null check on deletedDocs. The readVInt() method call is already a pipeline killer. Here's how that loop looks after I patch the deletions check for pseudo-iteration. while (i < length && count < df) { // manually inlined call to next() for speed final int docCode = freqStream.readVInt(); doc += docCode >>> 1; // shift off low bit if ((docCode & 1) != 0) // if low bit is set freq = 1; // freq is one else freq = freqStream.readVInt(); // else read freq count++; if (docNum >= nextDeletion) { if (docNum > nextDeletion) { nextDeletion = deletedDocs.nextSetBit(docNum); } if (docNum == nextDeletion) { continue ; } } docs[i] = doc; freqs[i] = freq; ++i; } return i; } Again, the CPU is probably going to do a pretty good job of predicting the results of the deletion check. And even then, we're accessing the same shared BitVector across all TermDocs, and its bits are hopefully a cache hit. To really tighten this loop, you have to do what Nate and I want with Lucy/KS: Remove all function/method call overhead. Operate directly on the memory mapped postings file. SegPList_bulk_read(SegPostingList *self, i32_t *doc_nums, i32_t *freqs, u32_t request) { i32_t doc_num = self->doc_num; const u32_t remaining = self->doc_freq - self->count; const u32_t num_got = request < remaining ? request : remaining; char *buf = InStream_Buf(instream, C32_MAX_BYTES * num_got); u32_t i; for (i = 0; i < num_got; i++) { u32_t doc_code = Math_decode_c32(&buf); /* static inline function */ u32_t freq = (doc_code & 1) ? 1 : Math_decode_c32(&buf); doc_num += doc_code >> 1; doc_nums[i] = doc_num; freqs[i] = freq; ++i; } InStream_Advance_Buf(instream, buf); self->doc_num = doc_num; self->count += num_got; return num_got; } (That loop would be even better using PFOR instead of vbyte.) In terms of public API, I don't think it's reasonable to change Lucene's Scorer and TermDocs classes so that their iterators start returning deleted docs. We could potentially make that choice with Lucy/KS, thus allowing us to remove the deletions check in the PostingList iterator (as above) and getting a potential speedup. But even then I hesitate to push the deletions API upwards into a space where users of raw Scorer and TermDocs classes have to deal with it – especially since iterator-style deletions aren't very user-friendly.
        Hide
        Marvin Humphrey added a comment -

        Mike McCandless:

        > if it's sparse, you need an iterator (state) to remember where you are.

        We can hide the sparse representation and the internal state, having the
        object lazily build the a non-sparse representation. That's what I had in
        mind with the code for TombstoneDelEnum.nextDeletion().
        TombstoneDelEnum.nextInternal() would be a private method used for building up
        the internal BitVector.

        Show
        Marvin Humphrey added a comment - Mike McCandless: > if it's sparse, you need an iterator (state) to remember where you are. We can hide the sparse representation and the internal state, having the object lazily build the a non-sparse representation. That's what I had in mind with the code for TombstoneDelEnum.nextDeletion(). TombstoneDelEnum.nextInternal() would be a private method used for building up the internal BitVector.
        Hide
        Marvin Humphrey added a comment -

        Mike McCandless:

        > Commit is for crash recovery, and for knowing when it's OK to delete
        > prior commits. Simply writing the files (and not syncing them), and
        > perhaps giving IndexReader.open the SegmentInfos to use directly (and
        > not writing a segments_N via the filesystem) would allow us to search
        > added docs without paying the cost of sync'ing all the files.

        Mmm. I think I might have given IndexWriter.commit() slightly different
        semantics. Specifically, I might have given it a boolean "sync" argument
        which defaults to false.

        > Also: brand new, tiny segments should be written into a RAMDirectory
        > and then merged over time into the real Directory.

        Two comments. First, if you don't sync, but rather leave it up to the OS when
        it wants to actually perform the actual disk i/o, how expensive is flushing? Can
        we make it cheap enough to meet Jason's absolute change rate requirements?

        Second, the multi-index model is very tricky when dealing with "updates". How
        do you guarantee that you always see the "current" version of a given
        document, and only that version? When do you expose new deletes in the
        RAMDirectory, when do you expose new deletes in the FSDirectory, how do you
        manage slow merges from the RAMDirectory to the FSDirectory, how do you manage
        new adds to the RAMDirectory that take place during slow merges...

        Building a single-index, two-writer model that could handle fast updates while
        performing background merging was one of the main drivers behind the tombstone
        design.

        Show
        Marvin Humphrey added a comment - Mike McCandless: > Commit is for crash recovery, and for knowing when it's OK to delete > prior commits. Simply writing the files (and not syncing them), and > perhaps giving IndexReader.open the SegmentInfos to use directly (and > not writing a segments_N via the filesystem) would allow us to search > added docs without paying the cost of sync'ing all the files. Mmm. I think I might have given IndexWriter.commit() slightly different semantics. Specifically, I might have given it a boolean "sync" argument which defaults to false. > Also: brand new, tiny segments should be written into a RAMDirectory > and then merged over time into the real Directory. Two comments. First, if you don't sync, but rather leave it up to the OS when it wants to actually perform the actual disk i/o, how expensive is flushing? Can we make it cheap enough to meet Jason's absolute change rate requirements? Second, the multi-index model is very tricky when dealing with "updates". How do you guarantee that you always see the "current" version of a given document, and only that version? When do you expose new deletes in the RAMDirectory, when do you expose new deletes in the FSDirectory, how do you manage slow merges from the RAMDirectory to the FSDirectory, how do you manage new adds to the RAMDirectory that take place during slow merges... Building a single-index, two-writer model that could handle fast updates while performing background merging was one of the main drivers behind the tombstone design.
        Hide
        Michael McCandless added a comment -

        > How about if we model deletions-as-iterator on BitSet.nextSetBit(int tick) instead of a true iterator that keeps state?

        That works if under-the-hood it's a non-sparse representation. But if it's sparse, you need an iterator (state) to remember where you are.

        Show
        Michael McCandless added a comment - > How about if we model deletions-as-iterator on BitSet.nextSetBit(int tick) instead of a true iterator that keeps state? That works if under-the-hood it's a non-sparse representation. But if it's sparse, you need an iterator (state) to remember where you are.
        Hide
        Michael McCandless added a comment -

        > If we move the deletions filtering up, then we'd increase traffic through that cache

        OK, right. So we may have some added cost because of this. I think
        it's only TermScorer that uses the bulk API though.

        > If you were applying deletions filtering after Scorer.next(), then it seems
        > likely that costs would go up because of extra hit processing. However, if
        > you use Scorer.skipTo() to jump past deletions, as in the loop I provided
        > above, then PhraseScorer etc. shouldn't incur any more costs themselves.

        Ahhh, now I got it! Good, you're right.

        > Under the skipTo() loop, I think the filter effectively does get applied
        > earlier in the chain. Does that make sense?

        Right. This is how Lucene works today. Excellent.

        So, net/net it seems like "deletes-as-a-filter" approach is compelling?

        Show
        Michael McCandless added a comment - > If we move the deletions filtering up, then we'd increase traffic through that cache OK, right. So we may have some added cost because of this. I think it's only TermScorer that uses the bulk API though. > If you were applying deletions filtering after Scorer.next(), then it seems > likely that costs would go up because of extra hit processing. However, if > you use Scorer.skipTo() to jump past deletions, as in the loop I provided > above, then PhraseScorer etc. shouldn't incur any more costs themselves. Ahhh, now I got it! Good, you're right. > Under the skipTo() loop, I think the filter effectively does get applied > earlier in the chain. Does that make sense? Right. This is how Lucene works today. Excellent. So, net/net it seems like "deletes-as-a-filter" approach is compelling?
        Hide
        Marvin Humphrey added a comment -

        Mike McCandless:

        > I'm also curious what cost you see of doing the merge sort for every
        > search; I think it could be uncomfortably high since it's so
        > hard-for-cpu-to-predict-branch-intensive.

        Probably true. You're going to get accelerating degradation as the number of
        deletions increases. In a large index, you could end up merging 20, 30
        streams. Based on how the priority queue in ORScorer tends to take up space
        in profiling data, that might not be good.

        It'd be manageable if you can keep your index reasonably in good shape, but you'll
        be suckin' pondwater if it gets flabby.

        > We could take the first search that doesn't use skipTo and save the result
        > of the merge sort, essentially doing an in-RAM-only "merge" of those
        > deletes, and let subsequent searches use that single merged stream.

        That was what I had in mind when proposing the pseudo-iterator model.

        class TombStoneDelEnum extends DelEnum {
          int nextDeletion(int docNum) {
            while (currentMax < docNum) { nextInternal(); }
            return bits.nextSetBit(docNum);
          }
          // ...
        }
        

        > (This is not MMAP friendly, though).

        Yeah. Ironically, that use of tombstones is more compatible with the Lucene
        model.

        I'd be reluctant to have Lucy/KS realize those large BitVectors in per-object
        process RAM. That'd spoil the "cheap wrapper around system i/o cache"
        IndexReader plan.

        I can't see an answer yet. But the one thing I do know is that Lucy/KS needs
        a pluggable deletions mechanism to make experimentation easier – so that's
        what I'm working on today.

        Show
        Marvin Humphrey added a comment - Mike McCandless: > I'm also curious what cost you see of doing the merge sort for every > search; I think it could be uncomfortably high since it's so > hard-for-cpu-to-predict-branch-intensive. Probably true. You're going to get accelerating degradation as the number of deletions increases. In a large index, you could end up merging 20, 30 streams. Based on how the priority queue in ORScorer tends to take up space in profiling data, that might not be good. It'd be manageable if you can keep your index reasonably in good shape, but you'll be suckin' pondwater if it gets flabby. > We could take the first search that doesn't use skipTo and save the result > of the merge sort, essentially doing an in-RAM-only "merge" of those > deletes, and let subsequent searches use that single merged stream. That was what I had in mind when proposing the pseudo-iterator model. class TombStoneDelEnum extends DelEnum { int nextDeletion( int docNum) { while (currentMax < docNum) { nextInternal(); } return bits.nextSetBit(docNum); } // ... } > (This is not MMAP friendly, though). Yeah. Ironically, that use of tombstones is more compatible with the Lucene model. I'd be reluctant to have Lucy/KS realize those large BitVectors in per-object process RAM. That'd spoil the "cheap wrapper around system i/o cache" IndexReader plan. I can't see an answer yet. But the one thing I do know is that Lucy/KS needs a pluggable deletions mechanism to make experimentation easier – so that's what I'm working on today.
        Hide
        Michael McCandless added a comment -

        > It would be exposed as a combination reader writer that manages the transaction status of each update.

        I think the transactions layer would also sit on top of this
        "realtime" layer? EG this "realtime" layer would expose a commit()
        method, and the transaction layer above it would maintain the
        transaction log, periodically calling commit() and truncating the
        transaction log?

        This "realtime" layer, then, would internally maintain a single
        IndexWriter and the readers. IndexWriter would flush (not commit) new
        segments into a RAMDir and yield its in-RAM SegmentInfos to
        IndexReader.reopen. MergePolicy periodically gets those into the real
        Directory. When reopening a reader we have the freedom to use old
        (already merged away) segments if the newly merged segment isn't warm
        yet.

        We "just" need to open some things up in IndexWriter:

        • IndexReader.reopen with the in-RAM SegmentInfos
        • Willingness to allow an IndexReader to maintain & updated deleted
          docs even though IndexWriter has the write lock
        • Access to segments that were already merged away (I think we could
          make a DeletionPolicy that pays attention to when the newly merged
          segment is not yet warmed and keeps thue prior segments around).
          I think this'd require allowing DeletionPolicy to see "flush
          points" in addition to commit points (it doesn't today).

        But I'm still hazy on the details on exactly how to open up
        IndexWriter.

        Show
        Michael McCandless added a comment - > It would be exposed as a combination reader writer that manages the transaction status of each update. I think the transactions layer would also sit on top of this "realtime" layer? EG this "realtime" layer would expose a commit() method, and the transaction layer above it would maintain the transaction log, periodically calling commit() and truncating the transaction log? This "realtime" layer, then, would internally maintain a single IndexWriter and the readers. IndexWriter would flush (not commit) new segments into a RAMDir and yield its in-RAM SegmentInfos to IndexReader.reopen. MergePolicy periodically gets those into the real Directory. When reopening a reader we have the freedom to use old (already merged away) segments if the newly merged segment isn't warm yet. We "just" need to open some things up in IndexWriter: IndexReader.reopen with the in-RAM SegmentInfos Willingness to allow an IndexReader to maintain & updated deleted docs even though IndexWriter has the write lock Access to segments that were already merged away (I think we could make a DeletionPolicy that pays attention to when the newly merged segment is not yet warmed and keeps thue prior segments around). I think this'd require allowing DeletionPolicy to see "flush points" in addition to commit points (it doesn't today). But I'm still hazy on the details on exactly how to open up IndexWriter.
        Hide
        Michael McCandless added a comment -

        > If Lucene crashed for some reason the transaction log would be replayed.

        I think the transaction log is useful for some applications, but could
        (should) be built as a separate (optional) layer entirely on top of
        Lucene's core. Ie, neither IndexWriter nor IndexReader need to be
        aware of the transaction log, which update belongs to which
        transaction, etc?

        Show
        Michael McCandless added a comment - > If Lucene crashed for some reason the transaction log would be replayed. I think the transaction log is useful for some applications, but could (should) be built as a separate (optional) layer entirely on top of Lucene's core. Ie, neither IndexWriter nor IndexReader need to be aware of the transaction log, which update belongs to which transaction, etc?
        Hide
        Michael McCandless added a comment -

        > There's going to be a change rate that overwhelms the multi-file
        > commit system, and it seems that you've determined you're up against
        > it.

        Well... IndexWriter need not "commit" in order to allow a reader to
        see the files?

        Commit is for crash recovery, and for knowing when it's OK to delete
        prior commits. Simply writing the files (and not syncing them), and
        perhaps giving IndexReader.open the SegmentInfos to use directly (and
        not writing a segments_N via the filesystem) would allow us to search
        added docs without paying the cost of sync'ing all the files.

        Also: brand new, tiny segments should be written into a RAMDirectory
        and then merged over time into the real Directory.

        Show
        Michael McCandless added a comment - > There's going to be a change rate that overwhelms the multi-file > commit system, and it seems that you've determined you're up against > it. Well... IndexWriter need not "commit" in order to allow a reader to see the files? Commit is for crash recovery, and for knowing when it's OK to delete prior commits. Simply writing the files (and not syncing them), and perhaps giving IndexReader.open the SegmentInfos to use directly (and not writing a segments_N via the filesystem) would allow us to search added docs without paying the cost of sync'ing all the files. Also: brand new, tiny segments should be written into a RAMDirectory and then merged over time into the real Directory.
        Hide
        Marvin Humphrey added a comment -

        Jason Rutherglen:

        > I found in making the realtime search write speed fast enough that writing
        > to individual files per segment can become too costly (they accumulate fast,
        > appending to a single file is faster than creating new files, deleting the
        > files becomes costly).

        I saw you mentioning i/o overhead on Windows in particular. I can't see a way
        to mod Lucene so that it doesn't generate a bunch of files for each commit,
        and FWIW Lucy/KS is going to generate even more files than Lucene.

        Half-seriously... how about writing a single-file Directory implementation?

        > For example, writing to small individual files per commit, if the number of
        > segments is large and the delete spans multiple segments will generate many
        > files.

        There would be a maximum of two files per segment to hold the tombstones: one
        to hold the tombstone rows, and one to map segment identifiers to tombstone
        rows. (In Lucy/KS, the mappings would probably be stored in the JSON-encoded
        "segmeta" file, which stores human-readable metadata on behalf of multiple
        components.)

        Segments containing tombstones would be merged according to whatever merge
        policy was in place. So there won't ever be an obscene number of tombstone
        files unless you allow an obscene number of segments to accumulate.

        > Many users may not want a transaction log as they may be storing the updates
        > in a separate SQL database instance (this is the case where I work) and so a
        > transaction log is redundant and should be optional.

        I can see how this would be quite useful at the application level. However, I
        think it might be challenging to generalize the transaction log concept at the
        library level:

        CustomAnalyzer analyzer = new CustomAnalyzer();
        IndexWriter indexWriter = new IndexWriter(analyzer, "/path/to/index");
        indexWriter.add(nextDoc());
        analyzer.setFoo(2); // change of state not recorded by transaction log
        indexWriter.add(nextDoc());
        

        MySQL is more of a closed system than Lucene, which I think makes options
        available that aren't available to us.

        > The reader stack is drained based on whether a reader is too old to be
        > useful anymore (i.e. no references to it, or it's has N number of readers
        > ahead of it).

        Right, this is the kind of thing that Lucene has to do because of the
        single-reader model, and that were trying to get away from in Lucy/KS by
        exploiting mmap and making IndexReaders cheap wrappers around the system i/o
        cache.

        I don't think I can offer any alternative design suggestions that meet your
        needs. There's going to be a change rate that overwhelms the multi-file
        commit system, and it seems that you've determined you're up against it.

        What's killing us is something different: not absolute change rate, but poor
        worst-case performance.

        FWIW, we contemplated a multi-index system with an index on a RAM disk for
        fast changes and a primary index on the main file system. It would have
        worked fine for pure adds, but it was very tricky to manage state for
        documents which were being "updated", i.e. deleted and re-added. How are you
        handling all these small adds with your combo reader/writer? Do you not have
        that problem?

        Show
        Marvin Humphrey added a comment - Jason Rutherglen: > I found in making the realtime search write speed fast enough that writing > to individual files per segment can become too costly (they accumulate fast, > appending to a single file is faster than creating new files, deleting the > files becomes costly). I saw you mentioning i/o overhead on Windows in particular. I can't see a way to mod Lucene so that it doesn't generate a bunch of files for each commit, and FWIW Lucy/KS is going to generate even more files than Lucene. Half-seriously... how about writing a single-file Directory implementation? > For example, writing to small individual files per commit, if the number of > segments is large and the delete spans multiple segments will generate many > files. There would be a maximum of two files per segment to hold the tombstones: one to hold the tombstone rows, and one to map segment identifiers to tombstone rows. (In Lucy/KS, the mappings would probably be stored in the JSON-encoded "segmeta" file, which stores human-readable metadata on behalf of multiple components.) Segments containing tombstones would be merged according to whatever merge policy was in place. So there won't ever be an obscene number of tombstone files unless you allow an obscene number of segments to accumulate. > Many users may not want a transaction log as they may be storing the updates > in a separate SQL database instance (this is the case where I work) and so a > transaction log is redundant and should be optional. I can see how this would be quite useful at the application level. However, I think it might be challenging to generalize the transaction log concept at the library level: CustomAnalyzer analyzer = new CustomAnalyzer(); IndexWriter indexWriter = new IndexWriter(analyzer, "/path/to/index" ); indexWriter.add(nextDoc()); analyzer.setFoo(2); // change of state not recorded by transaction log indexWriter.add(nextDoc()); MySQL is more of a closed system than Lucene, which I think makes options available that aren't available to us. > The reader stack is drained based on whether a reader is too old to be > useful anymore (i.e. no references to it, or it's has N number of readers > ahead of it). Right, this is the kind of thing that Lucene has to do because of the single-reader model, and that were trying to get away from in Lucy/KS by exploiting mmap and making IndexReaders cheap wrappers around the system i/o cache. I don't think I can offer any alternative design suggestions that meet your needs. There's going to be a change rate that overwhelms the multi-file commit system, and it seems that you've determined you're up against it. What's killing us is something different: not absolute change rate, but poor worst-case performance. FWIW, we contemplated a multi-index system with an index on a RAM disk for fast changes and a primary index on the main file system. It would have worked fine for pure adds, but it was very tricky to manage state for documents which were being "updated", i.e. deleted and re-added. How are you handling all these small adds with your combo reader/writer? Do you not have that problem?
        Hide
        Jason Rutherglen added a comment -

        Marvin: "The whole tombstone idea arose out of the need for (close to) realtime search! It's intended to improve write speed."

        It does improve the write speed. I found in making the realtime search write speed fast enough that writing to individual files per segment can become too costly (they accumulate fast, appending to a single file is faster than creating new files, deleting the files becomes costly). For example, writing to small individual files per commit, if the number of segments is large and the delete spans multiple segments will generate many files. This is variable based on how often the updates are expected to occur. I modeled this after the extreme case of the frequency of updates of a MySQL instance backing data for a web application.

        The MySQL design, translated to Lucene is a transaction log per index. Where the updates consisting of documents and deletes are written to the transaction log file. If Lucene crashed for some reason the transaction log would be replayed. The in memory indexes and newly deleted document bitvectors would be held in RAM (LUCENE-1314) until flushed (the in memory indexes and deleted documents) manually or based on memory usage. Many users may not want a transaction log as they may be storing the updates in a separate SQL database instance (this is the case where I work) and so a transaction log is redundant and should be optional. The first implementation of this will not have a transaction log.

        Marvin: "I don't think I understand. Is this the "combination index reader/writer" model, where the writer prepares a data structure that then gets handed off to the reader?"

        It would be exposed as a combination reader writer that manages the transaction status of each update. The internal architecture is such that after each update a new reader representing the new documents and deletes for the transaction is generated and put onto a stack. The reader stack is drained based on whether a reader is too old to be useful anymore (i.e. no references to it, or it's has N number of readers ahead of it).

        Show
        Jason Rutherglen added a comment - Marvin: "The whole tombstone idea arose out of the need for (close to) realtime search! It's intended to improve write speed." It does improve the write speed. I found in making the realtime search write speed fast enough that writing to individual files per segment can become too costly (they accumulate fast, appending to a single file is faster than creating new files, deleting the files becomes costly). For example, writing to small individual files per commit, if the number of segments is large and the delete spans multiple segments will generate many files. This is variable based on how often the updates are expected to occur. I modeled this after the extreme case of the frequency of updates of a MySQL instance backing data for a web application. The MySQL design, translated to Lucene is a transaction log per index. Where the updates consisting of documents and deletes are written to the transaction log file. If Lucene crashed for some reason the transaction log would be replayed. The in memory indexes and newly deleted document bitvectors would be held in RAM ( LUCENE-1314 ) until flushed (the in memory indexes and deleted documents) manually or based on memory usage. Many users may not want a transaction log as they may be storing the updates in a separate SQL database instance (this is the case where I work) and so a transaction log is redundant and should be optional. The first implementation of this will not have a transaction log. Marvin: "I don't think I understand. Is this the "combination index reader/writer" model, where the writer prepares a data structure that then gets handed off to the reader?" It would be exposed as a combination reader writer that manages the transaction status of each update. The internal architecture is such that after each update a new reader representing the new documents and deletes for the transaction is generated and put onto a stack. The reader stack is drained based on whether a reader is too old to be useful anymore (i.e. no references to it, or it's has N number of readers ahead of it).
        Hide
        Mark Miller added a comment -

        I noticed that in one version of the patch for segment-centric search (LUCENE-1483), each sorted search involved the creation of sub-searchers, which were then used to compile Scorers. It would make sense to cache those as individual SegmentSearcher objects, no?

        Thats a fairly old version I think (based on using MutliSearcher as a hack). Now we are using one queue and running it through each subreader of the MultiReader.

        Show
        Mark Miller added a comment - I noticed that in one version of the patch for segment-centric search ( LUCENE-1483 ), each sorted search involved the creation of sub-searchers, which were then used to compile Scorers. It would make sense to cache those as individual SegmentSearcher objects, no? Thats a fairly old version I think (based on using MutliSearcher as a hack). Now we are using one queue and running it through each subreader of the MultiReader.
        Hide
        Marvin Humphrey added a comment -

        Mike McCandless:

        > For a TermQuery (one term) the cost of the two approaches should be
        > the same.

        It'll be close, but I don't think that's quite true. TermScorer pre-fetches
        document numbers in batches from the TermDocs object. At present, only
        non-deleted doc nums get cached. If we move the deletions filtering up, then
        we'd increase traffic through that cache. However, filling it would be
        slightly cheaper, because we wouldn't be performing the deletions check.

        In theory. I'm not sure there's a way to streamline away that deletions check
        in TermDocs and maintain backwards compatibility. And while this is a fun
        brainstorm, I'm still far from convinced that having TermDocs.next() and
        Scorer.next() return deleted docs by default is a good idea.

        > For AND (and other) queries I'm not sure. In theory, having to
        > process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery
        > may see much higher net cost.

        If you were applying deletions filtering after Scorer.next(), then it seems
        likely that costs would go up because of extra hit processing. However, if
        you use Scorer.skipTo() to jump past deletions, as in the loop I provided
        above, then PhraseScorer etc. shouldn't incur any more costs themselves.

        > a costly per-docID search
        > with a very restrictive filter could be far more efficient if you
        > applied the Filter earlier in the chain.

        Under the skipTo() loop, I think the filter effectively does get applied
        earlier in the chain. Does that make sense?

        I think the potential performance downside comes down to prefetching in
        TermScorer, unless there are other classes that do similar prefetching.

        Show
        Marvin Humphrey added a comment - Mike McCandless: > For a TermQuery (one term) the cost of the two approaches should be > the same. It'll be close, but I don't think that's quite true. TermScorer pre-fetches document numbers in batches from the TermDocs object. At present, only non-deleted doc nums get cached. If we move the deletions filtering up, then we'd increase traffic through that cache. However, filling it would be slightly cheaper, because we wouldn't be performing the deletions check. In theory. I'm not sure there's a way to streamline away that deletions check in TermDocs and maintain backwards compatibility. And while this is a fun brainstorm, I'm still far from convinced that having TermDocs.next() and Scorer.next() return deleted docs by default is a good idea. > For AND (and other) queries I'm not sure. In theory, having to > process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery > may see much higher net cost. If you were applying deletions filtering after Scorer.next(), then it seems likely that costs would go up because of extra hit processing. However, if you use Scorer.skipTo() to jump past deletions, as in the loop I provided above, then PhraseScorer etc. shouldn't incur any more costs themselves. > a costly per-docID search > with a very restrictive filter could be far more efficient if you > applied the Filter earlier in the chain. Under the skipTo() loop, I think the filter effectively does get applied earlier in the chain. Does that make sense? I think the potential performance downside comes down to prefetching in TermScorer, unless there are other classes that do similar prefetching.
        Hide
        Marvin Humphrey added a comment -

        How about if we model deletions-as-iterator on BitSet.nextSetBit(int tick) instead of a true iterator that keeps state?

        You can do that now by implementing BitVector.nextSetBit(int tick) and using that in TermDocs to set a nextDeletion member var instead of checking every doc num with BitVector.get().

        That way, the object that provides deletions can still be shared.

        Show
        Marvin Humphrey added a comment - How about if we model deletions-as-iterator on BitSet.nextSetBit(int tick) instead of a true iterator that keeps state? You can do that now by implementing BitVector.nextSetBit(int tick) and using that in TermDocs to set a nextDeletion member var instead of checking every doc num with BitVector.get(). That way, the object that provides deletions can still be shared.
        Hide
        Marvin Humphrey added a comment -

        Paul Elschot:

        > How about a SegmentSearcher?

        I like the idea of a SegmentSearcher in general. A little while back, I wondered whether exposing SegmentReaders was really the best way to handle segment-centric search. Upon reflection, I think it is. Segments are a good unit. They're pure inverted indexes (notwithstanding doc stores and tombstones); the larger composite only masquerades as one.

        I noticed that in one version of the patch for segment-centric search (LUCENE-1483), each sorted search involved the creation of sub-searchers, which were then used to compile Scorers. It would make sense to cache those as individual SegmentSearcher objects, no?

        And then, to respond to the original suggestion, the SegmentSearcher level seems like a good place to handle application of a deletions quasi-filter. I think we could avoid having to deal with segment-start offsets that way.

        Show
        Marvin Humphrey added a comment - Paul Elschot: > How about a SegmentSearcher? I like the idea of a SegmentSearcher in general. A little while back, I wondered whether exposing SegmentReaders was really the best way to handle segment-centric search. Upon reflection, I think it is. Segments are a good unit. They're pure inverted indexes (notwithstanding doc stores and tombstones); the larger composite only masquerades as one. I noticed that in one version of the patch for segment-centric search ( LUCENE-1483 ), each sorted search involved the creation of sub-searchers, which were then used to compile Scorers. It would make sense to cache those as individual SegmentSearcher objects, no? And then, to respond to the original suggestion, the SegmentSearcher level seems like a good place to handle application of a deletions quasi-filter. I think we could avoid having to deal with segment-start offsets that way.
        Hide
        Paul Elschot added a comment -

        To minimize CPU cycles, it would theoretically make more sense to handle deletions much higher up, at the top level Scorer, Searcher, or even the HitCollector level.

        How about a SegmentSearcher?

        Show
        Paul Elschot added a comment - To minimize CPU cycles, it would theoretically make more sense to handle deletions much higher up, at the top level Scorer, Searcher, or even the HitCollector level. How about a SegmentSearcher?
        Hide
        Michael McCandless added a comment -

        > PostingList would be completely ignorant of deletions, as would classes like NOTScorer and MatchAllScorer:

        This is a neat idea! Deletions are then applied just like a Filter.

        For a TermQuery (one term) the cost of the two approaches should be
        the same.

        For OR'd Term queries, it actually seems like your proposed approach
        may be lower cost? Ie rather than each TermEnum doing the "AND NOT
        deleted" intersection, you only do it once at the top. There is added
        cost in that each TermEnum is now returning more docIDs than before,
        but the deleted ones are eliminated before scoring.

        For AND (and other) queries I'm not sure. In theory, having to
        process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery
        may see much higher net cost. We should test.

        Conceivably, a future "search optimization phase" could pick & choose
        the best point to inject the "AND NOT deleted" filter. In fact, it
        could also pick when to inject a Filter... a costly per-docID search
        with a very restrictive filter could be far more efficient if you
        applied the Filter earlier in the chain.

        I'm also curious what cost you see of doing the merge sort for every
        search; I think it could be uncomfortably high since it's so
        hard-for-cpu-to-predict-branch-intensive. We could take the first
        search that doesn't use skipTo and save the result of the merge sort,
        essentially doing an in-RAM-only "merge" of those deletes, and let
        subsequent searches use that single merged stream. (This is not MMAP
        friendly, though).

        In my initial rough testing, I switched to iterator API for
        SegmentTermEnum and found if %tg deletes is < 10% the search was a bit
        faster using an iterator vs random access, but above that was slower.
        This was with an already "merged" list of in-order docIDs.

        Switching to an iterator API for accessing field values for many docs
        (LUCENE-831 – new FieldCache API, LUCENE-1231 – column stride
        fields) shouldn't have this same problem since it's the "top level"
        that's accessing the values (ie, one iterator per field X query).

        Show
        Michael McCandless added a comment - > PostingList would be completely ignorant of deletions, as would classes like NOTScorer and MatchAllScorer: This is a neat idea! Deletions are then applied just like a Filter. For a TermQuery (one term) the cost of the two approaches should be the same. For OR'd Term queries, it actually seems like your proposed approach may be lower cost? Ie rather than each TermEnum doing the "AND NOT deleted" intersection, you only do it once at the top. There is added cost in that each TermEnum is now returning more docIDs than before, but the deleted ones are eliminated before scoring. For AND (and other) queries I'm not sure. In theory, having to process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery may see much higher net cost. We should test. Conceivably, a future "search optimization phase" could pick & choose the best point to inject the "AND NOT deleted" filter. In fact, it could also pick when to inject a Filter... a costly per-docID search with a very restrictive filter could be far more efficient if you applied the Filter earlier in the chain. I'm also curious what cost you see of doing the merge sort for every search; I think it could be uncomfortably high since it's so hard-for-cpu-to-predict-branch-intensive. We could take the first search that doesn't use skipTo and save the result of the merge sort, essentially doing an in-RAM-only "merge" of those deletes, and let subsequent searches use that single merged stream. (This is not MMAP friendly, though). In my initial rough testing, I switched to iterator API for SegmentTermEnum and found if %tg deletes is < 10% the search was a bit faster using an iterator vs random access, but above that was slower. This was with an already "merged" list of in-order docIDs. Switching to an iterator API for accessing field values for many docs ( LUCENE-831 – new FieldCache API, LUCENE-1231 – column stride fields) shouldn't have this same problem since it's the "top level" that's accessing the values (ie, one iterator per field X query).
        Hide
        Paul Elschot added a comment -

        Jason,

        This issue LUCENE-584 had an implementation of a "matcher" class that the scorers implemented which I do not think made it into the committed patch.

        The only functional difference between the DocIdSetIterator as committed and the Matcher class at 584 is the Matcher.explain() method, which did not make it into DocIdSetIterator.

        Show
        Paul Elschot added a comment - Jason, This issue LUCENE-584 had an implementation of a "matcher" class that the scorers implemented which I do not think made it into the committed patch. The only functional difference between the DocIdSetIterator as committed and the Matcher class at 584 is the Matcher.explain() method, which did not make it into DocIdSetIterator.
        Hide
        Marvin Humphrey added a comment -

        Jason Rutherglen:

        > For realtime search where a new transaction may only have a handful of
        > deletes the tombstones may not be optimal

        The whole tombstone idea arose out of the need for (close to) realtime search! It's intended to improve write speed.

        When you make deletes with the BitSet model, you have to rewrite files that scale with segment size, regardless of how few deletions you make. Deletion of a single document in a large segment may necessitate writing out a substantial bit vector file.

        In contrast, i/o throughput for writing out a tombstone file scales with the number of tombstones.

        > because too many tombstones would accumulate (I believe).

        Say that you make a string of commits that are nothing but deleting a single document – thus adding a new segment each time that contains nothing but a single tombstone. Those are going to be cheap to merge, so it seems unlikely that we'll end up with an unwieldy number of tombstone streams to interleave at search-time.

        The more likely problem is the one McCandless articulated regarding a large segment accumulating a lot of tombstone streams against it. But I agree with him that it only gets truly serious if your merge policy neglects such segments and allows them to deteriorate for too long.

        > For this scenario rolling bitsets may be better. Meaning pool bit sets and
        > throw away unused readers.

        I don't think I understand. Is this the "combination index reader/writer" model, where the writer prepares a data structure that then gets handed off to the reader?

        Show
        Marvin Humphrey added a comment - Jason Rutherglen: > For realtime search where a new transaction may only have a handful of > deletes the tombstones may not be optimal The whole tombstone idea arose out of the need for (close to) realtime search! It's intended to improve write speed. When you make deletes with the BitSet model, you have to rewrite files that scale with segment size, regardless of how few deletions you make. Deletion of a single document in a large segment may necessitate writing out a substantial bit vector file. In contrast, i/o throughput for writing out a tombstone file scales with the number of tombstones. > because too many tombstones would accumulate (I believe). Say that you make a string of commits that are nothing but deleting a single document – thus adding a new segment each time that contains nothing but a single tombstone. Those are going to be cheap to merge, so it seems unlikely that we'll end up with an unwieldy number of tombstone streams to interleave at search-time. The more likely problem is the one McCandless articulated regarding a large segment accumulating a lot of tombstone streams against it. But I agree with him that it only gets truly serious if your merge policy neglects such segments and allows them to deteriorate for too long. > For this scenario rolling bitsets may be better. Meaning pool bit sets and > throw away unused readers. I don't think I understand. Is this the "combination index reader/writer" model, where the writer prepares a data structure that then gets handed off to the reader?
        Hide
        Jason Rutherglen added a comment -

        I like this idea of tombstones and we should figure out a way to support it. This issue https://issues.apache.org/jira/browse/LUCENE-584 had an implementation of a "matcher" class that the scorers implemented which I do not think made it into the committed patch.

        > I think the inefficiency of needing individual iterator objects applies to OpenBitSet as well, right?

        Yes that is true.

        For realtime search where a new transaction may only have a handful of deletes the tombstones may not be optimal because too many tombstones would accumulate (I believe). For this scenario rolling bitsets may be better. Meaning pool bit sets and throw away unused readers.

        Show
        Jason Rutherglen added a comment - I like this idea of tombstones and we should figure out a way to support it. This issue https://issues.apache.org/jira/browse/LUCENE-584 had an implementation of a "matcher" class that the scorers implemented which I do not think made it into the committed patch. > I think the inefficiency of needing individual iterator objects applies to OpenBitSet as well, right? Yes that is true. For realtime search where a new transaction may only have a handful of deletes the tombstones may not be optimal because too many tombstones would accumulate (I believe). For this scenario rolling bitsets may be better. Meaning pool bit sets and throw away unused readers.
        Hide
        Marvin Humphrey added a comment -

        While converting over PostingList (the Lucy/KS analogue to TermDocs) to use a deletions iterator, it occurred to me that the because the iterator has to keep state, a unique DelEnum object has to be created for every PostingList. In contrast, a BitVector object, which is accessed only via get(), can be shared.

        It bugs me that each PostingList will have its own DelEnum performing interleaving of tombstones. With very large queries against indexes with large numbers of deletions, it seems like a lot of duplicated work.

        To minimize CPU cycles, it would theoretically make more sense to handle deletions much higher up, at the top level Scorer, Searcher, or even the HitCollector level. PostingList would be completely ignorant of deletions, as would classes like NOTScorer and MatchAllScorer:

        i32_t
        MatchAllScorer_next(MatchAllScorer* self) 
        {
            if (++self->doc_num > self->max_docs) {
                self->doc_num--;
                return 0;
            }
            return self->doc_num;
        }
        
        void
        Scorer_collect(Scorer *self, Hitcollector *collector, DelEnum *del_enum)
        {
            i32_t next_deletion = del_enum ? 0 : I32_MAX;
            i32_t doc_num = 1;
            while (1) {
                while (doc_num >= next_deletion) {
                    next_deletion = DelEnum_Skip_To(del_enum, target);
                    while (doc_num == next_deletion) {
                        doc_num++;
                        next_deletion = DelEnum_Skip_To(del_enum, doc_num);
                    }
                }
                doc_num = Scorer_Skip_To(scorer, doc_num);
                if (doc_num) {
                    HC_Collect(collector, doc_num, Scorer_Tally(scorer));
                }
                else { 
                    break; 
                }
            }
        }
        

        The problem is that PostingLists spun off from indexReader.postingList(field, term) would include deleted documents, as would Scorers.

        I suppose you could band-aid indexReader.postingList() by adding a boolean suppressDeletions argument which would default to true, but that seems messy.

        Jason, I think the inefficiency of needing individual iterator objects applies to OpenBitSet as well, right? As soon as we do away with random access, we have to keep state. Dunno if it's going to be noticeable, but it's conceptually annoying.

        Show
        Marvin Humphrey added a comment - While converting over PostingList (the Lucy/KS analogue to TermDocs) to use a deletions iterator, it occurred to me that the because the iterator has to keep state, a unique DelEnum object has to be created for every PostingList. In contrast, a BitVector object, which is accessed only via get(), can be shared. It bugs me that each PostingList will have its own DelEnum performing interleaving of tombstones. With very large queries against indexes with large numbers of deletions, it seems like a lot of duplicated work. To minimize CPU cycles, it would theoretically make more sense to handle deletions much higher up, at the top level Scorer, Searcher, or even the HitCollector level. PostingList would be completely ignorant of deletions, as would classes like NOTScorer and MatchAllScorer: i32_t MatchAllScorer_next(MatchAllScorer* self) { if (++self->doc_num > self->max_docs) { self->doc_num--; return 0; } return self->doc_num; } void Scorer_collect(Scorer *self, Hitcollector *collector, DelEnum *del_enum) { i32_t next_deletion = del_enum ? 0 : I32_MAX; i32_t doc_num = 1; while (1) { while (doc_num >= next_deletion) { next_deletion = DelEnum_Skip_To(del_enum, target); while (doc_num == next_deletion) { doc_num++; next_deletion = DelEnum_Skip_To(del_enum, doc_num); } } doc_num = Scorer_Skip_To(scorer, doc_num); if (doc_num) { HC_Collect(collector, doc_num, Scorer_Tally(scorer)); } else { break ; } } } The problem is that PostingLists spun off from indexReader.postingList(field, term) would include deleted documents, as would Scorers. I suppose you could band-aid indexReader.postingList() by adding a boolean suppressDeletions argument which would default to true, but that seems messy. Jason, I think the inefficiency of needing individual iterator objects applies to OpenBitSet as well, right? As soon as we do away with random access, we have to keep state. Dunno if it's going to be noticeable, but it's conceptually annoying.
        Hide
        Jason Rutherglen added a comment -

        Wouldn't it be good to remove BitVector and replace it with OpenBitSet? OBS is faster, has the DocIdSetIterator already. It just needs to implement write to disk compression of the bitset (dgaps?). This would be a big win for almost all searches. We could also create an interface so that any bitset implementation could be used.

        Such as:

        public interface WriteableBitSet {
         public void write(IndexOutput output) throws IOException;
        }
        
        Show
        Jason Rutherglen added a comment - Wouldn't it be good to remove BitVector and replace it with OpenBitSet? OBS is faster, has the DocIdSetIterator already. It just needs to implement write to disk compression of the bitset (dgaps?). This would be a big win for almost all searches. We could also create an interface so that any bitset implementation could be used. Such as: public interface WriteableBitSet { public void write(IndexOutput output) throws IOException; }
        Hide
        Jason Rutherglen added a comment -

        Marvin:
        "I'm also bothered by the proliferation of small deletions files. Probably
        you'd want automatic consolidation of files under 4k, but you still could end
        up with a lot of files in a big index."

        A transaction log might be better here if we want to go to 0ish millisecond realtime.
        On Windows at least creating files rapidly and deleting them creates significant IO overhead.
        UNIX is probably faster but I do not know.

        Show
        Jason Rutherglen added a comment - Marvin: "I'm also bothered by the proliferation of small deletions files. Probably you'd want automatic consolidation of files under 4k, but you still could end up with a lot of files in a big index." A transaction log might be better here if we want to go to 0ish millisecond realtime. On Windows at least creating files rapidly and deleting them creates significant IO overhead. UNIX is probably faster but I do not know.
        Hide
        Michael McCandless added a comment -

        I like this approach!!

        It's also incremental in cost (cost of flush/commit is in proportion
        to how many deletes were done), but you are storing the "packet" of
        incremental deletes with the segment you just flushed and not against
        the N segments that had deletes. And you write only one file to hold
        all the tombstones, which for commit() (file sync) is much less cost.

        And it's great that we don't need a new merge policy to handle all the
        delete files.

        Though one possible downside is, for a very large segment in a very
        large index you will likely be merging (at search time) quite a few
        delete packets. But, with the cutover to
        deletes-accessed-only-by-iterator, this cost is probably not high
        until a large pctg of the segment's docs are deleted, at which point
        you should really expungeDeletes() or optimize() or optimize(int)
        anyway.

        If only we could write code as quickly as we can dream...

        Show
        Michael McCandless added a comment - I like this approach!! It's also incremental in cost (cost of flush/commit is in proportion to how many deletes were done), but you are storing the "packet" of incremental deletes with the segment you just flushed and not against the N segments that had deletes. And you write only one file to hold all the tombstones, which for commit() (file sync) is much less cost. And it's great that we don't need a new merge policy to handle all the delete files. Though one possible downside is, for a very large segment in a very large index you will likely be merging (at search time) quite a few delete packets. But, with the cutover to deletes-accessed-only-by-iterator, this cost is probably not high until a large pctg of the segment's docs are deleted, at which point you should really expungeDeletes() or optimize() or optimize(int) anyway. If only we could write code as quickly as we can dream...
        Hide
        Marvin Humphrey added a comment -

        > One approach would be to use a "segmented" model.

        That would improve the average performance of deleting a document, at the cost
        of some added complexity. Worst-case performance – which you'd hit when you
        consolidated those sub-segment deletions files – would actually degrade a
        bit.

        To manage consolidation, you'd need a deletions merge policy that operated
        independently from the primary merge policy. Aside from the complexity penalty,
        having two un-coordinated merge policies would be bad for real-time search,
        because you want to be able to control exactly when you pay for a big merge.

        I'm also bothered by the proliferation of small deletions files. Probably
        you'd want automatic consolidation of files under 4k, but you still could end
        up with a lot of files in a big index.

        So... what if we wrote, merged, and removed deletions files on the same
        schedule as ordinary segment files? Instead of going back and quasi-modifying
        an existing segment by associating a next-generation .del file with it, we write
        deletions to a NEW segment and have them reference older segments.

        In other words, we add "tombstones" rather than "delete" documents.

        Logically speaking, each tombstone segment file would consist of an array of
        segment identifiers, each of which would point to a "tombstone row" array of
        vbyte-encoded doc nums:

        // _6.tombstone
           _2: [3, 4, 25]
           _3: [13]
        
        // _7.tombstone
           _2: [5]
        
        // _8.tombstone
           _1: [94]
           _2: [7, 8]
           _5: [54, 55]
        

        The thing that makes this possible is that the dead docs marked by tombstones
        never get their doc nums shuffled during segment merging – they just
        disappear. If deleted docs lived to be consolidated into new segments and
        acquire new doc nums, tombstones wouldn't work. However, we can associate
        tombstone rows with segment names and they only need remain valid as long
        as the segments they reference survive.

        Some tombstone rows will become obsolete once the segments they reference go
        away, but we never arrive at a scenario where we are forced to discard valid
        tombstones. Merging tombstone files simply involves dropping obsolete
        tombstone rows and collating valid ones.

        At search time, we'd use an iterator with an internal priority queue to
        collate tombstone rows into a stream – so there's still no need to slurp the
        files at IndexReader startup.

        Show
        Marvin Humphrey added a comment - > One approach would be to use a "segmented" model. That would improve the average performance of deleting a document, at the cost of some added complexity. Worst-case performance – which you'd hit when you consolidated those sub-segment deletions files – would actually degrade a bit. To manage consolidation, you'd need a deletions merge policy that operated independently from the primary merge policy. Aside from the complexity penalty, having two un-coordinated merge policies would be bad for real-time search, because you want to be able to control exactly when you pay for a big merge. I'm also bothered by the proliferation of small deletions files. Probably you'd want automatic consolidation of files under 4k, but you still could end up with a lot of files in a big index. So... what if we wrote, merged, and removed deletions files on the same schedule as ordinary segment files? Instead of going back and quasi-modifying an existing segment by associating a next-generation .del file with it, we write deletions to a NEW segment and have them reference older segments. In other words, we add "tombstones" rather than "delete" documents. Logically speaking, each tombstone segment file would consist of an array of segment identifiers, each of which would point to a "tombstone row" array of vbyte-encoded doc nums: // _6.tombstone _2: [3, 4, 25] _3: [13] // _7.tombstone _2: [5] // _8.tombstone _1: [94] _2: [7, 8] _5: [54, 55] The thing that makes this possible is that the dead docs marked by tombstones never get their doc nums shuffled during segment merging – they just disappear. If deleted docs lived to be consolidated into new segments and acquire new doc nums, tombstones wouldn't work. However, we can associate tombstone rows with segment names and they only need remain valid as long as the segments they reference survive. Some tombstone rows will become obsolete once the segments they reference go away, but we never arrive at a scenario where we are forced to discard valid tombstones. Merging tombstone files simply involves dropping obsolete tombstone rows and collating valid ones. At search time, we'd use an iterator with an internal priority queue to collate tombstone rows into a stream – so there's still no need to slurp the files at IndexReader startup.
        Hide
        Michael McCandless added a comment -

        Mike, you were planning on managing IndexComponents via IndexReader and IndexWriter constructor args, but won't that get unwieldy if there are too many components? A Schema class allows you to group them together. You don't have to use it to manage fields the way KS does - just leave that out.

        Agreed. I'll try to do something along these lines under LUCENE-1458.

        Show
        Michael McCandless added a comment - Mike, you were planning on managing IndexComponents via IndexReader and IndexWriter constructor args, but won't that get unwieldy if there are too many components? A Schema class allows you to group them together. You don't have to use it to manage fields the way KS does - just leave that out. Agreed. I'll try to do something along these lines under LUCENE-1458 .
        Hide
        Marvin Humphrey added a comment -

        > It would be great if instead of relying on Lucene to manage the
        > deletedDocs file, the API would be pluggable

        In LUCENE-1478, "IndexComponent" was proposed, with potential subclasses including PostingsComponent, LexiconComponent/TermDictComponent, TermVectorsComponent, and so on. Since then, it has become apparent that SnapshotComponent and DeletionsComponent also belong at the top level.

        In Lucy/KS, these would all be specified within a Schema:

        class MySchema extends Schema {
          DeletionsComponent deletionsComponent() { 
            return new DocIdBitSetDeletionsComponent();
          }
        
          void initFields() {
            addField("title", "text");
            addField("content", "text");
          }
        
          Analyzer analyzer() {
            return new PolyAnalyzer("en");
          }
        }
        

        Mike, you were planning on managing IndexComponents via IndexReader and IndexWriter constructor args, but won't that get unwieldy if there are too many components? A Schema class allows you to group them together. You don't have to use it to manage fields the way KS does – just leave that out.

        Show
        Marvin Humphrey added a comment - > It would be great if instead of relying on Lucene to manage the > deletedDocs file, the API would be pluggable In LUCENE-1478 , "IndexComponent" was proposed, with potential subclasses including PostingsComponent, LexiconComponent/TermDictComponent, TermVectorsComponent, and so on. Since then, it has become apparent that SnapshotComponent and DeletionsComponent also belong at the top level. In Lucy/KS, these would all be specified within a Schema: class MySchema extends Schema { DeletionsComponent deletionsComponent() { return new DocIdBitSetDeletionsComponent(); } void initFields() { addField( "title" , "text" ); addField( "content" , "text" ); } Analyzer analyzer() { return new PolyAnalyzer( "en" ); } } Mike, you were planning on managing IndexComponents via IndexReader and IndexWriter constructor args, but won't that get unwieldy if there are too many components? A Schema class allows you to group them together. You don't have to use it to manage fields the way KS does – just leave that out.
        Hide
        robert engels added a comment -

        That's my point, in complex multi-treaded software with multiple readers, etc. it is a good backspot against errors..

        Show
        robert engels added a comment - That's my point, in complex multi-treaded software with multiple readers, etc. it is a good backspot against errors..
        Hide
        Marvin Humphrey added a comment -

        > Does it really need to throw an exception?

        Aside from back compat, I don't see why it would need to. I think the only rationale is to serve as a backstop protecting against invalid reads.

        Show
        Marvin Humphrey added a comment - > Does it really need to throw an exception? Aside from back compat, I don't see why it would need to. I think the only rationale is to serve as a backstop protecting against invalid reads.
        Hide
        Jason Rutherglen added a comment -

        It would be great if instead of relying on Lucene to manage the deletedDocs file, the API would be pluggable enough such that a DocIdBitSet (DocIdSet with random access) could be set in a SegmentReader, and the file access (reading and writing) could be managed from outside. Of course this is difficult to do and still make things backwards compatible, however for 3.0 I would really like to be a part of efforts to create a completely generic and pluggable API that is cleanly separated from the underlying index format and files. This would mean that the analyzing, querying, scoring portions of Lucene could access an IndexReader like pluggable class where the underlying index files, when and how the index files are written to disk is completely separated.

        One motivation for this patch is to allow custom queries access to the deletedDocs in a clean way (meaning not needing to be a part of the o.a.l.s. package)

        I am wondering if it is good to try to get IndexReader.clone working again, or if there is some other better way related to this patch to externally manage the deletedDocs.

        Improving the performance of deletedDocs would help for every query so it's worth looking at.

        Show
        Jason Rutherglen added a comment - It would be great if instead of relying on Lucene to manage the deletedDocs file, the API would be pluggable enough such that a DocIdBitSet (DocIdSet with random access) could be set in a SegmentReader, and the file access (reading and writing) could be managed from outside. Of course this is difficult to do and still make things backwards compatible, however for 3.0 I would really like to be a part of efforts to create a completely generic and pluggable API that is cleanly separated from the underlying index format and files. This would mean that the analyzing, querying, scoring portions of Lucene could access an IndexReader like pluggable class where the underlying index files, when and how the index files are written to disk is completely separated. One motivation for this patch is to allow custom queries access to the deletedDocs in a clean way (meaning not needing to be a part of the o.a.l.s. package) I am wondering if it is good to try to get IndexReader.clone working again, or if there is some other better way related to this patch to externally manage the deletedDocs. Improving the performance of deletedDocs would help for every query so it's worth looking at.
        Hide
        Michael McCandless added a comment -

        but IndexReader.document throws an exception if the document is deleted...0 so you still need random access

        Does it really need to throw an exception? (Of course for back compat it does, but we could move away from that to a new method that doesn't check).

        Show
        Michael McCandless added a comment - but IndexReader.document throws an exception if the document is deleted...0 so you still need random access Does it really need to throw an exception? (Of course for back compat it does, but we could move away from that to a new method that doesn't check).
        Hide
        robert engels added a comment -

        but IndexReader.document throws an exception if the document is deleted...0 so you still need random access

        Show
        robert engels added a comment - but IndexReader.document throws an exception if the document is deleted...0 so you still need random access
        Hide
        Michael McCandless added a comment -

        In many cases after you have read an index, and retrieved document numbers, these are lazily returned to the client.

        By the time some records are needed to be read, they may have already been deleted (at least this was the usage in old lucene, where deletions happened in the reader).

        I think a lot of code assumes this, and calls the isDeleted() to ensure the document is still valid.

        But isn't that an uncommon use case? It's dangerous to wait a long
        time after getting a docID from a reader, before looking up the
        document. Most apps pull the doc right away, send it to the user, and
        the docID isn't kept (I think?).

        But still I agree: we can't eliminate random access to isDeleted
        entirely. We'd still have to offer it for such external cases.

        I'm just saying the internal uses of isDeleted could all be switched
        to iteration instead, and, we might get some performance gains from
        it especially when the number of deletes on a segment is relatively low.

        Show
        Michael McCandless added a comment - In many cases after you have read an index, and retrieved document numbers, these are lazily returned to the client. By the time some records are needed to be read, they may have already been deleted (at least this was the usage in old lucene, where deletions happened in the reader). I think a lot of code assumes this, and calls the isDeleted() to ensure the document is still valid. But isn't that an uncommon use case? It's dangerous to wait a long time after getting a docID from a reader, before looking up the document. Most apps pull the doc right away, send it to the user, and the docID isn't kept (I think?). But still I agree: we can't eliminate random access to isDeleted entirely. We'd still have to offer it for such external cases. I'm just saying the internal uses of isDeleted could all be switched to iteration instead, and, we might get some performance gains from it especially when the number of deletes on a segment is relatively low.
        Hide
        Michael McCandless added a comment -

        It seemed wrong to pay the method call overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or MatchAllScorer.next(), when we could just store the next deletion:

        Nice! This is what I had in mind.

        I think we could [almost] do this across the board for Lucene.
        SegmentTermDocs would similarly store nextDeleted and apply the same
        "AND NOT" logic.

        that's because IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted

        Hmm – that is very nicely enabling.

        I've actually been trying to figure out a new design for deletions because writing them out for big segments is our last big write bottleneck

        One approach would be to use a "segmented" model. IE, if a few
        deletions are added, write that to a new "deletes segment", ie a
        single "normal segment" would then have multiple deletion files
        associated with it. These would have to be merged (iterator) when
        used during searching, and, periodically coalesced.

        if we only need iterator access, we can use vbyte encoding instead

        Right: if there are relatively few deletes against a segment, encoding
        the "on bits" directly (or deltas) should be a decent win since
        iteration is much faster.

        Show
        Michael McCandless added a comment - It seemed wrong to pay the method call overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or MatchAllScorer.next(), when we could just store the next deletion: Nice! This is what I had in mind. I think we could [almost] do this across the board for Lucene. SegmentTermDocs would similarly store nextDeleted and apply the same "AND NOT" logic. that's because IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted Hmm – that is very nicely enabling. I've actually been trying to figure out a new design for deletions because writing them out for big segments is our last big write bottleneck One approach would be to use a "segmented" model. IE, if a few deletions are added, write that to a new "deletes segment", ie a single "normal segment" would then have multiple deletion files associated with it. These would have to be merged (iterator) when used during searching, and, periodically coalesced. if we only need iterator access, we can use vbyte encoding instead Right: if there are relatively few deletes against a segment, encoding the "on bits" directly (or deltas) should be a decent win since iteration is much faster.
        Hide
        Marvin Humphrey added a comment -

        > Marvin, in KS/Lucy are you using random-access or iterator to access
        > deletedDocs & norms?

        Both. There's a DelEnum class which is used by NOTScorer and MatchAllScorer, but it's implemented using BitVectors which get the next deleted doc num by calling nextSetBit() internally.

        I happened to be coding up those classes this spring when there was the big brouhaha about IndexReader.isDeleted(). It seemed wrong to pay the method call overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or MatchAllScorer.next(), when we could just store the next deletion:

        i32_t
        MatchAllScorer_next(MatchAllScorer* self) 
        {
            do {
                if (++self->doc_num > self->max_docs) {
                    self->doc_num--;
                    return 0;
                }
                if (self->doc_num > self->next_deletion) {
                    self->next_deletion 
                        = DelEnum_Skip_To(self->del_enum, self->doc_num);
                }
            } while (self->doc_num == self->next_deletion);
            return self->doc_num;
        }
        

        (Note: Scorer.next() in KS returns the document number; doc nums start at 1, and 0 is the sentinel signaling iterator termination. I expect that Lucy will be the same.)

        Perhaps we could get away without needing the random access, but that's because IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted – unlike Lucene which throws an exception. Also, you can't delete documents against an IndexReader, so Robert's objection doesn't apply to us.

        I had always assumed we were going to have to expose isDeleted() eventually, but maybe we can get away with zapping it. Interesting!

        I've actually been trying to figure out a new design for deletions because writing them out for big segments is our last big write bottleneck, now that we've theoretically solved the sort cache warming issue. I figured we would continue to need bit-vector files because they're straightforward to mmap, but if we only need iterator access, we can use vbyte encoding instead... Hmm, we still face the problem of outsized write cost when a segment has a large number of deletions and you add one more...

        Show
        Marvin Humphrey added a comment - > Marvin, in KS/Lucy are you using random-access or iterator to access > deletedDocs & norms? Both. There's a DelEnum class which is used by NOTScorer and MatchAllScorer, but it's implemented using BitVectors which get the next deleted doc num by calling nextSetBit() internally. I happened to be coding up those classes this spring when there was the big brouhaha about IndexReader.isDeleted(). It seemed wrong to pay the method call overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or MatchAllScorer.next(), when we could just store the next deletion: i32_t MatchAllScorer_next(MatchAllScorer* self) { do { if (++self->doc_num > self->max_docs) { self->doc_num--; return 0; } if (self->doc_num > self->next_deletion) { self->next_deletion = DelEnum_Skip_To(self->del_enum, self->doc_num); } } while (self->doc_num == self->next_deletion); return self->doc_num; } (Note: Scorer.next() in KS returns the document number; doc nums start at 1, and 0 is the sentinel signaling iterator termination. I expect that Lucy will be the same.) Perhaps we could get away without needing the random access, but that's because IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted – unlike Lucene which throws an exception. Also, you can't delete documents against an IndexReader, so Robert's objection doesn't apply to us. I had always assumed we were going to have to expose isDeleted() eventually, but maybe we can get away with zapping it. Interesting! I've actually been trying to figure out a new design for deletions because writing them out for big segments is our last big write bottleneck, now that we've theoretically solved the sort cache warming issue. I figured we would continue to need bit-vector files because they're straightforward to mmap, but if we only need iterator access, we can use vbyte encoding instead... Hmm, we still face the problem of outsized write cost when a segment has a large number of deletions and you add one more...
        Hide
        robert engels added a comment -

        I don't think you can change this...

        In many cases after you have read an index, and retrieved document numbers, these are lazily returned to the client.

        By the time some records are needed to be read, they may have already been deleted (at least this was the usage in old lucene, where deletions happened in the reader).

        I think a lot of code assumes this, and calls the isDeleted() to ensure the document is still valid.

        Show
        robert engels added a comment - I don't think you can change this... In many cases after you have read an index, and retrieved document numbers, these are lazily returned to the client. By the time some records are needed to be read, they may have already been deleted (at least this was the usage in old lucene, where deletions happened in the reader). I think a lot of code assumes this, and calls the isDeleted() to ensure the document is still valid.
        Hide
        Michael McCandless added a comment -

        But, SegmentReader needs random access to the bits (DocIdSet only provides an iterator)?

        Although IndexReader.isDeleted exposes a random-access API to deleted docs, I think it may be overkill.

        Ie, in most (all?) uses of deleted docs throughout Lucene core/contrib, a simple iterator (DocIdSet) would in fact suffice.

        EG in SegmentTermDocs iteration we are always checking deletedDocs by ascending docID. It might be a performance gain (pure speculation) if we used an iterator API, because we could hold "nextDelDocID" and only advance that (skipTo) when the term's docID has moved past it. It's just like an "AND NOT X" clause.

        Similarly, norms, which also now expose a random-access API, should be fine with an iterator type API as well.

        This may also imply better VM behavior, since we don't actually require norms/deletions to be fully memory resident.

        This would be a biggish change, and it's not clear whether/when we should explore it, but I wanted to get the idea out there.

        Marvin, in KS/Lucy are you using random-access or iterator to access deletedDocs & norms?

        Show
        Michael McCandless added a comment - But, SegmentReader needs random access to the bits (DocIdSet only provides an iterator)? Although IndexReader.isDeleted exposes a random-access API to deleted docs, I think it may be overkill. Ie, in most (all?) uses of deleted docs throughout Lucene core/contrib, a simple iterator (DocIdSet) would in fact suffice. EG in SegmentTermDocs iteration we are always checking deletedDocs by ascending docID. It might be a performance gain (pure speculation) if we used an iterator API, because we could hold "nextDelDocID" and only advance that (skipTo) when the term's docID has moved past it. It's just like an "AND NOT X" clause. Similarly, norms, which also now expose a random-access API, should be fine with an iterator type API as well. This may also imply better VM behavior, since we don't actually require norms/deletions to be fully memory resident. This would be a biggish change, and it's not clear whether/when we should explore it, but I wanted to get the idea out there. Marvin, in KS/Lucy are you using random-access or iterator to access deletedDocs & norms?
        Hide
        Jason Rutherglen added a comment -

        BitVector does not implement the methods of java.util.BitSet. RABitSet could be implemented by OpenBitSet and BitVector. This way an OpenBitSet or another filter such as P4Delta could be used in place of BitVector in SegmentReader.

        The IndexReader.flush type of methods would need to either automatically not save, throw an exception, and there needs to be a setting. This helps the synchronization issue in SegmentReader.isDeleted by allowing access to it.

        Show
        Jason Rutherglen added a comment - BitVector does not implement the methods of java.util.BitSet. RABitSet could be implemented by OpenBitSet and BitVector. This way an OpenBitSet or another filter such as P4Delta could be used in place of BitVector in SegmentReader. The IndexReader.flush type of methods would need to either automatically not save, throw an exception, and there needs to be a setting. This helps the synchronization issue in SegmentReader.isDeleted by allowing access to it.
        Hide
        robert engels added a comment -

        BitSet is already random access, DocIdSet is not.

        Show
        robert engels added a comment - BitSet is already random access, DocIdSet is not.
        Hide
        Jason Rutherglen added a comment -

        Looks like we need a new abstract class. RABitSet?

        Show
        Jason Rutherglen added a comment - Looks like we need a new abstract class. RABitSet?
        Hide
        Michael McCandless added a comment -

        But, SegmentReader needs random access to the bits (DocIdSet only provides an iterator)?

        Show
        Michael McCandless added a comment - But, SegmentReader needs random access to the bits (DocIdSet only provides an iterator)?
        Hide
        Jason Rutherglen added a comment -

        LUCENE-1476.patch

        BitVector extends DocIdSet.

        TestBitVector implements testDocIdSet method that is based on TestSortedVIntList tests

        Show
        Jason Rutherglen added a comment - LUCENE-1476 .patch BitVector extends DocIdSet. TestBitVector implements testDocIdSet method that is based on TestSortedVIntList tests

          People

          • Assignee:
            Unassigned
            Reporter:
            Jason Rutherglen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 12h
              12h
              Remaining:
              Remaining Estimate - 12h
              12h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development