Lucene - Core
  1. Lucene - Core
  2. LUCENE-1526

For near real-time search, use paged copy-on-write BitVector impl

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 2.4
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      SegmentReader currently uses a BitVector to represent deleted docs.
      When performing rapid clone (see LUCENE-1314) and delete operations,
      performing a copy on write of the BitVector can become costly because
      the entire underlying byte array must be created and copied. A way to
      make this clone delete process faster is to implement tombstones, a
      term coined by Marvin Humphrey. Tombstones represent new deletions
      plus the incremental deletions from previously reopened readers in
      the current reader.

      The proposed implementation of tombstones is to accumulate deletions
      into an int array represented as a DocIdSet. With LUCENE-1476,
      SegmentTermDocs iterates over deleted docs using a DocIdSet rather
      than accessing the BitVector by calling get. This allows a BitVector
      and a set of tombstones to by ANDed together as the current reader's
      delete docs.

      A tombstone merge policy needs to be defined to determine when to
      merge tombstone DocIdSets into a new deleted docs BitVector as too
      many tombstones would eventually be detrimental to performance. A
      probable implementation will merge tombstones based on the number of
      tombstones and the total number of documents in the tombstones. The
      merge policy may be set in the clone/reopen methods or on the
      IndexReader.

      1. LUCENE-1526.patch
        18 kB
        Jason Rutherglen
      2. LUCENE-1526.patch
        28 kB
        Jason Rutherglen

        Activity

        Hide
        Jason Rutherglen added a comment -

        Won't be working on these and they're old

        Show
        Jason Rutherglen added a comment - Won't be working on these and they're old
        Hide
        Michael McCandless added a comment -

        John, what about memory exhaustion? Are you still hitting that as well?

        Show
        Michael McCandless added a comment - John, what about memory exhaustion? Are you still hitting that as well?
        Hide
        Michael McCandless added a comment -

        OK I spunoff LUCENE-2120.

        Show
        Michael McCandless added a comment - OK I spunoff LUCENE-2120 .
        Hide
        Michael McCandless added a comment -

        Yes, we still see the issue

        OK I'll open a separate issue to try to get to the bottom of this...

        Show
        Michael McCandless added a comment - Yes, we still see the issue OK I'll open a separate issue to try to get to the bottom of this...
        Hide
        John Wang added a comment -

        Yes, we still see the issue. The performance/stress test after 20+ min of run, latency spiked from 5ms to 550ms and file handle leakage was severe enough that the test crashed. This is the code:

        http://code.google.com/p/zoie/source/browse/branches/BR_DELETE_OPT/java/proj/zoie/impl/indexing/luceneNRT/ThrottledLuceneNRTDataConsumer.java

        Our logging indicates there is at most 3 index readers instances at open state. Yet the file handle count is very high.

        Show
        John Wang added a comment - Yes, we still see the issue. The performance/stress test after 20+ min of run, latency spiked from 5ms to 550ms and file handle leakage was severe enough that the test crashed. This is the code: http://code.google.com/p/zoie/source/browse/branches/BR_DELETE_OPT/java/proj/zoie/impl/indexing/luceneNRT/ThrottledLuceneNRTDataConsumer.java Our logging indicates there is at most 3 index readers instances at open state. Yet the file handle count is very high.
        Hide
        Michael McCandless added a comment -

        Jake, have you guys had a chance to re-run your tests across varying
        reopen rates? Are you still hitting OOM / file handle leaks with
        straight Lucene NRT? I've been unable to reproduce these issues in
        my stress testing.... so I'd like to hone in on what's different in our
        testing.

        Show
        Michael McCandless added a comment - Jake, have you guys had a chance to re-run your tests across varying reopen rates? Are you still hitting OOM / file handle leaks with straight Lucene NRT? I've been unable to reproduce these issues in my stress testing.... so I'd like to hone in on what's different in our testing.
        Hide
        Jason Rutherglen added a comment -

        Inlined into SegmentTermDocs. If there's an issue with the del
        docs null check we could go extreme and instantiate specialized
        instances of SegTD that doesn't perform the check. I'm not sure
        what would slow this down, but profiling will lets us know
        what's up.

        TestIndexReaderReopen and TestIndexWriterReader passes so I
        figure we're ready for benchmarking.

        Show
        Jason Rutherglen added a comment - Inlined into SegmentTermDocs. If there's an issue with the del docs null check we could go extreme and instantiate specialized instances of SegTD that doesn't perform the check. I'm not sure what would slow this down, but profiling will lets us know what's up. TestIndexReaderReopen and TestIndexWriterReader passes so I figure we're ready for benchmarking.
        Hide
        Jason Rutherglen added a comment -

        Here's a working version of this. The page size is statically
        configurable by adjusting CONST in PagedBitVector. I set it on
        the high side because the next thing is to inline the page and
        doc checking into SegmentTermDocs for benchmarking.

        The test is fairly randomized, though I think there's more that
        can be added.

        The pages are saved one by one, either as dgaps or bytes, which
        means the .del file format has changed. We can probably read the
        old format, write the new format if this is deployed.

        Show
        Jason Rutherglen added a comment - Here's a working version of this. The page size is statically configurable by adjusting CONST in PagedBitVector. I set it on the high side because the next thing is to inline the page and doc checking into SegmentTermDocs for benchmarking. The test is fairly randomized, though I think there's more that can be added. The pages are saved one by one, either as dgaps or bytes, which means the .del file format has changed. We can probably read the old format, write the new format if this is deployed.
        Hide
        Michael McCandless added a comment -

        We should test the performance tradeoffs incurred by switching to
        transactional data structure (like the proposed paged bit vector),
        but... my inclination at this point would be it's not a good tradeoff
        for Lucene NRT to make.

        Ie, it'd be making the same tradeoff Zoie now makes – faster reopen
        time for slower searching, which I don't think makes sense for most
        apps.

        Show
        Michael McCandless added a comment - We should test the performance tradeoffs incurred by switching to transactional data structure (like the proposed paged bit vector), but... my inclination at this point would be it's not a good tradeoff for Lucene NRT to make. Ie, it'd be making the same tradeoff Zoie now makes – faster reopen time for slower searching, which I don't think makes sense for most apps.
        Hide
        Michael McCandless added a comment -

        One of the nice things that we can do in Zoie by using this kind of index-latency backoff, is that because we have an in-memory two-way mapping of zoie-specific UID to docId, if we actually have time (in the background, since we're caching these readers now) to zip through and update the real delete BitVectors on the segments, and lose the extra check at query time, only using that if you have the index-latency time set below some threshold (determined by how long it takes the system to do this resolution - mapping docId to UID is an array lookup, the reverse is a little slower).

        Right – I think such a hybrid approach would have the best tradeoffs
        of all. You'd get insanely fast reopen, and then searching would only
        take the performance hit until the BG resolution of deleted UID ->
        Lucene docID completed. Similar to the JRE's BG hotspot compiler.

        Right, Zoie is making determined tradeoffs. I would expect that most apps are fine with controlled reopen frequency, ie, they would choose to not lose indexing and searching performance if it means they can "only" reopen, eg, 2X per second.

        In theory Zoie is making tradeoffs - in practice, at least against what is on trunk, Zoie's just going way faster in both indexing and querying in the redline perf test. I agree that in principle, once LUCENE-1313 and other improvements and bugs have been worked out of NRT, that query performance should be faster, and if zoie's default BalancedMergePolicy (nee ZoieMergePolicy) is in use for NRT, the indexing performance should be faster too - it's just not quite there yet at this point.

        Well.. unfortunately, we can't conclude much from the current test,
        besides that Zoie's reopen time is much faster than Lucene's (until/if
        we add the "reopen frequency" as a dimension, and see those results).

        Also the test is rather synthetic, in that most apps don't really need
        to reopen 100s of times per second. We really should try to test more
        realistic cases.

        One question: where is CPU utilization when you run the Lucene test?
        Presumably, if you block an incoming query until the reopen completes,
        and because only one reopen can happen at once, it seems like CPU must
        not be saturated?

        But, I agree, there are alot of moving parts here still – Zoie has
        far faster add-only throughput than Lucene (could simply be due to
        lack of LUCENE-1313), Lucene may have correctness issue (still can't
        repro), Lucene has some pending optimizations (LUCENE-2047), etc.

        In LUCENE-2061 I'm working on a standard benchmark we can use to test
        improvements to Lucene's NRT; it'll let us assess potential
        improvements and spot weird problems.

        One thing that Zoie benefited from, from an API standpoint, which might be nice in Lucene, now that 1.5 is in place, is that the IndexReaderWarmer could replace the raw SegmentReader with a warmed user-specified subclass of SegmentReader:

         
        public abstract class IndexReaderWarmer<R extends IndexReader> {
          public abstract T warm(IndexReader rawReader);
        }
        

        Which could replace the reader in the readerPool with the possibly-user-overridden subclass of SegmentReader (now that SegmentReader is as public as IndexReader itself is) which has now been warmed. For users who like to decorate their readers to keep additional state, instead of use them as keys into WeakHashMaps kept separate, this could be extremely useful (I know that the people I talked to at Apple's iTunes store do this, as well as in bobo, and zoie, to name a few examples off the top of my head).

        This is a good idea, and it's been suggested several times now,
        including eg notification when segment merging starts/commits, but I
        think we should take it up in the larger context of how to centralize
        reader pooling? This pool is just the pool used by IndexWriter, when
        its in NRT mode; I think IndexReader.open should somehow share the
        same infrastructure. And maybe LUCENE-2026 (refactoring IW) is the
        vehicle for "centralizing" this? Can you go carry over this
        suggestion there?

        I think Lucene could handle this well, if we made an IndexReader impl that directly searches DocumentWriter's RAM buffer. But that's somewhat challenging

        Jason mentioned this approach in his talk at ApacheCon, but I'm not at all convinced it's necessary - if a single box can handle indexing a couple hundred smallish documents a second (into a RAMDirectory), and could be sped up by using multiple IndexWriters (writing into multiple RAMDirecotries in parallel, if you were willing to give up some CPU cores to indexing), and you can search against them without having to do any deduplification / bloomfilter check against the disk, then I'd be surprised if searching the pre-indexed RAM buffer would really be much of a speedup in comparison to just doing it the simple way. But I could be wrong, as I'm not sure how much faster such a search could be.

        Right, we should clearly only take such a big step if performance
        shows it's justified. From the initial results I just posted in
        LUCENE-2061, it looks like Lucene does in fact handle the add-only
        case very well (ie degredation to QPS is fairly contained), even into
        an FSDir. I need to restest with LUCENE-1313.

        Show
        Michael McCandless added a comment - One of the nice things that we can do in Zoie by using this kind of index-latency backoff, is that because we have an in-memory two-way mapping of zoie-specific UID to docId, if we actually have time (in the background, since we're caching these readers now) to zip through and update the real delete BitVectors on the segments, and lose the extra check at query time, only using that if you have the index-latency time set below some threshold (determined by how long it takes the system to do this resolution - mapping docId to UID is an array lookup, the reverse is a little slower). Right – I think such a hybrid approach would have the best tradeoffs of all. You'd get insanely fast reopen, and then searching would only take the performance hit until the BG resolution of deleted UID -> Lucene docID completed. Similar to the JRE's BG hotspot compiler. Right, Zoie is making determined tradeoffs. I would expect that most apps are fine with controlled reopen frequency, ie, they would choose to not lose indexing and searching performance if it means they can "only" reopen, eg, 2X per second. In theory Zoie is making tradeoffs - in practice, at least against what is on trunk, Zoie's just going way faster in both indexing and querying in the redline perf test. I agree that in principle, once LUCENE-1313 and other improvements and bugs have been worked out of NRT, that query performance should be faster, and if zoie's default BalancedMergePolicy (nee ZoieMergePolicy) is in use for NRT, the indexing performance should be faster too - it's just not quite there yet at this point. Well.. unfortunately, we can't conclude much from the current test, besides that Zoie's reopen time is much faster than Lucene's (until/if we add the "reopen frequency" as a dimension, and see those results). Also the test is rather synthetic, in that most apps don't really need to reopen 100s of times per second. We really should try to test more realistic cases. One question: where is CPU utilization when you run the Lucene test? Presumably, if you block an incoming query until the reopen completes, and because only one reopen can happen at once, it seems like CPU must not be saturated? But, I agree, there are alot of moving parts here still – Zoie has far faster add-only throughput than Lucene (could simply be due to lack of LUCENE-1313 ), Lucene may have correctness issue (still can't repro), Lucene has some pending optimizations ( LUCENE-2047 ), etc. In LUCENE-2061 I'm working on a standard benchmark we can use to test improvements to Lucene's NRT; it'll let us assess potential improvements and spot weird problems. One thing that Zoie benefited from, from an API standpoint, which might be nice in Lucene, now that 1.5 is in place, is that the IndexReaderWarmer could replace the raw SegmentReader with a warmed user-specified subclass of SegmentReader: public abstract class IndexReaderWarmer<R extends IndexReader> { public abstract T warm(IndexReader rawReader); } Which could replace the reader in the readerPool with the possibly-user-overridden subclass of SegmentReader (now that SegmentReader is as public as IndexReader itself is) which has now been warmed. For users who like to decorate their readers to keep additional state, instead of use them as keys into WeakHashMaps kept separate, this could be extremely useful (I know that the people I talked to at Apple's iTunes store do this, as well as in bobo, and zoie, to name a few examples off the top of my head). This is a good idea, and it's been suggested several times now, including eg notification when segment merging starts/commits, but I think we should take it up in the larger context of how to centralize reader pooling? This pool is just the pool used by IndexWriter, when its in NRT mode; I think IndexReader.open should somehow share the same infrastructure. And maybe LUCENE-2026 (refactoring IW) is the vehicle for "centralizing" this? Can you go carry over this suggestion there? I think Lucene could handle this well, if we made an IndexReader impl that directly searches DocumentWriter's RAM buffer. But that's somewhat challenging Jason mentioned this approach in his talk at ApacheCon, but I'm not at all convinced it's necessary - if a single box can handle indexing a couple hundred smallish documents a second (into a RAMDirectory), and could be sped up by using multiple IndexWriters (writing into multiple RAMDirecotries in parallel, if you were willing to give up some CPU cores to indexing), and you can search against them without having to do any deduplification / bloomfilter check against the disk, then I'd be surprised if searching the pre-indexed RAM buffer would really be much of a speedup in comparison to just doing it the simple way. But I could be wrong, as I'm not sure how much faster such a search could be. Right, we should clearly only take such a big step if performance shows it's justified. From the initial results I just posted in LUCENE-2061 , it looks like Lucene does in fact handle the add-only case very well (ie degredation to QPS is fairly contained), even into an FSDir. I need to restest with LUCENE-1313 .
        Hide
        Jake Mannix added a comment -

        OK. It's clear Zoie's design is optimized for insanely fast reopen.

        That, and maxing out QPS and indexing rate while keeping query latency degredation to a minimum. From trying to turn off the extra deleted check, the latency overhead on a 5M doc index is a difference of queries taking 12-13ms with the extra check turned on, and 10ms without it, and you only really start to notice on the extreme edges (the queries hitting all 5million docs by way of an actual query (not MatchAllDocs)), when your performance goes from maybe 100ms to 140-150ms.

        EG what I'd love to see is, as a function of reopen rate, the "curve" of QPS vs docs per sec. Ie, if you reopen 1X per second, that consumes some of your machine's resources. What's left can be spent indexing or searching or both, so, it's a curve/line. So we should set up fixed rate indexing, and then redline the QPS to see what's possible, and do this for multiple indexing rates, and for multiple reopen rates.

        Yes, that curve would be a very useful benchmark. Now that I think of it, it wouldn't be too hard to just sneak some reader caching into the ZoieSystem with a tunable parameter for how long you hang onto it, so that we could see how much that can help. One of the nice things that we can do in Zoie by using this kind of index-latency backoff, is that because we have an in-memory two-way mapping of zoie-specific UID to docId, if we actually have time (in the background, since we're caching these readers now) to zip through and update the real delete BitVectors on the segments, and lose the extra check at query time, only using that if you have the index-latency time set below some threshold (determined by how long it takes the system to do this resolution - mapping docId to UID is an array lookup, the reverse is a little slower).

        Right, Zoie is making determined tradeoffs. I would expect that most apps are fine with controlled reopen frequency, ie, they would choose to not lose indexing and searching performance if it means they can "only" reopen, eg, 2X per second.

        In theory Zoie is making tradeoffs - in practice, at least against what is on trunk, Zoie's just going way faster in both indexing and querying in the redline perf test. I agree that in principle, once LUCENE-1313 and other improvements and bugs have been worked out of NRT, that query performance should be faster, and if zoie's default BalancedMergePolicy (nee ZoieMergePolicy) is in use for NRT, the indexing performance should be faster too - it's just not quite there yet at this point.

        I agree - having such well defined API semantics ("once updateDoc returns, searches can see it") is wonderful. But I think they can be cleanly built on top of Lucene NRT as it is today, with a pre-determined (reopen) latency.

        Of course! These api semantics are already built up on top of plain-old Lucene - even without NRT, so I can't imagine how NRT would remove this ability!

        I think the "large merge just finished" case is the most costly for such apps (which the "merged segment warmer" on IW should take care of)? (Because otherwise the segments are tiny, assuming everything is cutover to "per segment").

        Definitely. One thing that Zoie benefited from, from an API standpoint, which might be nice in Lucene, now that 1.5 is in place, is that the IndexReaderWarmer could replace the raw SegmentReader with a warmed user-specified subclass of SegmentReader:

        public abstract class IndexReaderWarmer<R extends IndexReader> {
          public abstract T warm(IndexReader rawReader);
        }
        

        Which could replace the reader in the readerPool with the possibly-user-overridden subclass of SegmentReader (now that SegmentReader is as public as IndexReader itself is) which has now been warmed. For users who like to decorate their readers to keep additional state, instead of use them as keys into WeakHashMaps kept separate, this could be extremely useful (I know that the people I talked to at Apple's iTunes store do this, as well as in bobo, and zoie, to name a few examples off the top of my head).

        I think Lucene could handle this well, if we made an IndexReader impl that directly searches DocumentWriter's RAM buffer. But that's somewhat challenging

        Jason mentioned this approach in his talk at ApacheCon, but I'm not at all convinced it's necessary - if a single box can handle indexing a couple hundred smallish documents a second (into a RAMDirectory), and could be sped up by using multiple IndexWriters (writing into multiple RAMDirecotries in parallel, if you were willing to give up some CPU cores to indexing), and you can search against them without having to do any deduplification / bloomfilter check against the disk, then I'd be surprised if searching the pre-indexed RAM buffer would really be much of a speedup in comparison to just doing it the simple way. But I could be wrong, as I'm not sure how much faster such a search could be.

        Show
        Jake Mannix added a comment - OK. It's clear Zoie's design is optimized for insanely fast reopen. That, and maxing out QPS and indexing rate while keeping query latency degredation to a minimum. From trying to turn off the extra deleted check, the latency overhead on a 5M doc index is a difference of queries taking 12-13ms with the extra check turned on, and 10ms without it, and you only really start to notice on the extreme edges (the queries hitting all 5million docs by way of an actual query (not MatchAllDocs)), when your performance goes from maybe 100ms to 140-150ms. EG what I'd love to see is, as a function of reopen rate, the "curve" of QPS vs docs per sec. Ie, if you reopen 1X per second, that consumes some of your machine's resources. What's left can be spent indexing or searching or both, so, it's a curve/line. So we should set up fixed rate indexing, and then redline the QPS to see what's possible, and do this for multiple indexing rates, and for multiple reopen rates. Yes, that curve would be a very useful benchmark. Now that I think of it, it wouldn't be too hard to just sneak some reader caching into the ZoieSystem with a tunable parameter for how long you hang onto it, so that we could see how much that can help. One of the nice things that we can do in Zoie by using this kind of index-latency backoff, is that because we have an in-memory two-way mapping of zoie-specific UID to docId, if we actually have time (in the background, since we're caching these readers now) to zip through and update the real delete BitVectors on the segments, and lose the extra check at query time, only using that if you have the index-latency time set below some threshold (determined by how long it takes the system to do this resolution - mapping docId to UID is an array lookup, the reverse is a little slower). Right, Zoie is making determined tradeoffs. I would expect that most apps are fine with controlled reopen frequency, ie, they would choose to not lose indexing and searching performance if it means they can "only" reopen, eg, 2X per second. In theory Zoie is making tradeoffs - in practice, at least against what is on trunk, Zoie's just going way faster in both indexing and querying in the redline perf test. I agree that in principle, once LUCENE-1313 and other improvements and bugs have been worked out of NRT, that query performance should be faster, and if zoie's default BalancedMergePolicy (nee ZoieMergePolicy) is in use for NRT, the indexing performance should be faster too - it's just not quite there yet at this point. I agree - having such well defined API semantics ("once updateDoc returns, searches can see it") is wonderful. But I think they can be cleanly built on top of Lucene NRT as it is today, with a pre-determined (reopen) latency. Of course! These api semantics are already built up on top of plain-old Lucene - even without NRT, so I can't imagine how NRT would remove this ability! I think the "large merge just finished" case is the most costly for such apps (which the "merged segment warmer" on IW should take care of)? (Because otherwise the segments are tiny, assuming everything is cutover to "per segment"). Definitely. One thing that Zoie benefited from, from an API standpoint, which might be nice in Lucene, now that 1.5 is in place, is that the IndexReaderWarmer could replace the raw SegmentReader with a warmed user-specified subclass of SegmentReader: public abstract class IndexReaderWarmer<R extends IndexReader> { public abstract T warm(IndexReader rawReader); } Which could replace the reader in the readerPool with the possibly-user-overridden subclass of SegmentReader (now that SegmentReader is as public as IndexReader itself is) which has now been warmed. For users who like to decorate their readers to keep additional state, instead of use them as keys into WeakHashMaps kept separate, this could be extremely useful (I know that the people I talked to at Apple's iTunes store do this, as well as in bobo, and zoie, to name a few examples off the top of my head). I think Lucene could handle this well, if we made an IndexReader impl that directly searches DocumentWriter's RAM buffer. But that's somewhat challenging Jason mentioned this approach in his talk at ApacheCon, but I'm not at all convinced it's necessary - if a single box can handle indexing a couple hundred smallish documents a second (into a RAMDirectory), and could be sped up by using multiple IndexWriters (writing into multiple RAMDirecotries in parallel, if you were willing to give up some CPU cores to indexing), and you can search against them without having to do any deduplification / bloomfilter check against the disk, then I'd be surprised if searching the pre-indexed RAM buffer would really be much of a speedup in comparison to just doing it the simple way. But I could be wrong, as I'm not sure how much faster such a search could be.
        Hide
        Michael McCandless added a comment -

        Due to the bloomfilter living on top of the hashSet, at least at the scales we're dealing with, we didn't see any change in cost due to the number of deletions (zoie by default keeps no more than 10k modifications in memory before flushing to disk, so the biggest the delSet is going to be is that, and we don't see the more-than-constant scaling yet at that size).

        Blooom filters are nice

        But your test is missing a dimension: frequency of reopen. If you reopen once per second, how do Zoie/Lucene compare? Twice per second? Once every 5 seconds? Etc.

        Yep, this is true. It's a little more invasive to put this into Zoie, because the reopen time is so fast that there's no pooling, so it would need to be kinda hacked in, or tacked on to the outside. Not rocket science, but not just the change of a parameter.

        OK. It's clear Zoie's design is optimized for insanely fast reopen.

        LUCENE-2050 should make it easy to test this for pure Lucene NRT.

        LinkedIn doesn't have any hard requirements of having to reopen hundreds of times per second, we're just stressing the system, to see what's going on.

        Redline tests are very important, to understand how the system will
        behave at extremes.

        But I think it'd be useful to controll which dimension to redline.

        EG what I'd love to see is, as a function of reopen rate, the "curve"
        of QPS vs docs per sec. Ie, if you reopen 1X per second, that
        consumes some of your machine's resources. What's left can be spent
        indexing or searching or both, so, it's a curve/line. So we should
        set up fixed rate indexing, and then redline the QPS to see what's
        possible, and do this for multiple indexing rates, and for multiple
        reopen rates.

        Then this all becomes a capacity question for apps.

        As you can see, nobody's filing a bug here that Lucene NRT is "broken" because it can't handle zero-latency updates.

        Right, Zoie is making determined tradeoffs. I would expect that most
        apps are fine with controlled reopen frequency, ie, they would choose
        to not lose indexing and searching performance if it means they can
        "only" reopen, eg, 2X per second.

        (Of course we will need to test, with LUCENE-2050, at what reopen
        frequency you really eat into your indexing/searching performance,
        given fixed hardware).

        What we did try to make sure was in the system was determinism: not knowing whether an update will be seen because there is some background process doing addIndexes from another thread which hasn't completed, or not knowing how fresh the pooled reader is, that kind of thing.

        This kind of determinism can certainly be gotten with NRT, by locking down the IndexWriter wrapped up in another class to keep it from being monkeyed with by other threads, and then tuning exactly how often the reader is reopened, and then dictate to clients that the freshness is exactly at or better than this freshness timeout, sure. This kind of user-friendliness is one of Zoie's main points - it provides an indexing system which manages all this, and certainly for some clients, we should add in the ability to pool the readers for less real-timeness, that's a good idea.

        I agree – having such well defined API semantics ("once updateDoc
        returns, searches can see it") is wonderful. But I think they can be
        cleanly built on top of Lucene NRT as it is today, with a
        pre-determined (reopen) latency.

        Of course, if your reopen() time is pretty heavy (lots of FieldCache data / other custom faceting data needs to be loaded for a bunch of fields), then at least for us, even not needing zero-latency updates means that the more realistically 5-10% degredation in query performance for normal queries is negligable, and we get deterministic zero-latency updates as a consequence.

        I think the "large merge just finished" case is the most costly for
        such apps (which the "merged segment warmer" on IW should take care
        of)? (Because otherwise the segments are tiny, assuming everything is
        cutover to "per segment").

        This whole discussion reminded me that there's another realtime update case, which neither Zoie nor NRT is properly optimized for: the absolutely zero deletes case with very fast indexing load and the desire for minimal latency of updates (imagine that you're indexing twitter - no changes, just adds), and you want to be able to provide a totally stream-oriented view on things as they're being added (matching some query, naturally) with sub-second turnaround. A subclass of SegmentReader which is constructed which doesn't even have a deletedSet could be instantiated, and the deleted check could be removed entirely, speeding things up even further.

        I think Lucene could handle this well, if we made an IndexReader impl
        that directly searches DocumentWriter's RAM buffer. But that's
        somewhat challenging

        Show
        Michael McCandless added a comment - Due to the bloomfilter living on top of the hashSet, at least at the scales we're dealing with, we didn't see any change in cost due to the number of deletions (zoie by default keeps no more than 10k modifications in memory before flushing to disk, so the biggest the delSet is going to be is that, and we don't see the more-than-constant scaling yet at that size). Blooom filters are nice But your test is missing a dimension: frequency of reopen. If you reopen once per second, how do Zoie/Lucene compare? Twice per second? Once every 5 seconds? Etc. Yep, this is true. It's a little more invasive to put this into Zoie, because the reopen time is so fast that there's no pooling, so it would need to be kinda hacked in, or tacked on to the outside. Not rocket science, but not just the change of a parameter. OK. It's clear Zoie's design is optimized for insanely fast reopen. LUCENE-2050 should make it easy to test this for pure Lucene NRT. LinkedIn doesn't have any hard requirements of having to reopen hundreds of times per second, we're just stressing the system, to see what's going on. Redline tests are very important, to understand how the system will behave at extremes. But I think it'd be useful to controll which dimension to redline. EG what I'd love to see is, as a function of reopen rate, the "curve" of QPS vs docs per sec. Ie, if you reopen 1X per second, that consumes some of your machine's resources. What's left can be spent indexing or searching or both, so, it's a curve/line. So we should set up fixed rate indexing, and then redline the QPS to see what's possible, and do this for multiple indexing rates, and for multiple reopen rates. Then this all becomes a capacity question for apps. As you can see, nobody's filing a bug here that Lucene NRT is "broken" because it can't handle zero-latency updates. Right, Zoie is making determined tradeoffs. I would expect that most apps are fine with controlled reopen frequency, ie, they would choose to not lose indexing and searching performance if it means they can "only" reopen, eg, 2X per second. (Of course we will need to test, with LUCENE-2050 , at what reopen frequency you really eat into your indexing/searching performance, given fixed hardware). What we did try to make sure was in the system was determinism: not knowing whether an update will be seen because there is some background process doing addIndexes from another thread which hasn't completed, or not knowing how fresh the pooled reader is, that kind of thing. This kind of determinism can certainly be gotten with NRT, by locking down the IndexWriter wrapped up in another class to keep it from being monkeyed with by other threads, and then tuning exactly how often the reader is reopened, and then dictate to clients that the freshness is exactly at or better than this freshness timeout, sure. This kind of user-friendliness is one of Zoie's main points - it provides an indexing system which manages all this, and certainly for some clients, we should add in the ability to pool the readers for less real-timeness, that's a good idea. I agree – having such well defined API semantics ("once updateDoc returns, searches can see it") is wonderful. But I think they can be cleanly built on top of Lucene NRT as it is today, with a pre-determined (reopen) latency. Of course, if your reopen() time is pretty heavy (lots of FieldCache data / other custom faceting data needs to be loaded for a bunch of fields), then at least for us, even not needing zero-latency updates means that the more realistically 5-10% degredation in query performance for normal queries is negligable, and we get deterministic zero-latency updates as a consequence. I think the "large merge just finished" case is the most costly for such apps (which the "merged segment warmer" on IW should take care of)? (Because otherwise the segments are tiny, assuming everything is cutover to "per segment"). This whole discussion reminded me that there's another realtime update case, which neither Zoie nor NRT is properly optimized for: the absolutely zero deletes case with very fast indexing load and the desire for minimal latency of updates (imagine that you're indexing twitter - no changes, just adds), and you want to be able to provide a totally stream-oriented view on things as they're being added (matching some query, naturally) with sub-second turnaround. A subclass of SegmentReader which is constructed which doesn't even have a deletedSet could be instantiated, and the deleted check could be removed entirely, speeding things up even further. I think Lucene could handle this well, if we made an IndexReader impl that directly searches DocumentWriter's RAM buffer. But that's somewhat challenging
        Hide
        Michael McCandless added a comment -

        The fact that Zoie on the pure indexing case (ie no deletions) was 10X faster than Lucene is very weird - that means something else is up, besides how deletions are carried in RAM. It's entirely possible it's the fact that Lucene doesn't flush the tiny segments to a RAMDir (which LUCENE-1313 addresses).

        Yeah, if you call getReader() a bunch of times per second, each one does a flush(true,true,true), right? Without having LUCENE-1313, this kills the indexing performance if querying is going on. If no getReader() is being called at all, Zoie is about 10% slower than pure Lucene IndexWriter.add() (that's the cost of doing it in two steps - index into two RAMDirs [so they are hot-swappable] and then writing segments to disk with addIndexesNoOptimize() periodically).

        It'll be great if LUCENE-1313 nets us a 10X improvement in indexing
        rate. With the improvements to benchmark (LUCENE-2050), I'm hoping
        this'll be easy to confirm...

        Ahh I see, so with very rare reopens, Zoie's indexing rate is also
        slower than Lucene's (because of the double buffering).

        So the big picture tradeoff here is Zoie has wicked fast reopen times,
        compared to Lucene, but has slightly slower (10%) indexing rate, and
        slower searches (22-28% in the worst case), as the tradeoff.

        It seems like we need to find the "break even" point. Ie, if you
        never reopen, then on fixed hardware, Lucene is faster at indexing and
        searching than Zoie. If you reopen at an insane rate (100s per sec),
        Zoie is much faster than Lucene on both indexing and searching. But
        what if you reopen 2x, 1x per second? Once every 2 seconds, etc. At
        some point the crossover will happen.

        Show
        Michael McCandless added a comment - The fact that Zoie on the pure indexing case (ie no deletions) was 10X faster than Lucene is very weird - that means something else is up, besides how deletions are carried in RAM. It's entirely possible it's the fact that Lucene doesn't flush the tiny segments to a RAMDir (which LUCENE-1313 addresses). Yeah, if you call getReader() a bunch of times per second, each one does a flush(true,true,true), right? Without having LUCENE-1313 , this kills the indexing performance if querying is going on. If no getReader() is being called at all, Zoie is about 10% slower than pure Lucene IndexWriter.add() (that's the cost of doing it in two steps - index into two RAMDirs [so they are hot-swappable] and then writing segments to disk with addIndexesNoOptimize() periodically). It'll be great if LUCENE-1313 nets us a 10X improvement in indexing rate. With the improvements to benchmark ( LUCENE-2050 ), I'm hoping this'll be easy to confirm... Ahh I see, so with very rare reopens, Zoie's indexing rate is also slower than Lucene's (because of the double buffering). So the big picture tradeoff here is Zoie has wicked fast reopen times, compared to Lucene, but has slightly slower (10%) indexing rate, and slower searches (22-28% in the worst case), as the tradeoff. It seems like we need to find the "break even" point. Ie, if you never reopen, then on fixed hardware, Lucene is faster at indexing and searching than Zoie. If you reopen at an insane rate (100s per sec), Zoie is much faster than Lucene on both indexing and searching. But what if you reopen 2x, 1x per second? Once every 2 seconds, etc. At some point the crossover will happen.
        Hide
        Michael McCandless added a comment -

        So we re-ran some of our tests last night, commenting out our deleted check to measure it's cost in the most extreme case possible: a dead easy query (in that it's only one term), but one which yes, hits the entire index (doing a MatchAllDocs query is actually special-cased in our code, and is perfectly fast, so not a good worst case to check), and as the index grows up above a million documents, zoie could shave somewhere from 22-28% of its time off by not doing the extra check.

        OK, thanks for running that test...

        So in the worst case (dead easy query, matching many many docs) Zoie's
        search slowdown is 22-28%. It's presumably quite a bit less
        (approaching zero) for hard queries that match few docs. So the
        search slowdown is app dependent.

        I think it'd be possible (though, complex!) to do a hybrid approach.
        Meaning you use Zoie to get the insanely fast reopen, but, to avoid
        the search slowdown, in the background you convert the buffered UIDs
        to the docID bit vector, such that once all conversion is done, you
        stop checking the int set.

        I guess you'd have to throttle the conversion so that in the unnatural
        (100s per sec) reopen test, with many queries in flight at once, you
        don't exhaust the heap.

        Show
        Michael McCandless added a comment - So we re-ran some of our tests last night, commenting out our deleted check to measure it's cost in the most extreme case possible: a dead easy query (in that it's only one term), but one which yes, hits the entire index (doing a MatchAllDocs query is actually special-cased in our code, and is perfectly fast, so not a good worst case to check), and as the index grows up above a million documents, zoie could shave somewhere from 22-28% of its time off by not doing the extra check. OK, thanks for running that test... So in the worst case (dead easy query, matching many many docs) Zoie's search slowdown is 22-28%. It's presumably quite a bit less (approaching zero) for hard queries that match few docs. So the search slowdown is app dependent. I think it'd be possible (though, complex!) to do a hybrid approach. Meaning you use Zoie to get the insanely fast reopen, but, to avoid the search slowdown, in the background you convert the buffered UIDs to the docID bit vector, such that once all conversion is done, you stop checking the int set. I guess you'd have to throttle the conversion so that in the unnatural (100s per sec) reopen test, with many queries in flight at once, you don't exhaust the heap.
        Hide
        Michael McCandless added a comment -

        Lucene NRT makes a clone of the BitVector for every reader that has new deletions. Once this is done, searching is "normal" - it's as if the reader were a disk reader. There's no extra checking of deleted docs (unlike Zoie), no OR'ing of 2 BitVectors, etc.

        Ok, so if this is copy-on-write, it's done every time there is a new delete for that segment? If the disk index is optimized that means it would happen on every update, a clone of the full numDocs sized BitVector? I'm still a little unsure of how this happens.

        Right. Actually is the index optimized in your tests? My current
        correctness testing (for the "lost deletes") isn't optimized... I'll
        try optimizing it.

        • somebody calls getReader() - they've got all the SegmentReaders for the disk segments, and each of them have BitVectors for deletions.
        • IW.update() gets called - the BitVector for the segment which now has a deletion is cloned, and set on a new pooled SegmentReader as its deletedSet

        Actually, the IW.updateDocument call merely buffers the Term to be
        deleted. It does not resolve that term to the corresponding docID
        until the getReader (same as reopen) is called again. But it would be
        better if Lucene did the resolution in the FG (during the
        updateDocument) call; this is what LUCENE-2047 will fix. This
        backgrounds the resolution, ie, reopen is no longer resolving all
        deletes in the FG.

        But, yes, the clone happens on the first delete to arrive against a
        SegmentReader after it had been cloned in the NRT reader.

        * maybe IW.update() gets called a bunch more - do these modify the pooled but as-yet-unused SegmentReader? New readers in the pool? What?

        Just more buffering right now, but after LUCENE-2047, it will mark
        further bits in the already cloned vector. Ie, the clone happens only
        after getReader has returned a reader using that SegmentReader.

        * another call to getReader() comes in, and gets an IndexReader wrapping the pooled SegmentReaders.

        Each SegmentReader is cloned, and referenced by the reader returned by
        getReader. And then the next delete to arrive to thse segments will
        force the bit vector to clone.

        Show
        Michael McCandless added a comment - Lucene NRT makes a clone of the BitVector for every reader that has new deletions. Once this is done, searching is "normal" - it's as if the reader were a disk reader. There's no extra checking of deleted docs (unlike Zoie), no OR'ing of 2 BitVectors, etc. Ok, so if this is copy-on-write, it's done every time there is a new delete for that segment? If the disk index is optimized that means it would happen on every update, a clone of the full numDocs sized BitVector? I'm still a little unsure of how this happens. Right. Actually is the index optimized in your tests? My current correctness testing (for the "lost deletes") isn't optimized... I'll try optimizing it. somebody calls getReader() - they've got all the SegmentReaders for the disk segments, and each of them have BitVectors for deletions. IW.update() gets called - the BitVector for the segment which now has a deletion is cloned, and set on a new pooled SegmentReader as its deletedSet Actually, the IW.updateDocument call merely buffers the Term to be deleted. It does not resolve that term to the corresponding docID until the getReader (same as reopen) is called again. But it would be better if Lucene did the resolution in the FG (during the updateDocument) call; this is what LUCENE-2047 will fix. This backgrounds the resolution, ie, reopen is no longer resolving all deletes in the FG. But, yes, the clone happens on the first delete to arrive against a SegmentReader after it had been cloned in the NRT reader. * maybe IW.update() gets called a bunch more - do these modify the pooled but as-yet-unused SegmentReader? New readers in the pool? What? Just more buffering right now, but after LUCENE-2047 , it will mark further bits in the already cloned vector. Ie, the clone happens only after getReader has returned a reader using that SegmentReader. * another call to getReader() comes in, and gets an IndexReader wrapping the pooled SegmentReaders. Each SegmentReader is cloned, and referenced by the reader returned by getReader. And then the next delete to arrive to thse segments will force the bit vector to clone.
        Hide
        Michael McCandless added a comment -

        if you're indexing 300 documents a second (possibly all of which are delete+re-add), and querying at a thousand queries a second, how many of these BitVectors are you going to end up making?

        Hopefully not much more than a few per second?

        We should be careful what we measure to ensure that we're targeting the right use cases.

        Requirements calling for zero latency updates (all index changes are always visible) are often in error (i.e. it's usually not a true requirement).

        Right, I think testing reopening 100s of times per second isn't all
        that useful (most apps don't really need to do this).

        I think seeing results broken out according to reopen frequency is
        more helpful.

        Seems like almost all apps should be well served by second reopen resolution on average (with the ability to immediately reopen on demand). The only thing that would seem to need lower latency is when an automated client does an add, and then immediately does a query and needs to see it. In that case, that client could specify that they need an immediate reopen (either during the add or the query).

        To prevent against accidental or intentional denial-of-service for
        clients that do the add + immediate query, one could also sync such
        clients up to the reopen frequency.

        This would also provide for the clean semantics (like GData) of "once
        the 'update document' request returns, it's in the index", which I
        agree is a very convenient API semantics.

        Ie, if your reopen rate is 4x per second (once every 250 msec), then
        you could hold all add requests coming in until the reopen completes,
        then return those requests.

        So the API can still build the well defined semantics on top of
        Lucene, even if the reopen is rate limited under the hood.

        Show
        Michael McCandless added a comment - if you're indexing 300 documents a second (possibly all of which are delete+re-add), and querying at a thousand queries a second, how many of these BitVectors are you going to end up making? Hopefully not much more than a few per second? We should be careful what we measure to ensure that we're targeting the right use cases. Requirements calling for zero latency updates (all index changes are always visible) are often in error (i.e. it's usually not a true requirement). Right, I think testing reopening 100s of times per second isn't all that useful (most apps don't really need to do this). I think seeing results broken out according to reopen frequency is more helpful. Seems like almost all apps should be well served by second reopen resolution on average (with the ability to immediately reopen on demand). The only thing that would seem to need lower latency is when an automated client does an add, and then immediately does a query and needs to see it. In that case, that client could specify that they need an immediate reopen (either during the add or the query). To prevent against accidental or intentional denial-of-service for clients that do the add + immediate query, one could also sync such clients up to the reopen frequency. This would also provide for the clean semantics (like GData) of "once the 'update document' request returns, it's in the index", which I agree is a very convenient API semantics. Ie, if your reopen rate is 4x per second (once every 250 msec), then you could hold all add requests coming in until the reopen completes, then return those requests. So the API can still build the well defined semantics on top of Lucene, even if the reopen is rate limited under the hood.
        Hide
        Michael McCandless added a comment -

        2) cpu and memory starvation - monitoring cpu and memory usage, the machine seems very starved, and I think that leads to performance differences more than the extra array look.

        CPU starvation is fully expected (this is a redline test).

        Memory starvation is interesting, because the bit vectors should all
        be transient, and should "die young" from the GC's standpoint. Plus
        these are all 1/8th the number of docs in RAM usage, and it's only
        those segments that have deletions whose bit vector is cloned. Are
        you starting from an optimized index?

        Oh, here's one idea: how many searches does your test allow to be
        in-flight at once? (Or: how large a thread pool are you using on the
        server?). Since you effectively reopen per search, each search will
        have dup'd the deleted docs. If you allow many searches in flight,
        that could explain it.

        Show
        Michael McCandless added a comment - 2) cpu and memory starvation - monitoring cpu and memory usage, the machine seems very starved, and I think that leads to performance differences more than the extra array look. CPU starvation is fully expected (this is a redline test). Memory starvation is interesting, because the bit vectors should all be transient, and should "die young" from the GC's standpoint. Plus these are all 1/8th the number of docs in RAM usage, and it's only those segments that have deletions whose bit vector is cloned. Are you starting from an optimized index? Oh, here's one idea: how many searches does your test allow to be in-flight at once? (Or: how large a thread pool are you using on the server?). Since you effectively reopen per search, each search will have dup'd the deleted docs. If you allow many searches in flight, that could explain it.
        Hide
        Michael McCandless added a comment -

        Correction: We are NOT using BalancedMergePolicy for NRT test, only default is used. BalancedMergePolicy is only used inside zoie.

        OK.

        I've also tried doing separate delete then add in the indexer threads, and still I don't see any deletions getting lost... I can't repro this correctness issue.

        Show
        Michael McCandless added a comment - Correction: We are NOT using BalancedMergePolicy for NRT test, only default is used. BalancedMergePolicy is only used inside zoie. OK. I've also tried doing separate delete then add in the indexer threads, and still I don't see any deletions getting lost... I can't repro this correctness issue.
        Hide
        Jason Rutherglen added a comment -

        As far as the implementation of this patch goes... I'm thinking
        we can increase the page size, and do a simulated setNextPage
        type of deal in SegmentTermDocs. We'll start at the first page,
        if we hit an ArrayIndexOutOfBoundsException, we figure out what
        page they're trying to access and return true|false for the bit.
        We can continue this process of accessing iteratively on a page,
        until we hit the AIOOBE, then figure out again. I think this is
        a good approach because Java arrays already perform the access
        bounds check, hitting an exception hopefully shouldn't be too
        costly if it only happens a handful of times per posting
        iteration, and then we're avoiding the small but extra array
        access lookup for eac bit. We'll be executing the same access pattern as
        today (i.e. random access on the byte array with about the same
        overhead, with the AIOOBE occurring when a page is completed).
        We can benchmark and see the performance difference.

        I think in general, we'll simply benchmark the options we've
        discussed 1) NRT 2) 1313 3) 1313 + pooling 4) 1313 + 1526 5)
        1313 + 1526 w/iterative sequential page access.

        Show
        Jason Rutherglen added a comment - As far as the implementation of this patch goes... I'm thinking we can increase the page size, and do a simulated setNextPage type of deal in SegmentTermDocs. We'll start at the first page, if we hit an ArrayIndexOutOfBoundsException, we figure out what page they're trying to access and return true|false for the bit. We can continue this process of accessing iteratively on a page, until we hit the AIOOBE, then figure out again. I think this is a good approach because Java arrays already perform the access bounds check, hitting an exception hopefully shouldn't be too costly if it only happens a handful of times per posting iteration, and then we're avoiding the small but extra array access lookup for eac bit. We'll be executing the same access pattern as today (i.e. random access on the byte array with about the same overhead, with the AIOOBE occurring when a page is completed). We can benchmark and see the performance difference. I think in general, we'll simply benchmark the options we've discussed 1) NRT 2) 1313 3) 1313 + pooling 4) 1313 + 1526 5) 1313 + 1526 w/iterative sequential page access.
        Hide
        John Wang added a comment -

        Correction: We are NOT using BalancedMergePolicy for NRT test, only default is used. BalancedMergePolicy is only used inside zoie.

        Show
        John Wang added a comment - Correction: We are NOT using BalancedMergePolicy for NRT test, only default is used. BalancedMergePolicy is only used inside zoie.
        Hide
        Michael McCandless added a comment -

        These sound serious - if you can provide any details, that'd help. I'll do some stress testing too. Thanks for testing and reporting

        Out of these, the specific issue of incorrectness of applied deletes is easiest to see - we saw it by indexing up to a million docs, then keep adding docs but only after doing a delete on the UID where UID instead of increasing, is looped around mod 1million. Calling numDocs (not maxDoc) on the reader with Zoie always returns 1M after looping around, but with NRT, it starts slowly growing above 1M.

        So far I've had no luck repro'ing this. I have a 5M doc wikipedia
        index. Then I created an alg with 2 indexing threads (each replacing
        docs at 100 docs/sec), and reopening ~ 60 times per second. Another
        thread then verifies that the docCount is always 5M. It's run fine
        for quite a while now...

        Hmm maybe I need to try the balanced merge policy? That would be
        spooky if it caused the issue...

        Show
        Michael McCandless added a comment - These sound serious - if you can provide any details, that'd help. I'll do some stress testing too. Thanks for testing and reporting Out of these, the specific issue of incorrectness of applied deletes is easiest to see - we saw it by indexing up to a million docs, then keep adding docs but only after doing a delete on the UID where UID instead of increasing, is looped around mod 1million. Calling numDocs (not maxDoc) on the reader with Zoie always returns 1M after looping around, but with NRT, it starts slowly growing above 1M. So far I've had no luck repro'ing this. I have a 5M doc wikipedia index. Then I created an alg with 2 indexing threads (each replacing docs at 100 docs/sec), and reopening ~ 60 times per second. Another thread then verifies that the docCount is always 5M. It's run fine for quite a while now... Hmm maybe I need to try the balanced merge policy? That would be spooky if it caused the issue...
        Hide
        Jake Mannix added a comment -

        These sound serious - if you can provide any details, that'd help. I'll do some stress testing too. Thanks for testing and reporting

        Out of these, the specific issue of incorrectness of applied deletes is easiest to see - we saw it by indexing up to a million docs, then keep adding docs but only after doing a delete on the UID where UID instead of increasing, is looped around mod 1million. Calling numDocs (not maxDoc) on the reader with Zoie always returns 1M after looping around, but with NRT, it starts slowly growing above 1M.

        The CPU and memory is undoubtedly due to the constant reopening of these readers, and yes, could be aleiviated by not doing this - we're just comparing to the zoie case, where we do reopen (the RAMDir) on every request (and copy the delSet) if there have been modifications since the last update.

        Lucene NRT makes a clone of the BitVector for every reader that has new deletions. Once this is done, searching is "normal" - it's as if the reader were a disk reader. There's no extra checking of deleted docs (unlike Zoie), no OR'ing of 2 BitVectors, etc.

        Ok, so if this is copy-on-write, it's done every time there is a new delete for that segment? If the disk index is optimized that means it would happen on every update, a clone of the full numDocs sized BitVector? I'm still a little unsure of how this happens.

        • somebody calls getReader() - they've got all the SegmentReaders for the disk segments, and each of them have BitVectors for deletions.
        • IW.update() gets called - the BitVector for the segment which now has a deletion is cloned, and set on a new pooled SegmentReader as its deletedSet
        • maybe IW.update() gets called a bunch more - do these modify the pooled but as-yet-unused SegmentReader? New readers in the pool? What?
        • another call to getReader() comes in, and gets an IndexReader wrapping the pooled SegmentReaders.

        Yes, this makes Lucene's reopen more costly. But, then there's no double checking for deletions. That's the tradeoff, and this is why the 64 msec is added to Zoie's search time. Zoie's searches are slower.

        So we re-ran some of our tests last night, commenting out our deleted check to measure it's cost in the most extreme case possible: a dead easy query (in that it's only one term), but one which yes, hits the entire index (doing a MatchAllDocs query is actually special-cased in our code, and is perfectly fast, so not a good worst case to check), and as the index grows up above a million documents, zoie could shave somewhere from 22-28% of its time off by not doing the extra check.

        We haven't re-run the test to see what happens as the index grows to 5M or 10M yet, but I can probably run that later today.

        The fact that Zoie on the pure indexing case (ie no deletions) was 10X faster than Lucene is very weird - that means something else is up,

        besides how deletions are carried in RAM. It's entirely possible it's the fact that Lucene doesn't flush the tiny segments to a RAMDir (which LUCENE-1313 addresses).

        Yeah, if you call getReader() a bunch of times per second, each one does a flush(true,true,true), right? Without having LUCENE-1313, this kills the indexing performance if querying is going on. If no getReader() is being called at all, Zoie is about 10% slower than pure Lucene IndexWriter.add() (that's the cost of doing it in two steps - index into two RAMDirs [so they are hot-swappable] and then writing segments to disk with addIndexesNoOptimize() periodically).

        I don't think there's any difference in the MergePolicy - I think they're both using the BalancedMergePolicy (since that's the one which is optimized for the realtime case).

        Actually I thought of a simple way to run the "search only" (not reopen) test - I'll just augment TopScoreDocCollector to optionally check the IntSetAccelerator, and measure the cost in practice, for different numbers of docs added to the IntSet.

        Due to the bloomfilter living on top of the hashSet, at least at the scales we're dealing with, we didn't see any change in cost due to the number of deletions (zoie by default keeps no more than 10k modifications in memory before flushing to disk, so the biggest the delSet is going to be is that, and we don't see the more-than-constant scaling yet at that size).

        But your test is missing a dimension: frequency of reopen. If you reopen once per second, how do Zoie/Lucene compare? Twice per second? Once every 5 seconds? Etc.

        Yep, this is true. It's a little more invasive to put this into Zoie, because the reopen time is so fast that there's no pooling, so it would need to be kinda hacked in, or tacked on to the outside. Not rocket science, but not just the change of a parameter.

        LinkedIn doesn't have any hard requirements of having to reopen hundreds of times per second, we're just stressing the system, to see what's going on. As you can see, nobody's filing a bug here that Lucene NRT is "broken" because it can't handle zero-latency updates. What we did try to make sure was in the system was determinism: not knowing whether an update will be seen because there is some background process doing addIndexes from another thread which hasn't completed, or not knowing how fresh the pooled reader is, that kind of thing.

        This kind of determinism can certainly be gotten with NRT, by locking down the IndexWriter wrapped up in another class to keep it from being monkeyed with by other threads, and then tuning exactly how often the reader is reopened, and then dictate to clients that the freshness is exactly at or better than this freshness timeout, sure. This kind of user-friendliness is one of Zoie's main points - it provides an indexing system which manages all this, and certainly for some clients, we should add in the ability to pool the readers for less real-timeness, that's a good idea.

        Of course, if your reopen() time is pretty heavy (lots of FieldCache data / other custom faceting data needs to be loaded for a bunch of fields), then at least for us, even not needing zero-latency updates means that the more realistically 5-10% degredation in query performance for normal queries is negligable, and we get deterministic zero-latency updates as a consequence.

        This whole discussion reminded me that there's another realtime update case, which neither Zoie nor NRT is properly optimized for: the absolutely zero deletes case with very fast indexing load and the desire for minimal latency of updates (imagine that you're indexing twitter - no changes, just adds), and you want to be able to provide a totally stream-oriented view on things as they're being added (matching some query, naturally) with sub-second turnaround. A subclass of SegmentReader which is constructed which doesn't even have a deletedSet could be instantiated, and the deleted check could be removed entirely, speeding things up even further.

        Show
        Jake Mannix added a comment - These sound serious - if you can provide any details, that'd help. I'll do some stress testing too. Thanks for testing and reporting Out of these, the specific issue of incorrectness of applied deletes is easiest to see - we saw it by indexing up to a million docs, then keep adding docs but only after doing a delete on the UID where UID instead of increasing, is looped around mod 1million. Calling numDocs (not maxDoc) on the reader with Zoie always returns 1M after looping around, but with NRT, it starts slowly growing above 1M. The CPU and memory is undoubtedly due to the constant reopening of these readers, and yes, could be aleiviated by not doing this - we're just comparing to the zoie case, where we do reopen (the RAMDir) on every request (and copy the delSet) if there have been modifications since the last update. Lucene NRT makes a clone of the BitVector for every reader that has new deletions. Once this is done, searching is "normal" - it's as if the reader were a disk reader. There's no extra checking of deleted docs (unlike Zoie), no OR'ing of 2 BitVectors, etc. Ok, so if this is copy-on-write, it's done every time there is a new delete for that segment? If the disk index is optimized that means it would happen on every update, a clone of the full numDocs sized BitVector? I'm still a little unsure of how this happens. somebody calls getReader() - they've got all the SegmentReaders for the disk segments, and each of them have BitVectors for deletions. IW.update() gets called - the BitVector for the segment which now has a deletion is cloned, and set on a new pooled SegmentReader as its deletedSet maybe IW.update() gets called a bunch more - do these modify the pooled but as-yet-unused SegmentReader? New readers in the pool? What? another call to getReader() comes in, and gets an IndexReader wrapping the pooled SegmentReaders. Yes, this makes Lucene's reopen more costly. But, then there's no double checking for deletions. That's the tradeoff, and this is why the 64 msec is added to Zoie's search time. Zoie's searches are slower. So we re-ran some of our tests last night, commenting out our deleted check to measure it's cost in the most extreme case possible: a dead easy query (in that it's only one term), but one which yes, hits the entire index (doing a MatchAllDocs query is actually special-cased in our code, and is perfectly fast, so not a good worst case to check), and as the index grows up above a million documents, zoie could shave somewhere from 22-28% of its time off by not doing the extra check. We haven't re-run the test to see what happens as the index grows to 5M or 10M yet, but I can probably run that later today. The fact that Zoie on the pure indexing case (ie no deletions) was 10X faster than Lucene is very weird - that means something else is up, besides how deletions are carried in RAM. It's entirely possible it's the fact that Lucene doesn't flush the tiny segments to a RAMDir (which LUCENE-1313 addresses). Yeah, if you call getReader() a bunch of times per second, each one does a flush(true,true,true), right? Without having LUCENE-1313 , this kills the indexing performance if querying is going on. If no getReader() is being called at all, Zoie is about 10% slower than pure Lucene IndexWriter.add() (that's the cost of doing it in two steps - index into two RAMDirs [so they are hot-swappable] and then writing segments to disk with addIndexesNoOptimize() periodically). I don't think there's any difference in the MergePolicy - I think they're both using the BalancedMergePolicy (since that's the one which is optimized for the realtime case). Actually I thought of a simple way to run the "search only" (not reopen) test - I'll just augment TopScoreDocCollector to optionally check the IntSetAccelerator, and measure the cost in practice, for different numbers of docs added to the IntSet. Due to the bloomfilter living on top of the hashSet, at least at the scales we're dealing with, we didn't see any change in cost due to the number of deletions (zoie by default keeps no more than 10k modifications in memory before flushing to disk, so the biggest the delSet is going to be is that, and we don't see the more-than-constant scaling yet at that size). But your test is missing a dimension: frequency of reopen. If you reopen once per second, how do Zoie/Lucene compare? Twice per second? Once every 5 seconds? Etc. Yep, this is true. It's a little more invasive to put this into Zoie, because the reopen time is so fast that there's no pooling, so it would need to be kinda hacked in, or tacked on to the outside. Not rocket science, but not just the change of a parameter. LinkedIn doesn't have any hard requirements of having to reopen hundreds of times per second, we're just stressing the system, to see what's going on. As you can see, nobody's filing a bug here that Lucene NRT is "broken" because it can't handle zero-latency updates. What we did try to make sure was in the system was determinism: not knowing whether an update will be seen because there is some background process doing addIndexes from another thread which hasn't completed, or not knowing how fresh the pooled reader is, that kind of thing. This kind of determinism can certainly be gotten with NRT, by locking down the IndexWriter wrapped up in another class to keep it from being monkeyed with by other threads, and then tuning exactly how often the reader is reopened, and then dictate to clients that the freshness is exactly at or better than this freshness timeout, sure. This kind of user-friendliness is one of Zoie's main points - it provides an indexing system which manages all this, and certainly for some clients, we should add in the ability to pool the readers for less real-timeness, that's a good idea. Of course, if your reopen() time is pretty heavy (lots of FieldCache data / other custom faceting data needs to be loaded for a bunch of fields), then at least for us, even not needing zero-latency updates means that the more realistically 5-10% degredation in query performance for normal queries is negligable, and we get deterministic zero-latency updates as a consequence. This whole discussion reminded me that there's another realtime update case, which neither Zoie nor NRT is properly optimized for: the absolutely zero deletes case with very fast indexing load and the desire for minimal latency of updates (imagine that you're indexing twitter - no changes, just adds), and you want to be able to provide a totally stream-oriented view on things as they're being added (matching some query, naturally) with sub-second turnaround. A subclass of SegmentReader which is constructed which doesn't even have a deletedSet could be instantiated, and the deleted check could be removed entirely, speeding things up even further.
        Hide
        Yonik Seeley added a comment -

        if you're indexing 300 documents a second (possibly all of which are delete+re-add), and querying at a thousand queries a second, how many of these BitVectors are you going to end up making?

        Hopefully not much more than a few per second?

        We should be careful what we measure to ensure that we're targeting the right use cases. Seems like almost all apps should be well served by second reopen resolution on average (with the ability to immediately reopen on demand). The only thing that would seem to need lower latency is when an automated client does an add, and then immediately does a query and needs to see it. In that case, that client could specify that they need an immediate reopen (either during the add or the query).

        Requirements calling for zero latency updates (all index changes are always visible) are often in error (i.e. it's usually not a true requirement).

        Show
        Yonik Seeley added a comment - if you're indexing 300 documents a second (possibly all of which are delete+re-add), and querying at a thousand queries a second, how many of these BitVectors are you going to end up making? Hopefully not much more than a few per second? We should be careful what we measure to ensure that we're targeting the right use cases. Seems like almost all apps should be well served by second reopen resolution on average (with the ability to immediately reopen on demand). The only thing that would seem to need lower latency is when an automated client does an add, and then immediately does a query and needs to see it. In that case, that client could specify that they need an immediate reopen (either during the add or the query). Requirements calling for zero latency updates (all index changes are always visible) are often in error (i.e. it's usually not a true requirement).
        Hide
        Michael McCandless added a comment -

        1) file handle leak - Our prod-quality machine fell over after 1 hr of running using NRT due to file handle leaking.
        2) cpu and memory starvation - monitoring cpu and memory usage, the machine seems very starved, and I think that leads to performance differences more than the extra array look.
        3) I am seeing also correctness issues as well, e.g. deletes don't get applied correctly. I am not sure about the unit test coverage for NRT to comment specifically.

        These sound serious – if you can provide any details, that'd help.
        I'll do some stress testing too. Thanks for testing and reporting

        Yes, so how does Lucene NRT deal with new deletes?

        Lucene NRT makes a clone of the BitVector for every reader that has
        new deletions. Once this is done, searching is "normal" – it's as if
        the reader were a disk reader. There's no extra checking of deleted
        docs (unlike Zoie), no OR'ing of 2 BitVectors, etc.

        Yes, this makes Lucene's reopen more costly. But, then there's no
        double checking for deletions. That's the tradeoff, and this is why
        the 64 msec is added to Zoie's search time. Zoie's searches are
        slower.

        The fact that Zoie on the pure indexing case (ie no deletions) was 10X
        faster than Lucene is very weird – that means something else is up,
        besides how deletions are carried in RAM. It's entirely possible it's
        the fact that Lucene doesn't flush the tiny segments to a RAMDir
        (which LUCENE-1313 addresses). Or, maybe there's another difference
        in that test (eg, MergePolicy?). Jake or John, if you could shed some
        light on any other specific differences in that test, that would help.

        This is simply a question of trade-offs.

        Precisely: Zoie has faster reopen time, but slower search time. But
        we haven't yet measured how much slower Zoie's searches are.

        Actually I thought of a simple way to run the "search only" (not
        reopen) test – I'll just augment TopScoreDocCollector to optionally
        check the IntSetAccelerator, and measure the cost in practice, for
        different numbers of docs added to the IntSet.

        BTW, is there a performance benchmark/setup for lucene NRT?

        In progress – see LUCENE-2050.

        Aiming for maxing out indexing speed and query throughput at the same time is what we're testing here, and this is a reasonable extreme limit to aim for when stress-testing real-time search.

        But your test is missing a dimension: frequency of reopen. If you
        reopen once per second, how do Zoie/Lucene compare? Twice per second?
        Once every 5 seconds? Etc.

        It sounds like LinkedIn has a hard requirement that the reopen must
        happen hundreds of times per second, which is perfectly fine. That's
        what LinkedIn needs. But other apps have different requirements, and
        so to make an informed decision they need to see the full picture.

        Show
        Michael McCandless added a comment - 1) file handle leak - Our prod-quality machine fell over after 1 hr of running using NRT due to file handle leaking. 2) cpu and memory starvation - monitoring cpu and memory usage, the machine seems very starved, and I think that leads to performance differences more than the extra array look. 3) I am seeing also correctness issues as well, e.g. deletes don't get applied correctly. I am not sure about the unit test coverage for NRT to comment specifically. These sound serious – if you can provide any details, that'd help. I'll do some stress testing too. Thanks for testing and reporting Yes, so how does Lucene NRT deal with new deletes? Lucene NRT makes a clone of the BitVector for every reader that has new deletions. Once this is done, searching is "normal" – it's as if the reader were a disk reader. There's no extra checking of deleted docs (unlike Zoie), no OR'ing of 2 BitVectors, etc. Yes, this makes Lucene's reopen more costly. But, then there's no double checking for deletions. That's the tradeoff, and this is why the 64 msec is added to Zoie's search time. Zoie's searches are slower. The fact that Zoie on the pure indexing case (ie no deletions) was 10X faster than Lucene is very weird – that means something else is up, besides how deletions are carried in RAM. It's entirely possible it's the fact that Lucene doesn't flush the tiny segments to a RAMDir (which LUCENE-1313 addresses). Or, maybe there's another difference in that test (eg, MergePolicy?). Jake or John, if you could shed some light on any other specific differences in that test, that would help. This is simply a question of trade-offs. Precisely: Zoie has faster reopen time, but slower search time. But we haven't yet measured how much slower Zoie's searches are. Actually I thought of a simple way to run the "search only" (not reopen) test – I'll just augment TopScoreDocCollector to optionally check the IntSetAccelerator, and measure the cost in practice, for different numbers of docs added to the IntSet. BTW, is there a performance benchmark/setup for lucene NRT? In progress – see LUCENE-2050 . Aiming for maxing out indexing speed and query throughput at the same time is what we're testing here, and this is a reasonable extreme limit to aim for when stress-testing real-time search. But your test is missing a dimension: frequency of reopen. If you reopen once per second, how do Zoie/Lucene compare? Twice per second? Once every 5 seconds? Etc. It sounds like LinkedIn has a hard requirement that the reopen must happen hundreds of times per second, which is perfectly fine. That's what LinkedIn needs. But other apps have different requirements, and so to make an informed decision they need to see the full picture.
        Hide
        Jake Mannix added a comment -

        Whoa, pretty insane volume.

        Aiming for maxing out indexing speed and query throughput at the same time is what we're testing here, and this is a reasonable extreme limit to aim for when stress-testing real-time search.

        A handful by pooling the BitVector fixed size bytes arrays (see LUCENE-1574).

        Pooling, you say? But what if updates come in too fast to reuse your pool? If you're indexing at the speeds I'm describing, won't you run out of BitVectors in the pool?

        I really need a solution that will absolutely not affect query performance from what is today

        "You" really need this? Why is the core case for real-time search a scenario where taking a hit of a huge reduction in throughput worth a possible gain in query latency? If the cost was 20% query latency drop in exchange for 7x throughput cost when doing heavy indexing, is that worth it? What about 10% latency cost vs 2x throughput loss? These questions aren't easily answered by saying real-time search with Lucene needs to absolutely not affect query performance from what it is today. These kinds of absolute statements should be backed up by comparisons with real performance and load testing.

        There are many axes of performance to optimize for:

        • query latency
        • query throughput
        • indexing throughput
        • index freshness (how fast before documents are visible)

        Saying that one of these is absolutely of more importance than the others without real metrics showing which ones are affected in which ways by different implementation choices is doing a disservice to the community, and is not by any means "conservative".

        Show
        Jake Mannix added a comment - Whoa, pretty insane volume. Aiming for maxing out indexing speed and query throughput at the same time is what we're testing here, and this is a reasonable extreme limit to aim for when stress-testing real-time search. A handful by pooling the BitVector fixed size bytes arrays (see LUCENE-1574 ). Pooling, you say? But what if updates come in too fast to reuse your pool? If you're indexing at the speeds I'm describing, won't you run out of BitVectors in the pool? I really need a solution that will absolutely not affect query performance from what is today "You" really need this? Why is the core case for real-time search a scenario where taking a hit of a huge reduction in throughput worth a possible gain in query latency? If the cost was 20% query latency drop in exchange for 7x throughput cost when doing heavy indexing, is that worth it? What about 10% latency cost vs 2x throughput loss? These questions aren't easily answered by saying real-time search with Lucene needs to absolutely not affect query performance from what it is today . These kinds of absolute statements should be backed up by comparisons with real performance and load testing. There are many axes of performance to optimize for: query latency query throughput indexing throughput index freshness (how fast before documents are visible) Saying that one of these is absolutely of more importance than the others without real metrics showing which ones are affected in which ways by different implementation choices is doing a disservice to the community, and is not by any means "conservative".
        Hide
        Jason Rutherglen added a comment -

        300 documents a second

        Whoa, pretty insane volume.

        how many of these BitVectors are you going to end up making?

        A handful by pooling the BitVector fixed size bytes arrays (see
        LUCENE-1574). I'm not sure if the synchronization on the pool
        will matter. If it does, we can use ConcurrentHashMap like
        Solr's LRUCache. Granted, JVMs are supposed to be able to handle
        rapid allocation efficiently, however I can't see the overhead
        of pooling being too significant. If it is, there's always the
        default of allocating new BVs.

        I really need a solution that will absolutely not affect query
        performance from what is today. Personally, pooling is the
        safest route for me to use in production as then, there are no
        worries about slowing down queries with alternative deleted docs
        mechanisms. And the memory allocation is kept within scope. The
        overhead is System.arraycopy, which no doubt will be
        insignificant for my use case.

        http://java.sun.com/performance/reference/whitepapers/6_performance.html#2.1.5
        http://www.javapractices.com/topic/TopicAction.do?Id=3

        I suppose if one has fairly simple queries and is willing to
        sacrifice query performance for update rate, then other deleted
        docs mechanisms may be a desired solution. I need to implement a more
        conservative approach.

        Show
        Jason Rutherglen added a comment - 300 documents a second Whoa, pretty insane volume. how many of these BitVectors are you going to end up making? A handful by pooling the BitVector fixed size bytes arrays (see LUCENE-1574 ). I'm not sure if the synchronization on the pool will matter. If it does, we can use ConcurrentHashMap like Solr's LRUCache. Granted, JVMs are supposed to be able to handle rapid allocation efficiently, however I can't see the overhead of pooling being too significant. If it is, there's always the default of allocating new BVs. I really need a solution that will absolutely not affect query performance from what is today. Personally, pooling is the safest route for me to use in production as then, there are no worries about slowing down queries with alternative deleted docs mechanisms. And the memory allocation is kept within scope. The overhead is System.arraycopy, which no doubt will be insignificant for my use case. http://java.sun.com/performance/reference/whitepapers/6_performance.html#2.1.5 http://www.javapractices.com/topic/TopicAction.do?Id=3 I suppose if one has fairly simple queries and is willing to sacrifice query performance for update rate, then other deleted docs mechanisms may be a desired solution. I need to implement a more conservative approach.
        Hide
        Jake Mannix added a comment -

        Zoie must do the IntSet check plus the BitVector check (done by

        Lucene), right?

        Yes, so how does Lucene NRT deal with new deletes? The disk-backed IndexReader still does its internal check for deletions, right? I haven't played with the latest patches on LUCENE-1313, so I'm not sure what has changed, but if you're leaving the disk index alone (to preserve point-in-time status of the index without writing to disk all the time), you've got your in-memory BitVector of newly uncommitted deletes, and then the SegmentReaders from the disk have their own internal deletedDocs BitVector. Are these two OR'ed with each other somewhere? What is done in NRT to minimize the time of checking both of these without modifying the read-only SegmentReader? In the current 2.9.0 code, the segment is reloaded completely on getReader() if there are new add/deletes, right?

        Ie comparing IntSet lookup vs BitVector lookup isn't the comparison

        you want to do. You should compare the IntSet lookup (Zoie's added
        cost) to 0.

        If you've got that technique for resolving new deletes against the disk-based ones while maintaining point-in-time nature and can completely amortize the reopen cost so that it doesn't affect performance, then yeah, that would be the comparison. I'm not sure I understand how the NRT implementation is doing this currently - I tried to step through the debugger while running the TestIndexWriterReader test, but I'm still not quite sure what is going on during the reopen.

        So, for a query that hits 5M docs, Zoie will take 64 msec longer than

        Lucene, due to the extra check. What I'd like to know is what
        pctg. slowdown that works out to be, eg for a simple TermQuery that
        hits those 5M results - that's Zoie's worst case search slowdown.

        Yes, this is a good check to see, for while it is still a micro-benchmark, really, since it would be done in isolation, while no other production tasks are going on, like rapid indexing and the consequent flushes to disk and reader reopening is going on, but it would be useful to see.

        What would be even better, however, would be to have a running system whereby there is continual updating of the index, and many concurrent requests are coming in which hit all 5M documents, and measure the mean latency for zoie in this case, in both comparison to NRT, and in comparison to lucene when you don't reopen the index (ie. you do things the pre-lucene2.9 way, where the CPU is still being consumed by indexing, but the reader is out of date until the next time it's scheduled by the application to reopen). This would measure the effective latency and throughtput costs of zoie and NRT vs non-NRT lucene. I'm not really sure it's terribly helpful to see "what is zoie's latency when you're not indexing at all" - why on earth would you use either NRT or zoie if you're not doing lots of indexing?

        Show
        Jake Mannix added a comment - Zoie must do the IntSet check plus the BitVector check (done by Lucene), right? Yes, so how does Lucene NRT deal with new deletes? The disk-backed IndexReader still does its internal check for deletions, right? I haven't played with the latest patches on LUCENE-1313 , so I'm not sure what has changed, but if you're leaving the disk index alone (to preserve point-in-time status of the index without writing to disk all the time), you've got your in-memory BitVector of newly uncommitted deletes, and then the SegmentReaders from the disk have their own internal deletedDocs BitVector. Are these two OR'ed with each other somewhere? What is done in NRT to minimize the time of checking both of these without modifying the read-only SegmentReader? In the current 2.9.0 code, the segment is reloaded completely on getReader() if there are new add/deletes, right? Ie comparing IntSet lookup vs BitVector lookup isn't the comparison you want to do. You should compare the IntSet lookup (Zoie's added cost) to 0. If you've got that technique for resolving new deletes against the disk-based ones while maintaining point-in-time nature and can completely amortize the reopen cost so that it doesn't affect performance, then yeah, that would be the comparison. I'm not sure I understand how the NRT implementation is doing this currently - I tried to step through the debugger while running the TestIndexWriterReader test, but I'm still not quite sure what is going on during the reopen. So, for a query that hits 5M docs, Zoie will take 64 msec longer than Lucene, due to the extra check. What I'd like to know is what pctg. slowdown that works out to be, eg for a simple TermQuery that hits those 5M results - that's Zoie's worst case search slowdown. Yes, this is a good check to see, for while it is still a micro-benchmark, really, since it would be done in isolation, while no other production tasks are going on, like rapid indexing and the consequent flushes to disk and reader reopening is going on, but it would be useful to see. What would be even better, however, would be to have a running system whereby there is continual updating of the index, and many concurrent requests are coming in which hit all 5M documents, and measure the mean latency for zoie in this case, in both comparison to NRT, and in comparison to lucene when you don't reopen the index (ie. you do things the pre-lucene2.9 way, where the CPU is still being consumed by indexing, but the reader is out of date until the next time it's scheduled by the application to reopen). This would measure the effective latency and throughtput costs of zoie and NRT vs non-NRT lucene. I'm not really sure it's terribly helpful to see "what is zoie's latency when you're not indexing at all" - why on earth would you use either NRT or zoie if you're not doing lots of indexing?
        Hide
        John Wang added a comment -

        wrote a little pgm on my mac pro (8 core 16GM mem beefy machine.
        100 threads mimic real load, each thread loops 100 times doing just a new on a BitVector for an index of numDocs (configurable) and take the average number.

        5M docs, 16 - 18 ms overhead
        10M docs, 35 - 40 ms overhead.

        That is not insignificant.

        Show
        John Wang added a comment - wrote a little pgm on my mac pro (8 core 16GM mem beefy machine. 100 threads mimic real load, each thread loops 100 times doing just a new on a BitVector for an index of numDocs (configurable) and take the average number. 5M docs, 16 - 18 ms overhead 10M docs, 35 - 40 ms overhead. That is not insignificant.
        Hide
        Jake Mannix added a comment -

        But Jason, if you're indexing 300 documents a second (possibly all of which are delete+re-add), and querying at a thousand queries a second, how many of these BitVectors are you going to end up making?

        Show
        Jake Mannix added a comment - But Jason, if you're indexing 300 documents a second (possibly all of which are delete+re-add), and querying at a thousand queries a second, how many of these BitVectors are you going to end up making?
        Hide
        Jason Rutherglen added a comment -

        The BitVector memory consumption is 125,000 bytes/122K for 1 million documents. I'd be surprised if copying a byte array has a noticeable performance impact.

        Show
        Jason Rutherglen added a comment - The BitVector memory consumption is 125,000 bytes/122K for 1 million documents. I'd be surprised if copying a byte array has a noticeable performance impact.
        Hide
        John Wang added a comment -

        Zoie will take 64 msec longer than Lucene, due to the extra check.

        That is not true. If you look at the report closely, it is 20ms difference, 64ms is the total size. (after I turned on -server, the diff is about 10ms). This is running on my laptop, hardly a production server.

        This is also assuming the entire corpus is returned, where we should really take an average of the result set from the query log.

        However, to save this "overhead", using BitVector is wasting a lot of memory, which is expensive to clone, new and gc. In a running system, much of that cost is hard to measure. This is simply a question of trade-offs.

        Again, I would suggest to run the tests yourself, afterall, it is open source and make decisions for yourself, this way, we can get a better understanding from concrete numbers and scenarios.

        BTW, is there a performance benchmark/setup for lucene NRT?

        The tests so far are really testing Zoie's reopen time vs Lucene's

        That is not true either. This test is simply testing searching with indexing turned on. Not specific to re-open. I don't think the statement that the performance difference is solely due to reopen is substantiated. I am seeing the following with NRT:

        e.g.
        1) file handle leak - Our prod-quality machine fell over after 1 hr of running using NRT due to file handle leaking.
        2) cpu and memory starvation - monitoring cpu and memory usage, the machine seems very starved, and I think that leads to performance differences more than the extra array look.
        3) I am seeing also correctness issues as well, e.g. deletes don't get applied correctly. I am not sure about the unit test coverage for NRT to comment specifically.

        Again, this can all be specific to my usage of NRT or the test setup. That is why I urge you guys to run our tests yourself and correct us if you see areas we are missing to make a fair comparison.

        Show
        John Wang added a comment - Zoie will take 64 msec longer than Lucene, due to the extra check. That is not true. If you look at the report closely, it is 20ms difference, 64ms is the total size. (after I turned on -server, the diff is about 10ms). This is running on my laptop, hardly a production server. This is also assuming the entire corpus is returned, where we should really take an average of the result set from the query log. However, to save this "overhead", using BitVector is wasting a lot of memory, which is expensive to clone, new and gc. In a running system, much of that cost is hard to measure. This is simply a question of trade-offs. Again, I would suggest to run the tests yourself, afterall, it is open source and make decisions for yourself, this way, we can get a better understanding from concrete numbers and scenarios. BTW, is there a performance benchmark/setup for lucene NRT? The tests so far are really testing Zoie's reopen time vs Lucene's That is not true either. This test is simply testing searching with indexing turned on. Not specific to re-open. I don't think the statement that the performance difference is solely due to reopen is substantiated. I am seeing the following with NRT: e.g. 1) file handle leak - Our prod-quality machine fell over after 1 hr of running using NRT due to file handle leaking. 2) cpu and memory starvation - monitoring cpu and memory usage, the machine seems very starved, and I think that leads to performance differences more than the extra array look. 3) I am seeing also correctness issues as well, e.g. deletes don't get applied correctly. I am not sure about the unit test coverage for NRT to comment specifically. Again, this can all be specific to my usage of NRT or the test setup. That is why I urge you guys to run our tests yourself and correct us if you see areas we are missing to make a fair comparison.
        Hide
        Michael McCandless added a comment -

        we need to see it in the real-world context of running actual worst case queries.

        Isn't checking every document in the corpus for deletes the worse case? e.g. first test?

        Zoie must do the IntSet check plus the BitVector check (done by
        Lucene), right?

        Ie comparing IntSet lookup vs BitVector lookup isn't the comparison
        you want to do. You should compare the IntSet lookup (Zoie's added
        cost) to 0.

        So, for a query that hits 5M docs, Zoie will take 64 msec longer than
        Lucene, due to the extra check. What I'd like to know is what
        pctg. slowdown that works out to be, eg for a simple TermQuery that
        hits those 5M results – that's Zoie's worst case search slowdown.

        at the expense of slower query time

        According to the test, Zoie's query time is faster.

        The tests so far are really testing Zoie's reopen time vs Lucene's.

        To test the query time, you should set up Zoie w/ some pending
        deletions, then turn off all indexing, then run the query test.

        That would give us both extreme datapoints – how much slower Lucene
        is at reopening (which the current tests show), and how much slower
        Zoie is during searching.

        it must double-check the deletions.

        True, this double-check is only done for a candidate for a hit from the underlying query.

        Right, Zoie's search slowdown is in proportion to the size of the
        result set. So eg for hard queries that produce very few results, the
        impact will be negligible. For simple queries that produce lots of
        results, the relative cost is highest (but we don't yet know what it
        actually is in practice).

        It could still be neglible, eg since the "if" rarely triggers, the CPU
        should be able to predict it just fine.

        Net/net, Zoie has faster reopen time than Lucene, but then pays a
        higher price (double check for deletion) for every result of every
        search. Users need to know what that price really is, in order to
        make informed decision about which approach is best for their
        situation.

        Normally result set is much smaller than the corpus, the overhead is not large.

        Well, this is generally app dependent, and it's the net/net worst case
        queries that apps need to worry about. Lucene can't [yet] take
        avantage of concurrency within a single query, so you're forced to
        shard (= big step up in deployment complexity) once your worst case
        queries get too slow.

        Can you describe the setup of the "indexing only "test?

        starting off with an empty index and keep on adding documents, at the same time, for each search request, return a reader for the current state of the indexing. Our test assumes 10 concurrent threads making search calls.

        Oh I see: that test is just adding documents, vs the 2nd test which is
        doing updateDocument. Got it.

        So, that's interesting... because, with no deletions, thus no
        resolving of Term -> docID, and no cloning of the BitVector, Lucene's
        reopen is still quite a bit slower.

        What differences remain at this point? Just the fact that the RAM dir
        is being used to flush the new [tiny] segments? Hmm what about the
        merge policy?

        Show
        Michael McCandless added a comment - we need to see it in the real-world context of running actual worst case queries. Isn't checking every document in the corpus for deletes the worse case? e.g. first test? Zoie must do the IntSet check plus the BitVector check (done by Lucene), right? Ie comparing IntSet lookup vs BitVector lookup isn't the comparison you want to do. You should compare the IntSet lookup (Zoie's added cost) to 0. So, for a query that hits 5M docs, Zoie will take 64 msec longer than Lucene, due to the extra check. What I'd like to know is what pctg. slowdown that works out to be, eg for a simple TermQuery that hits those 5M results – that's Zoie's worst case search slowdown. at the expense of slower query time According to the test, Zoie's query time is faster. The tests so far are really testing Zoie's reopen time vs Lucene's. To test the query time, you should set up Zoie w/ some pending deletions, then turn off all indexing, then run the query test. That would give us both extreme datapoints – how much slower Lucene is at reopening (which the current tests show), and how much slower Zoie is during searching. it must double-check the deletions. True, this double-check is only done for a candidate for a hit from the underlying query. Right, Zoie's search slowdown is in proportion to the size of the result set. So eg for hard queries that produce very few results, the impact will be negligible. For simple queries that produce lots of results, the relative cost is highest (but we don't yet know what it actually is in practice). It could still be neglible, eg since the "if" rarely triggers, the CPU should be able to predict it just fine. Net/net, Zoie has faster reopen time than Lucene, but then pays a higher price (double check for deletion) for every result of every search. Users need to know what that price really is, in order to make informed decision about which approach is best for their situation. Normally result set is much smaller than the corpus, the overhead is not large. Well, this is generally app dependent, and it's the net/net worst case queries that apps need to worry about. Lucene can't [yet] take avantage of concurrency within a single query, so you're forced to shard (= big step up in deployment complexity) once your worst case queries get too slow. Can you describe the setup of the "indexing only "test? starting off with an empty index and keep on adding documents, at the same time, for each search request, return a reader for the current state of the indexing. Our test assumes 10 concurrent threads making search calls. Oh I see: that test is just adding documents, vs the 2nd test which is doing updateDocument. Got it. So, that's interesting... because, with no deletions, thus no resolving of Term -> docID, and no cloning of the BitVector, Lucene's reopen is still quite a bit slower. What differences remain at this point? Just the fact that the RAM dir is being used to flush the new [tiny] segments? Hmm what about the merge policy?
        Hide
        John Wang added a comment -

        we need to see it in the real-world context of running actual worst case queries.

        Isn't checking every document in the corpus for deletes the worse case? e.g. first test?

        at the expense of slower query time

        According to the test, Zoie's query time is faster.

        it must double-check the deletions.

        True, this double-check is only done for a candidate for a hit from the underlying query. Normally result set is much smaller than the corpus, the overhead is not large. The overhead is 1 array lookup + a delset look up vs. 1 bitvector lookup.

        Can you describe the setup of the "indexing only "test?

        starting off with an empty index and keep on adding documents, at the same time, for each search request, return a reader for the current state of the indexing. Our test assumes 10 concurrent threads making search calls.

        Show
        John Wang added a comment - we need to see it in the real-world context of running actual worst case queries. Isn't checking every document in the corpus for deletes the worse case? e.g. first test? at the expense of slower query time According to the test, Zoie's query time is faster. it must double-check the deletions. True, this double-check is only done for a candidate for a hit from the underlying query. Normally result set is much smaller than the corpus, the overhead is not large. The overhead is 1 array lookup + a delset look up vs. 1 bitvector lookup. Can you describe the setup of the "indexing only "test? starting off with an empty index and keep on adding documents, at the same time, for each search request, return a reader for the current state of the indexing. Our test assumes 10 concurrent threads making search calls.
        Hide
        Michael McCandless added a comment -

        One optimization you could make with Zoie is, if a real-time deletion (from the AcceleratedIntSet) is in fact hit, it could mark the corresponding docID, to make subsequent searches a bit faster (and save the bg CPU when flushing the deletes to Lucene).

        That sound interesting - how would that work? We don't really touch the disk indexReader, other than to set this modSet on it in the ThreadLocal, where would this mark live?

        Whenever a query happens to "discover" a pending deletion, you could
        record somewhat the UID -> docID mapping, and then when it's time to
        flush the deletes you don't need to re-resolve the UIDs that the query
        had resolved for you. Likely in practice this would be a tiny
        speedup, though, so the added complexity is probably not worth it.
        Especially since this resolution is done in the BG for Zoie...

        Show
        Michael McCandless added a comment - One optimization you could make with Zoie is, if a real-time deletion (from the AcceleratedIntSet) is in fact hit, it could mark the corresponding docID, to make subsequent searches a bit faster (and save the bg CPU when flushing the deletes to Lucene). That sound interesting - how would that work? We don't really touch the disk indexReader, other than to set this modSet on it in the ThreadLocal, where would this mark live? Whenever a query happens to "discover" a pending deletion, you could record somewhat the UID -> docID mapping, and then when it's time to flush the deletes you don't need to re-resolve the UIDs that the query had resolved for you. Likely in practice this would be a tiny speedup, though, so the added complexity is probably not worth it. Especially since this resolution is done in the BG for Zoie...
        Hide
        Michael McCandless added a comment -

        Thanks for running these tests John.

        The micro-benchmark of BitVector vs IntAccelerator is nice, but, we
        need to see it in the real-world context of running actual worst case
        queries.

        Zoie aims for super fast reopon time, at the expense of slower query
        time since it must double-check the deletions.

        Lucene NRT makes the opposite tradeoff.

        The tests so far make it clear that Zoie's reopen time is much faster
        than Lucene's, but they don't yet measure (as far as I can see) what
        cost the double-check for deletions is adding to Zoie for the
        worst-case queries.

        So if you really need to reopen 100s of times per second, and can
        accept that your worst case queries will run slower (we're still not
        sure just how much slower), the Zoie approach is best.

        If you want full speed query performance, and can instead reopen once
        per second or once every 2 seconds, Lucene's approach will be better
        (though we still have important fixes to make – LUCENE-2047,
        LUCENE-1313).

        Can you describe the setup of the "indexing only "test? Are you doing
        any reopening at all?

        Show
        Michael McCandless added a comment - Thanks for running these tests John. The micro-benchmark of BitVector vs IntAccelerator is nice, but, we need to see it in the real-world context of running actual worst case queries. Zoie aims for super fast reopon time, at the expense of slower query time since it must double-check the deletions. Lucene NRT makes the opposite tradeoff. The tests so far make it clear that Zoie's reopen time is much faster than Lucene's, but they don't yet measure (as far as I can see) what cost the double-check for deletions is adding to Zoie for the worst-case queries. So if you really need to reopen 100s of times per second, and can accept that your worst case queries will run slower (we're still not sure just how much slower), the Zoie approach is best. If you want full speed query performance, and can instead reopen once per second or once every 2 seconds, Lucene's approach will be better (though we still have important fixes to make – LUCENE-2047 , LUCENE-1313 ). Can you describe the setup of the "indexing only "test? Are you doing any reopening at all?
        Hide
        John Wang added a comment -

        I'd love to see how the worst-case queries (matching millions of hits) perform with each of these three options.

        I wrote a small program on my laptop, 100 docs in the set, iterates thru 5M numbers and calls contains().
        I see 44 ms with BitVector and 64ms with IntAccelerator backed by IntOpenHashSet (from fastUtil)

        This is however an extreme case, so test 2, I chose 5000 docs from the set, e.g. mod 1000 to be a candidate for check. And both sets performed equally, around 45ms.

        So with the memory cost, and the allocations and clones of the BitVector, I think for us at least, using the IntSetAccelerator works well.

        why does each thread make a full clone of the AcceleratedBitSet?

        These are for updates, e.g. you updated doc x, it is updated to the ramdir, but it is already on the disk dir. So at query time, you need this set for dup removal.

        I'd love to see this too.

        Some more details on the test we ran:

        NRT - indexing only
        ***********************************************************
        SUMMARY:
        ***********************************************************
        TOTAL TRANSACTIONS: 622201
        TOTAL EXECUTIONS: 622201
        TOTAL SUCCESSFUL EXECUTIONS: 622201
        TOTAL FAILED EXECUTIONS: 0
        TOTAL RUNTIME IN MINS: 30.07
        INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1
        ***********************************************************

        zoie - indexing only
        SUMMARY:
        ***********************************************************
        TOTAL TRANSACTIONS: 6265384
        TOTAL EXECUTIONS: 6265384
        TOTAL SUCCESSFUL EXECUTIONS: 6265384
        TOTAL FAILED EXECUTIONS: 0
        TOTAL RUNTIME IN MINS: 30.07
        INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1
        ***********************************************************

        zoie - update
        SUMMARY:
        ***********************************************************
        TOTAL TRANSACTIONS: 1923592
        TOTAL EXECUTIONS: 1923592
        TOTAL SUCCESSFUL EXECUTIONS: 1923592
        TOTAL FAILED EXECUTIONS: 0
        TOTAL RUNTIME IN MINS: 30.07
        INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1
        ***********************************************************

        nrt - update

        SUMMARY:
        ***********************************************************
        TOTAL TRANSACTIONS: 399893
        TOTAL EXECUTIONS: 399893
        TOTAL SUCCESSFUL EXECUTIONS: 399893
        TOTAL FAILED EXECUTIONS: 0
        TOTAL RUNTIME IN MINS: 30.07
        INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1
        ***********************************************************

        Latencies:

        Zoie - insert test: linear growth from 1 ms to 5 ms as index grows in the duration of the test from 0 docs to 660k docs.
        Zoie - update test: averaged at 9ms, as index with continuous update and stayed in 1M docs
        NRT - insert test: fluctuated between 17 ms to 50 ms as index grows in the duration of the test from 0 docs to 220 docs.
        NRT - update test: big peak when query started, latency spiked up to 550ms and then dropped and stayed steadily at 50ms, with continuous updates to stay in 1M docs.

        Some observation at the NRT update test, I am seeing some delete issues, e.g. realtime deletes does not seem to reflect, and indexing speed sharply dropped.

        It's quite possible that I am not using NRT the most optimal way in my setup. Feel free to run the tests yourself. I'd happy to help with the setup.
        One thing with Zoie is that it is a full stream indexing system with a pluggable realtime engine, so you can actually use zoie for perf testing for NRT.

        One thing about the test to stress, we are testing realtime updates, so buffered indexing events up and flush once it a while is not realtime, and katta has already achieved good results with batch indexing with just minutes of delay, without making any internal changes to lucene.

        Show
        John Wang added a comment - I'd love to see how the worst-case queries (matching millions of hits) perform with each of these three options. I wrote a small program on my laptop, 100 docs in the set, iterates thru 5M numbers and calls contains(). I see 44 ms with BitVector and 64ms with IntAccelerator backed by IntOpenHashSet (from fastUtil) This is however an extreme case, so test 2, I chose 5000 docs from the set, e.g. mod 1000 to be a candidate for check. And both sets performed equally, around 45ms. So with the memory cost, and the allocations and clones of the BitVector, I think for us at least, using the IntSetAccelerator works well. why does each thread make a full clone of the AcceleratedBitSet? These are for updates, e.g. you updated doc x, it is updated to the ramdir, but it is already on the disk dir. So at query time, you need this set for dup removal. I'd love to see this too. Some more details on the test we ran: NRT - indexing only *********************************************************** SUMMARY: *********************************************************** TOTAL TRANSACTIONS: 622201 TOTAL EXECUTIONS: 622201 TOTAL SUCCESSFUL EXECUTIONS: 622201 TOTAL FAILED EXECUTIONS: 0 TOTAL RUNTIME IN MINS: 30.07 INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1 *********************************************************** zoie - indexing only SUMMARY: *********************************************************** TOTAL TRANSACTIONS: 6265384 TOTAL EXECUTIONS: 6265384 TOTAL SUCCESSFUL EXECUTIONS: 6265384 TOTAL FAILED EXECUTIONS: 0 TOTAL RUNTIME IN MINS: 30.07 INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1 *********************************************************** zoie - update SUMMARY: *********************************************************** TOTAL TRANSACTIONS: 1923592 TOTAL EXECUTIONS: 1923592 TOTAL SUCCESSFUL EXECUTIONS: 1923592 TOTAL FAILED EXECUTIONS: 0 TOTAL RUNTIME IN MINS: 30.07 INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1 *********************************************************** nrt - update SUMMARY: *********************************************************** TOTAL TRANSACTIONS: 399893 TOTAL EXECUTIONS: 399893 TOTAL SUCCESSFUL EXECUTIONS: 399893 TOTAL FAILED EXECUTIONS: 0 TOTAL RUNTIME IN MINS: 30.07 INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1 *********************************************************** Latencies: Zoie - insert test: linear growth from 1 ms to 5 ms as index grows in the duration of the test from 0 docs to 660k docs. Zoie - update test: averaged at 9ms, as index with continuous update and stayed in 1M docs NRT - insert test: fluctuated between 17 ms to 50 ms as index grows in the duration of the test from 0 docs to 220 docs. NRT - update test: big peak when query started, latency spiked up to 550ms and then dropped and stayed steadily at 50ms, with continuous updates to stay in 1M docs. Some observation at the NRT update test, I am seeing some delete issues, e.g. realtime deletes does not seem to reflect, and indexing speed sharply dropped. It's quite possible that I am not using NRT the most optimal way in my setup. Feel free to run the tests yourself. I'd happy to help with the setup. One thing with Zoie is that it is a full stream indexing system with a pluggable realtime engine, so you can actually use zoie for perf testing for NRT. One thing about the test to stress, we are testing realtime updates, so buffered indexing events up and flush once it a while is not realtime, and katta has already achieved good results with batch indexing with just minutes of delay, without making any internal changes to lucene.
        Hide
        Jake Mannix added a comment -

        But how many msec does this clone add in practice? Note that it's only done if there is a new deletion against that

        segment. I do agree it's silly wasteful, but searching should then be faster
        than using AccelerateIntSet or MultiBitSet. It's a tradeoff of the
        turnaround time for search perf.

        I actually don't know for sure if this is the majority of the time, as I haven't actually run both the AcceleratedIntSet or 2.9 NRT through a profiler, but if you're indexing at high speed (which is what is done in our load/perf tests), you're going to be cloning these things hundreds of times per second (look at the indexing throughput we're forcing the system to go through), and even if it's fast, that's costly.

        I'd love to see how the worst-case queries (matching millions of hits)

        perform with each of these three options.

        It's pretty easy to change the index and query files in our test to do that, that's a good idea. You can feel free to check out our load testing framework too - it will let you monkey with various parameters, monitor the whole thing via JMX, and so forth, both for the full zoie-based stuff, and where the zoie api is wrapped purely around Lucene 2.9 NRT. The instructions for how to set it up are on the zoie wiki.

        When a doc needs to be updated, you index it immediately into the

        RAMDir, and reopen the RAMDir's IndexReader. You add it's UID to the
        AcceleratedIntSet, and all searches "and NOT"'d against that set. You
        don't tell Lucene to delete the old doc, yet.

        Yep, basically. The IntSetAccellerator (of UIDs) is set on the (long lived) IndexReader for the disk index - this is why it's done as a ThreadLocal - everybody is sharing that IndexReader, but different threads have different point-in-time views of how much of it has been deleted.

        These are great results! If I'm reading them right, it looks like

        generally you get faster query throughput, and roughly equal indexing
        throughput, on upgrading from 2.4 to 2.9?

        That's about right. Of course, the comparison between zoie with either 2.4 or 2.9 against lucene 2.9 NRT is an important one to look at: zoie is pushing about 7-9x better throughput for both queries and indexing than NRT.

        I'm sure the performance numbers would change if we allowed not realtimeness, yes, that's one of the many dimensions to consider in this (along with percentage of indexing events which are deletes, how many of those are from really old segments vs. newer ones, how big the queries are, etc...).

        One optimization you could make with Zoie is, if a real-time deletion

        (from the AcceleratedIntSet) is in fact hit, it could mark the
        corresponding docID, to make subsequent searches a bit faster (and
        save the bg CPU when flushing the deletes to Lucene).

        That sound interesting - how would that work? We don't really touch the disk indexReader, other than to set this modSet on it in the ThreadLocal, where would this mark live?

        Show
        Jake Mannix added a comment - But how many msec does this clone add in practice? Note that it's only done if there is a new deletion against that segment. I do agree it's silly wasteful, but searching should then be faster than using AccelerateIntSet or MultiBitSet. It's a tradeoff of the turnaround time for search perf. I actually don't know for sure if this is the majority of the time, as I haven't actually run both the AcceleratedIntSet or 2.9 NRT through a profiler, but if you're indexing at high speed (which is what is done in our load/perf tests), you're going to be cloning these things hundreds of times per second (look at the indexing throughput we're forcing the system to go through), and even if it's fast, that's costly. I'd love to see how the worst-case queries (matching millions of hits) perform with each of these three options. It's pretty easy to change the index and query files in our test to do that, that's a good idea. You can feel free to check out our load testing framework too - it will let you monkey with various parameters, monitor the whole thing via JMX, and so forth, both for the full zoie-based stuff, and where the zoie api is wrapped purely around Lucene 2.9 NRT. The instructions for how to set it up are on the zoie wiki. When a doc needs to be updated, you index it immediately into the RAMDir, and reopen the RAMDir's IndexReader. You add it's UID to the AcceleratedIntSet, and all searches "and NOT"'d against that set. You don't tell Lucene to delete the old doc, yet. Yep, basically. The IntSetAccellerator (of UIDs) is set on the (long lived) IndexReader for the disk index - this is why it's done as a ThreadLocal - everybody is sharing that IndexReader, but different threads have different point-in-time views of how much of it has been deleted. These are great results! If I'm reading them right, it looks like generally you get faster query throughput, and roughly equal indexing throughput, on upgrading from 2.4 to 2.9? That's about right. Of course, the comparison between zoie with either 2.4 or 2.9 against lucene 2.9 NRT is an important one to look at: zoie is pushing about 7-9x better throughput for both queries and indexing than NRT. I'm sure the performance numbers would change if we allowed not realtimeness, yes, that's one of the many dimensions to consider in this (along with percentage of indexing events which are deletes, how many of those are from really old segments vs. newer ones, how big the queries are, etc...). One optimization you could make with Zoie is, if a real-time deletion (from the AcceleratedIntSet) is in fact hit, it could mark the corresponding docID, to make subsequent searches a bit faster (and save the bg CPU when flushing the deletes to Lucene). That sound interesting - how would that work? We don't really touch the disk indexReader, other than to set this modSet on it in the ThreadLocal, where would this mark live?
        Hide
        Michael McCandless added a comment -

        OK I opened LUCENE-2047, to resolve deletedDoc(s) to their docID(s) in the foreground.

        Show
        Michael McCandless added a comment - OK I opened LUCENE-2047 , to resolve deletedDoc(s) to their docID(s) in the foreground.
        Hide
        Michael McCandless added a comment -

        What's missing as it pertains to this jira issue is the raw
        query speed (meaning the time a query takes to execute, not QPS,
        over various percentages of deleted docs), without concurrent
        indexing.

        I'd love to see this too. Ie, this would measure the query performance when deletes are checked against AcceleratedIntSet, and wouldn't measure the reopen cost (which Lucene NRT will be slower at) at all. Specifically I'd love to see the worst case queries. It's those queries that drive the cutover to a shard'd architecture.

        Ie, we need to separately measure query performance vs reopen performance.

        Show
        Michael McCandless added a comment - What's missing as it pertains to this jira issue is the raw query speed (meaning the time a query takes to execute, not QPS, over various percentages of deleted docs), without concurrent indexing. I'd love to see this too. Ie, this would measure the query performance when deletes are checked against AcceleratedIntSet, and wouldn't measure the reopen cost (which Lucene NRT will be slower at) at all. Specifically I'd love to see the worst case queries. It's those queries that drive the cutover to a shard'd architecture. Ie, we need to separately measure query performance vs reopen performance.
        Hide
        Jake Mannix added a comment -

        I'll try to get those numbers for you, they should be in our logs, and if not, it's easy enough to put them. My guess it that if zoie is doing 7x the QPS, the latency is significantly less than the NRT latency, not more, but I could be wrong.

        Note that without concurrent indexing, the query speed will be the same, as all docs will be flushed to disk and the isDeleted check reduces to exactly the raw lucene case.

        Show
        Jake Mannix added a comment - I'll try to get those numbers for you, they should be in our logs, and if not, it's easy enough to put them. My guess it that if zoie is doing 7x the QPS, the latency is significantly less than the NRT latency, not more, but I could be wrong. Note that without concurrent indexing, the query speed will be the same, as all docs will be flushed to disk and the isDeleted check reduces to exactly the raw lucene case.
        Hide
        Jason Rutherglen added a comment -

        check out the zoie perf pages:
        http://code.google.com/p/zoie/wiki/Performance_Comparisons_for_Zo
        ieLucene24ZoieLucene29LuceneNRT

        What's missing as it pertains to this jira issue is the raw
        query speed (meaning the time a query takes to execute, not QPS,
        over various percentages of deleted docs), without concurrent
        indexing.

        If query speed goes down, users need to know by how much.

        Show
        Jason Rutherglen added a comment - check out the zoie perf pages: http://code.google.com/p/zoie/wiki/Performance_Comparisons_for_Zo ieLucene24ZoieLucene29LuceneNRT What's missing as it pertains to this jira issue is the raw query speed (meaning the time a query takes to execute, not QPS, over various percentages of deleted docs), without concurrent indexing. If query speed goes down, users need to know by how much.
        Hide
        Michael McCandless added a comment -

        But, I agree it's wasteful of space when deletes are so sparse... though it is fast.

        It's fast for random access, but it's really slow if you need to make a lot of these (either during heavy indexing if copy-on-write, or during heavy query load if copy-on-reopen).

        But how many msec does this clone add in practice?

        Note that it's only done if there is a new deletion against that
        segment.

        I do agree it's silly wasteful, but searching should then be faster
        than using AccelerateIntSet or MultiBitSet. It's a tradeoff of the
        turnaround time for search perf.

        I'd love to see how the worst-case queries (matching millions of hits)
        perform with each of these three options.

        However, I suspect it's not the clone time that's very costly... I bet
        it's the fact that Lucene has to resolve the deletions to docIDs, in
        the foreground of reopen, that dominates. And probably also that
        Lucene doesn't yet use RAMDir (but LUCENE-1313 is working towards
        fixing that).

        Ie, Zoie is "late binding" (filters out UIDs as it encounters them
        during searching), while Lucene is "early binding" (immediately
        resolves UIDs -> docIDs during reopen). And because Zoie does the
        "resolve deletions to docIDs" in the BG, it's not on any query's
        execution path.

        How does Zoie resolve UID -> docID, now? I remember a thread about
        this a while back...

        Actually one simple fix we could make to Lucene is to resolve
        deletions in the foreground, when the deleteDocuments is called.
        This'd mean it's the thread that does the updateDocument that pays the
        price, rather than a future reopen. Net/net it's a zero sum game
        (just distributed the cost from the reopen to the indexing), but it'd
        mean the reopen time is minimized, which is clearly the direction we
        want to go in. I'll open a new issue.

        So are you using this, only, as your deleted docs? Ie you don't store the deletions with Lucene? I'm getting confused if this is only for the NRT case, or, in general.

        These are only to augment the deleted docs of the disk reader - the disk reader isn't reopened at all except infrequently - once a batch (a big enough RAMDirectory is filled, or enough time goes by, depending on configuration) is ready to be flushed to disk, diskReader.addIndexes is called and when the diskReader is reopened, the deletes live in the normal diskReader's delete set. Before this time is ready, when there is a batch in ram that hasn't been flushed, the IntSetAccelerator is applied to the not-reopened diskReader.

        I think I now understand it (plus John's comment that these are Zoie's
        UIDs not Lucene's docIDs, helped)...

        When a doc needs to be updated, you index it immediately into the
        RAMDir, and reopen the RAMDir's IndexReader. You add it's UID to the
        AcceleratedIntSet, and all searches "and NOT"'d against that set. You
        don't tell Lucene to delete the old doc, yet.

        Periodically, in the BG, you use addIndexes to push the RAMDir to
        disk, and, on a perhaps separate schedule, you resolve the deleted
        UIDs to docIDs and flush them to disk.

        One question: does Zoie preserve Lucene's "point in time" searching?
        Is a new deletion immediately visible to all past reopened readers? I
        think for Lucene we need to preserve this, so we need a data structure
        that can be "efficiently" transactional. I guess we could consider
        allowing an NRT to optionally violate this, in which case we wouldn't
        need to do any cloning of the deleted docs.

        It's a copy-on-read ThreadLocal.

        Hmm – why does each thread make a full clone of the
        AcceleratedBitSet? Just for thread safety against additions to the
        set? Or is this somehow preserving "point in time"? And it fully
        re-clones whenever new updates have been committed to the RAMDir?

        It's actually pretty fantastic performance - check out the zoie perf pages: http://code.google.com/p/zoie/wiki/Performance_Comparisons_for_ZoieLucene24ZoieLucene29LuceneNRT

        These are great results! If I'm reading them right, it looks like
        generally you get faster query throughput, and roughly equal indexing
        throughput, on upgrading from 2.4 to 2.9?

        Zoie also gets much better performance than raw Lucene NRT, but this
        test focuses on reopen performance, I think? Ie, a query reopens the
        reader if any new docs were indexed? If you change that to, say,
        reopen once per N seconds, I wonder how the results would compare.

        One optimization you could make with Zoie is, if a real-time deletion
        (from the AcceleratedIntSet) is in fact hit, it could mark the
        corresponding docID, to make subsequent searches a bit faster (and
        save the bg CPU when flushing the deletes to Lucene).

        Show
        Michael McCandless added a comment - But, I agree it's wasteful of space when deletes are so sparse... though it is fast. It's fast for random access, but it's really slow if you need to make a lot of these (either during heavy indexing if copy-on-write, or during heavy query load if copy-on-reopen). But how many msec does this clone add in practice? Note that it's only done if there is a new deletion against that segment. I do agree it's silly wasteful, but searching should then be faster than using AccelerateIntSet or MultiBitSet. It's a tradeoff of the turnaround time for search perf. I'd love to see how the worst-case queries (matching millions of hits) perform with each of these three options. However, I suspect it's not the clone time that's very costly... I bet it's the fact that Lucene has to resolve the deletions to docIDs, in the foreground of reopen, that dominates. And probably also that Lucene doesn't yet use RAMDir (but LUCENE-1313 is working towards fixing that). Ie, Zoie is "late binding" (filters out UIDs as it encounters them during searching), while Lucene is "early binding" (immediately resolves UIDs -> docIDs during reopen). And because Zoie does the "resolve deletions to docIDs" in the BG, it's not on any query's execution path. How does Zoie resolve UID -> docID, now? I remember a thread about this a while back... Actually one simple fix we could make to Lucene is to resolve deletions in the foreground, when the deleteDocuments is called. This'd mean it's the thread that does the updateDocument that pays the price, rather than a future reopen. Net/net it's a zero sum game (just distributed the cost from the reopen to the indexing), but it'd mean the reopen time is minimized, which is clearly the direction we want to go in. I'll open a new issue. So are you using this, only, as your deleted docs? Ie you don't store the deletions with Lucene? I'm getting confused if this is only for the NRT case, or, in general. These are only to augment the deleted docs of the disk reader - the disk reader isn't reopened at all except infrequently - once a batch (a big enough RAMDirectory is filled, or enough time goes by, depending on configuration) is ready to be flushed to disk, diskReader.addIndexes is called and when the diskReader is reopened, the deletes live in the normal diskReader's delete set. Before this time is ready, when there is a batch in ram that hasn't been flushed, the IntSetAccelerator is applied to the not-reopened diskReader. I think I now understand it (plus John's comment that these are Zoie's UIDs not Lucene's docIDs, helped)... When a doc needs to be updated, you index it immediately into the RAMDir, and reopen the RAMDir's IndexReader. You add it's UID to the AcceleratedIntSet, and all searches "and NOT"'d against that set. You don't tell Lucene to delete the old doc, yet. Periodically, in the BG, you use addIndexes to push the RAMDir to disk, and, on a perhaps separate schedule, you resolve the deleted UIDs to docIDs and flush them to disk. One question: does Zoie preserve Lucene's "point in time" searching? Is a new deletion immediately visible to all past reopened readers? I think for Lucene we need to preserve this, so we need a data structure that can be "efficiently" transactional. I guess we could consider allowing an NRT to optionally violate this, in which case we wouldn't need to do any cloning of the deleted docs. It's a copy-on-read ThreadLocal. Hmm – why does each thread make a full clone of the AcceleratedBitSet? Just for thread safety against additions to the set? Or is this somehow preserving "point in time"? And it fully re-clones whenever new updates have been committed to the RAMDir? It's actually pretty fantastic performance - check out the zoie perf pages: http://code.google.com/p/zoie/wiki/Performance_Comparisons_for_ZoieLucene24ZoieLucene29LuceneNRT These are great results! If I'm reading them right, it looks like generally you get faster query throughput, and roughly equal indexing throughput, on upgrading from 2.4 to 2.9? Zoie also gets much better performance than raw Lucene NRT, but this test focuses on reopen performance, I think? Ie, a query reopens the reader if any new docs were indexed? If you change that to, say, reopen once per N seconds, I wonder how the results would compare. One optimization you could make with Zoie is, if a real-time deletion (from the AcceleratedIntSet) is in fact hit, it could mark the corresponding docID, to make subsequent searches a bit faster (and save the bg CPU when flushing the deletes to Lucene).
        Hide
        John Wang added a comment -

        Michael:

        I think I confused you by not giving you enough background information.

        IntSetAccelerator holds UIDs, not docids. For each segment, the termdocs iterates its docid, we map to its corresponding uid and then check in the set.

        Hopefully this clears it up.

        -John

        Show
        John Wang added a comment - Michael: I think I confused you by not giving you enough background information. IntSetAccelerator holds UIDs, not docids. For each segment, the termdocs iterates its docid, we map to its corresponding uid and then check in the set. Hopefully this clears it up. -John
        Hide
        Jake Mannix added a comment -

        But, I agree it's wasteful of space when deletes are so

        sparse... though it is fast.

        It's fast for random access, but it's really slow if you need to make a lot of these (either during heavy indexing if copy-on-write, or during heavy query load if copy-on-reopen).

        So are you using this, only, as your deleted docs? Ie you don't store

        the deletions with Lucene? I'm getting confused if this is only for
        the NRT case, or, in general.

        These are only to augment the deleted docs of the disk reader - the disk reader isn't reopened at all except infrequently - once a batch (a big enough RAMDirectory is filled, or enough time goes by, depending on configuration) is ready to be flushed to disk, diskReader.addIndexes is called and when the diskReader is reopened, the deletes live in the normal diskReader's delete set. Before this time is ready, when there is a batch in ram that hasn't been flushed, the IntSetAccelerator is applied to the not-reopened diskReader. It's a copy-on-read ThreadLocal.

        So I'm not sure if that described it correctly: only the deletes which should have been applied to the diskReader are treated separately - those are basically batched: for T amount of time or D amount of docs (configurable) whichever comes first, they are applied to the diskReader, which knows about Lucene's regular deletions and now these new ones as well. Once the memory is flushed to disk, the in-memory delSet is emptied, and applied to the diskReader using regular apis before reopening.

        OK, I think I'm catching up here... so you only open a new reader at

        the batch boundary right? Ie, a batch update (all its adds & deletes)
        is atomic from the readers standpoint?

        Yes - disk reader, you mean, right? This is only reopened at batch boundary.

        OK so a batch is quickly reopened, using bloom filter + int set for

        fast "contains" check for the deletions that occurred during that
        batch (and, custom TermDocs that does the "and not deleted"). This
        gets you your fast turnaround and decent search performance.

        The reopening isn't that quick, but it's in the background, or are you talking about the RAMDirectory? Yeah, that is reopened per query (if necessary - if there are no changes, of course no reopen), but it is kept very small (10k docs or less, for example). It's actually pretty fantastic performance - check out the zoie perf pages: http://code.google.com/p/zoie/wiki/Performance_Comparisons_for_ZoieLucene24ZoieLucene29LuceneNRT

        Show
        Jake Mannix added a comment - But, I agree it's wasteful of space when deletes are so sparse... though it is fast. It's fast for random access, but it's really slow if you need to make a lot of these (either during heavy indexing if copy-on-write, or during heavy query load if copy-on-reopen). So are you using this, only, as your deleted docs? Ie you don't store the deletions with Lucene? I'm getting confused if this is only for the NRT case, or, in general. These are only to augment the deleted docs of the disk reader - the disk reader isn't reopened at all except infrequently - once a batch (a big enough RAMDirectory is filled, or enough time goes by, depending on configuration) is ready to be flushed to disk, diskReader.addIndexes is called and when the diskReader is reopened, the deletes live in the normal diskReader's delete set. Before this time is ready, when there is a batch in ram that hasn't been flushed, the IntSetAccelerator is applied to the not-reopened diskReader. It's a copy-on-read ThreadLocal. So I'm not sure if that described it correctly: only the deletes which should have been applied to the diskReader are treated separately - those are basically batched: for T amount of time or D amount of docs (configurable) whichever comes first, they are applied to the diskReader, which knows about Lucene's regular deletions and now these new ones as well. Once the memory is flushed to disk, the in-memory delSet is emptied, and applied to the diskReader using regular apis before reopening. OK, I think I'm catching up here... so you only open a new reader at the batch boundary right? Ie, a batch update (all its adds & deletes) is atomic from the readers standpoint? Yes - disk reader, you mean, right? This is only reopened at batch boundary. OK so a batch is quickly reopened, using bloom filter + int set for fast "contains" check for the deletions that occurred during that batch (and, custom TermDocs that does the "and not deleted"). This gets you your fast turnaround and decent search performance. The reopening isn't that quick, but it's in the background, or are you talking about the RAMDirectory? Yeah, that is reopened per query (if necessary - if there are no changes, of course no reopen), but it is kept very small (10k docs or less, for example). It's actually pretty fantastic performance - check out the zoie perf pages: http://code.google.com/p/zoie/wiki/Performance_Comparisons_for_ZoieLucene24ZoieLucene29LuceneNRT
        Hide
        Michael McCandless added a comment -

        Alas, I'm confused again. If your reader searching a given batch is
        holding deletions against past segments, you can you make a TermDocs
        that filters them? (The TermDocs only enumerates the current batch's
        docs).

        Or: do you make the IntSetAccelerator for each past segments that
        received deletions in the current batch?

        Show
        Michael McCandless added a comment - Alas, I'm confused again. If your reader searching a given batch is holding deletions against past segments, you can you make a TermDocs that filters them? (The TermDocs only enumerates the current batch's docs). Or: do you make the IntSetAccelerator for each past segments that received deletions in the current batch?
        Hide
        Michael McCandless added a comment -

        Alas, I'm confused again. If your reader searching a given batch is
        holding deletions against past segments, you can you make a TermDocs
        that filters them? (The TermDocs only enumerates the current batch's
        docs).

        Or: do you make the IntSetAccelerator for each past segments that
        received deletions in the current batch?

        Show
        Michael McCandless added a comment - Alas, I'm confused again. If your reader searching a given batch is holding deletions against past segments, you can you make a TermDocs that filters them? (The TermDocs only enumerates the current batch's docs). Or: do you make the IntSetAccelerator for each past segments that received deletions in the current batch?
        Hide
        Michael McCandless added a comment -

        We do not hold the deleted set for a long period of time. I agree the memory cost is not a "killer" but it is tremendously wasteful, e.g. 10 M doc index, you have say 2 docs deleted, 0 and 9999999, you are representing it with 5M of memory (where you could have reprsented it with 2 ints, 8 bytes). Sure it is an extremely case, if you look at the avg number of deleted docs vs index size, it is usually sparse. hence we avoided this approach.

        Actually a 10 M doc index would be 1.25 MB BitVector right?

        But, I agree it's wasteful of space when deletes are so
        sparse... though it is fast.

        So are you using this, only, as your deleted docs? Ie you don't store
        the deletions with Lucene? I'm getting confused if this is only for
        the NRT case, or, in general.

        Have you compared performance of this vs straight lookup in BitVector?

        We do not accumulate deletes. Deletes/updates are tracked per batch, and special TermDocs are returned to skip over the deleted/mod set for the given reader.

        OK, I think I'm catching up here... so you only open a new reader at
        the batch boundary right? Ie, a batch update (all its adds & deletes)
        is atomic from the readers standpoint?

        We simply call delete on the diskreader for each batch and internal readers are refreshed with deletes loaded in background.

        OK so a batch is quickly reopened, using bloom filter + int set for
        fast "contains" check for the deletions that occurred during that
        batch (and, custom TermDocs that does the "and not deleted"). This
        gets you your fast turnaround and decent search performance.

        In the BG the deletes are applied to Lucene "for real".

        This is similar to Marvin's original tombstone deletions, in that the
        deletions against old segments are stored with the new segment, rather
        than aggressively pushed to the old segment's bit vectors.

        Show
        Michael McCandless added a comment - We do not hold the deleted set for a long period of time. I agree the memory cost is not a "killer" but it is tremendously wasteful, e.g. 10 M doc index, you have say 2 docs deleted, 0 and 9999999, you are representing it with 5M of memory (where you could have reprsented it with 2 ints, 8 bytes). Sure it is an extremely case, if you look at the avg number of deleted docs vs index size, it is usually sparse. hence we avoided this approach. Actually a 10 M doc index would be 1.25 MB BitVector right? But, I agree it's wasteful of space when deletes are so sparse... though it is fast. So are you using this, only, as your deleted docs? Ie you don't store the deletions with Lucene? I'm getting confused if this is only for the NRT case, or, in general. Have you compared performance of this vs straight lookup in BitVector? We do not accumulate deletes. Deletes/updates are tracked per batch, and special TermDocs are returned to skip over the deleted/mod set for the given reader. OK, I think I'm catching up here... so you only open a new reader at the batch boundary right? Ie, a batch update (all its adds & deletes) is atomic from the readers standpoint? We simply call delete on the diskreader for each batch and internal readers are refreshed with deletes loaded in background. OK so a batch is quickly reopened, using bloom filter + int set for fast "contains" check for the deletions that occurred during that batch (and, custom TermDocs that does the "and not deleted"). This gets you your fast turnaround and decent search performance. In the BG the deletes are applied to Lucene "for real". This is similar to Marvin's original tombstone deletions, in that the deletions against old segments are stored with the new segment, rather than aggressively pushed to the old segment's bit vectors.
        Hide
        John Wang added a comment -

        We do not hold the deleted set for a long period of time. I agree the memory cost is not a "killer" but it is tremendously wasteful, e.g. 10 M doc index, you have say 2 docs deleted, 0 and 9999999, you are representing it with 5M of memory (where you could have reprsented it with 2 ints, 8 bytes). Sure it is an extremely case, if you look at the avg number of deleted docs vs index size, it is usually sparse. hence we avoided this approach.

        We looked at trade-offs between this vs. our approach, for us, it was a worth-while trade-off.

        We do not accumulate deletes. Deletes/updates are tracked per batch, and special TermDocs are returned to skip over the deleted/mod set for the given reader.

        We simply call delete on the diskreader for each batch and internal readers are refreshed with deletes loaded in background.

        Show
        John Wang added a comment - We do not hold the deleted set for a long period of time. I agree the memory cost is not a "killer" but it is tremendously wasteful, e.g. 10 M doc index, you have say 2 docs deleted, 0 and 9999999, you are representing it with 5M of memory (where you could have reprsented it with 2 ints, 8 bytes). Sure it is an extremely case, if you look at the avg number of deleted docs vs index size, it is usually sparse. hence we avoided this approach. We looked at trade-offs between this vs. our approach, for us, it was a worth-while trade-off. We do not accumulate deletes. Deletes/updates are tracked per batch, and special TermDocs are returned to skip over the deleted/mod set for the given reader. We simply call delete on the diskreader for each batch and internal readers are refreshed with deletes loaded in background.
        Hide
        Michael McCandless added a comment -

        The issue of not using a BitSet/BitVector is not simply performance, but also memory cost.

        But don't you cutover all future searches to the newest NRT reader
        right away? Ie, those large bit vectors are all transient, so net mem
        cost at any given time should be well contained?

        Or... do you keep many readers (against different commit points)
        around for a longish time (longer than just allowing for the in-flight
        searches to complete)?

        In the case where deletes are sparse (expectedly so) and the index is large, BitSet/BitVector is not a good representation of a DocSet.

        I agree... it's done so "contains" is fast. But we had looked into
        "being sparse" (and using an iterator to "and not" the deleted docs,
        at the TermDocs level) and the performance was quite a bit worse...

        For deleted checks, it is usually of this pattern:
        Iterate thru my docs, for each doc i
        isDeletedCheck(doc i)

        So the cost is not really iterating the deleted set per say, it is the check.

        Right.

        We use bloomfilter to filter out negatives (mostly the case) and back it up with the underlying docset (normally an instance of openhashset) for positives.

        This makes sense, and it's a nice solution.

        But how do you make this transactional? Do you just make a full clone
        of IntSetAccelerator (& the underling int set) on every reopen?

        Also, what policy/approach do you use to periodically fold the deletes
        back into the SegmentReader? As you accumulate more and more deletes
        in the IntSet, cloning becomes more costly.

        Code is at:

        http://code.google.com/p/zoie/source/browse/trunk/java/proj/zoie/api/impl/util/IntSetAccelerator.java

        Feel free to play with it and run some perf numbers on your data.

        Thanks!

        Show
        Michael McCandless added a comment - The issue of not using a BitSet/BitVector is not simply performance, but also memory cost. But don't you cutover all future searches to the newest NRT reader right away? Ie, those large bit vectors are all transient, so net mem cost at any given time should be well contained? Or... do you keep many readers (against different commit points) around for a longish time (longer than just allowing for the in-flight searches to complete)? In the case where deletes are sparse (expectedly so) and the index is large, BitSet/BitVector is not a good representation of a DocSet. I agree... it's done so "contains" is fast. But we had looked into "being sparse" (and using an iterator to "and not" the deleted docs, at the TermDocs level) and the performance was quite a bit worse... For deleted checks, it is usually of this pattern: Iterate thru my docs, for each doc i isDeletedCheck(doc i) So the cost is not really iterating the deleted set per say, it is the check. Right. We use bloomfilter to filter out negatives (mostly the case) and back it up with the underlying docset (normally an instance of openhashset) for positives. This makes sense, and it's a nice solution. But how do you make this transactional? Do you just make a full clone of IntSetAccelerator (& the underling int set) on every reopen? Also, what policy/approach do you use to periodically fold the deletes back into the SegmentReader? As you accumulate more and more deletes in the IntSet, cloning becomes more costly. Code is at: http://code.google.com/p/zoie/source/browse/trunk/java/proj/zoie/api/impl/util/IntSetAccelerator.java Feel free to play with it and run some perf numbers on your data. Thanks!
        Hide
        John Wang added a comment -

        The issue of not using a BitSet/BitVector is not simply performance, but also memory cost.
        In the case where deletes are sparse (expectedly so) and the index is large, BitSet/BitVector is not a good representation of a DocSet.
        For deleted checks, it is usually of this pattern:

        Iterate thru my docs, for each doc i
        isDeletedCheck(doc i)

        So the cost is not really iterating the deleted set per say, it is the check.

        We use bloomfilter to filter out negatives (mostly the case) and back it up with the underlying docset (normally an instance of openhashset) for positives.

        Code is at:

        http://code.google.com/p/zoie/source/browse/trunk/java/proj/zoie/api/impl/util/IntSetAccelerator.java

        Feel free to play with it and run some perf numbers on your data.

        Show
        John Wang added a comment - The issue of not using a BitSet/BitVector is not simply performance, but also memory cost. In the case where deletes are sparse (expectedly so) and the index is large, BitSet/BitVector is not a good representation of a DocSet. For deleted checks, it is usually of this pattern: Iterate thru my docs, for each doc i isDeletedCheck(doc i) So the cost is not really iterating the deleted set per say, it is the check. We use bloomfilter to filter out negatives (mostly the case) and back it up with the underlying docset (normally an instance of openhashset) for positives. Code is at: http://code.google.com/p/zoie/source/browse/trunk/java/proj/zoie/api/impl/util/IntSetAccelerator.java Feel free to play with it and run some perf numbers on your data.
        Hide
        Michael McCandless added a comment -

        We did this in Zoie for a while, and it turned out to be a bottleneck - not as much of a bottleneck as continually cloning a bitvector (that was even worse), but still not good. We currently use a bloomfilter on top of an openintset, which performs pretty fantastically: constant-time adds and even-faster constant-time contains() checks, with small size (necessary for the new Reader per query scenario since this requires lots of deep-cloning of this structure).

        Good, real-world feedback – thanks! This sounds like a compelling
        approach.

        So the SegmentReader still had its full BitVector, but your OpenIntSet
        (what exactly is that?) + the bloom filter is then also checked when
        you enum the TermDocs? It's impressive this is fast enough... do you
        expect this approach to be faster than the paged "copy on write" bit
        vector approach?

        It also helped to not produce a docIdset iterator using these bits, but instead override TermDocs to be returned on the disk reader, and keep track of it directly there.

        The flex API should make this possible, without overriding TermDocs
        (just expose the Bits interface).

        Show
        Michael McCandless added a comment - We did this in Zoie for a while, and it turned out to be a bottleneck - not as much of a bottleneck as continually cloning a bitvector (that was even worse), but still not good. We currently use a bloomfilter on top of an openintset, which performs pretty fantastically: constant-time adds and even-faster constant-time contains() checks, with small size (necessary for the new Reader per query scenario since this requires lots of deep-cloning of this structure). Good, real-world feedback – thanks! This sounds like a compelling approach. So the SegmentReader still had its full BitVector, but your OpenIntSet (what exactly is that?) + the bloom filter is then also checked when you enum the TermDocs? It's impressive this is fast enough... do you expect this approach to be faster than the paged "copy on write" bit vector approach? It also helped to not produce a docIdset iterator using these bits, but instead override TermDocs to be returned on the disk reader, and keep track of it directly there. The flex API should make this possible, without overriding TermDocs (just expose the Bits interface).
        Hide
        Jake Mannix added a comment -

        Another approach we might take here is to track new deletions using a

        sorted tree. New deletions insert in O(log(N)) time, and then we can
        efficiently expose a DocIdSetIterator from that.

        We did this in Zoie for a while, and it turned out to be a bottleneck - not as much of a bottleneck as continually cloning a bitvector (that was even worse), but still not good. We currently use a bloomfilter on top of an openintset, which performs pretty fantastically: constant-time adds and even-faster constant-time contains() checks, with small size (necessary for the new Reader per query scenario since this requires lots of deep-cloning of this structure).

        Just a note from our experience over in zoie-land. It also helped to not produce a docIdset iterator using these bits, but instead override TermDocs to be returned on the disk reader, and keep track of it directly there.

        Show
        Jake Mannix added a comment - Another approach we might take here is to track new deletions using a sorted tree. New deletions insert in O(log(N)) time, and then we can efficiently expose a DocIdSetIterator from that. We did this in Zoie for a while, and it turned out to be a bottleneck - not as much of a bottleneck as continually cloning a bitvector (that was even worse), but still not good. We currently use a bloomfilter on top of an openintset, which performs pretty fantastically: constant-time adds and even-faster constant-time contains() checks, with small size (necessary for the new Reader per query scenario since this requires lots of deep-cloning of this structure). Just a note from our experience over in zoie-land. It also helped to not produce a docIdset iterator using these bits, but instead override TermDocs to be returned on the disk reader, and keep track of it directly there.
        Hide
        Michael McCandless added a comment -

        Another approach we might take here is to track new deletions using a
        sorted tree. New deletions insert in O(log(N)) time, and then we can
        efficiently expose a DocIdSetIterator from that.

        Then, every search would have this (if present) folded in as a "not"
        clause/filter.

        Only near real-time readers would have this.

        And then, somehow, periodically that tree would have to be merged in
        to the real deleted docs if it ever got to big, or, it was time to
        commit.

        This is a biggish change because suddenly IndexSearcher must be aware
        that a given SegmentReader is carrying near-real-time deletions, and
        build a BooleanQuery each time. Whereas MultiBitVector nicely keeps
        all this under-the-hood.

        Show
        Michael McCandless added a comment - Another approach we might take here is to track new deletions using a sorted tree. New deletions insert in O(log(N)) time, and then we can efficiently expose a DocIdSetIterator from that. Then, every search would have this (if present) folded in as a "not" clause/filter. Only near real-time readers would have this. And then, somehow, periodically that tree would have to be merged in to the real deleted docs if it ever got to big, or, it was time to commit. This is a biggish change because suddenly IndexSearcher must be aware that a given SegmentReader is carrying near-real-time deletions, and build a BooleanQuery each time. Whereas MultiBitVector nicely keeps all this under-the-hood.
        Hide
        Michael McCandless added a comment -

        Changing summary to match evolution of this issue...

        Show
        Michael McCandless added a comment - Changing summary to match evolution of this issue...
        Hide
        Michael McCandless added a comment -

        This looks like good progress! So it's a copy-on-write, by fixed page
        size (8 kbits, by default), BitVector.

        One danger of MultiBitVector is if one does a deletion against an
        already cloned instance, right? Because each instance only tracks
        refs back to the instance it was cloned from, not forward refs of
        other instances that have cloned it? So a clone of myself would
        incorrectly see changes that I make.

        That said, Lucene's internal use shouldn't ever do that – when we
        clone the deleted docs, the previous instance should never be touched
        again (the "write lock" moves to the new clone, just like when
        reopening a reader that's holding the write lock). Can you add
        assertions into MultiBitVector to verify this? And explain in
        javadocs that once you clone it, it's frozen.

        Also, I don't think we should force SegmentReader to always use MBV,
        unless we're sure the perf hit is negligible? Can we somehow
        conditionalize that?

        What remains here? Ie what tests fail & why? (Or, why isn't it
        committable?). If you can get it to a committable state, I can run
        some perf tests...

        In LUCENE-1458, the new flex API uses a simple interface (called
        "Bits") to represent docs that should be skipped, and when you ask for
        the DocsEnum, you pass in your "Bits skipDocs". This will be
        important for LUCENE-1536, but also important for this issue because
        it'll make swapping in different Bits impls easy.

        Show
        Michael McCandless added a comment - This looks like good progress! So it's a copy-on-write, by fixed page size (8 kbits, by default), BitVector. One danger of MultiBitVector is if one does a deletion against an already cloned instance, right? Because each instance only tracks refs back to the instance it was cloned from, not forward refs of other instances that have cloned it? So a clone of myself would incorrectly see changes that I make. That said, Lucene's internal use shouldn't ever do that – when we clone the deleted docs, the previous instance should never be touched again (the "write lock" moves to the new clone, just like when reopening a reader that's holding the write lock). Can you add assertions into MultiBitVector to verify this? And explain in javadocs that once you clone it, it's frozen. Also, I don't think we should force SegmentReader to always use MBV, unless we're sure the perf hit is negligible? Can we somehow conditionalize that? What remains here? Ie what tests fail & why? (Or, why isn't it committable?). If you can get it to a committable state, I can run some perf tests... In LUCENE-1458 , the new flex API uses a simple interface (called "Bits") to represent docs that should be skipped, and when you ask for the DocsEnum, you pass in your "Bits skipDocs". This will be important for LUCENE-1536 , but also important for this issue because it'll make swapping in different Bits impls easy.
        Hide
        Jason Rutherglen added a comment -

        This moves us on our way more efficient memory usage in heavy
        near realtime search apps, because we don't have to reallocate
        an entire byte[] equals to the maxDoc of the segment(s) that
        have new deletes.

        • A 2 dimensional byte array is used where each actual array is
          1024 in length.
        • A refs boolean array keeps track of the which arrays need to
          be copied when a bit is set (copy on ref).
        • The code can be benchmarked against the existing BV
          implementation for any possible speed slowdown due to the extra
          array lookup (probably minimal to nothing)
        • Need to implement the dgaps encoding
        • Code isn't committable
        • Still need to run all tests, however TestMultiBitVector passes
        Show
        Jason Rutherglen added a comment - This moves us on our way more efficient memory usage in heavy near realtime search apps, because we don't have to reallocate an entire byte[] equals to the maxDoc of the segment(s) that have new deletes. A 2 dimensional byte array is used where each actual array is 1024 in length. A refs boolean array keeps track of the which arrays need to be copied when a bit is set (copy on ref). The code can be benchmarked against the existing BV implementation for any possible speed slowdown due to the extra array lookup (probably minimal to nothing) Need to implement the dgaps encoding Code isn't committable Still need to run all tests, however TestMultiBitVector passes
        Hide
        Jason Rutherglen added a comment -

        This issue has somewhat changed course from relying on DocIdSet
        to having a BitVector implementation that is backed by a
        byte[][] instead of a byte[]. This will allow modification of
        parts of the array without allocating the entire array for
        clone/copy-on-write. This saves on byte[] allocation for large
        segments.

        We'll also need to benchmark vs. BitVector byte[].

        Show
        Jason Rutherglen added a comment - This issue has somewhat changed course from relying on DocIdSet to having a BitVector implementation that is backed by a byte[][] instead of a byte[]. This will allow modification of parts of the array without allocating the entire array for clone/copy-on-write. This saves on byte[] allocation for large segments. We'll also need to benchmark vs. BitVector byte[].
        Hide
        Michael McCandless added a comment -

        I don't think we should block 2.9 for this.

        Show
        Michael McCandless added a comment - I don't think we should block 2.9 for this.
        Hide
        Michael McCandless added a comment -

        I'm still creating performance tests, however the system is simply
        using skipto, to obtain the next deleted doc, the next deleted doc is
        cached in segmenttermdocs. So there shouldn't be a slowdown?

        I was actually thinking more generally that this is the natural
        tradeoff one makes with realtime search. EG flushing a new segment
        every document or two will necessarily give worse indexing throughput
        than bulk indexing w/ large RAM buffer, but gives much faster
        turnaround on searching.

        But it sounds like you're talking specifically about performance of
        switching to iterator access to deleted docs. I think the larger
        number of [harder-for-cpu-to-predict] if statements may be the cause
        of the slowdown once %tg deletes gets high enough? Also, if the
        underlying skipTo is a linear scan over a non-sparse representation
        then that's further cost.

        Show
        Michael McCandless added a comment - I'm still creating performance tests, however the system is simply using skipto, to obtain the next deleted doc, the next deleted doc is cached in segmenttermdocs. So there shouldn't be a slowdown? I was actually thinking more generally that this is the natural tradeoff one makes with realtime search. EG flushing a new segment every document or two will necessarily give worse indexing throughput than bulk indexing w/ large RAM buffer, but gives much faster turnaround on searching. But it sounds like you're talking specifically about performance of switching to iterator access to deleted docs. I think the larger number of [harder-for-cpu-to-predict] if statements may be the cause of the slowdown once %tg deletes gets high enough? Also, if the underlying skipTo is a linear scan over a non-sparse representation then that's further cost.
        Hide
        Michael McCandless added a comment -

        I'd rather do realtime incremental column stride fields in general with norms being a specific case.

        That would be even better!

        Show
        Michael McCandless added a comment - I'd rather do realtime incremental column stride fields in general with norms being a specific case. That would be even better!
        Hide
        Jason Rutherglen added a comment -

        For realtime search, I think we can accept some slowdown of
        search performance in exchange for very low latency turnaround when
        adding/deleting docs.

        I'm still creating performance tests, however the system is simply
        using skipto, to obtain the next deleted doc, the next deleted doc is
        cached in segmenttermdocs. So there shouldn't be a slowdown?

        Show
        Jason Rutherglen added a comment - For realtime search, I think we can accept some slowdown of search performance in exchange for very low latency turnaround when adding/deleting docs. I'm still creating performance tests, however the system is simply using skipto, to obtain the next deleted doc, the next deleted doc is cached in segmenttermdocs. So there shouldn't be a slowdown?
        Hide
        Jason Rutherglen added a comment -

        > we also eventually need an incremental-copy-on-write data structure for norms.

        I'd rather do realtime incremental column stride fields in general with norms being a specific case.

        > if you want to bias normal relevance sorting to mixin some measure

        This seem to fit in with allowing alternative search algorithms to TF/IDF.

        Show
        Jason Rutherglen added a comment - > we also eventually need an incremental-copy-on-write data structure for norms. I'd rather do realtime incremental column stride fields in general with norms being a specific case. > if you want to bias normal relevance sorting to mixin some measure This seem to fit in with allowing alternative search algorithms to TF/IDF.
        Hide
        Michael McCandless added a comment -

        In addition to deletions, we also eventually need an
        incremental-copy-on-write data structure for norms. Full
        copy-on-write for norms (which LUCENE-1314 will enable) is even more
        costly than full copy-on-write for deletions since each norm is 1 byte
        vs 1 bit for deletions.

        Though, it seems less common the people are setting norms, so it's
        probably lower priority.

        But I can still see "real-time norm changing" as being quite useful,
        eg if you want to bias normal relevance sorting to mixin some measure
        of popularity (eg recording how often each document is clicked), it'd
        be good to have real-time norm updating for that.

        Show
        Michael McCandless added a comment - In addition to deletions, we also eventually need an incremental-copy-on-write data structure for norms. Full copy-on-write for norms (which LUCENE-1314 will enable) is even more costly than full copy-on-write for deletions since each norm is 1 byte vs 1 bit for deletions. Though, it seems less common the people are setting norms, so it's probably lower priority. But I can still see "real-time norm changing" as being quite useful, eg if you want to bias normal relevance sorting to mixin some measure of popularity (eg recording how often each document is clicked), it'd be good to have real-time norm updating for that.
        Hide
        Michael McCandless added a comment -

        Also, an in-RAM binary tree representation can be nicely transactional (there
        was a recent thread about java-dev about this), so that on clone we would
        make a very low cost clone of the deleted docs which we could then change
        w/o affecting the original. Ie the 'copy on write' is still done, but it's not all
        done up front – each insert would copy only the nodes it needs, so the cost
        is amortized.

        Show
        Michael McCandless added a comment - Also, an in-RAM binary tree representation can be nicely transactional (there was a recent thread about java-dev about this), so that on clone we would make a very low cost clone of the deleted docs which we could then change w/o affecting the original. Ie the 'copy on write' is still done, but it's not all done up front – each insert would copy only the nodes it needs, so the cost is amortized.
        Hide
        Michael McCandless added a comment -

        For Lucene, I think the SegmentReader should lazily create an internal
        structure to hold the deleted doc IDs on the first search.

        This is basically doing the copy-on-write, which for realtime search
        we're wanting to avoid. But as long as this is a sparse structure
        (sorted list of deleted docIDs, assuming not many deletes accumulate
        in RAM) it should be OK.

        I also think for Lucene we could leave the index format unchanged
        (which means commit() is still more costly than it need be, but I'm
        not sure that's too serious), and use tombstones/list-of-sorted-docIDs
        representation only in RAM.

        For realtime search, I think we can accept some slowdown of search
        performance in exchange for very low latency turnaround when
        adding/deleting docs.

        But I think these decisions (the approach we take here) is very much
        dependent on what we learn from the performance tests from
        LUCENE-1476.

        Show
        Michael McCandless added a comment - For Lucene, I think the SegmentReader should lazily create an internal structure to hold the deleted doc IDs on the first search. This is basically doing the copy-on-write, which for realtime search we're wanting to avoid. But as long as this is a sparse structure (sorted list of deleted docIDs, assuming not many deletes accumulate in RAM) it should be OK. I also think for Lucene we could leave the index format unchanged (which means commit() is still more costly than it need be, but I'm not sure that's too serious), and use tombstones/list-of-sorted-docIDs representation only in RAM. For realtime search, I think we can accept some slowdown of search performance in exchange for very low latency turnaround when adding/deleting docs. But I think these decisions (the approach we take here) is very much dependent on what we learn from the performance tests from LUCENE-1476 .
        Hide
        Marvin Humphrey added a comment -

        > A tombstone merge policy needs to be defined to determine when to
        > merge tombstone DocIdSets into a new deleted docs BitVector as too
        > many tombstones would eventually be detrimental to performance.

        For Lucene, I think the SegmentReader should lazily create an internal
        structure to hold the deleted doc IDs on the first search. It would either be
        an integer array (if there are few deletions) or a BitVector (if there are
        many), and it would be created by recording the output of a priority queue
        merging multiple tombstone streams.

        Subsequent calls would not require the priority queue, but would use an
        iterator wrapper around the shared int array / BitVector.

        For Lucy/KS, if we are to stick with the "cheap IndexReader" model, we'll want
        to keep using the priority queue. Thus, it will be important to keep the
        deletion rate for the whole index down, in order to minimize priority queue
        iteration costs. One possibility is to have the default merge policy
        automatically consolidate any segment as soon as its deletion rate climbs over
        10%. That's pretty aggressive, but it keeps the search-time deletions
        iteration costs reasonably low, and it's in the spirit of "few writes, many
        reads".

        > A probable implementation will merge tombstones based on the number of
        > tombstones and the total number of documents in the tombstones. The merge
        > policy may be set in the clone/reopen methods or on the IndexReader.

        Would it make sense to realize a new integer array / BitVector on the first
        search after any new tombstone rows are added which affect the segment in
        question?

        Show
        Marvin Humphrey added a comment - > A tombstone merge policy needs to be defined to determine when to > merge tombstone DocIdSets into a new deleted docs BitVector as too > many tombstones would eventually be detrimental to performance. For Lucene, I think the SegmentReader should lazily create an internal structure to hold the deleted doc IDs on the first search. It would either be an integer array (if there are few deletions) or a BitVector (if there are many), and it would be created by recording the output of a priority queue merging multiple tombstone streams. Subsequent calls would not require the priority queue, but would use an iterator wrapper around the shared int array / BitVector. For Lucy/KS, if we are to stick with the "cheap IndexReader" model, we'll want to keep using the priority queue. Thus, it will be important to keep the deletion rate for the whole index down, in order to minimize priority queue iteration costs. One possibility is to have the default merge policy automatically consolidate any segment as soon as its deletion rate climbs over 10%. That's pretty aggressive, but it keeps the search-time deletions iteration costs reasonably low, and it's in the spirit of "few writes, many reads". > A probable implementation will merge tombstones based on the number of > tombstones and the total number of documents in the tombstones. The merge > policy may be set in the clone/reopen methods or on the IndexReader. Would it make sense to realize a new integer array / BitVector on the first search after any new tombstone rows are added which affect the segment in question?

          People

          • Assignee:
            Unassigned
            Reporter:
            Jason Rutherglen
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 168h
              168h
              Remaining:
              Remaining Estimate - 168h
              168h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development