Lucene - Core
  1. Lucene - Core
  2. LUCENE-2324

Per thread DocumentsWriters that write their own private segments

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Realtime Branch
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      See LUCENE-2293 for motivation and more details.

      I'm copying here Mike's summary he posted on 2293:

      Change the approach for how we buffer in RAM to a more isolated
      approach, whereby IW has N fully independent RAM segments
      in-process and when a doc needs to be indexed it's added to one of
      them. Each segment would also write its own doc stores and
      "normal" segment merging (not the inefficient merge we now do on
      flush) would merge them. This should be a good simplification in
      the chain (eg maybe we can remove the *PerThread classes). The
      segments can flush independently, letting us make much better
      concurrent use of IO & CPU.

      1. ASF.LICENSE.NOT.GRANTED--lucene-2324.patch
        168 kB
        Michael Busch
      2. lucene-2324.patch
        310 kB
        Michael Busch
      3. LUCENE-2324.patch
        6 kB
        Jason Rutherglen
      4. LUCENE-2324.patch
        2 kB
        Jason Rutherglen
      5. LUCENE-2324.patch
        16 kB
        Jason Rutherglen
      6. LUCENE-2324.patch
        61 kB
        Jason Rutherglen
      7. LUCENE-2324-SMALL.patch
        18 kB
        Jason Rutherglen
      8. LUCENE-2324-SMALL.patch
        12 kB
        Jason Rutherglen
      9. LUCENE-2324-SMALL.patch
        7 kB
        Jason Rutherglen
      10. LUCENE-2324-SMALL.patch
        6 kB
        Jason Rutherglen
      11. LUCENE-2324-SMALL.patch
        3 kB
        Jason Rutherglen
      12. test.out
        131 kB
        Jason Rutherglen
      13. test.out
        56 kB
        Jason Rutherglen
      14. test.out
        33 kB
        Jason Rutherglen
      15. test.out
        25 kB
        Jason Rutherglen

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          we land this on trunk via LUCENE-3023

          Show
          Simon Willnauer added a comment - we land this on trunk via LUCENE-3023
          Hide
          Simon Willnauer added a comment -

          guys I opened LUCENE-3023 to land on trunk! can I close this and we iterate on LUCENE-3023 from now on?

          simon

          Show
          Simon Willnauer added a comment - guys I opened LUCENE-3023 to land on trunk! can I close this and we iterate on LUCENE-3023 from now on? simon
          Hide
          Jason Rutherglen added a comment -

          Ie, if a given term X occurrs in 6 DWPTs (today) then we merge-sort the docIDs from the postings of that term, which is costly. (The "normal" merge that will merge these DWPTs after this issue lands just append by docIDs).

          Right, this is the same principal motivation behind implementing DWPTs for use with realtime search, eg, the doc-id interleaving is too expensive to be performed at query time.

          Show
          Jason Rutherglen added a comment - Ie, if a given term X occurrs in 6 DWPTs (today) then we merge-sort the docIDs from the postings of that term, which is costly. (The "normal" merge that will merge these DWPTs after this issue lands just append by docIDs). Right, this is the same principal motivation behind implementing DWPTs for use with realtime search, eg, the doc-id interleaving is too expensive to be performed at query time.
          Hide
          Michael McCandless added a comment -

          The slowdown could have been due to the merge sort by docID that we do today on flush.

          Ie, if a given term X occurrs in 6 DWPTs (today) then we merge-sort the docIDs from the postings of that term, which is costly. (The "normal" merge that will merge these DWPTs after this issue lands just append by docIDs).

          So maybe after this lands we'll see only faster performance the larger the RAM buffer That would be nice!

          Show
          Michael McCandless added a comment - The slowdown could have been due to the merge sort by docID that we do today on flush. Ie, if a given term X occurrs in 6 DWPTs (today) then we merge-sort the docIDs from the postings of that term, which is costly. (The "normal" merge that will merge these DWPTs after this issue lands just append by docIDs). So maybe after this lands we'll see only faster performance the larger the RAM buffer That would be nice!
          Hide
          Jason Rutherglen added a comment -

          I think making a different data structure to hold low-DF terms would actually be a big boost in RAM efficiency. The RAM-per-unique-term is fairly high...

          However we're not sure why a largish 1+ GB RAM buffer seems to slow down? If we're round robin indexing against the DWPTs I think they'll have a similar number of unique terms as today, even though each DWPT will be smaller in size total size from each containing 1/Nth docs.

          Show
          Jason Rutherglen added a comment - I think making a different data structure to hold low-DF terms would actually be a big boost in RAM efficiency. The RAM-per-unique-term is fairly high... However we're not sure why a largish 1+ GB RAM buffer seems to slow down? If we're round robin indexing against the DWPTs I think they'll have a similar number of unique terms as today, even though each DWPT will be smaller in size total size from each containing 1/Nth docs.
          Hide
          Michael McCandless added a comment -

          Concurrency today is similar to DWPT in that we simply write into multiple segments in RAM.

          But the, on flush, we do a merge (in RAM) of these segments and write a single on-disk segment.

          Vs this change which instead writes N on-disk segments and lets "normal" merging merge them.

          I think making a different data structure to hold low-DF terms would actually be a big boost in RAM efficiency. The RAM-per-unique-term is fairly high...

          Show
          Michael McCandless added a comment - Concurrency today is similar to DWPT in that we simply write into multiple segments in RAM. But the, on flush, we do a merge (in RAM) of these segments and write a single on-disk segment. Vs this change which instead writes N on-disk segments and lets "normal" merging merge them. I think making a different data structure to hold low-DF terms would actually be a big boost in RAM efficiency. The RAM-per-unique-term is fairly high...
          Hide
          Jason Rutherglen added a comment -

          Because 1) the RAM efficiency ought to scale up very well, as you see a given term in more and more docs (hmm, though, maybe not, because from Zipf's law, half your terms will be singletons no matter how many docs you index), and 2) less merging is required.

          I'm not sure how we handled concurrency on the terms hash before, however with DWPTs there won't be contention regardless. It'd be nice if we could build 1-2 GB segment's in RAM, I think that would greatly reduce the number merges that are required downstream. Eg, then there's less need for merging by size, and most merges would be caused by the number/percentage of deletes. If it turns out the low DF terms are causing the slowdown, maybe there is a different hashing system that could be used.

          Show
          Jason Rutherglen added a comment - Because 1) the RAM efficiency ought to scale up very well, as you see a given term in more and more docs (hmm, though, maybe not, because from Zipf's law, half your terms will be singletons no matter how many docs you index), and 2) less merging is required. I'm not sure how we handled concurrency on the terms hash before, however with DWPTs there won't be contention regardless. It'd be nice if we could build 1-2 GB segment's in RAM, I think that would greatly reduce the number merges that are required downstream. Eg, then there's less need for merging by size, and most merges would be caused by the number/percentage of deletes. If it turns out the low DF terms are causing the slowdown, maybe there is a different hashing system that could be used.
          Hide
          Michael McCandless added a comment -

          Is the max optimal DWPT size related to the size of the terms hash, or is it likely something else?

          Bigger really should be better I think.

          Because 1) the RAM efficiency ought to scale up very well, as you see a given term in more and more docs (hmm, though, maybe not, because from Zipf's law, half your terms will be singletons no matter how many docs you index), and 2) less merging is required.

          I'm not sure why in the past perf seemd to taper off and maybe get worst after RAM buffer was over 256 MB... we should definitely re-test.

          Show
          Michael McCandless added a comment - Is the max optimal DWPT size related to the size of the terms hash, or is it likely something else? Bigger really should be better I think. Because 1) the RAM efficiency ought to scale up very well, as you see a given term in more and more docs (hmm, though, maybe not, because from Zipf's law, half your terms will be singletons no matter how many docs you index), and 2) less merging is required. I'm not sure why in the past perf seemd to taper off and maybe get worst after RAM buffer was over 256 MB... we should definitely re-test.
          Hide
          Jason Rutherglen added a comment -

          Is the max optimal DWPT size related to the size of the terms hash, or is it likely something else?

          Show
          Jason Rutherglen added a comment - Is the max optimal DWPT size related to the size of the terms hash, or is it likely something else?
          Hide
          Michael McCandless added a comment -

          One nice side effect of DWPT is it should allow us to get over the 2GB RAM buffer limit, assuming you use multiple threads.

          Ie I can set my RAM buffer to 10 GB, and if I'm using 5 threads, it should work.

          Not sure it's really meaningful in practice, since in past tests I haven't seen any gains over ~128 or 256 MB buffer... but maybe that's changed now.

          Show
          Michael McCandless added a comment - One nice side effect of DWPT is it should allow us to get over the 2GB RAM buffer limit, assuming you use multiple threads. Ie I can set my RAM buffer to 10 GB, and if I'm using 5 threads, it should work. Not sure it's really meaningful in practice, since in past tests I haven't seen any gains over ~128 or 256 MB buffer... but maybe that's changed now.
          Hide
          Simon Willnauer added a comment -

          Outstanding issues are to fix the updateDocument() problems, and finish flush-by-RAM (LUCENE-2573).

          Seems like LUCENE-2573 is more isolated than the updateDocument() issue so I think I can spend time on that one without interfering with what you are working on. I might need some time to get into what has been done so far, might come back here or on the list if I have questions.

          Show
          Simon Willnauer added a comment - Outstanding issues are to fix the updateDocument() problems, and finish flush-by-RAM ( LUCENE-2573 ). Seems like LUCENE-2573 is more isolated than the updateDocument() issue so I think I can spend time on that one without interfering with what you are working on. I might need some time to get into what has been done so far, might come back here or on the list if I have questions.
          Hide
          Michael Busch added a comment -

          Can anyone gimme a quick statement about what is left here or what the status of this issue is? I am at the point where I need to do some rather big changes to DocValues which I would not need if we have DWPT so I might rather help here before wasting time.

          I think it's very close! The new deletes approach is implemented, and various bugs are fixed. Also the latest trunk is merged in (including LUCENE-2881).
          Outstanding issues are to fix the updateDocument() problems, and finish flush-by-RAM (LUCENE-2573).

          Other than TestStressIndexing2 and TestNRTThreads (updateDocument problem) and a few tests that rely on flush-by-RAM, all core and contrib tests are passing now.

          Show
          Michael Busch added a comment - Can anyone gimme a quick statement about what is left here or what the status of this issue is? I am at the point where I need to do some rather big changes to DocValues which I would not need if we have DWPT so I might rather help here before wasting time. I think it's very close! The new deletes approach is implemented, and various bugs are fixed. Also the latest trunk is merged in (including LUCENE-2881 ). Outstanding issues are to fix the updateDocument() problems, and finish flush-by-RAM ( LUCENE-2573 ). Other than TestStressIndexing2 and TestNRTThreads (updateDocument problem) and a few tests that rely on flush-by-RAM, all core and contrib tests are passing now.
          Hide
          Michael Busch added a comment -

          Somehow, we have to let each DWPT have some privacy, but, the field name -> number binding should be "global". I think Simon is going to open a separate issue to make something possible along these lines...

          This is done now (LUCENE-2881) and merged into the RT branch.

          The current plan w/ deletes is that a delete gets buffered 1) into the global pool (stored in DW and pushed whenever any DWPT flushes), as well as 2) per DWPT. The per-DWPT pools apply only to the segment flushed from that DWPT, while the global pool applies during coalescing (ie to all "prior" segments).

          I implemented and committed this approach. It's looking pretty good - almost all tests pass. Only TestStressIndexing2 is sometimes failing - but only when updateDocument() is called, not when I modify the test to only use add, delete-by-term and delete-by-query.

          To avoid the full-stop, I think during the flush we can have two global delete pools. We carefully sweep all DWPTs and flush each, in succession. Any DWPT not yet flushed is free to continue indexing as normal, putting deletes into the first global pool, flushing as normal. But, a DWPT that has been flushed by the "sweeper" must instead put deletes for an updateDocument carefully into the 2nd pool, and not buffer the delete into DWPTs not yet flushed.

          I haven't done this yet - it might fix the failing test I described.

          Show
          Michael Busch added a comment - Somehow, we have to let each DWPT have some privacy, but, the field name -> number binding should be "global". I think Simon is going to open a separate issue to make something possible along these lines... This is done now ( LUCENE-2881 ) and merged into the RT branch. The current plan w/ deletes is that a delete gets buffered 1) into the global pool (stored in DW and pushed whenever any DWPT flushes), as well as 2) per DWPT. The per-DWPT pools apply only to the segment flushed from that DWPT, while the global pool applies during coalescing (ie to all "prior" segments). I implemented and committed this approach. It's looking pretty good - almost all tests pass. Only TestStressIndexing2 is sometimes failing - but only when updateDocument() is called, not when I modify the test to only use add, delete-by-term and delete-by-query. To avoid the full-stop, I think during the flush we can have two global delete pools. We carefully sweep all DWPTs and flush each, in succession. Any DWPT not yet flushed is free to continue indexing as normal, putting deletes into the first global pool, flushing as normal. But, a DWPT that has been flushed by the "sweeper" must instead put deletes for an updateDocument carefully into the 2nd pool, and not buffer the delete into DWPTs not yet flushed. I haven't done this yet - it might fix the failing test I described.
          Hide
          Simon Willnauer added a comment -

          Can anyone gimme a quick statement about what is left here or what the status of this issue is? I am at the point where I need to do some rather big changes to DocValues which I would not need if we have DWPT so I might rather help here before wasting time.

          Show
          Simon Willnauer added a comment - Can anyone gimme a quick statement about what is left here or what the status of this issue is? I am at the point where I need to do some rather big changes to DocValues which I would not need if we have DWPT so I might rather help here before wasting time.
          Hide
          Jason Rutherglen added a comment -

          I had to read this a few times, yes it's very elegant as we're skipping the postings that otherwise would be deleted immediately after flush, and we're reusing the terms map already in DWPT.

          Show
          Jason Rutherglen added a comment - I had to read this a few times, yes it's very elegant as we're skipping the postings that otherwise would be deleted immediately after flush, and we're reusing the terms map already in DWPT.
          Hide
          Michael McCandless added a comment -

          OK, I opened LUCENE-2897.

          Show
          Michael McCandless added a comment - OK, I opened LUCENE-2897 .
          Hide
          Michael McCandless added a comment -

          Alternatively, we could keep a separate hash (as we have right now), but then, on flushing the segment, we apply the deletes as we flush.

          I'll open a separate issue for this – I think it's a low hanging fruit.

          Show
          Michael McCandless added a comment - Alternatively, we could keep a separate hash (as we have right now), but then, on flushing the segment, we apply the deletes as we flush. I'll open a separate issue for this – I think it's a low hanging fruit.
          Hide
          Michael McCandless added a comment -

          Actually, instead of buffering the delete term against each DWPT, and pushing the buffered delete packets when we flush the DWPT...

          Why don't we simply resolve the delete "live"?

          See, TermsHashPerField holds all indexed terms in a hash table, so, lookup is very fast (much faster than what we do today, where segment first gets flushed, then we load an SR and use terms dict to seek to the right term).

          We could eg add a parallel int[] array (ie a new int field, per unique term) that'd hold the docIDUpto. On flush, we'd then skip any docIDs less than the docIDUpto for that term.

          Alternatively, we could keep a separate hash (as we have right now), but then, on flushing the segment, we apply the deletes as we flush.

          Show
          Michael McCandless added a comment - Actually, instead of buffering the delete term against each DWPT, and pushing the buffered delete packets when we flush the DWPT... Why don't we simply resolve the delete "live"? See, TermsHashPerField holds all indexed terms in a hash table, so, lookup is very fast (much faster than what we do today, where segment first gets flushed, then we load an SR and use terms dict to seek to the right term). We could eg add a parallel int[] array (ie a new int field, per unique term) that'd hold the docIDUpto. On flush, we'd then skip any docIDs less than the docIDUpto for that term. Alternatively, we could keep a separate hash (as we have right now), but then, on flushing the segment, we apply the deletes as we flush.
          Hide
          Jason Rutherglen added a comment -

          Even if we do full stop for commit and getReader, we've still gained concurrency on the batch indexing case.

          Right, I'm still worried this case has update doc concurrency issues.

          The current plan w/ deletes is that a delete gets buffered 1) into the global pool (stored in DW and pushed whenever any DWPT flushes), as well as 2) per DWPT. The per-DWPT pools apply only to the segment flushed from that DWPT, while the global pool applies during coalescing (ie to all "prior" segments).

          Ok, this is very clear now.

          We also must then ensure that no "post commit point" DWPT is allowed to flush until all "pre commit point" DWPTs have been visited by the sweeper.

          Ah, this's the key that should solve the edge cases.

          Show
          Jason Rutherglen added a comment - Even if we do full stop for commit and getReader, we've still gained concurrency on the batch indexing case. Right, I'm still worried this case has update doc concurrency issues. The current plan w/ deletes is that a delete gets buffered 1) into the global pool (stored in DW and pushed whenever any DWPT flushes), as well as 2) per DWPT. The per-DWPT pools apply only to the segment flushed from that DWPT, while the global pool applies during coalescing (ie to all "prior" segments). Ok, this is very clear now. We also must then ensure that no "post commit point" DWPT is allowed to flush until all "pre commit point" DWPTs have been visited by the sweeper. Ah, this's the key that should solve the edge cases.
          Hide
          Michael McCandless added a comment -

          Where are we gaining flush concurrency then?

          Even if we do full stop for commit and getReader, we've still gained concurrency on the batch indexing case.

          So, I think we can actually avoid full stop for commit/getReader.

          The current plan w/ deletes is that a delete gets buffered 1) into the global pool (stored in DW and pushed whenever any DWPT flushes), as well as 2) per DWPT. The per-DWPT pools apply only to the segment flushed from that DWPT, while the global pool applies during coalescing (ie to all "prior" segments).

          To avoid the full-stop, I think during the flush we can have two global delete pools. We carefully sweep all DWPTs and flush each, in succession. Any DWPT not yet flushed is free to continue indexing as normal, putting deletes into the first global pool, flushing as normal. But, a DWPT that has been flushed by the "sweeper" must instead put deletes for an updateDocument carefully into the 2nd pool, and not buffer the delete into DWPTs not yet flushed.

          Basically, as the sweeper visits each DWPT, it's segregating them into "pre commit point" and "post commit point".

          We also must then ensure that no "post commit point" DWPT is allowed to flush until all "pre commit point" DWPTs have been visited by the sweeper.

          Show
          Michael McCandless added a comment - Where are we gaining flush concurrency then? Even if we do full stop for commit and getReader, we've still gained concurrency on the batch indexing case. So, I think we can actually avoid full stop for commit/getReader. The current plan w/ deletes is that a delete gets buffered 1) into the global pool (stored in DW and pushed whenever any DWPT flushes), as well as 2) per DWPT. The per-DWPT pools apply only to the segment flushed from that DWPT, while the global pool applies during coalescing (ie to all "prior" segments). To avoid the full-stop, I think during the flush we can have two global delete pools. We carefully sweep all DWPTs and flush each, in succession. Any DWPT not yet flushed is free to continue indexing as normal, putting deletes into the first global pool, flushing as normal. But, a DWPT that has been flushed by the "sweeper" must instead put deletes for an updateDocument carefully into the 2nd pool, and not buffer the delete into DWPTs not yet flushed. Basically, as the sweeper visits each DWPT, it's segregating them into "pre commit point" and "post commit point". We also must then ensure that no "post commit point" DWPT is allowed to flush until all "pre commit point" DWPTs have been visited by the sweeper.
          Hide
          Jason Rutherglen added a comment -

          Jason I don't think special tracking of updateDocument deletes will work. EG, the document I'm replacing could be in another DWPT?

          Right. I think that's why we'd need to keep track of [unfortunately] the delTerms of all DWPTs per DWPT. Then when a DWPT flushes it's deletes and documents, it'll flush delTerms to the other DWPTs. Whereas today, in the updateDoc call we're flushing the delTerm to all DWPTs (this could be too early), which would logically seem to break atomicity.

          we have to do a full stop on getReader or commit?

          Update doc isn't sync'd so we'd need to stop it as well? Where are we gaining flush concurrency then?

          Show
          Jason Rutherglen added a comment - Jason I don't think special tracking of updateDocument deletes will work. EG, the document I'm replacing could be in another DWPT? Right. I think that's why we'd need to keep track of [unfortunately] the delTerms of all DWPTs per DWPT. Then when a DWPT flushes it's deletes and documents, it'll flush delTerms to the other DWPTs. Whereas today, in the updateDoc call we're flushing the delTerm to all DWPTs (this could be too early), which would logically seem to break atomicity. we have to do a full stop on getReader or commit? Update doc isn't sync'd so we'd need to stop it as well? Where are we gaining flush concurrency then?
          Hide
          Michael McCandless added a comment -

          Ahh, I see the problem now – because we don't do a full sync on flushing all DWPTs for getReader (or, commit), we can see the delete part of an updateDocument sneak into a the "commit" without the corresponding add.

          I think this means we have to do a full stop on getReader or commit?

          Jason I don't think special tracking of updateDocument deletes will work. EG, the document I'm replacing could be in another DWPT?

          Show
          Michael McCandless added a comment - Ahh, I see the problem now – because we don't do a full sync on flushing all DWPTs for getReader (or, commit), we can see the delete part of an updateDocument sneak into a the "commit" without the corresponding add. I think this means we have to do a full stop on getReader or commit? Jason I don't think special tracking of updateDocument deletes will work. EG, the document I'm replacing could be in another DWPT?
          Hide
          Jason Rutherglen added a comment -

          I think we have a logical issue with updateDoc in that we're adding the delTerm
          to all DWPTs, however we're flushing DWPTs un-synced. This means that a
          delTerm would be applied to a segment prior to the DWPT with the new doc being
          flushed. This can easily happen because even though getReader syncs on flush,
          doFlush is not synced, meaning another concurrent flush may begin, merges may
          be performed, all without the new doc being made available. The solution is
          probably as simple as treating updateDoc deletes as special and storing them
          in a Map<SegmentInfo/DWPT,<Term,Integer>> that belongs to the DWPT that
          received the new doc (instead of adding the delTerm to each DWPT).

          This updateDocDeletesMap would be flushed only when the DWPT [with the updated
          doc] is flushed, and if a DWPT no longer exists, we can add the
          delTerm to the last segment.

          Show
          Jason Rutherglen added a comment - I think we have a logical issue with updateDoc in that we're adding the delTerm to all DWPTs, however we're flushing DWPTs un-synced. This means that a delTerm would be applied to a segment prior to the DWPT with the new doc being flushed. This can easily happen because even though getReader syncs on flush, doFlush is not synced, meaning another concurrent flush may begin, merges may be performed, all without the new doc being made available. The solution is probably as simple as treating updateDoc deletes as special and storing them in a Map<SegmentInfo/DWPT,<Term,Integer>> that belongs to the DWPT that received the new doc (instead of adding the delTerm to each DWPT). This updateDocDeletesMap would be flushed only when the DWPT [with the updated doc] is flushed, and if a DWPT no longer exists, we can add the delTerm to the last segment.
          Hide
          Jason Rutherglen added a comment -

          The compilation errors are gone, TestNRTThreads and TestStressIndexing2 are still failing. I think we need to implement Mike's idea:

          https://issues.apache.org/jira/browse/LUCENE-2324?focusedCommentId=12984285&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12984285

          then retest.

          Is a test deadlocking somewhere, ant hasn't returned.

          Show
          Jason Rutherglen added a comment - The compilation errors are gone, TestNRTThreads and TestStressIndexing2 are still failing. I think we need to implement Mike's idea: https://issues.apache.org/jira/browse/LUCENE-2324?focusedCommentId=12984285&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12984285 then retest. Is a test deadlocking somewhere, ant hasn't returned.
          Hide
          Simon Willnauer added a comment -

          Somehow, we have to let each DWPT have some privacy, but, the field name -> number binding should be "global". I think Simon is going to open a separate issue to make something possible along these lines...

          Yep, I opened LUCENE-2881 for this

          Show
          Simon Willnauer added a comment - Somehow, we have to let each DWPT have some privacy, but, the field name -> number binding should be "global". I think Simon is going to open a separate issue to make something possible along these lines... Yep, I opened LUCENE-2881 for this
          Hide
          Michael McCandless added a comment -

          I'm getting compilation errors after update the RT branch – eg, BlockTermState is missing?

          Show
          Michael McCandless added a comment - I'm getting compilation errors after update the RT branch – eg, BlockTermState is missing?
          Hide
          Michael McCandless added a comment -

          One problem I realized in discussing stuff w/ Simon....

          The DWPT-private FieldInfos we now make are dangerous since they break bulk merging of stored fields and term vectors, I think?

          Somehow, we have to let each DWPT have some privacy, but, the field name -> number binding should be "global". I think Simon is going to open a separate issue to make something possible along these lines...

          I'll make a test case that asserts bulk merging is "working" w/ threaded indexing... would be good to know it's working properly today, too

          Show
          Michael McCandless added a comment - One problem I realized in discussing stuff w/ Simon.... The DWPT-private FieldInfos we now make are dangerous since they break bulk merging of stored fields and term vectors, I think? Somehow, we have to let each DWPT have some privacy, but, the field name -> number binding should be "global". I think Simon is going to open a separate issue to make something possible along these lines... I'll make a test case that asserts bulk merging is "working" w/ threaded indexing... would be good to know it's working properly today, too
          Hide
          Jason Rutherglen added a comment -

          the cost of replaying the log, assuming the log is "smallish"

          This is recording and replaying the doc-ids? How/when does a previous BV become 'free' to be used by the next reader? What if they're open at the same time? And if it's a previous previous reader that's been closed, won't that be quite a few docids to save? Eg, a delete-by-query has removed thousands of docs, I guess we'd use System.arraycopy then. The most usual case is updateDocument with [N]RT, which'd generate few doc-ids.

          System.arraycopy, while fast, is still O(N)

          Right, the larger segments will really adversely affect performance, as they do today, however the indexing is so much slower with NRT + clone that it's not noticeable.

          Using RT/NRT shouldn't slow down searching

          Right! The cost needs to be in the indexing and/or reopen threads.

          Show
          Jason Rutherglen added a comment - the cost of replaying the log, assuming the log is "smallish" This is recording and replaying the doc-ids? How/when does a previous BV become 'free' to be used by the next reader? What if they're open at the same time? And if it's a previous previous reader that's been closed, won't that be quite a few docids to save? Eg, a delete-by-query has removed thousands of docs, I guess we'd use System.arraycopy then. The most usual case is updateDocument with [N] RT, which'd generate few doc-ids. System.arraycopy, while fast, is still O(N) Right, the larger segments will really adversely affect performance, as they do today, however the indexing is so much slower with NRT + clone that it's not noticeable. Using RT/NRT shouldn't slow down searching Right! The cost needs to be in the indexing and/or reopen threads.
          Hide
          Michael McCandless added a comment -

          Right, it'll eat into the RAM buffer but it's not extreme (or is it?!).

          I think this could be acceptable (2 bytes per doc) as long as it's only for the docIDs in iW's RAM buffer, and not for docs in flushed segments?

          I did propose that a while back, and I'm not sure why, but I don't think you were a big fan: LUCENE-1574

          Ugh, you're right! Back then I wasn't a fan.... but, back then I didn't realize we could also reuse the contents of the bit vector (not just the allocated RAM), using a replay log.

          Would this also be used for DW's deletes?

          It's tempting – but let's first see how it works out for the flushed segments.

          The paged approach I think'll have issues in a low reader latency enviro, ie, create overhead from all the changes. Whereas an array is fast to change, and fast to copy.

          You mean paged BV right? I think that, and more generally any transactional data structure (eg like Zoie's wrapped bloom filter / HashSet approach) is too much added cost for searching. Using RT/NRT shouldn't slow down searching, ie I prefer the cost be front loaded into the reopen than backloaded into all searches.

          Couldn't we simply use System.arraycopy and be done?

          Well... System.arraycopy, while fast, is still O(N). Yes, it has a small constant in front, but for a large index that cost will start to dominate. Vs the cost of replaying the log, assuming the log is "smallish", is linear in the number of deletes since this BV's last reader. Still I expect we'll need a hybrid approach – if the number of deletes in the log is too many then we fallback to System.arraycopy.

          Show
          Michael McCandless added a comment - Right, it'll eat into the RAM buffer but it's not extreme (or is it?!). I think this could be acceptable (2 bytes per doc) as long as it's only for the docIDs in iW's RAM buffer, and not for docs in flushed segments? I did propose that a while back, and I'm not sure why, but I don't think you were a big fan: LUCENE-1574 Ugh, you're right! Back then I wasn't a fan.... but, back then I didn't realize we could also reuse the contents of the bit vector (not just the allocated RAM), using a replay log. Would this also be used for DW's deletes? It's tempting – but let's first see how it works out for the flushed segments. The paged approach I think'll have issues in a low reader latency enviro, ie, create overhead from all the changes. Whereas an array is fast to change, and fast to copy. You mean paged BV right? I think that, and more generally any transactional data structure (eg like Zoie's wrapped bloom filter / HashSet approach) is too much added cost for searching. Using RT/NRT shouldn't slow down searching, ie I prefer the cost be front loaded into the reopen than backloaded into all searches. Couldn't we simply use System.arraycopy and be done? Well... System.arraycopy, while fast, is still O(N). Yes, it has a small constant in front, but for a large index that cost will start to dominate. Vs the cost of replaying the log, assuming the log is "smallish", is linear in the number of deletes since this BV's last reader. Still I expect we'll need a hybrid approach – if the number of deletes in the log is too many then we fallback to System.arraycopy.
          Hide
          Jason Rutherglen added a comment -

          Sorry, pre-coffee math, a short is 2 bytes! * 500,000 = ~1 MB. That's a little less painful.

          Show
          Jason Rutherglen added a comment - Sorry, pre-coffee math, a short is 2 bytes! * 500,000 = ~1 MB. That's a little less painful.
          Hide
          Jason Rutherglen added a comment -

          How would we handle wraparound (in a concurrent way)? Also, 16 fold increase in RAM usage is not cheap!

          Instantiate a new array, and the next reader's seqid is set to 0, while a second value is incremented to guarantee uniqueness of the reader.

          A short's 8 bytes * 500,000 docs (in the RAM buffer, is that a lot?) = ~4 MB? Right, it'll eat into the RAM buffer but it's not extreme (or is it?!).

          we drop the BV, let GC recycle it, allocate a new BV (same size), copy in nearly the same bits that we just discarded, set a few more bits.

          Right, that's probably our best option for DF, BV, norms, and any other similar array. I did propose that a while back, and I'm not sure why, but I don't think you were a big fan: LUCENE-1574

          Would this also be used for DW's deletes? At 16 times the minimum size of the sequence-ids then the pooled approach would allow the equivalent of 16 BVs! The paged approach I think'll have issues in a low reader latency enviro, ie, create overhead from all the changes. Whereas an array is fast to change, and fast to copy.

          also tracked their "state" ie what "delete gen" they were at, and then when we need a new BV for that same segment, we pull a free one, catch it up

          Couldn't we simply use System.arraycopy and be done?

          Show
          Jason Rutherglen added a comment - How would we handle wraparound (in a concurrent way)? Also, 16 fold increase in RAM usage is not cheap! Instantiate a new array, and the next reader's seqid is set to 0, while a second value is incremented to guarantee uniqueness of the reader. A short's 8 bytes * 500,000 docs (in the RAM buffer, is that a lot?) = ~4 MB? Right, it'll eat into the RAM buffer but it's not extreme (or is it?!). we drop the BV, let GC recycle it, allocate a new BV (same size), copy in nearly the same bits that we just discarded, set a few more bits. Right, that's probably our best option for DF, BV, norms, and any other similar array. I did propose that a while back, and I'm not sure why, but I don't think you were a big fan: LUCENE-1574 Would this also be used for DW's deletes? At 16 times the minimum size of the sequence-ids then the pooled approach would allow the equivalent of 16 BVs! The paged approach I think'll have issues in a low reader latency enviro, ie, create overhead from all the changes. Whereas an array is fast to change, and fast to copy. also tracked their "state" ie what "delete gen" they were at, and then when we need a new BV for that same segment, we pull a free one, catch it up Couldn't we simply use System.arraycopy and be done?
          Hide
          Michael McCandless added a comment -

          If we implemented sequence-id deletes using a short[], then we're only increasing the RAM usage by 16 times, though we then do not need to clone which can generate excessive garbage (in a high flush [N]RT enviro).

          How would we handle wraparound (in a concurrent way)? Also, 16 fold increase in RAM usage is not cheap!

          I think, instead, we should recycle the bit vectors? See, what we do now is really quite silly: we drop the BV, let GC recycle it, allocate a new BV (same size), copy in nearly the same bits that we just discarded, set a few more bits.

          If instead we had a pool that'd hold recently freed BVs (for a given segment), but, also tracked their "state" ie what "delete gen" they were at, and then when we need a new BV for that same segment, we pull a free one, catch it up (replay the deletes that had arrived since it was created), and use it, that's very fast? Ie the cost is about as good as it can be – incremental to the number of deletes actually recorded. And the added RAM is "a few bits" per doc, where exactly how many "a few" is is decided by your app, ie, how many in-flight readers it "tends" to keep open.

          Show
          Michael McCandless added a comment - If we implemented sequence-id deletes using a short[], then we're only increasing the RAM usage by 16 times, though we then do not need to clone which can generate excessive garbage (in a high flush [N] RT enviro). How would we handle wraparound (in a concurrent way)? Also, 16 fold increase in RAM usage is not cheap! I think, instead, we should recycle the bit vectors? See, what we do now is really quite silly: we drop the BV, let GC recycle it, allocate a new BV (same size), copy in nearly the same bits that we just discarded, set a few more bits. If instead we had a pool that'd hold recently freed BVs (for a given segment), but, also tracked their "state" ie what "delete gen" they were at, and then when we need a new BV for that same segment, we pull a free one, catch it up (replay the deletes that had arrived since it was created), and use it, that's very fast? Ie the cost is about as good as it can be – incremental to the number of deletes actually recorded. And the added RAM is "a few bits" per doc, where exactly how many "a few" is is decided by your app, ie, how many in-flight readers it "tends" to keep open.
          Hide
          Jason Rutherglen added a comment -

          We could/should go do it right now

          Nice!

          I think the buffered deletes will work fine for non-sequential merging - we'd do the same coalescing we do now, only applying deletes on-demand to the to-be-merged segs, etc.

          I think this is going to make IW deletes even more hairy and hard to understand! Though if we keep the option of using a BV for deletes then there's probably no choice. If we implemented sequence-id deletes using a short[], then we're only increasing the RAM usage by 16 times, though we then do not need to clone which can generate excessive garbage (in a high flush [N]RT enviro).

          Show
          Jason Rutherglen added a comment - We could/should go do it right now Nice! I think the buffered deletes will work fine for non-sequential merging - we'd do the same coalescing we do now, only applying deletes on-demand to the to-be-merged segs, etc. I think this is going to make IW deletes even more hairy and hard to understand! Though if we keep the option of using a BV for deletes then there's probably no choice. If we implemented sequence-id deletes using a short[], then we're only increasing the RAM usage by 16 times, though we then do not need to clone which can generate excessive garbage (in a high flush [N] RT enviro).
          Hide
          Michael McCandless added a comment -

          In fact, with shared doc-store gone, what's holding up non-sequential merging?

          Nothing really! We could/should go do it right now... I think it should be trivial. Then, we should fixup our default MP to behave more like BSMP!! Immense segments are merged only pair wise, and no inadvertent optimizing...

          I think the buffered deletes will work fine for non-sequential merging – we'd do the same coalescing we do now, only applying deletes on-demand to the to-be-merged segs, etc.

          We just have to make sure the merged segment is appended to the end of the index (well, what was the end as of when the merge kicked off); this way I think we can continue w/ the invariant that buffered deletes apply to all segments to their "left"?

          Show
          Michael McCandless added a comment - In fact, with shared doc-store gone, what's holding up non-sequential merging? Nothing really! We could/should go do it right now... I think it should be trivial. Then, we should fixup our default MP to behave more like BSMP!! Immense segments are merged only pair wise, and no inadvertent optimizing... I think the buffered deletes will work fine for non-sequential merging – we'd do the same coalescing we do now, only applying deletes on-demand to the to-be-merged segs, etc. We just have to make sure the merged segment is appended to the end of the index (well, what was the end as of when the merge kicked off); this way I think we can continue w/ the invariant that buffered deletes apply to all segments to their "left"?
          Hide
          Jason Rutherglen added a comment -

          When a delete arrives, we should buffer in each DWPT, but also buffer into the "global" deletes pool (held in DocumentsWriter).

          This'll work, however it seems like it's going to be a temporary solution if we implement sequence-ids properly and/or implement non-sequential merges. In fact, with shared doc-store gone, what's holding up non-sequential merging?

          Show
          Jason Rutherglen added a comment - When a delete arrives, we should buffer in each DWPT, but also buffer into the "global" deletes pool (held in DocumentsWriter). This'll work, however it seems like it's going to be a temporary solution if we implement sequence-ids properly and/or implement non-sequential merges. In fact, with shared doc-store gone, what's holding up non-sequential merging?
          Hide
          Michael McCandless added a comment -

          OK I think Michael's example can be solved, with a small change to the delete buffering.

          When a delete arrives, we should buffer in each DWPT, but also buffer into the "global" deletes pool (held in DocumentsWriter).

          Whenever any DWPT is flushed, that global pool is pushed.

          Then, the buffered deletes against each DWPT are carried (as usual) along w/ the segment that's flushed from that DWPT, but those buffered deletes only apply to the docs in that one segment.

          The pushed deletes from the global pool apply to all prior segments (ie, they "coalesce").

          This way, the deletes that will be applied to the already flushed segments are aggressively pushed.

          Separately, I think we should relax the error semantics for updateDocument: if an aborting exception occurs (eg disk full while flushing a segment), then it's possible that the "delete" from an updateDocument will have applied but the "add" did not. Outside of error cases, of course, updateDocument will continue to be atomic (ie a commit() can never split the delete & add). Then the updateDocument case is handled as just an [atomic wrt flush] add & delete.

          Show
          Michael McCandless added a comment - OK I think Michael's example can be solved, with a small change to the delete buffering. When a delete arrives, we should buffer in each DWPT, but also buffer into the "global" deletes pool (held in DocumentsWriter). Whenever any DWPT is flushed, that global pool is pushed. Then, the buffered deletes against each DWPT are carried (as usual) along w/ the segment that's flushed from that DWPT, but those buffered deletes only apply to the docs in that one segment. The pushed deletes from the global pool apply to all prior segments (ie, they "coalesce"). This way, the deletes that will be applied to the already flushed segments are aggressively pushed. Separately, I think we should relax the error semantics for updateDocument: if an aborting exception occurs (eg disk full while flushing a segment), then it's possible that the "delete" from an updateDocument will have applied but the "add" did not. Outside of error cases, of course, updateDocument will continue to be atomic (ie a commit() can never split the delete & add). Then the updateDocument case is handled as just an [atomic wrt flush] add & delete.
          Hide
          Jason Rutherglen added a comment -

          we are wanting to allow out-of-order-merges soon

          Then the coalescing of deletes won't work.

          I'm starting to think there's no way around sequenceIds? Even without RT.

          This sounds about right, however would the sequence-ids be for all segments (instead of only for DW)?

          Show
          Jason Rutherglen added a comment - we are wanting to allow out-of-order-merges soon Then the coalescing of deletes won't work. I'm starting to think there's no way around sequenceIds? Even without RT. This sounds about right, however would the sequence-ids be for all segments (instead of only for DW)?
          Hide
          Michael McCandless added a comment -

          I like the grocery store analogy! Yes, just like that

          Every DWPT has its own private FieldInfos. When a segment is flushed the DWPT uses its private FI and then it updates the original DW.fieldInfos (from IW), which is a synchronized call.

          OK that sounds good.

          Hmm, given that IW.flush() isn't synchronized anymore I assume this can lead into a problem? E.g. the SegmentMerger gets a FieldInfos that's "newer" than the list of segments it's trying to flush?

          Yes... or, the FieldInfos changes (due to flush) while SegmentMerger is still merging. Probably SR should make a deep copy of the FieldInfos when it starts?

          DW has a SegmentDeletes (pendingDeletes) which gets pushed to the last segment. We only add delTerms to DW.pendingDeletes if we couldn't push it to any DWPT. Btw. I think the whole pushDeletes business isn't working correctly yet, I'm looking into it. I need to understand the code that coalesces the deletes better.

          OK

          How can it figure out to which docs the deletes to apply to? _1 and _2 are probably gone a long time ago. If we apply the deletes to all of _3 this would be a mistake too.

          Hmmm... I think you're right (this is a problem).

          I think we also have problems w/ updateDocument? That call is
          supposed be atomic (ie the del & addDoc can never be separately
          committed), but, I think if one DWPT (holding the del term) gets
          flushed but another (holding the del term and the added doc) aborts
          and then you commit, you could see the del "make it" but not the
          addDocument?

          Finally, we are wanting to allow out-of-order-merges soon... so
          that eg BSMP becomes much simpler to implement & bring to core,
          but, these buffered deletes also make that more complicated.

          Gonna have to mull on this...

          Show
          Michael McCandless added a comment - I like the grocery store analogy! Yes, just like that Every DWPT has its own private FieldInfos. When a segment is flushed the DWPT uses its private FI and then it updates the original DW.fieldInfos (from IW), which is a synchronized call. OK that sounds good. Hmm, given that IW.flush() isn't synchronized anymore I assume this can lead into a problem? E.g. the SegmentMerger gets a FieldInfos that's "newer" than the list of segments it's trying to flush? Yes... or, the FieldInfos changes (due to flush) while SegmentMerger is still merging. Probably SR should make a deep copy of the FieldInfos when it starts? DW has a SegmentDeletes (pendingDeletes) which gets pushed to the last segment. We only add delTerms to DW.pendingDeletes if we couldn't push it to any DWPT. Btw. I think the whole pushDeletes business isn't working correctly yet, I'm looking into it. I need to understand the code that coalesces the deletes better. OK How can it figure out to which docs the deletes to apply to? _1 and _2 are probably gone a long time ago. If we apply the deletes to all of _3 this would be a mistake too. Hmmm... I think you're right (this is a problem). I think we also have problems w/ updateDocument? That call is supposed be atomic (ie the del & addDoc can never be separately committed), but, I think if one DWPT (holding the del term) gets flushed but another (holding the del term and the added doc) aborts and then you commit, you could see the del "make it" but not the addDocument? Finally, we are wanting to allow out-of-order-merges soon... so that eg BSMP becomes much simpler to implement & bring to core, but, these buffered deletes also make that more complicated. Gonna have to mull on this...
          Hide
          Michael Busch added a comment -

          So I'm wondering about the following problem with deletes:

          Suppose we open a new IW on an existing index with 2 segments _1 and _2. IW is set to maxBufferedDocs=1000. The app starts indexing with two threads, so two DWPTs are created. DWPT1 starts working on _3, DWPT2 on _4. Both "remember" that they must apply their deletes only to segments _1 and _2. After adding 500 docs thread 2 stops indexing for an hour, but thread 1 keeps working. While thread 2 is sleeping several segment flushes (_3, _5, _6, etc) happen.

          Now thread 2 wakes up again and adds another 500 docs, and also some deletes, so DWPT2 has to flush finally. How can it figure out to which docs the deletes to apply to? _1 and _2 are probably gone a long time ago. If we apply the deletes to all of _3 this would be a mistake too.

          I'm starting to think there's no way around sequenceIds? Even without RT.

          Show
          Michael Busch added a comment - So I'm wondering about the following problem with deletes: Suppose we open a new IW on an existing index with 2 segments _1 and _2. IW is set to maxBufferedDocs=1000. The app starts indexing with two threads, so two DWPTs are created. DWPT1 starts working on _3, DWPT2 on _4. Both "remember" that they must apply their deletes only to segments _1 and _2. After adding 500 docs thread 2 stops indexing for an hour, but thread 1 keeps working. While thread 2 is sleeping several segment flushes (_3, _5, _6, etc) happen. Now thread 2 wakes up again and adds another 500 docs, and also some deletes, so DWPT2 has to flush finally. How can it figure out to which docs the deletes to apply to? _1 and _2 are probably gone a long time ago. If we apply the deletes to all of _3 this would be a mistake too. I'm starting to think there's no way around sequenceIds? Even without RT.
          Hide
          Michael Busch added a comment -

          Why does DW.anyDeletions need to be sync'd?

          Hmm good point. Actually only the call to DW.pendingDeletes.any() needs to be synced, but not the loop that calls the DWPTs.

          In ThreadAffinityDWTP... it may be better if we had a single queue,
          where threads wait in line, if no DWPT is available? And when a DWPT
          finishes it then notifies any waiting threads? (Ie, instead of queue-per-DWPT).

          Whole foods instead of safeway?
          Yeah that would be fairer. A large doc (= a full cart) wouldn't block unlucky other docs. I'll make that change, good idea!

          I see the fieldInfos.update(dwpt.getFieldInfos()) (in
          DW.updateDocument) - is there a risk that two threads bring a new
          field into existence at the same time, but w/ different config? Eg
          one doc omitsTFAP and the other doesn't? Or, on flush, does each DWPT
          use its private FieldInfos to correctly flush the segment? (Hmm: do
          we seed each DWPT w/ the original FieldInfos created by IW on init?).

          Every DWPT has its own private FieldInfos. When a segment is flushed the DWPT uses its private FI and then it updates the original DW.fieldInfos (from IW), which is a synchronized call.

          The only consumer of DW.getFieldInfos() is SegmentMerger in IW. Hmm, given that IW.flush() isn't synchronized anymore I assume this can lead into a problem? E.g. the SegmentMerger gets a FieldInfos that's "newer" than the list of segments it's trying to flush?

          How are we handling the case of open IW, do delete-by-term but no added docs?

          DW has a SegmentDeletes (pendingDeletes) which gets pushed to the last segment. We only add delTerms to DW.pendingDeletes if we couldn't push it to any DWPT. Btw. I think the whole pushDeletes business isn't working correctly yet, I'm looking into it. I need to understand the code that coalesces the deletes better.

          In DW.deleteTerms... shouldn't we skip a DWPT if it has no buffered docs?

          Yeah, I did that already, but not committed yet.

          Show
          Michael Busch added a comment - Why does DW.anyDeletions need to be sync'd? Hmm good point. Actually only the call to DW.pendingDeletes.any() needs to be synced, but not the loop that calls the DWPTs. In ThreadAffinityDWTP... it may be better if we had a single queue, where threads wait in line, if no DWPT is available? And when a DWPT finishes it then notifies any waiting threads? (Ie, instead of queue-per-DWPT). Whole foods instead of safeway? Yeah that would be fairer. A large doc (= a full cart) wouldn't block unlucky other docs. I'll make that change, good idea! I see the fieldInfos.update(dwpt.getFieldInfos()) (in DW.updateDocument) - is there a risk that two threads bring a new field into existence at the same time, but w/ different config? Eg one doc omitsTFAP and the other doesn't? Or, on flush, does each DWPT use its private FieldInfos to correctly flush the segment? (Hmm: do we seed each DWPT w/ the original FieldInfos created by IW on init?). Every DWPT has its own private FieldInfos. When a segment is flushed the DWPT uses its private FI and then it updates the original DW.fieldInfos (from IW), which is a synchronized call. The only consumer of DW.getFieldInfos() is SegmentMerger in IW. Hmm, given that IW.flush() isn't synchronized anymore I assume this can lead into a problem? E.g. the SegmentMerger gets a FieldInfos that's "newer" than the list of segments it's trying to flush? How are we handling the case of open IW, do delete-by-term but no added docs? DW has a SegmentDeletes (pendingDeletes) which gets pushed to the last segment. We only add delTerms to DW.pendingDeletes if we couldn't push it to any DWPT. Btw. I think the whole pushDeletes business isn't working correctly yet, I'm looking into it. I need to understand the code that coalesces the deletes better. In DW.deleteTerms... shouldn't we skip a DWPT if it has no buffered docs? Yeah, I did that already, but not committed yet.
          Hide
          Michael McCandless added a comment -

          The branch is looking very nice!! Very clean

          Random comments:

          Why does DW.anyDeletions need to be sync'd?

          Missing headers on at least DocumentsWriterPerThreadPool,
          ThreadAffinityDWTP.

          IWC.setIndexerThreadPool's javadoc is stale.

          On ThreadAffinityDWTP... it may be better if we had a single queue,
          where threads wait in line, if no DWPT is available? And when a DWPT
          finishes it then notifies any waiting threads? (Ie, instead of queue-per-DWPT).

          I see the fieldInfos.update(dwpt.getFieldInfos()) (in
          DW.updateDocument) – is there a risk that two threads bring a new
          field into existence at the same time, but w/ different config? Eg
          one doc omitsTFAP and the other doesn't? Or, on flush, does each DWPT
          use its private FieldInfos to correctly flush the segment? (Hmm: do
          we seed each DWPT w/ the original FieldInfos created by IW on init?).

          How are we handling the case of open IW, do delete-by-term but no
          added docs?

          Does DW.pushDeletes really need to sync on IW? BufferedDeletes is
          sync'd already.

          DW.substractFlushedDocs is mis-spelled (not sure it's used though).

          In DW.deleteTerms... shouldn't we skip a DWPT if it has no buffered
          docs?

          Show
          Michael McCandless added a comment - The branch is looking very nice!! Very clean Random comments: Why does DW.anyDeletions need to be sync'd? Missing headers on at least DocumentsWriterPerThreadPool, ThreadAffinityDWTP. IWC.setIndexerThreadPool's javadoc is stale. On ThreadAffinityDWTP... it may be better if we had a single queue, where threads wait in line, if no DWPT is available? And when a DWPT finishes it then notifies any waiting threads? (Ie, instead of queue-per-DWPT). I see the fieldInfos.update(dwpt.getFieldInfos()) (in DW.updateDocument) – is there a risk that two threads bring a new field into existence at the same time, but w/ different config? Eg one doc omitsTFAP and the other doesn't? Or, on flush, does each DWPT use its private FieldInfos to correctly flush the segment? (Hmm: do we seed each DWPT w/ the original FieldInfos created by IW on init?). How are we handling the case of open IW, do delete-by-term but no added docs? Does DW.pushDeletes really need to sync on IW? BufferedDeletes is sync'd already. DW.substractFlushedDocs is mis-spelled (not sure it's used though). In DW.deleteTerms... shouldn't we skip a DWPT if it has no buffered docs?
          Hide
          Michael Busch added a comment -

          I ran a quick perf test here: I built the 10M Wikipedia index,
          Standard codec, using 6 threads. Trunk took 541.6 sec; RT took 518.2
          sec (only a bit faster), but the test wasn't really fair because it
          flushed @ docCount=12870.

          Thanks for running the tests!
          Hmm that's a bit disappointing - we were hoping for more speedup.
          Flushing by docCount is currently per DWPT, so every initial segment
          in your test had 12870 docs. I guess there's a lot of merging happening.

          Maybe you could rerun with higher docCount?

          But I can't test flush by RAM - that's not working yet on RT right?

          True. I'm going to add that soonish. There's one thread-safety bug
          related to deletes that needs to be fixed too.

          Then I ran a single-threaded test. Trunk took 1097.1 sec and RT took
          1040.5 sec - a bit faster! Presumably in the noise (we don't expect
          a speedup?), but excellent that it's not slower...

          Yeah I didn't expect much speedup - cool! Maybe because some
          code is gone, like the WaitQueue, not sure how much overhead that
          added in the single-threaded case.

          I think we lost infoStream output on the details of flushing? I can't
          see when which DWPTs are flushing...

          Oh yeah, good point, I'll add some infoStream messages to DWPT!

          Show
          Michael Busch added a comment - I ran a quick perf test here: I built the 10M Wikipedia index, Standard codec, using 6 threads. Trunk took 541.6 sec; RT took 518.2 sec (only a bit faster), but the test wasn't really fair because it flushed @ docCount=12870. Thanks for running the tests! Hmm that's a bit disappointing - we were hoping for more speedup. Flushing by docCount is currently per DWPT, so every initial segment in your test had 12870 docs. I guess there's a lot of merging happening. Maybe you could rerun with higher docCount? But I can't test flush by RAM - that's not working yet on RT right? True. I'm going to add that soonish. There's one thread-safety bug related to deletes that needs to be fixed too. Then I ran a single-threaded test. Trunk took 1097.1 sec and RT took 1040.5 sec - a bit faster! Presumably in the noise (we don't expect a speedup?), but excellent that it's not slower... Yeah I didn't expect much speedup - cool! Maybe because some code is gone, like the WaitQueue, not sure how much overhead that added in the single-threaded case. I think we lost infoStream output on the details of flushing? I can't see when which DWPTs are flushing... Oh yeah, good point, I'll add some infoStream messages to DWPT!
          Hide
          Jason Rutherglen added a comment -

          I can't test flush by RAM - that's not working yet on RT right?

          Right, we're only flushing by doc count, so we could be flushing segments that are too small? However we can see some of the concurrency gains by not sync'ing on IW and allowing documents updates to continue while flushing.

          Show
          Jason Rutherglen added a comment - I can't test flush by RAM - that's not working yet on RT right? Right, we're only flushing by doc count, so we could be flushing segments that are too small? However we can see some of the concurrency gains by not sync'ing on IW and allowing documents updates to continue while flushing.
          Hide
          Michael McCandless added a comment -

          I ran a quick perf test here: I built the 10M Wikipedia index,
          Standard codec, using 6 threads. Trunk took 541.6 sec; RT took 518.2
          sec (only a bit faster), but the test wasn't really fair because it
          flushed @ docCount=12870.

          But I can't test flush by RAM – that's not working yet on RT right?

          (The search results matched, which is nice!)

          Then I ran a single-threaded test. Trunk took 1097.1 sec and RT took
          1040.5 sec – a bit faster! Presumably in the noise (we don't expect
          a speedup?), but excellent that it's not slower...

          I think we lost infoStream output on the details of flushing? I can't
          see when which DWPTs are flushing...

          Show
          Michael McCandless added a comment - I ran a quick perf test here: I built the 10M Wikipedia index, Standard codec, using 6 threads. Trunk took 541.6 sec; RT took 518.2 sec (only a bit faster), but the test wasn't really fair because it flushed @ docCount=12870. But I can't test flush by RAM – that's not working yet on RT right? (The search results matched, which is nice!) Then I ran a single-threaded test. Trunk took 1097.1 sec and RT took 1040.5 sec – a bit faster! Presumably in the noise (we don't expect a speedup?), but excellent that it's not slower... I think we lost infoStream output on the details of flushing? I can't see when which DWPTs are flushing...
          Hide
          Jason Rutherglen added a comment -

          Ok, TestNRTThreads works after 10+ iterations. TestStressIndexing2 works most of the time however with enough iterations, eg, "ant test-core -Dtestcase=TestStressIndexing2 -Dtests.iter=30" it fails. I think that deletes are sneaking in because we're not sync'ed on DW as we're flushing the DWPT. Ideally some assertions would pick this up.

          Show
          Jason Rutherglen added a comment - Ok, TestNRTThreads works after 10+ iterations. TestStressIndexing2 works most of the time however with enough iterations, eg, "ant test-core -Dtestcase=TestStressIndexing2 -Dtests.iter=30" it fails. I think that deletes are sneaking in because we're not sync'ed on DW as we're flushing the DWPT. Ideally some assertions would pick this up.
          Hide
          Jason Rutherglen added a comment -

          Looks like TestNRTThreads is still sometimes failing, if I moved the sync around then it passes and TestStressIndexing2 fails.

          Show
          Jason Rutherglen added a comment - Looks like TestNRTThreads is still sometimes failing, if I moved the sync around then it passes and TestStressIndexing2 fails.
          Hide
          Jason Rutherglen added a comment -

          Very nice! Looks like we needed all kinds of IW syncs? I noticed that in addition to TestStressIndexing2, TestNRTThreads was also failing. The attached patch fixes both by adding a sync on DW for deletes (and the update doc delete term). Time to add the RAM usage?

          Show
          Jason Rutherglen added a comment - Very nice! Looks like we needed all kinds of IW syncs? I noticed that in addition to TestStressIndexing2, TestNRTThreads was also failing. The attached patch fixes both by adding a sync on DW for deletes (and the update doc delete term). Time to add the RAM usage?
          Hide
          Michael Busch added a comment -

          My last commit yesterday made almost all test cases pass.

          The ones that test flush-by-ram are still failing. Also TestStressIndexing2 still fails. The reason has to do with how deletes are pushed into bufferedDeletes. E.g. if I call addDocument() instead of updateDocument() in TestStressIndexing.IndexerThread then the test passes.

          I need to look more into that problem, but otherwise it's looking good and we're pretty close!

          Show
          Michael Busch added a comment - My last commit yesterday made almost all test cases pass. The ones that test flush-by-ram are still failing. Also TestStressIndexing2 still fails. The reason has to do with how deletes are pushed into bufferedDeletes. E.g. if I call addDocument() instead of updateDocument() in TestStressIndexing.IndexerThread then the test passes. I need to look more into that problem, but otherwise it's looking good and we're pretty close!
          Hide
          Jason Rutherglen added a comment -

          Here's a potential plan for the locking. I think we still need a global lock for flush status, abort, and deletes. DocumentsWriterPerThreadPool has the getAndLock method so maybe we should add a global lock on it.

          1) Add DWPTP.setFlushStatus(DWPT)
          2) Add DWPTP.setAbortStatus(DWPT)
          3) Add DWPTP.getFlushStatus(DWPT)
          4) Add DWPTP.getAbortStatus(DWPT)
          5) Add DWPTP.getAndLock(Thread requestingThread, DocumentsWriter documentsWriter, Document doc, Term delTerm)
          6) Add DWPTP.[deleteTerm,deleteQuery]

          Where each of these methods acquires a lock on DWPTP.

          I think it'll look somewhat like FlushControl, however the common parameter will be an idx to the appropriate DWPT, eg, getFlushPending(int idx). Alternatively instead of placing this in DWPTP, DW is another possible candidate.

          Show
          Jason Rutherglen added a comment - Here's a potential plan for the locking. I think we still need a global lock for flush status, abort, and deletes. DocumentsWriterPerThreadPool has the getAndLock method so maybe we should add a global lock on it. 1) Add DWPTP.setFlushStatus(DWPT) 2) Add DWPTP.setAbortStatus(DWPT) 3) Add DWPTP.getFlushStatus(DWPT) 4) Add DWPTP.getAbortStatus(DWPT) 5) Add DWPTP.getAndLock(Thread requestingThread, DocumentsWriter documentsWriter, Document doc, Term delTerm) 6) Add DWPTP. [deleteTerm,deleteQuery] Where each of these methods acquires a lock on DWPTP. I think it'll look somewhat like FlushControl, however the common parameter will be an idx to the appropriate DWPT, eg, getFlushPending(int idx). Alternatively instead of placing this in DWPTP, DW is another possible candidate.
          Hide
          Jason Rutherglen added a comment -

          Maybe getAndLock should accept a delTerm, and lock on every non-flushing, non-aborting, numDocs > 0, DWPT, and add the delTerm to them, then unlock each locked DWPT? This is analogous to how trunk adds the delTerm in the synced DW.getThreadState method?

          Show
          Jason Rutherglen added a comment - Maybe getAndLock should accept a delTerm, and lock on every non-flushing, non-aborting, numDocs > 0, DWPT, and add the delTerm to them, then unlock each locked DWPT? This is analogous to how trunk adds the delTerm in the synced DW.getThreadState method?
          Hide
          Jason Rutherglen added a comment -

          I just noticed this. I think this line outside of any locking is probably not good for the concurrency of updateDoc. Meaning all kinds of things can sneak in like flushes, before we get to adding the delete to all DWPTs? This's part of what was tricky with LUCENE-2680, we had to keep the locking on DW for updateDoc. Maybe to test this's an issue we can assert the count of the deletes, ala, FlushControl (which was added I think to ensure concurrency correctness)?

            // delete term from other DWPTs later, so that this thread
            // doesn't have to lock multiple DWPTs at the same time
            if (delTerm != null) {
              deleteTerm(delTerm, perThread);
            }
          
          Show
          Jason Rutherglen added a comment - I just noticed this. I think this line outside of any locking is probably not good for the concurrency of updateDoc. Meaning all kinds of things can sneak in like flushes, before we get to adding the delete to all DWPTs? This's part of what was tricky with LUCENE-2680 , we had to keep the locking on DW for updateDoc. Maybe to test this's an issue we can assert the count of the deletes, ala, FlushControl (which was added I think to ensure concurrency correctness)? // delete term from other DWPTs later, so that this thread // doesn't have to lock multiple DWPTs at the same time if (delTerm != null ) { deleteTerm(delTerm, perThread); }
          Hide
          Jason Rutherglen added a comment -

          I removed ThreadState and DWPT now extends ReentrantLock.

          I added assertions such as delete terms assert numDocsInRAM > 0; which fails. I think we should focus on cleaning up the locking, ie, do we need to use synchronized in DWPT? Perhaps everything should use the RL?

          Show
          Jason Rutherglen added a comment - I removed ThreadState and DWPT now extends ReentrantLock. I added assertions such as delete terms assert numDocsInRAM > 0; which fails. I think we should focus on cleaning up the locking, ie, do we need to use synchronized in DWPT? Perhaps everything should use the RL?
          Hide
          Jason Rutherglen added a comment -

          DW.deleteTerms iterates on DWPTs without acquiring the ThreadState.lock, instead DWPT.deleteTerms is synced (on DWPT). I think if a flush is occurring then deletes can get in at the same time? I don't think BufferedDeletes supports that?

          Show
          Jason Rutherglen added a comment - DW.deleteTerms iterates on DWPTs without acquiring the ThreadState.lock, instead DWPT.deleteTerms is synced (on DWPT). I think if a flush is occurring then deletes can get in at the same time? I don't think BufferedDeletes supports that?
          Hide
          Jason Rutherglen added a comment -

          In DW.flushAllThreads we're accessing indexWriter.segmentInfos while we're not synced on IW, so the segment infos vector may be changing as we're accessing it. I'm not sure how we can reasonably solve this, I don't think cloning segment infos will work. In trunk, doFlush is sync'ed on IW and so doesn't run into these problems. Perhaps for the flush all threads case we should simply sync on IW?

          Show
          Jason Rutherglen added a comment - In DW.flushAllThreads we're accessing indexWriter.segmentInfos while we're not synced on IW, so the segment infos vector may be changing as we're accessing it. I'm not sure how we can reasonably solve this, I don't think cloning segment infos will work. In trunk, doFlush is sync'ed on IW and so doesn't run into these problems. Perhaps for the flush all threads case we should simply sync on IW?
          Hide
          Jason Rutherglen added a comment -

          The TestStressIndexing2 errors remind me of what I saw when working on LUCENE-2680. I'll take a look. They weren't there in the previous revisions of this branch.

          Show
          Jason Rutherglen added a comment - The TestStressIndexing2 errors remind me of what I saw when working on LUCENE-2680 . I'll take a look. They weren't there in the previous revisions of this branch.
          Hide
          Jason Rutherglen added a comment -

          Are you also merging trunk in as svn up yields a lot of updates.

          There are new test failures in: TestSnapshotDeletionPolicy

          Show
          Jason Rutherglen added a comment - Are you also merging trunk in as svn up yields a lot of updates. There are new test failures in: TestSnapshotDeletionPolicy
          Hide
          Michael Busch added a comment -

          I just committed fixes for some failing tests.
          Eg. the addIndexes() problem is now fixed. The problem was that I had accidentally removed the following line in DW.addIndexes():

           // Update SI appropriately
           info.setDocStore(info.getDocStoreOffset(), newDsName, info.getDocStoreIsCompoundFile());
          

          info.setDocStore() calls clearFiles(), which empties a SegmentInfo-local cache of all filenames that belong to the corresponding segment. Since addIndexes() changes the segment name, it is important to refill that cache with the new file names.

          This was a sneaky bug. We should probably call clearFiles() explicitly there in addIndexes(). For now I added a comment.

          Show
          Michael Busch added a comment - I just committed fixes for some failing tests. Eg. the addIndexes() problem is now fixed. The problem was that I had accidentally removed the following line in DW.addIndexes(): // Update SI appropriately info.setDocStore(info.getDocStoreOffset(), newDsName, info.getDocStoreIsCompoundFile()); info.setDocStore() calls clearFiles(), which empties a SegmentInfo-local cache of all filenames that belong to the corresponding segment. Since addIndexes() changes the segment name, it is important to refill that cache with the new file names. This was a sneaky bug. We should probably call clearFiles() explicitly there in addIndexes(). For now I added a comment.
          Hide
          Jason Rutherglen added a comment -

          Also, why are we always (well, likely) assigning the DWPT to a different thread state if tryLock returns false? If there's a lot of contention (eg, far more incoming threads than DWPTs), then won't the thread assignation code become a hotspot?

          In ThreadAffinityDocumentsWriterThreadPool.clearThreadBindings(ThreadState perThread) we're actually clearing the entire map. When this's called in IW.flush (which is unsynced on IW), if there are multiple concurrent flushes, then perhaps a single DWPT is in use by multiple threads. To safeguard against this and perhaps more easily add an assertion, maybe we should lock on the DWPT rather than ThreadState?

          Show
          Jason Rutherglen added a comment - Also, why are we always (well, likely) assigning the DWPT to a different thread state if tryLock returns false? If there's a lot of contention (eg, far more incoming threads than DWPTs), then won't the thread assignation code become a hotspot? In ThreadAffinityDocumentsWriterThreadPool.clearThreadBindings(ThreadState perThread) we're actually clearing the entire map. When this's called in IW.flush (which is unsynced on IW), if there are multiple concurrent flushes, then perhaps a single DWPT is in use by multiple threads. To safeguard against this and perhaps more easily add an assertion, maybe we should lock on the DWPT rather than ThreadState?
          Hide
          Jason Rutherglen added a comment -

          The threadBindings hashmap is a ConcurrentHashMap and the getActivePerThreadsIterator() is threadsafe I believe.

          Sorry yes CHM is used, it all looks thread safe, but there must be multiple threads accessing a single DWPT at the same time for some of these errors to be occurring.

          Show
          Jason Rutherglen added a comment - The threadBindings hashmap is a ConcurrentHashMap and the getActivePerThreadsIterator() is threadsafe I believe. Sorry yes CHM is used, it all looks thread safe, but there must be multiple threads accessing a single DWPT at the same time for some of these errors to be occurring.
          Hide
          Michael Busch added a comment -

          as we're iterating on ThreadStates and on a non-concurrent hashmap calling put while not in a lock?

          The threadBindings hashmap is a ConcurrentHashMap and the getActivePerThreadsIterator() is threadsafe I believe.

          Show
          Michael Busch added a comment - as we're iterating on ThreadStates and on a non-concurrent hashmap calling put while not in a lock? The threadBindings hashmap is a ConcurrentHashMap and the getActivePerThreadsIterator() is threadsafe I believe.
          Hide
          Jason Rutherglen added a comment -

          Also multiple threads can call DocumentsWriterPerThread.addDocument and that's resulting in this:

          [junit] java.lang.AssertionError: omitTermFreqAndPositions:false postings.docFreqs[termID]:0
              [junit]     at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:143)
              [junit]     at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:234)
              [junit]     at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:91)
              [junit]     at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:274)
              [junit]     at org.apache.lucene.index.DocumentsWriterPerThread.addDocument(DocumentsWriterPerThread.java:184)
              [junit]     at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:374)
              [junit]     at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1403)
              [junit]     at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1375)
          
          Show
          Jason Rutherglen added a comment - Also multiple threads can call DocumentsWriterPerThread.addDocument and that's resulting in this: [junit] java.lang.AssertionError: omitTermFreqAndPositions: false postings.docFreqs[termID]:0 [junit] at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:143) [junit] at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:234) [junit] at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:91) [junit] at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:274) [junit] at org.apache.lucene.index.DocumentsWriterPerThread.addDocument(DocumentsWriterPerThread.java:184) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:374) [junit] at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1403) [junit] at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1375)
          Hide
          Jason Rutherglen added a comment -

          I'm taking a guess here, however the ThreadAffinityDocumentsWriterThreadPool.getAndLock method looks a little suspicious as we're iterating on ThreadStates and on a non-concurrent hashmap calling put while not in a lock?

          Show
          Jason Rutherglen added a comment - I'm taking a guess here, however the ThreadAffinityDocumentsWriterThreadPool.getAndLock method looks a little suspicious as we're iterating on ThreadStates and on a non-concurrent hashmap calling put while not in a lock?
          Hide
          Jason Rutherglen added a comment -

          Here's the latest test.out. There's a lot of these:

          [junit] junit.framework.AssertionFailedError: IndexFileDeleter doesn't know about file _5.fdt
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1156)
              [junit] 	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1088)
              [junit] 	at org.apache.lucene.index.IndexWriter.filesExist(IndexWriter.java:3273)
              [junit] 	at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:3321)
              [junit] 	at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2339)
              [junit] 	at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2410)
              [junit] 	at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1083)
              [junit] 	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1027)
              [junit] 	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:991)
              [junit] 	at org.apache.lucene.index.TestAddIndexes.testMergeAfterCopy(TestAddIndexes.java:432)
          
          Show
          Jason Rutherglen added a comment - Here's the latest test.out. There's a lot of these: [junit] junit.framework.AssertionFailedError: IndexFileDeleter doesn't know about file _5.fdt [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1156) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1088) [junit] at org.apache.lucene.index.IndexWriter.filesExist(IndexWriter.java:3273) [junit] at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:3321) [junit] at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2339) [junit] at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2410) [junit] at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1083) [junit] at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1027) [junit] at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:991) [junit] at org.apache.lucene.index.TestAddIndexes.testMergeAfterCopy(TestAddIndexes.java:432)
          Hide
          Jason Rutherglen added a comment -

          I think there's something wrong with TermVectors - several related test cases fail. We need to investigate more.

          Yes, there are far more than the last revision! I don't think it's just TV however.

          Show
          Jason Rutherglen added a comment - I think there's something wrong with TermVectors - several related test cases fail. We need to investigate more. Yes, there are far more than the last revision! I don't think it's just TV however.
          Hide
          Michael Busch added a comment -

          DWPT.perDocAllocator and freeLevel can be removed?

          done.

          DWPT's RecyclingByteBlockAllocator -> DirectAllocator?

          done. Also removed more recycling code.

          I don't think we need FlushControl anymore as the RAM tracking should occur in DW and there's no need for IW to [globally] wait for flushes.

          I removed flushControl from DW.

          I'm curious if the file not found errors are gone.

          I think there's something wrong with TermVectors - several related test cases fail. We need to investigate more.

          Show
          Michael Busch added a comment - DWPT.perDocAllocator and freeLevel can be removed? done. DWPT's RecyclingByteBlockAllocator -> DirectAllocator? done. Also removed more recycling code. I don't think we need FlushControl anymore as the RAM tracking should occur in DW and there's no need for IW to [globally] wait for flushes. I removed flushControl from DW. I'm curious if the file not found errors are gone. I think there's something wrong with TermVectors - several related test cases fail. We need to investigate more.
          Hide
          Jason Rutherglen added a comment -

          look at ThreadAffinityDocumentsWriterThreadPool. There I'm able to use things like tryLock() and getQueueLength().

          Makes sense, I had only read the DocumentsWriterPerThreadPool part.

          • DWPT.perDocAllocator and freeLevel can be removed?
          • DWPT's RecyclingByteBlockAllocator -> DirectAllocator?
          • Looks like the deletes handling is updated to the patch
          • I don't think we need FlushControl anymore as the RAM tracking should occur in DW and there's no need for IW to [globally] wait for flushes.
          • The locking is more clear now, I can see DW.updateDocument locks the threadstate as does flushAllThreads.

          I'll reincorporate the RAM tracking, and then will try the unit tests again. I'm curious if the file not found errors are gone.

          Show
          Jason Rutherglen added a comment - look at ThreadAffinityDocumentsWriterThreadPool. There I'm able to use things like tryLock() and getQueueLength(). Makes sense, I had only read the DocumentsWriterPerThreadPool part. DWPT.perDocAllocator and freeLevel can be removed? DWPT's RecyclingByteBlockAllocator -> DirectAllocator? Looks like the deletes handling is updated to the patch I don't think we need FlushControl anymore as the RAM tracking should occur in DW and there's no need for IW to [globally] wait for flushes. The locking is more clear now, I can see DW.updateDocument locks the threadstate as does flushAllThreads. I'll reincorporate the RAM tracking, and then will try the unit tests again. I'm curious if the file not found errors are gone.
          Hide
          Michael Busch added a comment -

          How do I currently get the ..er.. current version?

          Just do 'svn up' on the RT branch.

          Regardless of everything else, I'd ask you not to extend random things

          This was a conscious decision, not random. Extending ReentrantLock is not an uncommon pattern, e.g. ConcurrentHashMap.Segment does exactly that. ThreadState basically is nothing but a lock that has a reference to the corresponding DWPT it protects.

          I encourage you to look at the code.

          Show
          Michael Busch added a comment - How do I currently get the ..er.. current version? Just do 'svn up' on the RT branch. Regardless of everything else, I'd ask you not to extend random things This was a conscious decision, not random. Extending ReentrantLock is not an uncommon pattern, e.g. ConcurrentHashMap.Segment does exactly that. ThreadState basically is nothing but a lock that has a reference to the corresponding DWPT it protects. I encourage you to look at the code.
          Hide
          Earwin Burrfoot added a comment -

          Maan, this comment list is infinite.
          How do I currently get the ..er.. current version? Latest branch + latest Jason's patch?

          Regardless of everything else, I'd ask you not to extend random things at least if you can't say is-a about them.
          DocumentsWriterPerThreadPool.ThreadState IS A ReentrantLock? No. So you're better off encapsulating it rather than extending.
          Same can be applied to SegmentInfos that extends Vector :/

          Show
          Earwin Burrfoot added a comment - Maan, this comment list is infinite. How do I currently get the ..er.. current version? Latest branch + latest Jason's patch? Regardless of everything else, I'd ask you not to extend random things at least if you can't say is-a about them. DocumentsWriterPerThreadPool.ThreadState IS A ReentrantLock? No. So you're better off encapsulating it rather than extending. Same can be applied to SegmentInfos that extends Vector :/
          Hide
          Michael Busch added a comment -

          Really? That makes synchronized seem simpler?

          Well look at ThreadAffinityDocumentsWriterThreadPool. There I'm able to use things like tryLock() and getQueueLength().
          Also DocumentsWriterPerThreadPool has a getAndLock() method, that can be used by DW for addDocument(), whereas DW.flush(), which needs to iterate the DWPTs, can lock the individual DWPTs directly. I think it's simpler, but I'm open to other suggestions of course

          What about the memory used, eg, the non-use of byte[] recycling? I guess it'll be cleared on flush.

          Yeah, sure. That is independent on whether they're all created upfront or not. But yeah, after flush or abort we need to clear the DWPT's state to make sure they're not consuming unused RAM (as you described in your earlier comment).

          Show
          Michael Busch added a comment - Really? That makes synchronized seem simpler? Well look at ThreadAffinityDocumentsWriterThreadPool. There I'm able to use things like tryLock() and getQueueLength(). Also DocumentsWriterPerThreadPool has a getAndLock() method, that can be used by DW for addDocument(), whereas DW.flush(), which needs to iterate the DWPTs, can lock the individual DWPTs directly. I think it's simpler, but I'm open to other suggestions of course What about the memory used, eg, the non-use of byte[] recycling? I guess it'll be cleared on flush. Yeah, sure. That is independent on whether they're all created upfront or not. But yeah, after flush or abort we need to clear the DWPT's state to make sure they're not consuming unused RAM (as you described in your earlier comment).
          Hide
          Jason Rutherglen added a comment -

          DocumentsWriterPerThreadPool.ThreadState now extends ReentrantLock, which means that standard methods like lock() and unlock() can be used to reserve a DWPT for a task.

          Really? That makes synchronized seem simpler?

          the max. number of DWPTs allowed (config.maxThreadStates) is instantiated up-front.

          What about the memory used, eg, the non-use of byte[] recycling? I guess it'll be cleared on flush.

          fix RAM tracking and flush-by-RAM

          I created a BytesUsed object that cascades the changes to parent BytesUsed objects, this allows each individual SD, DWPT, DW, etc to keep track of their bytes used, while also propagating the changes to the higher level objects, eg, SD -> DWPT, DWPT -> DW.

          Show
          Jason Rutherglen added a comment - DocumentsWriterPerThreadPool.ThreadState now extends ReentrantLock, which means that standard methods like lock() and unlock() can be used to reserve a DWPT for a task. Really? That makes synchronized seem simpler? the max. number of DWPTs allowed (config.maxThreadStates) is instantiated up-front. What about the memory used, eg, the non-use of byte[] recycling? I guess it'll be cleared on flush. fix RAM tracking and flush-by-RAM I created a BytesUsed object that cascades the changes to parent BytesUsed objects, this allows each individual SD, DWPT, DW, etc to keep track of their bytes used, while also propagating the changes to the higher level objects, eg, SD -> DWPT, DWPT -> DW.
          Hide
          Michael Busch added a comment -

          I made some progress with the concurrency model, especially removing the need for various locks to make everything easier.

          • DocumentsWriterPerThreadPool.ThreadState now extends ReentrantLock, which means that standard methods like lock() and unlock() can be used to reserve a DWPT for a task.
          • The max. number of DWPTs allowed (config.maxThreadStates) is instantiated up-front. Creating a DWPT is cheap, so this is not a performance concern; this makes it easier to push config changes to the DWPTs without synchronizing on the pool and without having to worry about newly created DWPTs getting the same config settings.
          • DocumentsWriterPerThreadPool.getActivePerThreadsIterator() gives the caller a static snapshot of the active DWPTs at the time the iterator was acquired, e.g. for flushAllThreads() or DW.abort(). Here synchronizing on the pool isn't necessary either.
          • deletes are now pushed to DW.pendingDeletes() if no active DWPTs are present.

          TODOs:

          • fix remaining testcases that still fail
          • fix RAM tracking and flush-by-RAM
          • write new testcases to test thread pool, thread assignment, etc
          • review if all cases that were discussed in the recent comments here work as expected (likely not )
          • performance testing and code cleanup
          Show
          Michael Busch added a comment - I made some progress with the concurrency model, especially removing the need for various locks to make everything easier. DocumentsWriterPerThreadPool.ThreadState now extends ReentrantLock, which means that standard methods like lock() and unlock() can be used to reserve a DWPT for a task. The max. number of DWPTs allowed (config.maxThreadStates) is instantiated up-front. Creating a DWPT is cheap, so this is not a performance concern; this makes it easier to push config changes to the DWPTs without synchronizing on the pool and without having to worry about newly created DWPTs getting the same config settings. DocumentsWriterPerThreadPool.getActivePerThreadsIterator() gives the caller a static snapshot of the active DWPTs at the time the iterator was acquired, e.g. for flushAllThreads() or DW.abort(). Here synchronizing on the pool isn't necessary either. deletes are now pushed to DW.pendingDeletes() if no active DWPTs are present. TODOs: fix remaining testcases that still fail fix RAM tracking and flush-by-RAM write new testcases to test thread pool, thread assignment, etc review if all cases that were discussed in the recent comments here work as expected (likely not ) performance testing and code cleanup
          Hide
          Michael McCandless added a comment -

          Lost in abbreviations - Can you remind me what 'ES' is?

          I read it as ExecutorService, ie, a thread pool.

          Yes, sorry that's what I meant.

          Ie someday IW can take an ES too and farm things out to it when it could make use of concurrency (like flush the world). But that's for later </dream>.

          Show
          Michael McCandless added a comment - Lost in abbreviations - Can you remind me what 'ES' is? I read it as ExecutorService, ie, a thread pool. Yes, sorry that's what I meant. Ie someday IW can take an ES too and farm things out to it when it could make use of concurrency (like flush the world). But that's for later </dream>.
          Hide
          Michael Busch added a comment -

          Mainly that we could have DWPT(s) lying around unused, consuming [recycled] RAM, eg, from a sudden drop in the number of incoming threads after a flush. This is a drop the code, and put it back in if that was a bad idea solution.

          Ah thanks, got it.

          Or simply stop recycling any RAM, so that a just-flushed DWPT is an empty shell.

          +1

          Show
          Michael Busch added a comment - Mainly that we could have DWPT(s) lying around unused, consuming [recycled] RAM, eg, from a sudden drop in the number of incoming threads after a flush. This is a drop the code, and put it back in if that was a bad idea solution. Ah thanks, got it. Or simply stop recycling any RAM, so that a just-flushed DWPT is an empty shell. +1
          Hide
          Jason Rutherglen added a comment -

          Lost in abbreviations - Can you remind me what 'ES' is?

          I read it as ExecutorService, ie, a thread pool.

          I'm not sure I understand what the problem here with recycling RAM is. Could someone elaborate?

          Mainly that we could have DWPT(s) lying around unused, consuming [recycled] RAM, eg, from a sudden drop in the number of incoming threads after a flush. This is a drop the code, and put it back in if that was a bad idea solution.

          Show
          Jason Rutherglen added a comment - Lost in abbreviations - Can you remind me what 'ES' is? I read it as ExecutorService, ie, a thread pool. I'm not sure I understand what the problem here with recycling RAM is. Could someone elaborate? Mainly that we could have DWPT(s) lying around unused, consuming [recycled] RAM, eg, from a sudden drop in the number of incoming threads after a flush. This is a drop the code, and put it back in if that was a bad idea solution.
          Hide
          Michael Busch added a comment -

          I think aborting a flush should only lose the docs in that one DWPT (as it is today).

          Yeah I'm convinced now I don't want the "nuke the world" approach. Btw, Mike, you're very good with giving things intuitive names

          I think on commit if we hit an aborting exception flushing a given DWPT, we throw it then & there.

          Yes sounds good.

          Any segs already flushed remain flushed (but not committed). Any segs not yet flushed remain not yet flushed...

          If the segment are flushed, then they will be deleted? Or they will be made available in a subsequent and completely successful commit?

          The aborting exception might be thrown due to a disk-full situation. This can be fixed and commit() called again, which then would flush the remaining DWPTs and commit all flushed segments.
          Otherwise, those flushed segments will be orphaned and deleted sometime later by a different IW because they don't belong to any SegmentInfos.

          Show
          Michael Busch added a comment - I think aborting a flush should only lose the docs in that one DWPT (as it is today). Yeah I'm convinced now I don't want the "nuke the world" approach. Btw, Mike, you're very good with giving things intuitive names I think on commit if we hit an aborting exception flushing a given DWPT, we throw it then & there. Yes sounds good. Any segs already flushed remain flushed (but not committed). Any segs not yet flushed remain not yet flushed... If the segment are flushed, then they will be deleted? Or they will be made available in a subsequent and completely successful commit? The aborting exception might be thrown due to a disk-full situation. This can be fixed and commit() called again, which then would flush the remaining DWPTs and commit all flushed segments. Otherwise, those flushed segments will be orphaned and deleted sometime later by a different IW because they don't belong to any SegmentInfos.
          Hide
          Michael Busch added a comment -

          Longer term c) would be great, or, if IW has an ES then it'd send multiple flush jobs to the ES.

          Lost in abbreviations - Can you remind me what 'ES' is?

          But, you're right: maybe we should sometimes "prune" DWPTs. Or simply stop recycling any RAM, so that a just-flushed DWPT is an empty shell.

          I'm not sure I understand what the problem here with recycling RAM is. Could someone elaborate?

          Show
          Michael Busch added a comment - Longer term c) would be great, or, if IW has an ES then it'd send multiple flush jobs to the ES. Lost in abbreviations - Can you remind me what 'ES' is? But, you're right: maybe we should sometimes "prune" DWPTs. Or simply stop recycling any RAM, so that a just-flushed DWPT is an empty shell. I'm not sure I understand what the problem here with recycling RAM is. Could someone elaborate?
          Hide
          Jason Rutherglen added a comment -

          Any segs already flushed remain flushed (but not committed). Any segs not yet flushed remain not yet flushed...

          If the segment are flushed, then they will be deleted? Or they will be made available in a subsequent and completely successful commit?

          Show
          Jason Rutherglen added a comment - Any segs already flushed remain flushed (but not committed). Any segs not yet flushed remain not yet flushed... If the segment are flushed, then they will be deleted? Or they will be made available in a subsequent and completely successful commit?
          Hide
          Michael McCandless added a comment -

          I think on commit if we hit an aborting exception flushing a given DWPT, we throw it then & there.

          Any segs already flushed remain flushed (but not committed). Any segs not yet flushed remain not yet flushed...

          Show
          Michael McCandless added a comment - I think on commit if we hit an aborting exception flushing a given DWPT, we throw it then & there. Any segs already flushed remain flushed (but not committed). Any segs not yet flushed remain not yet flushed...
          Hide
          Jason Rutherglen added a comment -

          Once flush is triggered, the thread doing the flushing is free to flush any DWPT.

          OK.

          OK let's start there and put back re-use only if we see a real perf issue?

          I think that's best. Balancing RAM isn't implemented in the branch, we can't predict the future usage of DWPT(s) (which could languish consuming RAM with byte[]s well after they're flushed due to a sudden drop in the number of calling threads external to IW).

          But it's really a "nuke the world" option which scares me. EG it could be a looong indexing session (app doesn't call commit() until the end) and we could be throwing away alot of progress.

          Right. Another option is to on commit try to flush all segments, meaning even if one DWPT/segment aborts, continue on with the other DWPTs (ie, a best effort). Then perhaps throw an exception with a report of which segment flushes succeeded, or simply return a report object detailing what happened during commit (somewhat expert usage though). Either way I think we need to give a few options to the user, then choose a default and see if it sticks. The default should probably be "best effort".

          Show
          Jason Rutherglen added a comment - Once flush is triggered, the thread doing the flushing is free to flush any DWPT. OK. OK let's start there and put back re-use only if we see a real perf issue? I think that's best. Balancing RAM isn't implemented in the branch, we can't predict the future usage of DWPT(s) (which could languish consuming RAM with byte[]s well after they're flushed due to a sudden drop in the number of calling threads external to IW). But it's really a "nuke the world" option which scares me. EG it could be a looong indexing session (app doesn't call commit() until the end) and we could be throwing away alot of progress. Right. Another option is to on commit try to flush all segments, meaning even if one DWPT/segment aborts, continue on with the other DWPTs (ie, a best effort). Then perhaps throw an exception with a report of which segment flushes succeeded, or simply return a report object detailing what happened during commit (somewhat expert usage though). Either way I think we need to give a few options to the user, then choose a default and see if it sticks. The default should probably be "best effort".
          Hide
          Michael McCandless added a comment -

          I think with B
          we're saying even if the calling thread is bound to DWPT #1, if DWPT #2 is
          greater in size and the aggregate RAM usage exceeds the max, using the calling
          thread, we take DWPT #2 out of production, flush, and return it?

          Right – the thread affinity has nothing to do with which thread gets to flush which DWPT. Once flush is triggered, the thread doing the flushing is free to flush any DWPT.

          Maybe we can simply throw out the DWPT
          and put recycling byte[]s and/or pooling DWPTs back in later if it's necessary?

          OK let's start there and put back re-use only if we see a real perf issue?

          What I meant was the following situation: Suppose we have two DWPTs and IW.commit() is called. The first DWPT finishes flushing successfully, is returned to the pool and idle again. The second DWPT flush fails with an aborting exception.

          Hmm, tricky. I think I'd lean towards keeping segment 1. Discarding it would be inconsistent w/ aborts hit during the "flushed by RAM" case? EG if seg 1 was flushed due to RAM usage, succeeds, and then later seg 2 is flushed due to RAM usage, but aborts. In this case we would still keep seg 1?

          I think aborting a flush should only lose the docs in that one DWPT (as it is today).

          Remember, a call to commit may succeed in flushing seg 1 to disk, and updating the in-memory segment infos, but on hitting the aborting exc to seg 2, will throw that to the caller, not having committed any change to the index. Exceptions thrown during the prepareCommit (phase 1) part of commit mean nothing is changed in the index.

          Alternatively... we could abort the entire IW session (as eg we handle OOME today) if ever an aborting exception was hit? This might be cleaner? But it's really a "nuke the world" option which scares me. EG it could be a looong indexing session (app doesn't call commit() until the end) and we could be throwing away alot of progress.

          Show
          Michael McCandless added a comment - I think with B we're saying even if the calling thread is bound to DWPT #1, if DWPT #2 is greater in size and the aggregate RAM usage exceeds the max, using the calling thread, we take DWPT #2 out of production, flush, and return it? Right – the thread affinity has nothing to do with which thread gets to flush which DWPT. Once flush is triggered, the thread doing the flushing is free to flush any DWPT. Maybe we can simply throw out the DWPT and put recycling byte[]s and/or pooling DWPTs back in later if it's necessary? OK let's start there and put back re-use only if we see a real perf issue? What I meant was the following situation: Suppose we have two DWPTs and IW.commit() is called. The first DWPT finishes flushing successfully, is returned to the pool and idle again. The second DWPT flush fails with an aborting exception. Hmm, tricky. I think I'd lean towards keeping segment 1. Discarding it would be inconsistent w/ aborts hit during the "flushed by RAM" case? EG if seg 1 was flushed due to RAM usage, succeeds, and then later seg 2 is flushed due to RAM usage, but aborts. In this case we would still keep seg 1? I think aborting a flush should only lose the docs in that one DWPT (as it is today). Remember, a call to commit may succeed in flushing seg 1 to disk, and updating the in-memory segment infos, but on hitting the aborting exc to seg 2, will throw that to the caller, not having committed any change to the index. Exceptions thrown during the prepareCommit (phase 1) part of commit mean nothing is changed in the index. Alternatively... we could abort the entire IW session (as eg we handle OOME today) if ever an aborting exception was hit? This might be cleaner? But it's really a "nuke the world" option which scares me. EG it could be a looong indexing session (app doesn't call commit() until the end) and we could be throwing away alot of progress.
          Hide
          Jason Rutherglen added a comment -

          I think segment 1 shouldn't be committed, ie. a global flush should be all or nothing. This means we would have to delay the commit of the segments until all DWPTs flushed successfully.

          If a DWPT aborts during flush, we simply throw an exception, however we still keep the successfully flushed segment(s). If there's an abort on any DWPT during commit then we throw away any successfully flushed segments as well. I think that makes sense, eg, all or nothing.

          Show
          Jason Rutherglen added a comment - I think segment 1 shouldn't be committed, ie. a global flush should be all or nothing. This means we would have to delay the commit of the segments until all DWPTs flushed successfully. If a DWPT aborts during flush, we simply throw an exception, however we still keep the successfully flushed segment(s). If there's an abort on any DWPT during commit then we throw away any successfully flushed segments as well. I think that makes sense, eg, all or nothing.
          Hide
          Michael Busch added a comment -

          I think start simple - the addDocument always happens? Ie it's never coordinated w/ the ongoing flush. It picks a free DWPT like normal, and since flush is single threaded, there should always be a free DWPT?

          Yeah I agree. The change I'll make then is to not have the global lock and return a DWPT immediately to the pool and set it to 'idle' after its flush completed.

          I think we should continue what we do today? Ie, if it's an 'aborting' exception, then the entire segment held by that DWPT is discarded? And we then throw this exc back to caller (and don't try to flush any other segments)?

          What I meant was the following situation: Suppose we have two DWPTs and IW.commit() is called. The first DWPT finishes flushing successfully, is returned to the pool and idle again. The second DWPT flush fails with an aborting exception. Should the segment of the first DWPT make it into the index or not? I think segment 1 shouldn't be committed, ie. a global flush should be all or nothing. This means we would have to delay the commit of the segments until all DWPTs flushed successfully.

          Show
          Michael Busch added a comment - I think start simple - the addDocument always happens? Ie it's never coordinated w/ the ongoing flush. It picks a free DWPT like normal, and since flush is single threaded, there should always be a free DWPT? Yeah I agree. The change I'll make then is to not have the global lock and return a DWPT immediately to the pool and set it to 'idle' after its flush completed. I think we should continue what we do today? Ie, if it's an 'aborting' exception, then the entire segment held by that DWPT is discarded? And we then throw this exc back to caller (and don't try to flush any other segments)? What I meant was the following situation: Suppose we have two DWPTs and IW.commit() is called. The first DWPT finishes flushing successfully, is returned to the pool and idle again. The second DWPT flush fails with an aborting exception. Should the segment of the first DWPT make it into the index or not? I think segment 1 shouldn't be committed, ie. a global flush should be all or nothing. This means we would have to delay the commit of the segments until all DWPTs flushed successfully.
          Hide
          Michael Busch added a comment -

          I think the risk is a new DWPT likely will have been created during flush, which'd make the returning DWPT inutile.

          The DWPT will not be removed from the pool, just marked as busy during flush, like as its state is busy (or currently called "non-idle" in the code) during addDocumentI(). So no new DWPT would be created during flush if the maxThreadState limit was already reached.

          Show
          Michael Busch added a comment - I think the risk is a new DWPT likely will have been created during flush, which'd make the returning DWPT inutile. The DWPT will not be removed from the pool, just marked as busy during flush, like as its state is busy (or currently called "non-idle" in the code) during addDocumentI(). So no new DWPT would be created during flush if the maxThreadState limit was already reached.
          Hide
          Jason Rutherglen added a comment -

          To further clarify, we also no longer have global aborts? Each abort only applies to an individual DWPT?

          Show
          Jason Rutherglen added a comment - To further clarify, we also no longer have global aborts? Each abort only applies to an individual DWPT?
          Hide
          Jason Rutherglen added a comment -

          the "flush the world" case? (Ie the app calls IW.commit or
          IW.getReader). In this case the thread just one by one pulls all DWPTs that
          have any indexed docs out of production, flushes them, clears them, and returns
          them to production?

          The 2 cases are: A) Flush every DWPT sequentually (aka flush the world) and
          B) flush by RAM usage when adding docs or deleting. A is clear! I think with B
          we're saying even if the calling thread is bound to DWPT #1, if DWPT #2 is
          greater in size and the aggregate RAM usage exceeds the max, using the calling
          thread, we take DWPT #2 out of production, flush, and return it?

          The behavior of calling IW.close while other threads are still adding
          docs has never been defined (and, shouldn't be) except that we won't corrupt
          your index, and we'll get all docs indexed before .close was called, committed.
          So I think even for this case we don't need a global lock.

          Great, that simplifies and clarifies that we do not require a global lock.

          But, you're right: maybe we should sometimes "prune" DWPTs. Or simply
          stop recycling any RAM, so that a just-flushed DWPT is an empty shell.

          I'm not sure how we'd prune, typically object pools have a separate eviction
          thread, I think that's going overboard? Maybe we can simply throw out the DWPT
          and put recycling byte[]s and/or pooling DWPTs back in later if it's necessary?

          Show
          Jason Rutherglen added a comment - the "flush the world" case? (Ie the app calls IW.commit or IW.getReader). In this case the thread just one by one pulls all DWPTs that have any indexed docs out of production, flushes them, clears them, and returns them to production? The 2 cases are: A) Flush every DWPT sequentually (aka flush the world) and B) flush by RAM usage when adding docs or deleting. A is clear! I think with B we're saying even if the calling thread is bound to DWPT #1, if DWPT #2 is greater in size and the aggregate RAM usage exceeds the max, using the calling thread, we take DWPT #2 out of production, flush, and return it? The behavior of calling IW.close while other threads are still adding docs has never been defined (and, shouldn't be) except that we won't corrupt your index, and we'll get all docs indexed before .close was called, committed. So I think even for this case we don't need a global lock. Great, that simplifies and clarifies that we do not require a global lock. But, you're right: maybe we should sometimes "prune" DWPTs. Or simply stop recycling any RAM, so that a just-flushed DWPT is an empty shell. I'm not sure how we'd prune, typically object pools have a separate eviction thread, I think that's going overboard? Maybe we can simply throw out the DWPT and put recycling byte[]s and/or pooling DWPTs back in later if it's necessary?
          Hide
          Michael McCandless added a comment -

          And there's the case of the thread calling flush doesn't yet have a DWPT, it's going to need to get one assigned to it, however the one assigned may not be the max ram consumer. What'll we do then? If the user explicitly called flush we can a) do nothing b) flush (the max ram consumer) thread's DWPT, however that gets hairy with wait notifies (almost like the global lock?).

          Wait – why would the thread calling flush need to have a DWPT assigned to it? You're talking about the "flush the world" case? (Ie the app calls IW.commit or IW.getReader). In this case the thread just one by one pulls all DWPTs that have any indexed docs out of production, flushes them, clears them, and returns them to production?

          Show
          Michael McCandless added a comment - And there's the case of the thread calling flush doesn't yet have a DWPT, it's going to need to get one assigned to it, however the one assigned may not be the max ram consumer. What'll we do then? If the user explicitly called flush we can a) do nothing b) flush (the max ram consumer) thread's DWPT, however that gets hairy with wait notifies (almost like the global lock?). Wait – why would the thread calling flush need to have a DWPT assigned to it? You're talking about the "flush the world" case? (Ie the app calls IW.commit or IW.getReader). In this case the thread just one by one pulls all DWPTs that have any indexed docs out of production, flushes them, clears them, and returns them to production?
          Hide
          Michael McCandless added a comment -

          The proposed change is simply the thread calling add doc will flush it's DWPT if needed, take it offline while doing so, and return it when completed.

          Wait – this is the "addDocument" case right? (I thought we were still talking about the "flush the world" case...).

          I think the risk is a new DWPT likely will have been created during flush, which'd make the returning DWPT inutile?

          A new DWPT will have been created only if more than one thread is indexing docs right? In which case this is fine? Ie the old DWPT (just flushed) will just go back into rotation, and when another thread comes in it can take it?

          But, you're right: maybe we should sometimes "prune" DWPTs. Or simply stop recycling any RAM, so that a just-flushed DWPT is an empty shell.

          However I think we may still need the global lock for close, eg, today we're preventing the user from adding docs during close, after this issue is merged that behavior would change?

          Well, the threads still adding docs will hit AlreadyClosedException? (But, that's just "best effort"). The behavior of calling IW.close while other threads are still adding docs has never been defined (and, shouldn't be) except that we won't corrupt your index, and we'll get all docs indexed before .close was called, committed. So I think even for this case we don't need a global lock.

          Show
          Michael McCandless added a comment - The proposed change is simply the thread calling add doc will flush it's DWPT if needed, take it offline while doing so, and return it when completed. Wait – this is the "addDocument" case right? (I thought we were still talking about the "flush the world" case...). I think the risk is a new DWPT likely will have been created during flush, which'd make the returning DWPT inutile? A new DWPT will have been created only if more than one thread is indexing docs right? In which case this is fine? Ie the old DWPT (just flushed) will just go back into rotation, and when another thread comes in it can take it? But, you're right: maybe we should sometimes "prune" DWPTs. Or simply stop recycling any RAM, so that a just-flushed DWPT is an empty shell. However I think we may still need the global lock for close, eg, today we're preventing the user from adding docs during close, after this issue is merged that behavior would change? Well, the threads still adding docs will hit AlreadyClosedException? (But, that's just "best effort"). The behavior of calling IW.close while other threads are still adding docs has never been defined (and, shouldn't be) except that we won't corrupt your index, and we'll get all docs indexed before .close was called, committed. So I think even for this case we don't need a global lock.
          Hide
          Jason Rutherglen added a comment -

          And there's the case of the thread calling flush doesn't yet have a DWPT, it's going to need to get one assigned to it, however the one assigned may not be the max ram consumer. What'll we do then? If the user explicitly called flush we can a) do nothing b) flush (the max ram consumer) thread's DWPT, however that gets hairy with wait notifies (almost like the global lock?).

          Show
          Jason Rutherglen added a comment - And there's the case of the thread calling flush doesn't yet have a DWPT, it's going to need to get one assigned to it, however the one assigned may not be the max ram consumer. What'll we do then? If the user explicitly called flush we can a) do nothing b) flush (the max ram consumer) thread's DWPT, however that gets hairy with wait notifies (almost like the global lock?).
          Hide
          Jason Rutherglen added a comment -

          As soon as a DWPT is pulled from production for flushing, it loses all thread affinity and becomes unavailable until its flush finishes. When a thread needs a DWPT, it tries to pick the one it last had (affinity) but if that one's busy, it picks a new one. If none are available but we are below our max DWPT count, it spins up a new one?

          Right.

          With the proposed approach, all docs added (or in the process of being added) will make it into the flushed segments once the flush returns; newly added docs after the flush call started may or not make it. But this is fine? I mean, if the app has stronger requirements then it should externally sync?

          Ok. The proposed change is simply the thread calling add doc will flush it's DWPT if needed, take it offline while doing so, and return it when completed. I think the risk is a new DWPT likely will have been created during flush, which'd make the returning DWPT inutile?

          Why would we lose them? Wouldn't that DWPT just go back into rotation once the flush is done?

          Yes, we just need to change the existing code a bit then.

          However I think we may still need the global lock for close, eg, today we're preventing the user from adding docs during close, after this issue is merged that behavior would change?

          Show
          Jason Rutherglen added a comment - As soon as a DWPT is pulled from production for flushing, it loses all thread affinity and becomes unavailable until its flush finishes. When a thread needs a DWPT, it tries to pick the one it last had (affinity) but if that one's busy, it picks a new one. If none are available but we are below our max DWPT count, it spins up a new one? Right. With the proposed approach, all docs added (or in the process of being added) will make it into the flushed segments once the flush returns; newly added docs after the flush call started may or not make it. But this is fine? I mean, if the app has stronger requirements then it should externally sync? Ok. The proposed change is simply the thread calling add doc will flush it's DWPT if needed, take it offline while doing so, and return it when completed. I think the risk is a new DWPT likely will have been created during flush, which'd make the returning DWPT inutile? Why would we lose them? Wouldn't that DWPT just go back into rotation once the flush is done? Yes, we just need to change the existing code a bit then. However I think we may still need the global lock for close, eg, today we're preventing the user from adding docs during close, after this issue is merged that behavior would change?
          Hide
          Michael McCandless added a comment -

          What if the user wants a guaranteed hard flush of all state up to the point of
          the flush call (won't they want this sometimes with getReader)? If we're
          flushing sequentially (without pausing all threads) we're removing that? Maybe
          we'll need to give the option of global lock/stop or sequential flush?

          What's a "hard flush"?

          With the proposed approach, all docs added (or in the process of being added) will make it into the flushed segments once the flush returns; newly added docs after the flush call started may or not make it. But this is fine? I mean, if the app has stronger requirements then it should externally sync?

          Also I think we need to clear the thread bindings of a DWPT just prior to the flush of the DWPT?

          Right.

          As soon as a DWPT is pulled from production for flushing, it loses all thread affinity and becomes unavailable until its flush finishes. When a thread needs a DWPT, it tries to pick the one it last had (affinity) but if that one's busy, it picks a new one. If none are available but we are below our max DWPT count, it spins up a new one?

          Then, what happens to reusing the DWPT if we're flushing it, and we spin a new
          DWPT (effectively replacing the old DWPT), eg, we're going to lose the byte[]
          recycling?

          Why would we lose them? Wouldn't that DWPT just go back into rotation once the flush is done?

          Show
          Michael McCandless added a comment - What if the user wants a guaranteed hard flush of all state up to the point of the flush call (won't they want this sometimes with getReader)? If we're flushing sequentially (without pausing all threads) we're removing that? Maybe we'll need to give the option of global lock/stop or sequential flush? What's a "hard flush"? With the proposed approach, all docs added (or in the process of being added) will make it into the flushed segments once the flush returns; newly added docs after the flush call started may or not make it. But this is fine? I mean, if the app has stronger requirements then it should externally sync? Also I think we need to clear the thread bindings of a DWPT just prior to the flush of the DWPT? Right. As soon as a DWPT is pulled from production for flushing, it loses all thread affinity and becomes unavailable until its flush finishes. When a thread needs a DWPT, it tries to pick the one it last had (affinity) but if that one's busy, it picks a new one. If none are available but we are below our max DWPT count, it spins up a new one? Then, what happens to reusing the DWPT if we're flushing it, and we spin a new DWPT (effectively replacing the old DWPT), eg, we're going to lose the byte[] recycling? Why would we lose them? Wouldn't that DWPT just go back into rotation once the flush is done?
          Hide
          Jason Rutherglen added a comment -

          Also, don't we need the global lock for commit/close?

          Show
          Jason Rutherglen added a comment - Also, don't we need the global lock for commit/close?
          Hide
          Jason Rutherglen added a comment -

          So all that's guaranteed after the global flush() returns is that all
          state present prior to when flush() is invoked, is moved to disk. Ie if addDocs
          are still happening concurrently then the DWPTs will start filling up again
          even while the "global flush" runs. That's fine.

          What if the user wants a guaranteed hard flush of all state up to the point of
          the flush call (won't they want this sometimes with getReader)? If we're
          flushing sequentially (without pausing all threads) we're removing that? Maybe
          we'll need to give the option of global lock/stop or sequential flush?

          Also I think we need to clear the thread bindings of a DWPT just prior to the
          flush of the DWPT? Otherwise (when multiple threads are mapped to a single
          DWPT) the other threads will wait on the [main] DWPT flush when they should be
          spinning up a new DWPT?

          Then, what happens to reusing the DWPT if we're flushing it, and we spin a new
          DWPT (effectively replacing the old DWPT), eg, we're going to lose the byte[]
          recycling? Maybe we need to and share and sync the byte[] pooling between DWPTs
          or will that noticeably affect indexing performance?

          Show
          Jason Rutherglen added a comment - So all that's guaranteed after the global flush() returns is that all state present prior to when flush() is invoked, is moved to disk. Ie if addDocs are still happening concurrently then the DWPTs will start filling up again even while the "global flush" runs. That's fine. What if the user wants a guaranteed hard flush of all state up to the point of the flush call (won't they want this sometimes with getReader)? If we're flushing sequentially (without pausing all threads) we're removing that? Maybe we'll need to give the option of global lock/stop or sequential flush? Also I think we need to clear the thread bindings of a DWPT just prior to the flush of the DWPT? Otherwise (when multiple threads are mapped to a single DWPT) the other threads will wait on the [main] DWPT flush when they should be spinning up a new DWPT? Then, what happens to reusing the DWPT if we're flushing it, and we spin a new DWPT (effectively replacing the old DWPT), eg, we're going to lose the byte[] recycling? Maybe we need to and share and sync the byte[] pooling between DWPTs or will that noticeably affect indexing performance?
          Hide
          Michael McCandless added a comment -

          I guess we don't really need the global lock. A thread performing the "global flush" could still acquire each thread state before it starts flushing, but return a threadState to the pool once that particular threadState is done flushing?

          Good question... we could (in theory) also flush them concurrently? But, since we don't "own" the threads in IW, we can't easily do that, so I think no global lock, go through all DWPTs w/ current thread and flush, sequentially? So all that's guaranteed after the global flush() returns is that all state present prior to when flush() is invoked, is moved to disk. Ie if addDocs are still happening concurrently then the DWPTs will start filling up again even while the "global flush" runs. That's fine.

          A related question is: Do we want to piggyback on multiple threads when a global flush happens? Eg. Thread 1 called commit, Thread 2 shortly afterwards addDocument(). When should addDocument() happen?
          a) After all DWPTs finished flushing?
          b) After at least one DWPT finished flushing and is available again?
          c) Or should Thread 2 be used to help flushing DWPTs in parallel with Thread 1?

          a) is currently implemented, but I think not really what we want.
          b) is probably best for RT, because it means the lowest indexing latency for the new document to be added.
          c) probably means the best overall throughput (depending even on hardware like disk speed, etc)

          I think start simple – the addDocument always happens? Ie it's never coordinated w/ the ongoing flush. It picks a free DWPT like normal, and since flush is single threaded, there should always be a free DWPT?

          Longer term c) would be great, or, if IW has an ES then it'd send multiple flush jobs to the ES.

          For whatever option we pick, we'll have to carefully think about error handling. It's quite straightforward for a) (just commit all flushed segments to SegmentInfos when the global flush completed succesfully). But for b) and c) it's unclear what should happen if a DWPT flush fails after some completed already successfully before.

          I think we should continue what we do today? Ie, if it's an 'aborting' exception, then the entire segment held by that DWPT is discarded? And we then throw this exc back to caller (and don't try to flush any other segments)?

          Show
          Michael McCandless added a comment - I guess we don't really need the global lock. A thread performing the "global flush" could still acquire each thread state before it starts flushing, but return a threadState to the pool once that particular threadState is done flushing? Good question... we could (in theory) also flush them concurrently? But, since we don't "own" the threads in IW, we can't easily do that, so I think no global lock, go through all DWPTs w/ current thread and flush, sequentially? So all that's guaranteed after the global flush() returns is that all state present prior to when flush() is invoked, is moved to disk. Ie if addDocs are still happening concurrently then the DWPTs will start filling up again even while the "global flush" runs. That's fine. A related question is: Do we want to piggyback on multiple threads when a global flush happens? Eg. Thread 1 called commit, Thread 2 shortly afterwards addDocument(). When should addDocument() happen? a) After all DWPTs finished flushing? b) After at least one DWPT finished flushing and is available again? c) Or should Thread 2 be used to help flushing DWPTs in parallel with Thread 1? a) is currently implemented, but I think not really what we want. b) is probably best for RT, because it means the lowest indexing latency for the new document to be added. c) probably means the best overall throughput (depending even on hardware like disk speed, etc) I think start simple – the addDocument always happens? Ie it's never coordinated w/ the ongoing flush. It picks a free DWPT like normal, and since flush is single threaded, there should always be a free DWPT? Longer term c) would be great, or, if IW has an ES then it'd send multiple flush jobs to the ES. For whatever option we pick, we'll have to carefully think about error handling. It's quite straightforward for a) (just commit all flushed segments to SegmentInfos when the global flush completed succesfully). But for b) and c) it's unclear what should happen if a DWPT flush fails after some completed already successfully before. I think we should continue what we do today? Ie, if it's an 'aborting' exception, then the entire segment held by that DWPT is discarded? And we then throw this exc back to caller (and don't try to flush any other segments)?
          Hide
          Jason Rutherglen added a comment -

          RAM accounting is slightly improved, we had two variables in DW keeping track of it. The extraneous is removed, however we're still OOMing in: ant test-core -Dtestcase=TestIndexWriter -Dtestmethod=testIndexingThenDeleting

          Show
          Jason Rutherglen added a comment - RAM accounting is slightly improved, we had two variables in DW keeping track of it. The extraneous is removed, however we're still OOMing in: ant test-core -Dtestcase=TestIndexWriter -Dtestmethod=testIndexingThenDeleting
          Hide
          Jason Rutherglen added a comment -

          When we're OOMing the ram used reported by DW is 0. We're probably not adding to the RAM used value somewhere.

          Show
          Jason Rutherglen added a comment - When we're OOMing the ram used reported by DW is 0. We're probably not adding to the RAM used value somewhere.
          Hide
          Jason Rutherglen added a comment -

          actually what should be happening currently if the (default)
          ThreadAffinityThreadPool is used. I've to check the code again and maybe write
          a test specifically for that.

          Lets try to test it, though I'm not immediately sure how the test case'd look.

          let's add seqIDs back after the DWPT changes are done and in trunk.

          Right.

          True, the only global lock that locks all thread states happens when
          flushAllThreads is called. This is called when IW explicitly triggers a flush,
          e.g. on close/commit. However, maybe this is not the right approach?

          I think this is fine for the DWPT branch as flush, commit, and close are
          explicitly blocked commands issued by the user. If we implemented something
          more complex now, it wouldn't carry over to RT because the DWPTs don't require
          flushing to search on them. Which leads to the main drawback probably being for
          NRT, eg, get reader. Hmm... In that case a stop the world flush does affect
          overall indexing performance. Perhaps we can add flush and not block all DWPTs
          in a separate issue after the DWPT branch is merged to trunk, if there's user
          need? Or perhaps it's easy to implement, I'm still trying to get a feel for the
          lock progression in the branch.

          In the indexing many documents case, the DWPTs'll be flushed by the tiered RAM
          system. It's the bulk add case where we don't want to block all threads/DWPTs
          at once, eg, I think our main goal is to fix Mike's performance test, with NRT
          being secondary or even a distraction.

          But for b) and c) it's unclear what should happen if a DWPT flush fails
          after some completed already successfully before.

          Right, all that'd be solved if we bulk moved IW to a Scala-like asynchronous
          queuing model. However it's probably a bit too much to do right now. Perhaps in
          the bulk add-many-docs case we'll need a callback for errors? No because the
          add doc method call that triggers the flush will report any exception(s).

          Show
          Jason Rutherglen added a comment - actually what should be happening currently if the (default) ThreadAffinityThreadPool is used. I've to check the code again and maybe write a test specifically for that. Lets try to test it, though I'm not immediately sure how the test case'd look. let's add seqIDs back after the DWPT changes are done and in trunk. Right. True, the only global lock that locks all thread states happens when flushAllThreads is called. This is called when IW explicitly triggers a flush, e.g. on close/commit. However, maybe this is not the right approach? I think this is fine for the DWPT branch as flush, commit, and close are explicitly blocked commands issued by the user. If we implemented something more complex now, it wouldn't carry over to RT because the DWPTs don't require flushing to search on them. Which leads to the main drawback probably being for NRT, eg, get reader. Hmm... In that case a stop the world flush does affect overall indexing performance. Perhaps we can add flush and not block all DWPTs in a separate issue after the DWPT branch is merged to trunk, if there's user need? Or perhaps it's easy to implement, I'm still trying to get a feel for the lock progression in the branch. In the indexing many documents case, the DWPTs'll be flushed by the tiered RAM system. It's the bulk add case where we don't want to block all threads/DWPTs at once, eg, I think our main goal is to fix Mike's performance test, with NRT being secondary or even a distraction. But for b) and c) it's unclear what should happen if a DWPT flush fails after some completed already successfully before. Right, all that'd be solved if we bulk moved IW to a Scala-like asynchronous queuing model. However it's probably a bit too much to do right now. Perhaps in the bulk add-many-docs case we'll need a callback for errors? No because the add doc method call that triggers the flush will report any exception(s).
          Hide
          Michael Busch added a comment -

          I believe we can drop the delete in that case. We only need to buffer into DWPTs that have at least 1 doc.

          Yeah sounds right.

          If a given DWPT is flushing then we pick another? Ie the binding logic would naturally avoid DWPTs that are not available - either because another thread has it, or it's flushing. But it would prefer to use the same DWPT it used last time, if possible (affinity).

          This is actually what should be happening currently if the (default) ThreadAffinityThreadPool is used. I've to check the code again and maybe write a test specifically for that.

          Also: I thought we don't have sequence IDs anymore? (At least, for landing DWPT; after that (for "true RT") we need something like sequence IDs?).

          True, sequenceIDs are gone since the last merge. And yes, I still think we'll need them for RT. Even for the non-RT case sequenceIDs would have nice benefits. If methods like addDocument(), deleteDocuments(), etc. return the sequenceID they'd define a strict ordering on those operations and make it transparent for the application, which would be beneficial for document tracking and log replay.

          But anyway, let's add seqIDs back after the DWPT changes are done and in trunk.

          We shouldn't do global waiting anymore - this is what's great about DWPT.

          However we'll have global waiting for the flush all threads case. I think that can move down to DW though. Or should it simply be a sync in/on IW?

          True, the only global lock that locks all thread states happens when flushAllThreads is called. This is called when IW explicitly triggers a flush, e.g. on close/commit.
          However, maybe this is not the right approach? I guess we don't really need the global lock. A thread performing the "global flush" could still acquire each thread state before it starts flushing, but return a threadState to the pool once that particular threadState is done flushing?

          A related question is: Do we want to piggyback on multiple threads when a global flush happens? Eg. Thread 1 called commit, Thread 2 shortly afterwards addDocument(). When should addDocument() happen?
          a) After all DWPTs finished flushing?
          b) After at least one DWPT finished flushing and is available again?
          c) Or should Thread 2 be used to help flushing DWPTs in parallel with Thread 1?

          a) is currently implemented, but I think not really what we want.
          b) is probably best for RT, because it means the lowest indexing latency for the new document to be added.
          c) probably means the best overall throughput (depending even on hardware like disk speed, etc)

          For whatever option we pick, we'll have to carefully think about error handling. It's quite straightforward for a) (just commit all flushed segments to SegmentInfos when the global flush completed succesfully). But for b) and c) it's unclear what should happen if a DWPT flush fails after some completed already successfully before.

          Show
          Michael Busch added a comment - I believe we can drop the delete in that case. We only need to buffer into DWPTs that have at least 1 doc. Yeah sounds right. If a given DWPT is flushing then we pick another? Ie the binding logic would naturally avoid DWPTs that are not available - either because another thread has it, or it's flushing. But it would prefer to use the same DWPT it used last time, if possible (affinity). This is actually what should be happening currently if the (default) ThreadAffinityThreadPool is used. I've to check the code again and maybe write a test specifically for that. Also: I thought we don't have sequence IDs anymore? (At least, for landing DWPT; after that (for "true RT") we need something like sequence IDs?). True, sequenceIDs are gone since the last merge. And yes, I still think we'll need them for RT. Even for the non-RT case sequenceIDs would have nice benefits. If methods like addDocument(), deleteDocuments(), etc. return the sequenceID they'd define a strict ordering on those operations and make it transparent for the application, which would be beneficial for document tracking and log replay. But anyway, let's add seqIDs back after the DWPT changes are done and in trunk. We shouldn't do global waiting anymore - this is what's great about DWPT. However we'll have global waiting for the flush all threads case. I think that can move down to DW though. Or should it simply be a sync in/on IW? True, the only global lock that locks all thread states happens when flushAllThreads is called. This is called when IW explicitly triggers a flush, e.g. on close/commit. However, maybe this is not the right approach? I guess we don't really need the global lock. A thread performing the "global flush" could still acquire each thread state before it starts flushing, but return a threadState to the pool once that particular threadState is done flushing? A related question is: Do we want to piggyback on multiple threads when a global flush happens? Eg. Thread 1 called commit, Thread 2 shortly afterwards addDocument(). When should addDocument() happen? a) After all DWPTs finished flushing? b) After at least one DWPT finished flushing and is available again? c) Or should Thread 2 be used to help flushing DWPTs in parallel with Thread 1? a) is currently implemented, but I think not really what we want. b) is probably best for RT, because it means the lowest indexing latency for the new document to be added. c) probably means the best overall throughput (depending even on hardware like disk speed, etc) For whatever option we pick, we'll have to carefully think about error handling. It's quite straightforward for a) (just commit all flushed segments to SegmentInfos when the global flush completed succesfully). But for b) and c) it's unclear what should happen if a DWPT flush fails after some completed already successfully before.
          Hide
          Jason Rutherglen added a comment -

          I added a FlushPolicy class, deletes record the ram used to DWPT and DW. Recording to DW is for global ram used. The TIW OOM is still occurring however. The delete calls to FlushControl are gone, I'm not sure what's going to be left of it.

          I'm not sure we should queue. I wonder how much this'd slow down the single threaded case?

          Yes, that's a good point.

          We shouldn't do global waiting anymore - this is what's great about DWPT.

          However we'll have global waiting for the flush all threads case. I think that can move down to DW though. Or should it simply be a sync in/on IW?

          I thought we don't have sequence IDs anymore?

          The seqid lock was there, however it was removed in the last update. I think we need to clearly document the various locks and sync points as right now it's not clear enough to prevent deadlock situations.

          Show
          Jason Rutherglen added a comment - I added a FlushPolicy class, deletes record the ram used to DWPT and DW. Recording to DW is for global ram used. The TIW OOM is still occurring however. The delete calls to FlushControl are gone, I'm not sure what's going to be left of it. I'm not sure we should queue. I wonder how much this'd slow down the single threaded case? Yes, that's a good point. We shouldn't do global waiting anymore - this is what's great about DWPT. However we'll have global waiting for the flush all threads case. I think that can move down to DW though. Or should it simply be a sync in/on IW? I thought we don't have sequence IDs anymore? The seqid lock was there, however it was removed in the last update. I think we need to clearly document the various locks and sync points as right now it's not clear enough to prevent deadlock situations.
          Hide
          Michael McCandless added a comment -

          Right if the DWPT's flushing we can skip it. In the queue model we'd consume and locate
          the existing DWPTs, adding the delete to each DWPT not flushing.

          We have an issue open for the app to state that the delete will not apply to any docs indexed in the current session... once we do that, then, the deletes don't need to be buffered on any DWPTs; just on the "latest" already flushed segment in the index.

          However in the
          zero DWPT case we still need to record a delete somewhere, most likely we'd
          need to create a zero doc DWPT? Oh wait, we need to add the delete to the last
          segment? Ah, I can fix that in the existing code (eg, fix the reopen test case
          failures).

          Yes, to the last flushed segment.

          Show
          Michael McCandless added a comment - Right if the DWPT's flushing we can skip it. In the queue model we'd consume and locate the existing DWPTs, adding the delete to each DWPT not flushing. We have an issue open for the app to state that the delete will not apply to any docs indexed in the current session... once we do that, then, the deletes don't need to be buffered on any DWPTs; just on the "latest" already flushed segment in the index. However in the zero DWPT case we still need to record a delete somewhere, most likely we'd need to create a zero doc DWPT? Oh wait, we need to add the delete to the last segment? Ah, I can fix that in the existing code (eg, fix the reopen test case failures). Yes, to the last flushed segment.
          Hide
          Michael McCandless added a comment -

          We're going to great lengths it seems to emulate a producer consumer queue (eg,
          ordering of calls with sequence ids, thread pooling) without actually
          implementing one. A fixed size blocking queue would simply block threads as
          needed and would probably look cleaner in code. We could still implement thread
          affinities though I simply can't see most applications requiring affinity, so
          perhaps we can avoid it for now and put it back in later?

          I'm not sure we should queue. I wonder how much this'd slow down the single threaded case?

          Also: I thought we don't have sequence IDs anymore? (At least, for landing DWPT; after that (for "true RT") we need something like sequence IDs?).

          I think thread/doc-class affinity is fairly important. Docs compress better if they are indexed together with similar docs.

          I'm just not sure we still need FC's global waiting during flush, that'd seem to go away because the RAM usage tracking is in DW.

          We shouldn't do global waiting anymore – this is what's great about DWPT.

          However once the affinity DWPT flush completed, we'd need logic to revert back to the original?

          I don't think so? I mean a DWPT post-flush is a clean slate. Some other thread/doc-class can stick to it.

          I think the 5% model of LUCENE-2573 may typically yield flushing that occurs in
          near intervals of each other, ie, it's going to slow down the aggregate
          indexing if they're flushing on top of each other. Maybe we should start at 60%
          then the multiple of 40% divided by maxthreadstate - 1? Ideally we'd
          statistically optimize the flush interval per machine, eg, SSDs and RAM disks
          will likely require only a small flush percentage interval.

          Yeah we'll have to run tests to try to gauge the best "default" policy. And you're right that it'll depend on the relative strength of IO vs CPU on the machine. Fast IO system means we can flush "later".

          Show
          Michael McCandless added a comment - We're going to great lengths it seems to emulate a producer consumer queue (eg, ordering of calls with sequence ids, thread pooling) without actually implementing one. A fixed size blocking queue would simply block threads as needed and would probably look cleaner in code. We could still implement thread affinities though I simply can't see most applications requiring affinity, so perhaps we can avoid it for now and put it back in later? I'm not sure we should queue. I wonder how much this'd slow down the single threaded case? Also: I thought we don't have sequence IDs anymore? (At least, for landing DWPT; after that (for "true RT") we need something like sequence IDs?). I think thread/doc-class affinity is fairly important. Docs compress better if they are indexed together with similar docs. I'm just not sure we still need FC's global waiting during flush, that'd seem to go away because the RAM usage tracking is in DW. We shouldn't do global waiting anymore – this is what's great about DWPT. However once the affinity DWPT flush completed, we'd need logic to revert back to the original? I don't think so? I mean a DWPT post-flush is a clean slate. Some other thread/doc-class can stick to it. I think the 5% model of LUCENE-2573 may typically yield flushing that occurs in near intervals of each other, ie, it's going to slow down the aggregate indexing if they're flushing on top of each other. Maybe we should start at 60% then the multiple of 40% divided by maxthreadstate - 1? Ideally we'd statistically optimize the flush interval per machine, eg, SSDs and RAM disks will likely require only a small flush percentage interval. Yeah we'll have to run tests to try to gauge the best "default" policy. And you're right that it'll depend on the relative strength of IO vs CPU on the machine. Fast IO system means we can flush "later".
          Hide
          Jason Rutherglen added a comment -

          Here's a new test.out, I'll look at TestCheckIndex which should probably work.

          "IndexFileDeleter doesn't know about file" seems odd. We're OOMing in TestIndexWriter because we're not flushing by RAM (eg, it currently defaults to return false).

          Show
          Jason Rutherglen added a comment - Here's a new test.out, I'll look at TestCheckIndex which should probably work. "IndexFileDeleter doesn't know about file" seems odd. We're OOMing in TestIndexWriter because we're not flushing by RAM (eg, it currently defaults to return false).
          Hide
          Jason Rutherglen added a comment -

          Same as the last patch, however default deletes is added to DW to which deletes are added to when there are no available DWPTs. On flush all threads, default deletes is applied to the last segment with no doc limit.

          TestIndexReaderReopen now passes.

          Show
          Jason Rutherglen added a comment - Same as the last patch, however default deletes is added to DW to which deletes are added to when there are no available DWPTs. On flush all threads, default deletes is applied to the last segment with no doc limit. TestIndexReaderReopen now passes.
          Hide
          Jason Rutherglen added a comment -

          I believe we can drop the delete in that case. We only need to buffer
          into DWPTs that have at least 1 doc.

          Right if the DWPT's flushing we can skip it. In the queue model we'd consume and locate
          the existing DWPTs, adding the delete to each DWPT not flushing. However in the
          zero DWPT case we still need to record a delete somewhere, most likely we'd
          need to create a zero doc DWPT? Oh wait, we need to add the delete to the last
          segment? Ah, I can fix that in the existing code (eg, fix the reopen test case
          failures).

          Show
          Jason Rutherglen added a comment - I believe we can drop the delete in that case. We only need to buffer into DWPTs that have at least 1 doc. Right if the DWPT's flushing we can skip it. In the queue model we'd consume and locate the existing DWPTs, adding the delete to each DWPT not flushing. However in the zero DWPT case we still need to record a delete somewhere, most likely we'd need to create a zero doc DWPT? Oh wait, we need to add the delete to the last segment? Ah, I can fix that in the existing code (eg, fix the reopen test case failures).
          Hide
          Jason Rutherglen added a comment -

          We're going to great lengths it seems to emulate a producer consumer queue (eg,
          ordering of calls with sequence ids, thread pooling) without actually
          implementing one. A fixed size blocking queue would simply block threads as
          needed and would probably look cleaner in code. We could still implement thread
          affinities though I simply can't see most applications requiring affinity, so
          perhaps we can avoid it for now and put it back in later?

          I think flush control must be global? Ie when we've used too much RAM we
          start flushing?

          Right, it should. I'm just not sure we still need FC's global waiting during
          flush, that'd seem to go away because the RAM usage tracking is in DW. If we
          record the new incremental RAM used (which I think we do) per add/update/delete
          then we can enable a pluggable user defined flush policy.

          If a given DWPT is flushing then we pick another? Ie the binding logic
          would naturally avoid DWPTs that are not available - either because another
          thread has it, or it's flushing. But it would prefer to use the same DWPT it
          used last time, if possible (affinity).

          However once the affinity DWPT flush completed, we'd need logic to revert back
          to the original?

          I think the 5% model of LUCENE-2573 may typically yield flushing that occurs in
          near intervals of each other, ie, it's going to slow down the aggregate
          indexing if they're flushing on top of each other. Maybe we should start at 60%
          then the multiple of 40% divided by maxthreadstate - 1? Ideally we'd
          statistically optimize the flush interval per machine, eg, SSDs and RAM disks
          will likely require only a small flush percentage interval.

          Show
          Jason Rutherglen added a comment - We're going to great lengths it seems to emulate a producer consumer queue (eg, ordering of calls with sequence ids, thread pooling) without actually implementing one. A fixed size blocking queue would simply block threads as needed and would probably look cleaner in code. We could still implement thread affinities though I simply can't see most applications requiring affinity, so perhaps we can avoid it for now and put it back in later? I think flush control must be global? Ie when we've used too much RAM we start flushing? Right, it should. I'm just not sure we still need FC's global waiting during flush, that'd seem to go away because the RAM usage tracking is in DW. If we record the new incremental RAM used (which I think we do) per add/update/delete then we can enable a pluggable user defined flush policy. If a given DWPT is flushing then we pick another? Ie the binding logic would naturally avoid DWPTs that are not available - either because another thread has it, or it's flushing. But it would prefer to use the same DWPT it used last time, if possible (affinity). However once the affinity DWPT flush completed, we'd need logic to revert back to the original? I think the 5% model of LUCENE-2573 may typically yield flushing that occurs in near intervals of each other, ie, it's going to slow down the aggregate indexing if they're flushing on top of each other. Maybe we should start at 60% then the multiple of 40% divided by maxthreadstate - 1? Ideally we'd statistically optimize the flush interval per machine, eg, SSDs and RAM disks will likely require only a small flush percentage interval.
          Hide
          Michael McCandless added a comment -
          {queue}
          Another model we could implement is a straight queuing. This'd give us total
          ordering on all IW calls. Documents, deletes, and flushes would be queued up
          and executed asynchronously. For example in today's DWPT code we will still
          block document additions while flushing because we're tying a thread to a given
          DWPT. If a thread's DWPT is flushing, wouldn't we want to simply assign the doc
          add to a different non-flushing DWPT to gain full efficiency? This seems more
          easily doable with a queuing model. If we want synchronous flushing then we'd
          place a flush event in the queue and wait for it to complete executing. How
          does this sound?{queue}

          I think we should have to add queueing to all incoming ops...

          If a given DWPT is flushing then we pick another? Ie the binding logic would naturally avoid DWPTs that are not available – either because another thread has it, or it's flushing. But it would prefer to use the same DWPT it used last time, if possible (affinity).

          Show
          Michael McCandless added a comment - {queue} Another model we could implement is a straight queuing. This'd give us total ordering on all IW calls. Documents, deletes, and flushes would be queued up and executed asynchronously. For example in today's DWPT code we will still block document additions while flushing because we're tying a thread to a given DWPT. If a thread's DWPT is flushing, wouldn't we want to simply assign the doc add to a different non-flushing DWPT to gain full efficiency? This seems more easily doable with a queuing model. If we want synchronous flushing then we'd place a flush event in the queue and wait for it to complete executing. How does this sound?{queue} I think we should have to add queueing to all incoming ops... If a given DWPT is flushing then we pick another? Ie the binding logic would naturally avoid DWPTs that are not available – either because another thread has it, or it's flushing. But it would prefer to use the same DWPT it used last time, if possible (affinity).
          Hide
          Michael McCandless added a comment -

          Taking a step back, I'm not sure flush control should be global, as flushing is
          entirely per thread now?

          I think flush control must be global? Ie when we've used too much RAM we start flushing?

          If we're adding a delete term for every DWPT, if one
          is flushing do we wait or do we simply queue it up? I don't think we can wait
          in the delete call for a DWPT to completely flush?

          I believe we can drop the delete in that case. We only need to buffer into DWPTs that have at least 1 doc.

          Show
          Michael McCandless added a comment - Taking a step back, I'm not sure flush control should be global, as flushing is entirely per thread now? I think flush control must be global? Ie when we've used too much RAM we start flushing? If we're adding a delete term for every DWPT, if one is flushing do we wait or do we simply queue it up? I don't think we can wait in the delete call for a DWPT to completely flush? I believe we can drop the delete in that case. We only need to buffer into DWPTs that have at least 1 doc.
          Hide
          Jason Rutherglen added a comment -

          Another model we could implement is a straight queuing. This'd give us total
          ordering on all IW calls. Documents, deletes, and flushes would be queued up
          and executed asynchronously. For example in today's DWPT code we will still
          block document additions while flushing because we're tying a thread to a given
          DWPT. If a thread's DWPT is flushing, wouldn't we want to simply assign the doc
          add to a different non-flushing DWPT to gain full efficiency? This seems more
          easily doable with a queuing model. If we want synchronous flushing then we'd
          place a flush event in the queue and wait for it to complete executing. How
          does this sound?

          Show
          Jason Rutherglen added a comment - Another model we could implement is a straight queuing. This'd give us total ordering on all IW calls. Documents, deletes, and flushes would be queued up and executed asynchronously. For example in today's DWPT code we will still block document additions while flushing because we're tying a thread to a given DWPT. If a thread's DWPT is flushing, wouldn't we want to simply assign the doc add to a different non-flushing DWPT to gain full efficiency? This seems more easily doable with a queuing model. If we want synchronous flushing then we'd place a flush event in the queue and wait for it to complete executing. How does this sound?
          Hide
          Jason Rutherglen added a comment -

          Perhaps it's best to place the RAM tracking into FlushControl where the RAM
          consumed by deleted query, terms, and added documents can be recorded, so that
          the proper flush decision may be made in it, a central global object. To get this idea
          working we'd need to implement LUCENE-2573 in FlushControl. I'll likely get
          started on this.

          Show
          Jason Rutherglen added a comment - Perhaps it's best to place the RAM tracking into FlushControl where the RAM consumed by deleted query, terms, and added documents can be recorded, so that the proper flush decision may be made in it, a central global object. To get this idea working we'd need to implement LUCENE-2573 in FlushControl. I'll likely get started on this.
          Hide
          Jason Rutherglen added a comment -

          I think we can get DWPTs working sans LUCENE-2573 for now. We can do this by setting a hard max buffer size, deletes, etc per DWPT. The values will be from IWC divided by the max thread states.

          Show
          Jason Rutherglen added a comment - I think we can get DWPTs working sans LUCENE-2573 for now. We can do this by setting a hard max buffer size, deletes, etc per DWPT. The values will be from IWC divided by the max thread states.
          Hide
          Jason Rutherglen added a comment -

          Taking a step back, I'm not sure flush control should be global, as flushing is
          entirely per thread now? If we're adding a delete term for every DWPT, if one
          is flushing do we wait or do we simply queue it up? I don't think we can wait
          in the delete call for a DWPT to completely flush?

          So we'll likely need to delete in a PerThreadTask that's executed on each
          existing DWPT. How we guarantee concurrency seems a little odd here as what if
          a new DWPT is spun up while we're deleting in another thread? Perhaps we should
          simply spin up on DW init, the max thread state number of DWPTs? This way we
          always have > 0 available, and we can perhaps lock on seq id when adding
          deletes to all DWPTs (it's a fast call). We may want to simply skip any
          flushing DWPTs when adding deletes?

          In browsing the code, FlushControl isn't used in very many places. This'll get
          a little bit more fleshed out when we integrate LUCENE-2573.

          Show
          Jason Rutherglen added a comment - Taking a step back, I'm not sure flush control should be global, as flushing is entirely per thread now? If we're adding a delete term for every DWPT, if one is flushing do we wait or do we simply queue it up? I don't think we can wait in the delete call for a DWPT to completely flush? So we'll likely need to delete in a PerThreadTask that's executed on each existing DWPT. How we guarantee concurrency seems a little odd here as what if a new DWPT is spun up while we're deleting in another thread? Perhaps we should simply spin up on DW init, the max thread state number of DWPTs? This way we always have > 0 available, and we can perhaps lock on seq id when adding deletes to all DWPTs (it's a fast call). We may want to simply skip any flushing DWPTs when adding deletes? In browsing the code, FlushControl isn't used in very many places. This'll get a little bit more fleshed out when we integrate LUCENE-2573 .
          Hide
          Jason Rutherglen added a comment -

          I added an assertion showing the lack of DWPTs when delete is called.

          deleteTermNoWait (which skips flush control) is called in update doc and now deadlock doesn't occur when executing TestIndexWriter.

          Show
          Jason Rutherglen added a comment - I added an assertion showing the lack of DWPTs when delete is called. deleteTermNoWait (which skips flush control) is called in update doc and now deadlock doesn't occur when executing TestIndexWriter.
          Hide
          Jason Rutherglen added a comment -

          Also I think the deadlock is happening in DW update doc where we're calling delete term across all DWPTs. How will we guarantee point-in-timeness when RT is turned on? I guess with the sequence ids?

          Show
          Jason Rutherglen added a comment - Also I think the deadlock is happening in DW update doc where we're calling delete term across all DWPTs. How will we guarantee point-in-timeness when RT is turned on? I guess with the sequence ids?
          Hide
          Jason Rutherglen added a comment -

          Looks like the problem with TestIndexReaderReopen is the test opens an IW, tried to queue deletes, however because there isn't an existing DWPT (at least when flush is called), the deletes haven't been recorded and so they're not applied to the index. Perhaps we need to init DW with at least 1 DWPT?

          Show
          Jason Rutherglen added a comment - Looks like the problem with TestIndexReaderReopen is the test opens an IW, tried to queue deletes, however because there isn't an existing DWPT (at least when flush is called), the deletes haven't been recorded and so they're not applied to the index. Perhaps we need to init DW with at least 1 DWPT?
          Hide
          Jason Rutherglen added a comment -

          Small patch fixing the num deletes test null pointer.

          The TestIndexReaderReopen failure seems to have something to do with flushing deletes.

          Show
          Jason Rutherglen added a comment - Small patch fixing the num deletes test null pointer. The TestIndexReaderReopen failure seems to have something to do with flushing deletes.
          Hide
          Jason Rutherglen added a comment -

          Here's ant test-core output. Looks like it's deadlocking in TestIndexWriter? There are some IR.reopen failures, a null pointer, and a delete count I'll look at.

          Show
          Jason Rutherglen added a comment - Here's ant test-core output. Looks like it's deadlocking in TestIndexWriter? There are some IR.reopen failures, a null pointer, and a delete count I'll look at.
          Hide
          Jason Rutherglen added a comment -

          Also, it'd be great if we could summarize the changes trunk -> DWPT branch.

          Show
          Jason Rutherglen added a comment - Also, it'd be great if we could summarize the changes trunk -> DWPT branch.
          Hide
          Jason Rutherglen added a comment -

          As per Michael B's email, I'll start on integrating deletes and flush-by-RAM aka LUCENE-2573, along the way the deadlock issue could be uncovered, which test is it occurring on?

          Show
          Jason Rutherglen added a comment - As per Michael B's email, I'll start on integrating deletes and flush-by-RAM aka LUCENE-2573 , along the way the deadlock issue could be uncovered, which test is it occurring on?
          Hide
          Jason Rutherglen added a comment -

          The merge is rather hard, as you might imagine

          I'd hold off until LUCENE-2680 is committed to trunk, otherwise the merge work may be redundant.

          commit to trunk and merge trunk into realtime again

          I think it's pretty much ready for trunk as the tests pass.

          Merging realtime/DWPT branch back to trunk should be easier if we try to wrap up DWPT in a short period of time, eg, plow ahead until it's merge-able.

          Show
          Jason Rutherglen added a comment - The merge is rather hard, as you might imagine I'd hold off until LUCENE-2680 is committed to trunk, otherwise the merge work may be redundant. commit to trunk and merge trunk into realtime again I think it's pretty much ready for trunk as the tests pass. Merging realtime/DWPT branch back to trunk should be easier if we try to wrap up DWPT in a short period of time, eg, plow ahead until it's merge-able.
          Hide
          Michael Busch added a comment -

          I started merging yesterday the latest trunk into realtime.

          As part of this I want to clean up the branch a bit and remove unnecessary changes (like refactorings) to make the merge back into trunk less difficult. When I'm done with the merge we should patch LUCENE-2680 into realtime. (or commit to trunk and merge trunk into realtime again)

          Show
          Michael Busch added a comment - I started merging yesterday the latest trunk into realtime. As part of this I want to clean up the branch a bit and remove unnecessary changes (like refactorings) to make the merge back into trunk less difficult. When I'm done with the merge we should patch LUCENE-2680 into realtime. (or commit to trunk and merge trunk into realtime again)
          Hide
          Michael Busch added a comment -

          I started merging yesterday the latest trunk into realtime. The merge is rather hard, as you might imagine
          But I'm down from 600 compile errors to ~100. I can try to finish it this weekend.

          But I don't want to block you, if you want to go the patch route and have time now don't wait for me.

          Show
          Michael Busch added a comment - I started merging yesterday the latest trunk into realtime. The merge is rather hard, as you might imagine But I'm down from 600 compile errors to ~100. I can try to finish it this weekend. But I don't want to block you, if you want to go the patch route and have time now don't wait for me.
          Hide
          Jason Rutherglen added a comment -

          Michael,

          I don't have commit access to the realtime branch. A patch is probably easier
          for other people to work on and look at than the branch? Also, I have some time
          to work on integrating DWPT and the changes since the branch creation are
          fairly significant and probably require mostly manual merging.

          Show
          Jason Rutherglen added a comment - Michael, I don't have commit access to the realtime branch. A patch is probably easier for other people to work on and look at than the branch? Also, I have some time to work on integrating DWPT and the changes since the branch creation are fairly significant and probably require mostly manual merging.
          Hide
          Michael Busch added a comment -

          Not sure if that's much easier though, because what you said is true: the realtime branch currently is basically the DWPT branch.

          Show
          Michael Busch added a comment - Not sure if that's much easier though, because what you said is true: the realtime branch currently is basically the DWPT branch.
          Hide
          Jason Rutherglen added a comment -

          > and then merge realtime back into trunk

          I'd prefer to simply create a patch of the DWPT changes, which granted'll be large, however we're through most of the hurdles and at this point, the patch'll be straightforward to implement and test using the existing unit tests?

          Show
          Jason Rutherglen added a comment - > and then merge realtime back into trunk I'd prefer to simply create a patch of the DWPT changes, which granted'll be large, however we're through most of the hurdles and at this point, the patch'll be straightforward to implement and test using the existing unit tests?
          Hide
          Michael Busch added a comment -

          Ideally we should merge trunk into realtime after LUCENE-2680 is committed, get everything working there, and then merge realtime back into trunk?

          I agree that it totally makes sense to get DWPT into trunk as soon as possible (ie. not wait until all realtime stuff is done).

          Show
          Michael Busch added a comment - Ideally we should merge trunk into realtime after LUCENE-2680 is committed, get everything working there, and then merge realtime back into trunk? I agree that it totally makes sense to get DWPT into trunk as soon as possible (ie. not wait until all realtime stuff is done).
          Hide
          Jason Rutherglen added a comment -

          Now that LUCENE-2680 is implemented we can get deletes working properly in the
          realtime, which is really the DWPT branch at this point. Given the significant
          changes made since it's creation, rather than continue with a realtime branch,
          I propose we patch into trunk the DWPT changes.

          Show
          Jason Rutherglen added a comment - Now that LUCENE-2680 is implemented we can get deletes working properly in the realtime, which is really the DWPT branch at this point. Given the significant changes made since it's creation, rather than continue with a realtime branch, I propose we patch into trunk the DWPT changes.
          Hide
          Jason Rutherglen added a comment -

          I opened an issue for the deletes LUCENE-2655

          Show
          Jason Rutherglen added a comment - I opened an issue for the deletes LUCENE-2655
          Hide
          Jason Rutherglen added a comment - - edited

          Simon, on second thought, lets go ahead and factor out BytesHash, do you want to submit a patch for the realtime branch and post it here or should I?

          Show
          Jason Rutherglen added a comment - - edited Simon, on second thought, lets go ahead and factor out BytesHash, do you want to submit a patch for the realtime branch and post it here or should I?
          Hide
          Jason Rutherglen added a comment -

          Simon, I think the BytesHash being factored is useful, though not a must have for committing the flush by DWPT code.

          Mike, I need to finish the unit tests for LUCENE-2573.

          Michael, what is the issue with deletes? We don't need deletes to use sequence ids yet? Maybe we should open a separate issue to make deletes work for the realtime/DWPT branch?

          Show
          Jason Rutherglen added a comment - Simon, I think the BytesHash being factored is useful, though not a must have for committing the flush by DWPT code. Mike, I need to finish the unit tests for LUCENE-2573 . Michael, what is the issue with deletes? We don't need deletes to use sequence ids yet? Maybe we should open a separate issue to make deletes work for the realtime/DWPT branch?
          Hide
          Simon Willnauer added a comment -

          BTW, randomly, LUCENE-2186 has already roughly factored out the BytesRef hash from TermsHashPerField... so if we want also to reuse that here, we can share.

          do you guys still need the BytesHash being factored out since I currently work on splitting stand alone parts out of LUCENE-2186. If so that would be the next on the list... let me know

          I just posted some details about the concurrency bottleneck of flush at http://chbits.blogspot.com/2010/09/lucenes-indexing-is-fast.html

          nice post mike

          Show
          Simon Willnauer added a comment - BTW, randomly, LUCENE-2186 has already roughly factored out the BytesRef hash from TermsHashPerField... so if we want also to reuse that here, we can share. do you guys still need the BytesHash being factored out since I currently work on splitting stand alone parts out of LUCENE-2186 . If so that would be the next on the list... let me know I just posted some details about the concurrency bottleneck of flush at http://chbits.blogspot.com/2010/09/lucenes-indexing-is-fast.html nice post mike
          Hide
          Michael McCandless added a comment -

          I just posted some details about the concurrency bottleneck of flush at http://chbits.blogspot.com/2010/09/lucenes-indexing-is-fast.html

          It's pretty bad – w/ only 6 threads, I see flush taking 54% of the time, ie for 54% of the time all threads are blocked on indexing while the flush runs.

          Show
          Michael McCandless added a comment - I just posted some details about the concurrency bottleneck of flush at http://chbits.blogspot.com/2010/09/lucenes-indexing-is-fast.html It's pretty bad – w/ only 6 threads, I see flush taking 54% of the time, ie for 54% of the time all threads are blocked on indexing while the flush runs.
          Hide
          Michael Busch added a comment -

          Is this near-comittable?

          I think we need to:

          • merge trunk and make tests pass
          • finish flushing by RAM
          • make deletes work again

          Then it should be ready to commit. Sorry, was so busy the last weeks that I couldn't make much progress.

          Show
          Michael Busch added a comment - Is this near-comittable? I think we need to: merge trunk and make tests pass finish flushing by RAM make deletes work again Then it should be ready to commit. Sorry, was so busy the last weeks that I couldn't make much progress.
          Hide
          Michael McCandless added a comment -

          Is this because indexing stops while the DWPT segment is being flushed to disk or are you referring to a different sync?

          I'm talking about Lucene trunk today (ie before this patch).

          Yes, because indexing of all 20 threads is blocked while a single thread moves the RAM buffer to disk. But, with this patch, each thread will privately move its own RAM buffer to disk, not blocking the rest.

          With 20 threads I'm seeing ~4 seconds of concurrent indexing and then 6-8 seconds to flush (w/ 256 MB RAM buffer).

          Show
          Michael McCandless added a comment - Is this because indexing stops while the DWPT segment is being flushed to disk or are you referring to a different sync? I'm talking about Lucene trunk today (ie before this patch). Yes, because indexing of all 20 threads is blocked while a single thread moves the RAM buffer to disk. But, with this patch, each thread will privately move its own RAM buffer to disk, not blocking the rest. With 20 threads I'm seeing ~4 seconds of concurrent indexing and then 6-8 seconds to flush (w/ 256 MB RAM buffer).
          Hide
          Jason Rutherglen added a comment -

          I think the sync'd flush is a big bottleneck

          Is this because indexing stops while the DWPT segment is being flushed to disk or are you referring to a different sync?

          Show
          Jason Rutherglen added a comment - I think the sync'd flush is a big bottleneck Is this because indexing stops while the DWPT segment is being flushed to disk or are you referring to a different sync?
          Hide
          Michael McCandless added a comment -

          Is this near-comittable? Ie "just" the DWPT cutover? This part seems separable from making each DWPT's buffer searchable?

          I'm running some tests w/ 20 indexing threads and I think the sync'd flush is a big bottleneck...

          Show
          Michael McCandless added a comment - Is this near-comittable? Ie "just" the DWPT cutover? This part seems separable from making each DWPT's buffer searchable? I'm running some tests w/ 20 indexing threads and I think the sync'd flush is a big bottleneck...
          Hide
          Michael McCandless added a comment -

          BTW, randomly, LUCENE-2186 has already roughly factored out the BytesRef hash from TermsHashPerField... so if we want also to reuse that here, we can share.

          Show
          Michael McCandless added a comment - BTW, randomly, LUCENE-2186 has already roughly factored out the BytesRef hash from TermsHashPerField... so if we want also to reuse that here, we can share.
          Hide
          Michael Busch added a comment -

          Is it possible that as part of this issue (or this effort), you'll think of opening PTDW for easier extensions (such as Parallel Indexing)?

          Yeah I'd like to make some progress on parallel indexing too. I think now that DWPT is roughly working I can start thinking about what further changes are necessary in the indexer.

          Show
          Michael Busch added a comment - Is it possible that as part of this issue (or this effort), you'll think of opening PTDW for easier extensions (such as Parallel Indexing)? Yeah I'd like to make some progress on parallel indexing too. I think now that DWPT is roughly working I can start thinking about what further changes are necessary in the indexer.
          Hide
          Shai Erera added a comment -

          Is it possible that as part of this issue (or this effort), you'll think of opening PTDW for easier extensions (such as Parallel Indexing)? If it can easily be done by marking some methods as protected instead of (package-)private, then I think it's worth it. We can mark them as lucene.internal or something ... And perhaps even allow setting the DW on IW, for really expert use?

          If you think that should be part of the overhaul refactoring you've once suggested (Michael B.), then I'm ok with that as well. Just thinking that it's better to hit the iron while it's still hot .

          Show
          Shai Erera added a comment - Is it possible that as part of this issue (or this effort), you'll think of opening PTDW for easier extensions (such as Parallel Indexing)? If it can easily be done by marking some methods as protected instead of (package-)private, then I think it's worth it. We can mark them as lucene.internal or something ... And perhaps even allow setting the DW on IW, for really expert use? If you think that should be part of the overhaul refactoring you've once suggested (Michael B.), then I'm ok with that as well. Just thinking that it's better to hit the iron while it's still hot .
          Hide
          Michael McCandless added a comment -

          Yes, you're right! So, after this change, an exception while processing stored fields / term vectors must handled as an aborting exception.

          Show
          Michael McCandless added a comment - Yes, you're right! So, after this change, an exception while processing stored fields / term vectors must handled as an aborting exception.
          Hide
          Michael Busch added a comment -

          Well, though, if we did write it "live", since the doc is marked deleted, it would be "harmless" right?

          Only difference is it took up some disk space (which should be minor given that it's not "normal" to have lots of docs hitting non-aborting exceptions).

          Also... when the doc is large, the fact that we double buffer (write first to RAMFile then to the store) can be costly. Ie we consume 1X the stored-field size in transient RAM, vs writing directly to disk, I think?
          {quote]

          Yeah I totally agree that this would be a nice performance win - I got first excited too about removing that extra buffering layer. But we probably have to handle exceptions a bit differently then? Because now all exceptions thrown in processDocument() are considered non-aborting, because they nothing is actually written to disk in processDocument().

          We only abort when we encounter an exception in finishDocument(), because then the files might be corrupted.

          So we probably just have to catch the IOExceptions somewhere else to figure out if it was e.g. a (non-aborting) exception from the TokenStream vs. e.g. a disk full (aborting) exception from StoredFieldsWriter.

          Show
          Michael Busch added a comment - Well, though, if we did write it "live", since the doc is marked deleted, it would be "harmless" right? Only difference is it took up some disk space (which should be minor given that it's not "normal" to have lots of docs hitting non-aborting exceptions). Also... when the doc is large, the fact that we double buffer (write first to RAMFile then to the store) can be costly. Ie we consume 1X the stored-field size in transient RAM, vs writing directly to disk, I think? {quote] Yeah I totally agree that this would be a nice performance win - I got first excited too about removing that extra buffering layer. But we probably have to handle exceptions a bit differently then? Because now all exceptions thrown in processDocument() are considered non-aborting, because they nothing is actually written to disk in processDocument(). We only abort when we encounter an exception in finishDocument(), because then the files might be corrupted. So we probably just have to catch the IOExceptions somewhere else to figure out if it was e.g. a (non-aborting) exception from the TokenStream vs. e.g. a disk full (aborting) exception from StoredFieldsWriter.
          Hide
          Michael McCandless added a comment -

          I think we don't need PerDocBuffer, either. And this also simplifies RAM usage tracking!

          The nice thing about that buffer is that on
          non-aborting exceptions we can simply skip
          the whole document, because nothing gets
          written to the stores until finishDocument
          is called.

          Well, though, if we did write it "live", since the doc is marked deleted, it would be "harmless" right?

          Only difference is it took up some disk space (which should be minor given that it's not "normal" to have lots of docs hitting non-aborting exceptions).

          Also... when the doc is large, the fact that we double buffer (write first to RAMFile then to the store) can be costly. Ie we consume 1X the stored-field size in transient RAM, vs writing directly to disk, I think?

          Show
          Michael McCandless added a comment - I think we don't need PerDocBuffer, either. And this also simplifies RAM usage tracking! The nice thing about that buffer is that on non-aborting exceptions we can simply skip the whole document, because nothing gets written to the stores until finishDocument is called. Well, though, if we did write it "live", since the doc is marked deleted, it would be "harmless" right? Only difference is it took up some disk space (which should be minor given that it's not "normal" to have lots of docs hitting non-aborting exceptions). Also... when the doc is large, the fact that we double buffer (write first to RAMFile then to the store) can be costly. Ie we consume 1X the stored-field size in transient RAM, vs writing directly to disk, I think?
          Hide
          Michael Busch added a comment -

          I think we don't need PerDocBuffer, either.
          And this also simplifies RAM usage tracking!

          The nice thing about that buffer is that on
          non-aborting exceptions we can simply skip
          the whole document, because nothing gets
          written to the stores until finishDocument
          is called.

          Show
          Michael Busch added a comment - I think we don't need PerDocBuffer, either. And this also simplifies RAM usage tracking! The nice thing about that buffer is that on non-aborting exceptions we can simply skip the whole document, because nothing gets written to the stores until finishDocument is called.
          Hide
          Jason Rutherglen added a comment -

          I think we should just remove the doc stores

          Right, I think we should remove sharing doc stores between
          segments. And in general, RT apps will likely not want to use
          doc stores if they are performing numerous updates and/or
          deletes. We can explicitly state this in the javadocs.

          I'm thinking we could explore efficient deleted docs as sequence
          ids in a different issue, specifically storing them in a short[]
          and wrapping around.

          Show
          Jason Rutherglen added a comment - I think we should just remove the doc stores Right, I think we should remove sharing doc stores between segments. And in general, RT apps will likely not want to use doc stores if they are performing numerous updates and/or deletes. We can explicitly state this in the javadocs. I'm thinking we could explore efficient deleted docs as sequence ids in a different issue, specifically storing them in a short[] and wrapping around.
          Hide
          Yonik Seeley added a comment -

          Yeah that's pretty much how TermsHashPerField works. I agree with Mike, let's reuse that code.

          Do we even need to maintain a hash over it though, or can we simply keep a list (and allow dup terms until it's time to apply them)?

          Show
          Yonik Seeley added a comment - Yeah that's pretty much how TermsHashPerField works. I agree with Mike, let's reuse that code. Do we even need to maintain a hash over it though, or can we simply keep a list (and allow dup terms until it's time to apply them)?
          Hide
          Michael Busch added a comment -

          Perhaps a single byte[] with length prefixes (like the field cache has). A single int could then represent a term (it would just be an offset into the byte[], which is field-specific, so no need to store the field each time).

          Yeah that's pretty much how TermsHashPerField works. I agree with Mike,
          let's reuse that code.

          Hmm, but... this opto only helps in that we don't have to merge the
          doc stores if we merge segments that already share their doc stores.
          But if (say) I have 2 threads indexing, and I'm indexing lots of docs
          and each DWPT has written 5 segments, we will then merge these 10
          segments, and must merge the doc stores at that point. So the sharing
          isn't really buying us much (just not closing old files & opening new
          ones, which is presumably negligible)?

          Yeah that's true. I agree it won't help much. I think we should just
          remove the doc stores, great simplification (which should also make
          parallel indexing a bit easier ).

          Another thing: it looks like finishFlushedSegment is sync'd on the IW
          instance, but, it need not be sync'd for all of that? EG
          readerPool.get(), applyDeletes, building the CFS, may not need to be
          inside the sync block?

          Thanks for the hint. I need to carefully go over all the synchronization,
          there are likely more problems.

          Show
          Michael Busch added a comment - Perhaps a single byte[] with length prefixes (like the field cache has). A single int could then represent a term (it would just be an offset into the byte[], which is field-specific, so no need to store the field each time). Yeah that's pretty much how TermsHashPerField works. I agree with Mike, let's reuse that code. Hmm, but... this opto only helps in that we don't have to merge the doc stores if we merge segments that already share their doc stores. But if (say) I have 2 threads indexing, and I'm indexing lots of docs and each DWPT has written 5 segments, we will then merge these 10 segments, and must merge the doc stores at that point. So the sharing isn't really buying us much (just not closing old files & opening new ones, which is presumably negligible)? Yeah that's true. I agree it won't help much. I think we should just remove the doc stores, great simplification (which should also make parallel indexing a bit easier ). Another thing: it looks like finishFlushedSegment is sync'd on the IW instance, but, it need not be sync'd for all of that? EG readerPool.get(), applyDeletes, building the CFS, may not need to be inside the sync block? Thanks for the hint. I need to carefully go over all the synchronization, there are likely more problems.
          Hide
          Michael McCandless added a comment -

          I still see usage of docStoreOffset, but aren't we doing away with shared doc stores with the cutover to DWPT?

          Do we want all segments that one DWPT writes to share the same
          doc store, i.e. one doc store per DWPT, or remove doc stores
          entirely?

          Oh good question... a single DWPT can in fact continue to share doc
          store across the segments it flushes.

          Hmm, but... this opto only helps in that we don't have to merge the
          doc stores if we merge segments that already share their doc stores.
          But if (say) I have 2 threads indexing, and I'm indexing lots of docs
          and each DWPT has written 5 segments, we will then merge these 10
          segments, and must merge the doc stores at that point. So the sharing
          isn't really buying us much (just not closing old files & opening new
          ones, which is presumably negligible)?

          I think you can further simplify DocumentsWriterPerThread.DocWriter; in fact I think you can remove it & all subclasses in consumers!

          I agree! Now that a high number of testcases pass it's less scary
          to modify even more code - will do this next.

          Also, we don't need separate closeDocStore; it should just be closed during flush.

          OK sounds good.

          Super

          I like the ThreadAffinityDocumentsWriterThreadPool; it's the default right (I see some tests explicitly setting in on IWC; not sure why)?

          It's actually only TestStressIndexing2 and it sets it to use a different
          number of max thread states than the default.

          Ahh OK great.

          We should make the in-RAM deletes impl somehow pluggable?

          Do you mean so that it's customizable how deletes are handled?

          Actually I was worried about the long[] sequenceIDs (adding 8 bytes
          RAM per buffered doc) – this could be a biggish hit to RAM efficiency
          for small docs.

          E.g. doing live deletes vs. lazy deletes on flush?
          I think that's a good idea. E.g. at Twitter we'll do live deletes always
          to get the lowest latency (and we don't have too many deletes),
          but that's probably not the best default for everyone.
          So I agree that making this customizable is a good idea.

          Yeah, this too

          Actually deletions today are not applied on flush – they continue to
          be buffered beyond flush, and then get applied just before a merge
          kicks off. I think we should keep this (as an option and probably as
          the default) – it's important for apps w/ large indices that don't use
          NRT (and don't pool readers) because it's costly to open readers.

          So it sounds like we should support "lazy" (apply-before-merge like
          today) and "live" (live means resolve deleted Term/Query -> docID(s)
          synchronously inside deleteDocuments, right?).

          Live should also be less performant because of less temporal locality
          (vs lazy).

          It'd also be nice to have a more efficient data structure to buffer the
          deletes. With many buffered deletes the java hashmap approach
          will not be very efficient. Terms could be written into a byte pool,
          but what should we do with queries?

          I agree w/ Yonik: let's worry only about delete by Term (not Query)
          for now.

          Maybe we could reuse (factor out) TermsHashPerField's custom hash
          here, for the buffered Terms? It efficiently maps a BytesRef --> int.

          Another thing: it looks like finishFlushedSegment is sync'd on the IW
          instance, but, it need not be sync'd for all of that? EG
          readerPool.get(), applyDeletes, building the CFS, may not need to be
          inside the sync block?

          Show
          Michael McCandless added a comment - I still see usage of docStoreOffset, but aren't we doing away with shared doc stores with the cutover to DWPT? Do we want all segments that one DWPT writes to share the same doc store, i.e. one doc store per DWPT, or remove doc stores entirely? Oh good question... a single DWPT can in fact continue to share doc store across the segments it flushes. Hmm, but... this opto only helps in that we don't have to merge the doc stores if we merge segments that already share their doc stores. But if (say) I have 2 threads indexing, and I'm indexing lots of docs and each DWPT has written 5 segments, we will then merge these 10 segments, and must merge the doc stores at that point. So the sharing isn't really buying us much (just not closing old files & opening new ones, which is presumably negligible)? I think you can further simplify DocumentsWriterPerThread.DocWriter; in fact I think you can remove it & all subclasses in consumers! I agree! Now that a high number of testcases pass it's less scary to modify even more code - will do this next. Also, we don't need separate closeDocStore; it should just be closed during flush. OK sounds good. Super I like the ThreadAffinityDocumentsWriterThreadPool; it's the default right (I see some tests explicitly setting in on IWC; not sure why)? It's actually only TestStressIndexing2 and it sets it to use a different number of max thread states than the default. Ahh OK great. We should make the in-RAM deletes impl somehow pluggable? Do you mean so that it's customizable how deletes are handled? Actually I was worried about the long[] sequenceIDs (adding 8 bytes RAM per buffered doc) – this could be a biggish hit to RAM efficiency for small docs. E.g. doing live deletes vs. lazy deletes on flush? I think that's a good idea. E.g. at Twitter we'll do live deletes always to get the lowest latency (and we don't have too many deletes), but that's probably not the best default for everyone. So I agree that making this customizable is a good idea. Yeah, this too Actually deletions today are not applied on flush – they continue to be buffered beyond flush, and then get applied just before a merge kicks off. I think we should keep this (as an option and probably as the default) – it's important for apps w/ large indices that don't use NRT (and don't pool readers) because it's costly to open readers. So it sounds like we should support "lazy" (apply-before-merge like today) and "live" (live means resolve deleted Term/Query -> docID(s) synchronously inside deleteDocuments, right?). Live should also be less performant because of less temporal locality (vs lazy). It'd also be nice to have a more efficient data structure to buffer the deletes. With many buffered deletes the java hashmap approach will not be very efficient. Terms could be written into a byte pool, but what should we do with queries? I agree w/ Yonik: let's worry only about delete by Term (not Query) for now. Maybe we could reuse (factor out) TermsHashPerField's custom hash here, for the buffered Terms? It efficiently maps a BytesRef --> int. Another thing: it looks like finishFlushedSegment is sync'd on the IW instance, but, it need not be sync'd for all of that? EG readerPool.get(), applyDeletes, building the CFS, may not need to be inside the sync block?
          Hide
          Yonik Seeley added a comment -

          It'd also be nice to have a more efficient data structure to buffer the deletes. With many buffered deletes the java hashmap approach will not be very efficient. Terms could be written into a byte pool, but what should we do with queries?

          IMO, terms are an order of magnitude more important than queries. Most deletes will be by some sort of unique id, and will be in the same field.

          Perhaps a single byte[] with length prefixes (like the field cache has). A single int could then represent a term (it would just be an offset into the byte[], which is field-specific, so no need to store the field each time).

          We could then build a treemap or hashmap that natively used an int[]... but that may not be necessary (depending on how deletes are applied). Perhaps a sort could be done right before applying, and duplicate terms could be handled at that time.

          Anyway, I'm only casually following this issue, but I'ts looking like really cool stuff!

          Show
          Yonik Seeley added a comment - It'd also be nice to have a more efficient data structure to buffer the deletes. With many buffered deletes the java hashmap approach will not be very efficient. Terms could be written into a byte pool, but what should we do with queries? IMO, terms are an order of magnitude more important than queries. Most deletes will be by some sort of unique id, and will be in the same field. Perhaps a single byte[] with length prefixes (like the field cache has). A single int could then represent a term (it would just be an offset into the byte[], which is field-specific, so no need to store the field each time). We could then build a treemap or hashmap that natively used an int[]... but that may not be necessary (depending on how deletes are applied). Perhaps a sort could be done right before applying, and duplicate terms could be handled at that time. Anyway, I'm only casually following this issue, but I'ts looking like really cool stuff!
          Hide
          Michael Busch added a comment -

          Thanks, Mike - great feedback! (as always)

          I still see usage of docStoreOffset, but aren't we doing away with
          shared doc stores with the cutover to DWPT?

          Do we want all segments that one DWPT writes to share the same
          doc store, i.e. one doc store per DWPT, or remove doc stores
          entirely?

          I think you can further simplify DocumentsWriterPerThread.DocWriter;
          in fact I think you can remove it & all subclasses in consumers!

          I agree! Now that a high number of testcases pass it's less scary
          to modify even more code - will do this next.

          Also, we don't need separate closeDocStore; it should just be closed
          during flush.

          OK sounds good.

          I like the ThreadAffinityDocumentsWriterThreadPool; it's the default
          right (I see some tests explicitly setting in on IWC; not sure why)?

          It's actually only TestStressIndexing2 and it sets it to use a different
          number of max thread states than the default.

          We should make the in-RAM deletes impl somehow pluggable?

          Do you mean so that it's customizable how deletes are handled?
          E.g. doing live deletes vs. lazy deletes on flush?
          I think that's a good idea. E.g. at Twitter we'll do live deletes always
          to get the lowest latency (and we don't have too many deletes),
          but that's probably not the best default for everyone.
          So I agree that making this customizable is a good idea.

          It'd also be nice to have a more efficient data structure to buffer the
          deletes. With many buffered deletes the java hashmap approach
          will not be very efficient. Terms could be written into a byte pool,
          but what should we do with queries?

          Show
          Michael Busch added a comment - Thanks, Mike - great feedback! (as always) I still see usage of docStoreOffset, but aren't we doing away with shared doc stores with the cutover to DWPT? Do we want all segments that one DWPT writes to share the same doc store, i.e. one doc store per DWPT, or remove doc stores entirely? I think you can further simplify DocumentsWriterPerThread.DocWriter; in fact I think you can remove it & all subclasses in consumers! I agree! Now that a high number of testcases pass it's less scary to modify even more code - will do this next. Also, we don't need separate closeDocStore; it should just be closed during flush. OK sounds good. I like the ThreadAffinityDocumentsWriterThreadPool; it's the default right (I see some tests explicitly setting in on IWC; not sure why)? It's actually only TestStressIndexing2 and it sets it to use a different number of max thread states than the default. We should make the in-RAM deletes impl somehow pluggable? Do you mean so that it's customizable how deletes are handled? E.g. doing live deletes vs. lazy deletes on flush? I think that's a good idea. E.g. at Twitter we'll do live deletes always to get the lowest latency (and we don't have too many deletes), but that's probably not the best default for everyone. So I agree that making this customizable is a good idea. It'd also be nice to have a more efficient data structure to buffer the deletes. With many buffered deletes the java hashmap approach will not be very efficient. Terms could be written into a byte pool, but what should we do with queries?
          Hide
          Michael McCandless added a comment -

          This is looking awesome Michael! I love the removal of *PerThread –
          they are all logically absorbed into DWPT, so everything is now per
          thread.

          I still see usage of docStoreOffset, but aren't we doing away with
          shared doc stores with the cutover to DWPT?

          I think you can further simplify DocumentsWriterPerThread.DocWriter;
          in fact I think you can remove it & all subclasses in consumers! The
          consumers can simply directly write their files. The only reason this
          class was created was because we have to interleave docs when writing
          the doc stores; this is no longer needed since doc stores are again
          private to the segment. I think we don't need PerDocBuffer, either.
          And this also simplifies RAM usage tracking!

          Also, we don't need separate closeDocStore; it should just be closed
          during flush.

          I like the ThreadAffinityDocumentsWriterThreadPool; it's the default
          right (I see some tests explicitly setting in on IWC; not sure why)?

          We should make the in-RAM deletes impl somehow pluggable?

          Show
          Michael McCandless added a comment - This is looking awesome Michael! I love the removal of *PerThread – they are all logically absorbed into DWPT, so everything is now per thread. I still see usage of docStoreOffset, but aren't we doing away with shared doc stores with the cutover to DWPT? I think you can further simplify DocumentsWriterPerThread.DocWriter; in fact I think you can remove it & all subclasses in consumers! The consumers can simply directly write their files. The only reason this class was created was because we have to interleave docs when writing the doc stores; this is no longer needed since doc stores are again private to the segment. I think we don't need PerDocBuffer, either. And this also simplifies RAM usage tracking! Also, we don't need separate closeDocStore; it should just be closed during flush. I like the ThreadAffinityDocumentsWriterThreadPool; it's the default right (I see some tests explicitly setting in on IWC; not sure why)? We should make the in-RAM deletes impl somehow pluggable?
          Hide
          Jason Rutherglen added a comment -

          Implement logic to discard deletes from the deletes
          buffer

          Michael, where in the code is this supposed to occur?

          Implement flush-by-ram logic

          I'll make a go of this.

          Maybe change delete logic: currently deletes are applied
          when a segment is flushed. Maybe we can keep it this way in the
          realtime-branch though, because that's most likely what we want
          to do once the RAM buffer is searchable and deletes are cheaper
          as they can then be done in-memory before flush

          I think we'll keep things this way for this issue (ie, per
          thread document writers), however for LUCENE-2312 I think we'll
          want to implement foreground deletes (eg, updating the deleted
          docs sequences int[]).

          Show
          Jason Rutherglen added a comment - Implement logic to discard deletes from the deletes buffer Michael, where in the code is this supposed to occur? Implement flush-by-ram logic I'll make a go of this. Maybe change delete logic: currently deletes are applied when a segment is flushed. Maybe we can keep it this way in the realtime-branch though, because that's most likely what we want to do once the RAM buffer is searchable and deletes are cheaper as they can then be done in-memory before flush I think we'll keep things this way for this issue (ie, per thread document writers), however for LUCENE-2312 I think we'll want to implement foreground deletes (eg, updating the deleted docs sequences int[]).
          Hide
          Jason Rutherglen added a comment -

          Looks like we're not using MergeDocIDRemapper anymore?

          Show
          Jason Rutherglen added a comment - Looks like we're not using MergeDocIDRemapper anymore?
          Hide
          Michael Busch added a comment -

          We need to update the indexing chain comment in DocumentsWriterPerThread

          There's a lot of code cleanup to do. I just wanted to checkpoint what I have so far.

          Before I/we forget, maybe we can describe what we discussed about RT features such as the terms dictionary (the AtomicIntArray linked list, possible usage of a btree), the multi-level skip list (static levels), and other such features at: https://issues.apache.org/jira/browse/LUCENE-2312

          Yeah, I will. But first I need to catch up on sleep

          Show
          Michael Busch added a comment - We need to update the indexing chain comment in DocumentsWriterPerThread There's a lot of code cleanup to do. I just wanted to checkpoint what I have so far. Before I/we forget, maybe we can describe what we discussed about RT features such as the terms dictionary (the AtomicIntArray linked list, possible usage of a btree), the multi-level skip list (static levels), and other such features at: https://issues.apache.org/jira/browse/LUCENE-2312 Yeah, I will. But first I need to catch up on sleep
          Hide
          Jason Rutherglen added a comment -

          We need to update the indexing chain comment in DocumentsWriterPerThread

          Show
          Jason Rutherglen added a comment - We need to update the indexing chain comment in DocumentsWriterPerThread
          Hide
          Jason Rutherglen added a comment -

          Michael, thanks for posting and committing the patch. I'll be taking a look.

          Before I/we forget, maybe we can describe what we discussed about RT features such as the terms dictionary (the AtomicIntArray linked list, possible usage of a btree), the multi-level skip list (static levels), and other such features at: https://issues.apache.org/jira/browse/LUCENE-2312

          Show
          Jason Rutherglen added a comment - Michael, thanks for posting and committing the patch. I'll be taking a look. Before I/we forget, maybe we can describe what we discussed about RT features such as the terms dictionary (the AtomicIntArray linked list, possible usage of a btree), the multi-level skip list (static levels), and other such features at: https://issues.apache.org/jira/browse/LUCENE-2312
          Hide
          Michael Busch added a comment -

          OK, I committed to the branch. I'll try tomorrow to merge trunk into the branch. I was already warned that there will most likely be lots of conflicts - so help is welcome!

          Show
          Michael Busch added a comment - OK, I committed to the branch. I'll try tomorrow to merge trunk into the branch. I was already warned that there will most likely be lots of conflicts - so help is welcome!
          Hide
          Michael Busch added a comment -

          Finally a new version of the patch! (Sorry for keeping you guys waiting...)

          It's not done yet, but it compiles (against realtime branch!) and >95% of the core test cases pass.

          Work done in addition to last patch:

          • Added DocumentsWriterPerThread
          • Reimplemented big parts of DocumentsWriter
          • Added DocumentsWriterThreadPool which is an extension point for different pool implementation. The default impl is
            the ThreadAffinityDocumentsWriterThreadPool, which does what the old code did (try to assign a DWPT always to
            the same thread). It should be easy now to add Document#getSourceID() and another pool that can assign threads
            based on the sourceID.
          • Initial implementation of sequenceIDs. Currently they're only used to keep track of deletes and not for
            e.g. NRT readers yet.
          • Lots of other changes here and there.

          TODOs:

          • Implement flush-by-ram logic
          • Implement logic to discard deletes from the deletes buffer
          • Finish sequenceID handling: IW#commit() and IW#close() should return ID of last flushed sequenceID
          • Maybe change delete logic: currently deletes are applied when a segment is flushed. Maybe we can keep it this way
            in the realtime-branch though, because that's most likely what we want to do once the RAM buffer is searchable and
            deletes are cheaper as they can then be done in-memory before flush
          • Fix unit tests (mostly exception handling and thread safety)
          • New test cases, e.g. for sequenceID testing
          • Simplify code: In some places I copied code around, which can probably be further simplified
          • I started removing some of the old setters/getters in IW which are not in IndexWriterConfig - need to finish that,
            or revert those changes and use a different patch
          • Fix nocommits
          • Performance testing

          I'm planning to commit this soon to the realtime branch, even though it's obviously not done yet. But it's a big
          patch and changes will be easier to track with an svn history.

          Show
          Michael Busch added a comment - Finally a new version of the patch! (Sorry for keeping you guys waiting...) It's not done yet, but it compiles (against realtime branch!) and >95% of the core test cases pass. Work done in addition to last patch: Added DocumentsWriterPerThread Reimplemented big parts of DocumentsWriter Added DocumentsWriterThreadPool which is an extension point for different pool implementation. The default impl is the ThreadAffinityDocumentsWriterThreadPool, which does what the old code did (try to assign a DWPT always to the same thread). It should be easy now to add Document#getSourceID() and another pool that can assign threads based on the sourceID. Initial implementation of sequenceIDs. Currently they're only used to keep track of deletes and not for e.g. NRT readers yet. Lots of other changes here and there. TODOs: Implement flush-by-ram logic Implement logic to discard deletes from the deletes buffer Finish sequenceID handling: IW#commit() and IW#close() should return ID of last flushed sequenceID Maybe change delete logic: currently deletes are applied when a segment is flushed. Maybe we can keep it this way in the realtime-branch though, because that's most likely what we want to do once the RAM buffer is searchable and deletes are cheaper as they can then be done in-memory before flush Fix unit tests (mostly exception handling and thread safety) New test cases, e.g. for sequenceID testing Simplify code: In some places I copied code around, which can probably be further simplified I started removing some of the old setters/getters in IW which are not in IndexWriterConfig - need to finish that, or revert those changes and use a different patch Fix nocommits Performance testing I'm planning to commit this soon to the realtime branch, even though it's obviously not done yet. But it's a big patch and changes will be easier to track with an svn history.
          Hide
          Michael Busch added a comment -

          I just created a new branch for this (and related RT features) here:
          https://svn.apache.org/repos/asf/lucene/dev/branches/realtime_search

          I'll commit the latest version of my patch soon (have to update to trunk and make it compile)...

          Show
          Michael Busch added a comment - I just created a new branch for this (and related RT features) here: https://svn.apache.org/repos/asf/lucene/dev/branches/realtime_search I'll commit the latest version of my patch soon (have to update to trunk and make it compile)...
          Hide
          Jason Rutherglen added a comment -

          Michael B, what's the status? I'm thinking of working on it.

          Show
          Jason Rutherglen added a comment - Michael B, what's the status? I'm thinking of working on it.
          Hide
          Jason Rutherglen added a comment -

          We just have to explain this in the javadocs. It's a change that we should also mention in the backwards-compatibility section.

          We're punting eh?

          Show
          Jason Rutherglen added a comment - We just have to explain this in the javadocs. It's a change that we should also mention in the backwards-compatibility section. We're punting eh?
          Hide
          Jason Rutherglen added a comment -

          This means that we don't have to remap deletes
          anymore.

          Good.

          the pooled SegmentReaders that read the flushed segments
          can use arrays of sequenceIDs instead of BitSets?

          Ok, that'd be an improvement. In order to understand all the
          logic, I'd need to see a prototype, I can't really regurgitate
          what I've read thus far. Are there any concurrency issues with
          the seq arrays? Like say if a reader is iterating a posting
          list? I guess not because we're always incrementing?

          safeSeqID is always smaller than any buffered doc or
          buffered delete in DW or DWPT.

          Can you elaborate on this? Wouldn't safeSeqID be greater than
          buffered doc or deleted doc due to add
          docs and interleaving?

          Also can we modify the terminology, perhaps committed seq id is
          better? To me that's a little more clear. Maybe it'd be good to
          start a wiki on this so we can have a master doc we're all
          referring to (kind of a Solr thing that because the projects are
          merged we can try for?)

          Show
          Jason Rutherglen added a comment - This means that we don't have to remap deletes anymore. Good. the pooled SegmentReaders that read the flushed segments can use arrays of sequenceIDs instead of BitSets? Ok, that'd be an improvement. In order to understand all the logic, I'd need to see a prototype, I can't really regurgitate what I've read thus far. Are there any concurrency issues with the seq arrays? Like say if a reader is iterating a posting list? I guess not because we're always incrementing? safeSeqID is always smaller than any buffered doc or buffered delete in DW or DWPT. Can you elaborate on this? Wouldn't safeSeqID be greater than buffered doc or deleted doc due to add docs and interleaving? Also can we modify the terminology, perhaps committed seq id is better? To me that's a little more clear. Maybe it'd be good to start a wiki on this so we can have a master doc we're all referring to (kind of a Solr thing that because the projects are merged we can try for?)
          Hide
          Michael Busch added a comment -

          No, I think the consensus is to have a DWPT maxBufferedDocs flush trigger, not a global one.

          I think we should just keep it simple. If you set maxBufferedDocs to 1000, then every thread will flush at 1000 docs.
          If you set maxThreadStates to 5, then you could at the same time in theory have 5000 docs in memory.

          We just have to explain this in the javadocs. It's a change that we should also mention in the backwards-compatibility section.

          Show
          Michael Busch added a comment - No, I think the consensus is to have a DWPT maxBufferedDocs flush trigger, not a global one. I think we should just keep it simple. If you set maxBufferedDocs to 1000, then every thread will flush at 1000 docs. If you set maxThreadStates to 5, then you could at the same time in theory have 5000 docs in memory. We just have to explain this in the javadocs. It's a change that we should also mention in the backwards-compatibility section.
          Hide
          Jason Rutherglen added a comment -

          I think the consensus is to have a DWPT maxBufferedDocs flush trigger, not a global one.

          Ok, so we'll need to implement dynamic resizing of the maxBufferedDocs per DWPT in the case where the maxNumberThreadStates changes?

          Show
          Jason Rutherglen added a comment - I think the consensus is to have a DWPT maxBufferedDocs flush trigger, not a global one. Ok, so we'll need to implement dynamic resizing of the maxBufferedDocs per DWPT in the case where the maxNumberThreadStates changes?
          Hide
          Michael Busch added a comment -

          No, I think the consensus is to have a DWPT maxBufferedDocs flush trigger, not a global one.

          Show
          Michael Busch added a comment - No, I think the consensus is to have a DWPT maxBufferedDocs flush trigger, not a global one.
          Hide
          Jason Rutherglen added a comment -

          It'd sort of "match" the current behaviour, in that you get segments flushed to the index with that many docs.

          It seems like the consensus is, most users don't use the maxBufferedDocs feature, however for the ones that do, we can stop the world and flush all segments? That way we can focus on the RAM consumed use case?

          Show
          Jason Rutherglen added a comment - It'd sort of "match" the current behaviour, in that you get segments flushed to the index with that many docs. It seems like the consensus is, most users don't use the maxBufferedDocs feature, however for the ones that do, we can stop the world and flush all segments? That way we can focus on the RAM consumed use case?
          Hide
          Michael McCandless added a comment -

          I'm not sure I understand how this would help for ParallelReader?
          I think you can't use multi-threaded indexing even today, because you
          have no control over the order in which the docs will make it into the
          index.

          Well, to set up indexes for PR today, you have to run IndexWriter in a very degraded state – flush by doc count, use a single thread, turn off concurrent merging (use SMS), use LogDocMergePolicy.

          So having maxBufferedDocs per DWPT seems tempting to me. Then you know
          that each written segment will have exactly a size of maxBufferedDocs,
          so this is much more predictable. And if you index with a single
          thread only the behavior is identical to a "global" maxBufferedDocs
          flush trigger.

          Yeah, maybe that'd be sufficient...? It'd sort of "match" the current behaviour, in that you get segments flushed to the index with that many docs.

          Show
          Michael McCandless added a comment - I'm not sure I understand how this would help for ParallelReader? I think you can't use multi-threaded indexing even today, because you have no control over the order in which the docs will make it into the index. Well, to set up indexes for PR today, you have to run IndexWriter in a very degraded state – flush by doc count, use a single thread, turn off concurrent merging (use SMS), use LogDocMergePolicy. So having maxBufferedDocs per DWPT seems tempting to me. Then you know that each written segment will have exactly a size of maxBufferedDocs, so this is much more predictable. And if you index with a single thread only the behavior is identical to a "global" maxBufferedDocs flush trigger. Yeah, maybe that'd be sufficient...? It'd sort of "match" the current behaviour, in that you get segments flushed to the index with that many docs.
          Hide
          Michael Busch added a comment -

          The problem with maxBufferedDocs is that you can only enforce it by "stopping the world"

          Why not just stop the world for maxBufferedDocs/maxBufferedDelDocs?
          It should be pretty simple to implement?

          Obviously you'll take a perf hit if you flush by these counts, so, you
          really should flush by RAM instead... But some apps need this (eg
          setting up indices for ParallelReader, until we get SliceWriter
          working...).

          I'm not sure I understand how this would help for ParallelReader?
          I think you can't use multi-threaded indexing even today, because you
          have no control over the order in which the docs will make it into the
          index.

          So having maxBufferedDocs per DWPT seems tempting to me. Then you know
          that each written segment will have exactly a size of maxBufferedDocs,
          so this is much more predictable. And if you index with a single
          thread only the behavior is identical to a "global" maxBufferedDocs
          flush trigger.

          Show
          Michael Busch added a comment - The problem with maxBufferedDocs is that you can only enforce it by "stopping the world" Why not just stop the world for maxBufferedDocs/maxBufferedDelDocs? It should be pretty simple to implement? Obviously you'll take a perf hit if you flush by these counts, so, you really should flush by RAM instead... But some apps need this (eg setting up indices for ParallelReader, until we get SliceWriter working...). I'm not sure I understand how this would help for ParallelReader? I think you can't use multi-threaded indexing even today, because you have no control over the order in which the docs will make it into the index. So having maxBufferedDocs per DWPT seems tempting to me. Then you know that each written segment will have exactly a size of maxBufferedDocs, so this is much more predictable. And if you index with a single thread only the behavior is identical to a "global" maxBufferedDocs flush trigger.
          Hide
          Michael McCandless added a comment -

          The problem with maxBufferedDocs is that you can only enforce it by "stopping the world"

          Why not just stop the world for maxBufferedDocs/maxBufferedDelDocs?
          It should be pretty simple to implement?

          Obviously you'll take a perf hit if you flush by these counts, so, you
          really should flush by RAM instead... But some apps need this (eg
          setting up indices for ParallelReader, until we get SliceWriter
          working...).

          Show
          Michael McCandless added a comment - The problem with maxBufferedDocs is that you can only enforce it by "stopping the world" Why not just stop the world for maxBufferedDocs/maxBufferedDelDocs? It should be pretty simple to implement? Obviously you'll take a perf hit if you flush by these counts, so, you really should flush by RAM instead... But some apps need this (eg setting up indices for ParallelReader, until we get SliceWriter working...).
          Hide
          Michael Busch added a comment -

          I think we simply keep track of maxBufferedDocs and
          maxBufferedDeleteTerms globally. We can't deprecate because
          that'd be a pain for users.

          The problem with maxBufferedDocs is that you can only enforce it by
          "stopping the world", i.e. preventing other DWPTs from adding docs while
          one DWPT is flushing. Unless we also introduce the 90%-95% etc flush
          triggers for maxBufferedDocs.

          The other option would be to have maxBufferedDocs per DWPT. Then
          the theoretical total limit of docs in RAM would be
          maxNumberThreadStates * maxBufferedDocs.

          Show
          Michael Busch added a comment - I think we simply keep track of maxBufferedDocs and maxBufferedDeleteTerms globally. We can't deprecate because that'd be a pain for users. The problem with maxBufferedDocs is that you can only enforce it by "stopping the world", i.e. preventing other DWPTs from adding docs while one DWPT is flushing. Unless we also introduce the 90%-95% etc flush triggers for maxBufferedDocs. The other option would be to have maxBufferedDocs per DWPT. Then the theoretical total limit of docs in RAM would be maxNumberThreadStates * maxBufferedDocs.
          Hide
          Michael Busch added a comment - - edited

          The deletes aren't entirely in the foreground, only the RAM
          buffer deletes. Deletes to existing segments would use the
          existing clone and delete mechanism.

          I think deletes should all be based on the new sequenceIDs. Today the buffered
          deletes depend on a value that can change (numDocs changes when segments are
          merged). But with sequenceIDs they're absolute: each delete operation will
          have a sequenceID assigned, and the ordering of all write operations (add,
          update, delete) is unambiguously defined by the sequenceIDs. Remember that
          addDocument/updateDocument/deleteDocuments will all return the sequenceID.

          This means that we don't have to remap deletes anymore. Also the pooled
          SegmentReaders that read the flushed segments can use arrays of sequenceIDs
          instead of BitSets? Of course that needs more memory, but even if you add 1M
          docs without ever calling IW.commit/close you only need 8MB - I think that's
          acceptable. And this size is independent on how many times you call
          reopen/clone on the realtime readers, because they can all share the same
          deletes array.

          We can also modify IW.commit/close to return the latest sequenceID. This would
          be very nice for document tracking. E.g. when you hit an aborting exception
          after you flushed already some segments, then even though you must discard
          everything in the DW's buffer you can still call commit/close to commit
          everything that doesn't have to be discarded. IW.commit/close would in that
          case return the sequenceID of the latest write operation that was successfully
          committed, i.e. that would be visible to an IndexReader. Though we have to be
          careful here: multiple segments can have interleaving sequenceIDs, so we must
          discard every segment that has one or more sequenceIDs greater than the lowest
          one in the DW. So we still need the push-deletes logic, that keeps RAM deletes
          separate from the flushed ones until flushing was successful.

          DW/IW need to keep track of the largest sequenceID that is safe, i.e. that
          could be committed even if DW hits an aborting exception. Some invariants:

          • All deletes with sequenceID smaller or equal to safeSeqID have already been
            applied to the deletes arrays of the flushed segments.
          • All deletes with sequenceID greater than safeSeqID are in a deletesInRAM
            buffer.
          • safeSeqID is always smaller than any buffered doc or buffered delete in DW
            or DWPT.
          • safeSeqID is always equal to the maximum sequenceID of one, and only one,
            flushed segment.

          When IW.close/commit is called then safeSeqID is returned. If no aborting
          exception occurred it equals the highest sequenceID ever assigned (during that
          IW "session"). In any case it's always the sequenceID of the latest write
          operation an IndexReader will "see" that you open after IW.close/commit
          finished.

          But this would be nice for apps to track which docs made it successfully into
          the index. Apps can then externally keep a log to figure out what they have to
          reply in case of exceptions.

          Does all this make sense? This is very complex stuff, I wouldn't be
          surprised if there's something I didn't think about.

          Show
          Michael Busch added a comment - - edited The deletes aren't entirely in the foreground, only the RAM buffer deletes. Deletes to existing segments would use the existing clone and delete mechanism. I think deletes should all be based on the new sequenceIDs. Today the buffered deletes depend on a value that can change (numDocs changes when segments are merged). But with sequenceIDs they're absolute: each delete operation will have a sequenceID assigned, and the ordering of all write operations (add, update, delete) is unambiguously defined by the sequenceIDs. Remember that addDocument/updateDocument/deleteDocuments will all return the sequenceID. This means that we don't have to remap deletes anymore. Also the pooled SegmentReaders that read the flushed segments can use arrays of sequenceIDs instead of BitSets? Of course that needs more memory, but even if you add 1M docs without ever calling IW.commit/close you only need 8MB - I think that's acceptable. And this size is independent on how many times you call reopen/clone on the realtime readers, because they can all share the same deletes array. We can also modify IW.commit/close to return the latest sequenceID. This would be very nice for document tracking. E.g. when you hit an aborting exception after you flushed already some segments, then even though you must discard everything in the DW's buffer you can still call commit/close to commit everything that doesn't have to be discarded. IW.commit/close would in that case return the sequenceID of the latest write operation that was successfully committed, i.e. that would be visible to an IndexReader. Though we have to be careful here: multiple segments can have interleaving sequenceIDs, so we must discard every segment that has one or more sequenceIDs greater than the lowest one in the DW. So we still need the push-deletes logic, that keeps RAM deletes separate from the flushed ones until flushing was successful. DW/IW need to keep track of the largest sequenceID that is safe , i.e. that could be committed even if DW hits an aborting exception. Some invariants: All deletes with sequenceID smaller or equal to safeSeqID have already been applied to the deletes arrays of the flushed segments. All deletes with sequenceID greater than safeSeqID are in a deletesInRAM buffer. safeSeqID is always smaller than any buffered doc or buffered delete in DW or DWPT. safeSeqID is always equal to the maximum sequenceID of one, and only one, flushed segment. When IW.close/commit is called then safeSeqID is returned. If no aborting exception occurred it equals the highest sequenceID ever assigned (during that IW "session"). In any case it's always the sequenceID of the latest write operation an IndexReader will "see" that you open after IW.close/commit finished. But this would be nice for apps to track which docs made it successfully into the index. Apps can then externally keep a log to figure out what they have to reply in case of exceptions. Does all this make sense? This is very complex stuff, I wouldn't be surprised if there's something I didn't think about.
          Hide
          Jason Rutherglen added a comment -

          flushing by maxBufferedDeleteTerms is also tricky,
          because that needs to be a flush across all DWPTs

          I think we simply keep track of maxBufferedDocs and
          maxBufferedDeleteTerms globally. We can't deprecate because
          that'd be a pain for users.

          With searchable RAM buffers and deletes in the foreground
          this trigger should actually not be necessary anymore?

          The deletes aren't entirely in the foreground, only the RAM
          buffer deletes. Deletes to existing segments would use the
          existing clone and delete mechanism. I asked about foreground
          deletes to existing segments before, and we agreed that it's not
          a good idea due to locality of terms/postings.

          We would probably apply deletes when the realtime
          IndexReader is (re)opened?

          Right, that's how the existing apply deletes basically works.
          That'd be the same. Oh ok, I see, perhaps the global deletes
          manager could cache the deletes for the existing segments, the
          DWPTs can keep track of their deletes separately. We probably
          won't apply segment or DWPT deletes in the foreground? We'd
          cache them separately, and update ram consumption and unique
          term/query counts globally.

          Show
          Jason Rutherglen added a comment - flushing by maxBufferedDeleteTerms is also tricky, because that needs to be a flush across all DWPTs I think we simply keep track of maxBufferedDocs and maxBufferedDeleteTerms globally. We can't deprecate because that'd be a pain for users. With searchable RAM buffers and deletes in the foreground this trigger should actually not be necessary anymore? The deletes aren't entirely in the foreground, only the RAM buffer deletes. Deletes to existing segments would use the existing clone and delete mechanism. I asked about foreground deletes to existing segments before, and we agreed that it's not a good idea due to locality of terms/postings. We would probably apply deletes when the realtime IndexReader is (re)opened? Right, that's how the existing apply deletes basically works. That'd be the same. Oh ok, I see, perhaps the global deletes manager could cache the deletes for the existing segments, the DWPTs can keep track of their deletes separately. We probably won't apply segment or DWPT deletes in the foreground? We'd cache them separately, and update ram consumption and unique term/query counts globally.
          Hide
          Michael Busch added a comment -

          How shall we actually handle flushing by maxBufferedDocs? It'd be the easiest to just make this a per-DWPT flush trigger. Of course that'd be a change in runtime behavior, but I think that's acceptable?

          And flushing by maxBufferedDeleteTerms is also tricky, because that needs to be a flush across all DWPTs? With searchable RAM buffers and deletes in the foreground this trigger should actually not be necessary anymore? We would probably apply deletes when the realtime IndexReader is (re)opened?

          Show
          Michael Busch added a comment - How shall we actually handle flushing by maxBufferedDocs? It'd be the easiest to just make this a per-DWPT flush trigger. Of course that'd be a change in runtime behavior, but I think that's acceptable? And flushing by maxBufferedDeleteTerms is also tricky, because that needs to be a flush across all DWPTs? With searchable RAM buffers and deletes in the foreground this trigger should actually not be necessary anymore? We would probably apply deletes when the realtime IndexReader is (re)opened?
          Hide
          Michael McCandless added a comment -

          Sounds like we use the current thread affinity system (ie, a
          hash map), that when the max threads is reached, new threads get
          kind of round robined onto existing DWPTs?

          Yeah something along those lines... and clearing out all mappings for
          a given DWPT when it flushes.

          if RAM usage grows too much beyond your first trigger and before that first flush has finished, start a 2nd DWPT flushing, etc.

          What's the definition of the "too much" portion of the above
          statement?

          We could do something simple, eg, at 90% RAM used, you flush your
          first DWPT. At 110% RAM used, you flush all DWPTs. And take linear
          steps in between?

          EG if I have 5 DWPTs, I'd flush first one at 90%, 2nd at 95%, 3rd at
          100%, 4th at 105% and 5th at 110%.

          Of course, if flushing is fast, then RAM is quickly freed up, then we
          only flush 1 DWPT at a time... we only need these tiers to
          self-regulate RAM consumed from ongoing indexing vs time it takes to
          do the flush.

          Show
          Michael McCandless added a comment - Sounds like we use the current thread affinity system (ie, a hash map), that when the max threads is reached, new threads get kind of round robined onto existing DWPTs? Yeah something along those lines... and clearing out all mappings for a given DWPT when it flushes. if RAM usage grows too much beyond your first trigger and before that first flush has finished, start a 2nd DWPT flushing, etc. What's the definition of the "too much" portion of the above statement? We could do something simple, eg, at 90% RAM used, you flush your first DWPT. At 110% RAM used, you flush all DWPTs. And take linear steps in between? EG if I have 5 DWPTs, I'd flush first one at 90%, 2nd at 95%, 3rd at 100%, 4th at 105% and 5th at 110%. Of course, if flushing is fast, then RAM is quickly freed up, then we only flush 1 DWPT at a time... we only need these tiers to self-regulate RAM consumed from ongoing indexing vs time it takes to do the flush.
          Hide
          Michael McCandless added a comment -

          I think the reason why we have two different APIs in mind (you: sourceID, I:
          expert thread binder API) is that we're having different goals with
          them?

          Yes, I think you're right!

          You want to make the out-of-the-box indexing performance as good as
          possible, and users should have to set a minimum amount of
          easy-to-understand parameters (such as buffer size in MB). I think
          that's the right thing to do of course. (though that doesn't prevent
          us from adding an expert API in addition, as we always have)

          Right – simple things should be simple and complex things should be
          possible.

          I'm thinking a lot about real-time indexing and the searchable RAM buffer
          these days, so the thread-binder API could help you to have more control over
          where your docs will actually end up and which reader will see them. But I
          think too that this API would be very "expert" and not many people would use
          it.

          In fact now I want both

          Ie, make it possible (optional) to declare the sourceID, and IW
          optimizes based on this hint.

          But, also, letting advanced apps directly control individual DWPTs.
          (I think a new experimental Indexer interface can work well here...).

          We can do this as a separate issue... it's fairly orthogonal.

          Yeah I was just thinking the same - I agree.

          OK I'll open this...

          Is this sync really so bad? First, we should move all allocators/pools to per-DWPT, so they don't need to be sync'd.

          OK cool that we agree on that. I was worried you wanted to have global pools
          too, if it's only the single long it's not very complicated, I agree.

          Yeah let's not do global pools (anymore!)...

          Sorry if I'm being annoying

          No, you're not! You're asking good questions (as usual)!

          My goal is to have a default indexing chain that isn't slower than the one we
          have today, but searchable and that very fast. That's not trivial, but I think
          we can do it!

          This is an awesome goal, and I agree very reachable. Though the
          deletes/sequence ID/merging/NRT interaction is going to be fun...

          I'll implement the global flush trigger and make all pools DWPT-local. The
          explicit thread-binder or sourceID APIs we can worry about later, as we agreed
          above.

          OK thanks.

          Show
          Michael McCandless added a comment - I think the reason why we have two different APIs in mind (you: sourceID, I: expert thread binder API) is that we're having different goals with them? Yes, I think you're right! You want to make the out-of-the-box indexing performance as good as possible, and users should have to set a minimum amount of easy-to-understand parameters (such as buffer size in MB). I think that's the right thing to do of course. (though that doesn't prevent us from adding an expert API in addition, as we always have) Right – simple things should be simple and complex things should be possible. I'm thinking a lot about real-time indexing and the searchable RAM buffer these days, so the thread-binder API could help you to have more control over where your docs will actually end up and which reader will see them. But I think too that this API would be very "expert" and not many people would use it. In fact now I want both Ie, make it possible (optional) to declare the sourceID, and IW optimizes based on this hint. But, also, letting advanced apps directly control individual DWPTs. (I think a new experimental Indexer interface can work well here...). We can do this as a separate issue... it's fairly orthogonal. Yeah I was just thinking the same - I agree. OK I'll open this... Is this sync really so bad? First, we should move all allocators/pools to per-DWPT, so they don't need to be sync'd. OK cool that we agree on that. I was worried you wanted to have global pools too, if it's only the single long it's not very complicated, I agree. Yeah let's not do global pools (anymore!)... Sorry if I'm being annoying No, you're not! You're asking good questions (as usual)! My goal is to have a default indexing chain that isn't slower than the one we have today, but searchable and that very fast. That's not trivial, but I think we can do it! This is an awesome goal, and I agree very reachable. Though the deletes/sequence ID/merging/NRT interaction is going to be fun... I'll implement the global flush trigger and make all pools DWPT-local. The explicit thread-binder or sourceID APIs we can worry about later, as we agreed above. OK thanks.
          Hide
          Michael Busch added a comment -

          I still think this "zero sync'd code" at the cost of perf loss /
          exposing per-DWPT details is taking things too far. You're cutting
          into the bone...

          No worries - I haven't started implementing the RAM management part yet.

          I don't think we should apps to be setting per-DWPT RAM limits, or,
          even expose to apps how IW manages threads (this is an impl. detail).

          I think we should keep the approach we have today - you set the
          overall RAM limit and IW internally manages flushing when that
          allotted RAM is full.

          I think the reason why we have two different APIs in mind (you: sourceID, I:
          expert thread binder API) is that we're having different goals with them? You
          want to make the out-of-the-box indexing performance as good as possible, and
          users should have to set a minimum amount of easy-to-understand parameters
          (such as buffer size in MB). I think that's the right thing to do of course.
          (though that doesn't prevent us from adding an expert API in addition, as we
          always have)

          I'm thinking a lot about real-time indexing and the searchable RAM buffer
          these days, so the thread-binder API could help you to have more control over
          where your docs will actually end up and which reader will see them. But I
          think too that this API would be very "expert" and not many people would use
          it.

          We can do this as a separate issue... it's fairly orthogonal.

          Yeah I was just thinking the same - I agree.

          Is this sync really so bad? First, we should move all
          allocators/pools to per-DWPT, so they don't need to be sync'd.

          OK cool that we agree on that. I was worried you wanted to have global pools
          too, if it's only the single long it's not very complicated, I agree.

          We're still gonna need sync'd code, anyway (global sequence ID,
          grabbing a DWPT), right? We can put this "go flush a DWPT" logic in
          the same block if we really have to? It feels like we're going to
          great lengths (cutting into the bone) to avoid a trivial cost (the
          minor complexity of managing flushing based on aggregate RAM used).

          Sorry if I'm being annoying Yeah sure, there will be several sync'd spots.
          If we don't share any data structures between threads that hold indexed (and
          in the future searchable) data I'm happy.

          I haven't spent as much time as you thinking about the current RAM management
          yet and the current code that ensures thread safety - still learning some
          parts of the code. I do appreciate all your patient feedback!

          This isn't great... I mean it's weird that on adding say a 3rd
          indexing thread I suddenly see a flush triggered even though I'm
          nowehere near the RAM limit. Then, later, if I cut back to using only
          2 threads, I still only ever use up to 2/3rd of my RAM buffer. IW's
          API really shouldn't have such "surprising" behavior where how many /
          which threads come through it so drastically affect it's flushing
          behavior.

          Yeah I don't really like that either. Let's not do that. I had first not
          thought about that disadvantage, added this point later to the list, and never
          really liked it. (and knew you would complain about it )

          My goal is to have a default indexing chain that isn't slower than the one we
          have today, but searchable and that very fast. That's not trivial, but I think
          we can do it!

          I'll implement the global flush trigger and make all pools DWPT-local. The
          explicit thread-binder or sourceID APIs we can worry about later, as we agreed
          above.

          Show
          Michael Busch added a comment - I still think this "zero sync'd code" at the cost of perf loss / exposing per-DWPT details is taking things too far. You're cutting into the bone... No worries - I haven't started implementing the RAM management part yet. I don't think we should apps to be setting per-DWPT RAM limits, or, even expose to apps how IW manages threads (this is an impl. detail). I think we should keep the approach we have today - you set the overall RAM limit and IW internally manages flushing when that allotted RAM is full. I think the reason why we have two different APIs in mind (you: sourceID, I: expert thread binder API) is that we're having different goals with them? You want to make the out-of-the-box indexing performance as good as possible, and users should have to set a minimum amount of easy-to-understand parameters (such as buffer size in MB). I think that's the right thing to do of course. (though that doesn't prevent us from adding an expert API in addition, as we always have) I'm thinking a lot about real-time indexing and the searchable RAM buffer these days, so the thread-binder API could help you to have more control over where your docs will actually end up and which reader will see them. But I think too that this API would be very "expert" and not many people would use it. We can do this as a separate issue... it's fairly orthogonal. Yeah I was just thinking the same - I agree. Is this sync really so bad? First, we should move all allocators/pools to per-DWPT, so they don't need to be sync'd. OK cool that we agree on that. I was worried you wanted to have global pools too, if it's only the single long it's not very complicated, I agree. We're still gonna need sync'd code, anyway (global sequence ID, grabbing a DWPT), right? We can put this "go flush a DWPT" logic in the same block if we really have to? It feels like we're going to great lengths (cutting into the bone) to avoid a trivial cost (the minor complexity of managing flushing based on aggregate RAM used). Sorry if I'm being annoying Yeah sure, there will be several sync'd spots. If we don't share any data structures between threads that hold indexed (and in the future searchable) data I'm happy. I haven't spent as much time as you thinking about the current RAM management yet and the current code that ensures thread safety - still learning some parts of the code. I do appreciate all your patient feedback! This isn't great... I mean it's weird that on adding say a 3rd indexing thread I suddenly see a flush triggered even though I'm nowehere near the RAM limit. Then, later, if I cut back to using only 2 threads, I still only ever use up to 2/3rd of my RAM buffer. IW's API really shouldn't have such "surprising" behavior where how many / which threads come through it so drastically affect it's flushing behavior. Yeah I don't really like that either. Let's not do that. I had first not thought about that disadvantage, added this point later to the list, and never really liked it. (and knew you would complain about it ) My goal is to have a default indexing chain that isn't slower than the one we have today, but searchable and that very fast. That's not trivial, but I think we can do it! I'll implement the global flush trigger and make all pools DWPT-local. The explicit thread-binder or sourceID APIs we can worry about later, as we agreed above.
          Hide
          Jason Rutherglen added a comment -

          I get Michael B's motivation however Mike M's global ram limit
          is probably better from a user perspective and it would be a
          sync only on a long RAM used variable, so we should be good?

          Sounds like we use the current thread affinity system (ie, a
          hash map), that when the max threads is reached, new threads get
          kind of round robined onto existing DWPTs?

          if RAM usage grows too much beyond your first trigger and
          before that first flush has finished, start a 2nd DWPT flushing,
          etc.

          What's the definition of the "too much" portion of the above
          statement?

          Show
          Jason Rutherglen added a comment - I get Michael B's motivation however Mike M's global ram limit is probably better from a user perspective and it would be a sync only on a long RAM used variable, so we should be good? Sounds like we use the current thread affinity system (ie, a hash map), that when the max threads is reached, new threads get kind of round robined onto existing DWPTs? if RAM usage grows too much beyond your first trigger and before that first flush has finished, start a 2nd DWPT flushing, etc. What's the definition of the "too much" portion of the above statement?
          Hide
          Michael McCandless added a comment -

          I still think this "zero sync'd code" at the cost of perf loss /
          exposing per-DWPT details is taking things too far. You're cutting
          into the bone...

          I don't think we should apps to be setting per-DWPT RAM limits, or,
          even expose to apps how IW manages threads (this is an impl. detail).

          I think we should keep the approach we have today – you set the
          overall RAM limit and IW internally manages flushing when that
          allotted RAM is full.

          E.g. if someone has such an app where different threads
          index docs of different sizes, then the DW that indexes big docs can be given
          more memory?

          Hmm this isn't really fair – the app in general can't predict how
          many docs of each type will come in, how IW allocates RAM for
          different kinds of docs (this is an impl detail), etc.

          What I'm mainly trying to avoid is synchronization points between the
          different DWPTs. For example, currently the same ByteBlockAllocator is shared
          between the different threads, so all its methods need to be synchronized.

          I understand the motivation, but...

          Is this sync really so bad? First, we should move all
          allocators/pools to per-DWPT, so they don't need to be sync'd.

          Then, all that needs to be sync'd is the tracking of net RAM used (a
          single long), and then the logic to pick the DWPT(s) to flush? So
          then each DWPT would allocate its own RAM (unsync'd), track its own
          RAM used (unsync'd), and update the total (in tiny sync block) after
          the update (add/del) is serviced?

          We're still gonna need sync'd code, anyway (global sequence ID,
          grabbing a DWPT), right? We can put this "go flush a DWPT" logic in
          the same block if we really have to? It feels like we're going to
          great lengths (cutting into the bone) to avoid a trivial cost (the
          minor complexity of managing flushing based on aggregate RAM used).

          1. Expose a ThreadBinder API for controlling number of DWPT instances and
            thread affinity of DWPTs explicitly. (We can later decide if we want to also
            support such an affinity after a segment was flushed, as Tim is asking for.
            But that should IMO not be part of this patch.)
          2. Also expose an API for specifying the RAM buffer size per DWPT.

          I don't think we should expose so much.

          I think, instead, we should add an optional method to Document (eg
          set/getSourceID or something), that'd reference which "source" this
          doc comes from. The app would set it, optionally, as a "hint" to IW.

          The source ID should not be publically tied to DWPT – how IW
          optimizes based on this "hint" from the app is really an impl detail.
          Yes, today we'll use it for DWPT affinity; tomorrow, who knows. EG,
          that source ID need not be less than the max DWPTs.

          When source ID isn't provided we'd fallback to the same "best guess"
          we have today (same thread = same source ID).

          The javadoc would be something like "as a hint to IW, to possibly
          improve its indexing performance, if you have docs from difference
          sources you should set the source ID on your Document". And
          how/whether IW makes use of this information is "under the hood"...

          We can do this as a separate issue... it's fairly orthogonal.

          Allow flushing in parallel (multiple DWPTs can flush at the same time).

          +1

          This would be a natural way to protect against too much RAM usage
          while flush(es) are happening. Start one flush going, but keep
          indexing docs into the other DWTPs... if RAM usage grows too much
          beyond your first trigger and before that first flush has finished,
          start a 2nd DWPT flushing, etc. This is naturally self-regulating,
          since the "mutator" threads are tied up doing the flushing...

          The DWPT RAM value must be updateable. E.g. when you first start indexing
          only one DWPT should be created with the max RAM. Then when multiple threads
          are used for adding documents another DWPT should be added and the RAM
          value of the already existing one should be reduced, and possibly a flush of that
          DWPT needs to be triggered.

          This isn't great... I mean it's weird that on adding say a 3rd
          indexing thread I suddenly see a flush triggered even though I'm
          nowehere near the RAM limit. Then, later, if I cut back to using only
          2 threads, I still only ever use up to 2/3rd of my RAM buffer. IW's
          API really shouldn't have such "surprising" behavior where how many /
          which threads come through it so drastically affect it's flushing
          behavior.

          Show
          Michael McCandless added a comment - I still think this "zero sync'd code" at the cost of perf loss / exposing per-DWPT details is taking things too far. You're cutting into the bone... I don't think we should apps to be setting per-DWPT RAM limits, or, even expose to apps how IW manages threads (this is an impl. detail). I think we should keep the approach we have today – you set the overall RAM limit and IW internally manages flushing when that allotted RAM is full. E.g. if someone has such an app where different threads index docs of different sizes, then the DW that indexes big docs can be given more memory? Hmm this isn't really fair – the app in general can't predict how many docs of each type will come in, how IW allocates RAM for different kinds of docs (this is an impl detail), etc. What I'm mainly trying to avoid is synchronization points between the different DWPTs. For example, currently the same ByteBlockAllocator is shared between the different threads, so all its methods need to be synchronized. I understand the motivation, but... Is this sync really so bad? First, we should move all allocators/pools to per-DWPT, so they don't need to be sync'd. Then, all that needs to be sync'd is the tracking of net RAM used (a single long), and then the logic to pick the DWPT(s) to flush? So then each DWPT would allocate its own RAM (unsync'd), track its own RAM used (unsync'd), and update the total (in tiny sync block) after the update (add/del) is serviced? We're still gonna need sync'd code, anyway (global sequence ID, grabbing a DWPT), right? We can put this "go flush a DWPT" logic in the same block if we really have to? It feels like we're going to great lengths (cutting into the bone) to avoid a trivial cost (the minor complexity of managing flushing based on aggregate RAM used). Expose a ThreadBinder API for controlling number of DWPT instances and thread affinity of DWPTs explicitly. (We can later decide if we want to also support such an affinity after a segment was flushed, as Tim is asking for. But that should IMO not be part of this patch.) Also expose an API for specifying the RAM buffer size per DWPT. I don't think we should expose so much. I think, instead, we should add an optional method to Document (eg set/getSourceID or something), that'd reference which "source" this doc comes from. The app would set it, optionally, as a "hint" to IW. The source ID should not be publically tied to DWPT – how IW optimizes based on this "hint" from the app is really an impl detail. Yes, today we'll use it for DWPT affinity; tomorrow, who knows. EG, that source ID need not be less than the max DWPTs. When source ID isn't provided we'd fallback to the same "best guess" we have today (same thread = same source ID). The javadoc would be something like "as a hint to IW, to possibly improve its indexing performance, if you have docs from difference sources you should set the source ID on your Document". And how/whether IW makes use of this information is "under the hood"... We can do this as a separate issue... it's fairly orthogonal. Allow flushing in parallel (multiple DWPTs can flush at the same time). +1 This would be a natural way to protect against too much RAM usage while flush(es) are happening. Start one flush going, but keep indexing docs into the other DWTPs... if RAM usage grows too much beyond your first trigger and before that first flush has finished, start a 2nd DWPT flushing, etc. This is naturally self-regulating, since the "mutator" threads are tied up doing the flushing... The DWPT RAM value must be updateable. E.g. when you first start indexing only one DWPT should be created with the max RAM. Then when multiple threads are used for adding documents another DWPT should be added and the RAM value of the already existing one should be reduced, and possibly a flush of that DWPT needs to be triggered. This isn't great... I mean it's weird that on adding say a 3rd indexing thread I suddenly see a flush triggered even though I'm nowehere near the RAM limit. Then, later, if I cut back to using only 2 threads, I still only ever use up to 2/3rd of my RAM buffer. IW's API really shouldn't have such "surprising" behavior where how many / which threads come through it so drastically affect it's flushing behavior.
          Hide
          Jason Rutherglen added a comment -

          Michael B, sounds good. The approach outlined is straightforward and covers the edge cases.

          Show
          Jason Rutherglen added a comment - Michael B, sounds good. The approach outlined is straightforward and covers the edge cases.
          Hide
          Michael Busch added a comment - - edited

          But... could we allow an add/updateDocument call to express this
          affinity, explicitly? If you index homogenous docs you wouldn't use
          it, but, if you index drastically different docs that fall into clear
          "categories", expressing the affinity can get you a good gain in
          indexing throughput.

          This may be the best solution, since then one could pass the affinity
          even through a thread pool, and then we would fallback to thread
          binding if the document class wasn't declared?

          I would like this if we then also added an API that can be used to specify the
          per-DWPT RAM size. E.g. if someone has such an app where different threads
          index docs of different sizes, then the DW that indexes big docs can be given
          more memory?

          What I'm mainly trying to avoid is synchronization points between the
          different DWPTs. For example, currently the same ByteBlockAllocator is shared
          between the different threads, so all its methods need to be synchronized.

          The other DWs would keep indexing That's the beauty of this
          approach... a flush of one DW doesn't stop all other DWs from
          indexing, unliked today.

          And you want to serialize the flushing right? Ie, only one DW flushes
          at a time (the others keep indexing).

          Hmm I suppose flushing more than one should be allowed (OS/IO have
          alot of concurrency, esp since IO goes into write cache)... perhaps
          that's the best way to balance index vs flush time? EG we pick one to
          flush @ 90%, if we cross 95% we pick another to flush, another at
          100%, etc.

          Oh I don't want to disallow flushing in parallel! I think it makes perfect
          sense to allow more than one DW to flush at the same time. If each DWPT has a
          private max buffer size, then it can decide on its own when it's time to
          flush.

          Hmm I suppose flushing more than one should be allowed (OS/IO have
          alot of concurrency, esp since IO goes into write cache)... perhaps
          that's the best way to balance index vs flush time? EG we pick one to
          flush @ 90%, if we cross 95% we pick another to flush, another at
          100%, etc.

          If we allow flushing in parallel and also allow specifying the max RAM per
          DWPT, then there doesn't even have to be any cross-thread RAM tracking? Each
          DWPT could just flush when its own buffer is full?

          So let's summarize:

          1. Expose a ThreadBinder API for controlling number of DWPT instances and
            thread affinity of DWPTs explicitly. (We can later decide if we want to also
            support such an affinity after a segment was flushed, as Tim is asking for.
            But that should IMO not be part of this patch.)
          2. Also expose an API for specifying the RAM buffer size per DWPT.
          3. Allow flushing in parallel (multiple DWPTs can flush at the same time). A
            DWPT flushes when its buffer is full, independent of what the other DWPTs are
            doing.
          4. The default implementation of the ThreadBinder API assigns threads to DWPT
            randomly and gives each DWPT 1/n-th of the overall memory.
          5. The DWPT RAM value must be updateable. E.g. when you first start indexing
            only one DWPT should be created with the max RAM. Then when multiple threads
            are used for adding documents another DWPT should be added and the RAM
            value of the already existing one should be reduced, and possibly a flush of that
            DWPT needs to be triggered.

          How does this sound?

          Show
          Michael Busch added a comment - - edited But... could we allow an add/updateDocument call to express this affinity, explicitly? If you index homogenous docs you wouldn't use it, but, if you index drastically different docs that fall into clear "categories", expressing the affinity can get you a good gain in indexing throughput. This may be the best solution, since then one could pass the affinity even through a thread pool, and then we would fallback to thread binding if the document class wasn't declared? I would like this if we then also added an API that can be used to specify the per-DWPT RAM size. E.g. if someone has such an app where different threads index docs of different sizes, then the DW that indexes big docs can be given more memory? What I'm mainly trying to avoid is synchronization points between the different DWPTs. For example, currently the same ByteBlockAllocator is shared between the different threads, so all its methods need to be synchronized. The other DWs would keep indexing That's the beauty of this approach... a flush of one DW doesn't stop all other DWs from indexing, unliked today. And you want to serialize the flushing right? Ie, only one DW flushes at a time (the others keep indexing). Hmm I suppose flushing more than one should be allowed (OS/IO have alot of concurrency, esp since IO goes into write cache)... perhaps that's the best way to balance index vs flush time? EG we pick one to flush @ 90%, if we cross 95% we pick another to flush, another at 100%, etc. Oh I don't want to disallow flushing in parallel! I think it makes perfect sense to allow more than one DW to flush at the same time. If each DWPT has a private max buffer size, then it can decide on its own when it's time to flush. Hmm I suppose flushing more than one should be allowed (OS/IO have alot of concurrency, esp since IO goes into write cache)... perhaps that's the best way to balance index vs flush time? EG we pick one to flush @ 90%, if we cross 95% we pick another to flush, another at 100%, etc. If we allow flushing in parallel and also allow specifying the max RAM per DWPT, then there doesn't even have to be any cross-thread RAM tracking? Each DWPT could just flush when its own buffer is full? So let's summarize: Expose a ThreadBinder API for controlling number of DWPT instances and thread affinity of DWPTs explicitly. (We can later decide if we want to also support such an affinity after a segment was flushed, as Tim is asking for. But that should IMO not be part of this patch.) Also expose an API for specifying the RAM buffer size per DWPT. Allow flushing in parallel (multiple DWPTs can flush at the same time). A DWPT flushes when its buffer is full, independent of what the other DWPTs are doing. The default implementation of the ThreadBinder API assigns threads to DWPT randomly and gives each DWPT 1/n-th of the overall memory. The DWPT RAM value must be updateable. E.g. when you first start indexing only one DWPT should be created with the max RAM. Then when multiple threads are used for adding documents another DWPT should be added and the RAM value of the already existing one should be reduced, and possibly a flush of that DWPT needs to be triggered. How does this sound?
          Hide
          Tim Smith added a comment -

          Probably if you really want to keep the segments segregated like that, you should in fact index to separate indices?

          Thats what i'm currently thinking i'll have to do

          however it would be ideal if i could either subclass IndexWriter or use IndexWriter directly with this affinity concept (potentially writing my own segment merger that is affinity aware)
          that makes it so i can easily use near real time indexing, as only one IndexWriter will be in the mix, as well as make managing deletes and a whole other host of issues with multiple indexes disappear
          Also makes it so i can configure memory settings across all "affinity groups" instead of having to dynamically create them, each with their own memory bounds

          Show
          Tim Smith added a comment - Probably if you really want to keep the segments segregated like that, you should in fact index to separate indices? Thats what i'm currently thinking i'll have to do however it would be ideal if i could either subclass IndexWriter or use IndexWriter directly with this affinity concept (potentially writing my own segment merger that is affinity aware) that makes it so i can easily use near real time indexing, as only one IndexWriter will be in the mix, as well as make managing deletes and a whole other host of issues with multiple indexes disappear Also makes it so i can configure memory settings across all "affinity groups" instead of having to dynamically create them, each with their own memory bounds
          Hide
          Michael McCandless added a comment -

          i would love to be able to explicitly define a "segment" affinity for documents i'm feeding

          this would then allow me to say:
          all docs from table a has affinity 1
          all docs from table b has affinity 2

          Right, this is exactly what affinity would be good for – so IW would
          try to send "table a" docs their own DW(s) and "table b" docs to their
          own DW(s), which should give faster indexing than randomly binding to
          DWs.

          But:

          this would ideally result in indexing documents from each table into a different segment (obviously, i would then need to be able to have segment merging be affinity aware so optimize/merging would only merge segments that share an affinity)

          This part I was not proposing

          The affinity would just be an optimization hint in creating the
          initial flushed segments, so IW can speed up indexing.

          Probably if you really want to keep the segments segregated like that,
          you should in fact index to separate indices?

          Show
          Michael McCandless added a comment - i would love to be able to explicitly define a "segment" affinity for documents i'm feeding this would then allow me to say: all docs from table a has affinity 1 all docs from table b has affinity 2 Right, this is exactly what affinity would be good for – so IW would try to send "table a" docs their own DW(s) and "table b" docs to their own DW(s), which should give faster indexing than randomly binding to DWs. But: this would ideally result in indexing documents from each table into a different segment (obviously, i would then need to be able to have segment merging be affinity aware so optimize/merging would only merge segments that share an affinity) This part I was not proposing The affinity would just be an optimization hint in creating the initial flushed segments, so IW can speed up indexing. Probably if you really want to keep the segments segregated like that, you should in fact index to separate indices?
          Hide
          Jason Rutherglen added a comment -

          only one DW flushes at a time (the others keep indexing).

          I think it's best to simply flush at 90% for now. We already
          exceed the ram buffer size because of over allocation? Perhaps
          we can view the ram buffer size as a rough guideline not a hard
          and fast limit because, lets face it, we're using Java which is
          about as inexact when it comes to RAM consumption as it gets?
          Also, hopefully it would move the patch along faster and more
          complex algorithms could easily be added later.

          Show
          Jason Rutherglen added a comment - only one DW flushes at a time (the others keep indexing). I think it's best to simply flush at 90% for now. We already exceed the ram buffer size because of over allocation? Perhaps we can view the ram buffer size as a rough guideline not a hard and fast limit because, lets face it, we're using Java which is about as inexact when it comes to RAM consumption as it gets? Also, hopefully it would move the patch along faster and more complex algorithms could easily be added later.
          Hide
          Tim Smith added a comment -

          But... could we allow an add/updateDocument call to express this affinity, explicitly?

          i would love to be able to explicitly define a "segment" affinity for documents i'm feeding

          this would then allow me to say:
          all docs from table a has affinity 1
          all docs from table b has affinity 2

          this would ideally result in indexing documents from each table into a different segment (obviously, i would then need to be able to have segment merging be affinity aware so optimize/merging would only merge segments that share an affinity)

          Show
          Tim Smith added a comment - But... could we allow an add/updateDocument call to express this affinity, explicitly? i would love to be able to explicitly define a "segment" affinity for documents i'm feeding this would then allow me to say: all docs from table a has affinity 1 all docs from table b has affinity 2 this would ideally result in indexing documents from each table into a different segment (obviously, i would then need to be able to have segment merging be affinity aware so optimize/merging would only merge segments that share an affinity)
          Hide
          Michael McCandless added a comment -

          The usual design is a queued ingestion pipeline, where a pool of indexer threads take docs out of a queue and feed them to an IndexWriter, I think?

          Mainly, because I think apps with such an affinity that you describe are very rare?

          Hmm I suspect it's not that rare.... yes one design is a single
          indexing queue w/ dedicated thread pool only for indexing, but a push
          model is equal valid, where your app already has separate threads (or
          thread pools) servicing different content sources, so when a doc
          arrives to one of those source-specific threads, it's that thread that
          indexes it, rather than handing off to a separately pool.

          Lucene is used in a very wide variety of apps – we shouldn't optimize
          the indexer on such hard app specific assumptions.

          And if a user really has so different docs, maybe the right answer would be to have more than one single index?

          Hmm but the app shouldn't have to resort to this... (it doesn't have
          to today).

          But... could we allow an add/updateDocument call to express this
          affinity, explicitly? If you index homogenous docs you wouldn't use
          it, but, if you index drastically different docs that fall into clear
          "categories", expressing the affinity can get you a good gain in
          indexing throughput.

          This may be the best solution, since then one could pass the affinity
          even through a thread pool, and then we would fallback to thread
          binding if the document class wasn't declared?

          I mean this is virtually identical to "having more than one index",
          since the DW is like its own index. It just saves some of the
          copy-back/merge cost of addIndexes...

          Even if today an app utilizes the thread affinity, this only results in maybe somewhat faster indexing performance, but the benefits would be lost after flusing/merging.

          Yes this optimization is only about the initial flush, but, it's
          potentially sizable. Merging matters less since typically it's not
          the bottleneck (happens in the BG, quickly enough).

          On the right apps, thread affinity can make a huge difference. EG if
          you allow up to 8 thread states, and the threads are indexing content
          w/ highly divergent terms (eg, one language per thread, or, docs w/
          very different field names), in the worst case you'll be up to 1/8 as
          efficient since each term must now be copied in up to 8 places
          instead of one. We have a high per-term RAM cost (reduced thanks to
          the parallel arrays, but, still high).

          If we assign docs randomly to available DocumentsWriterPerThreads, then we should on average make good use of the overall memory?

          It really depends on the app – if the term space is highly thread
          dependent (above examples) you an end up flush much more frequently for
          a given RAM buffer.

          Alternatively we could also select the DWPT from the pool of available DWPTs that has the highest amount of free memory?

          Hmm... this would be kinda costly binder? You'd need a pqueue?
          Thread affinity (or the explicit affinity) is a single
          map/array/member lookup. But it's an interesting idea...

          If you do have a global RAM management, how would the flushing work? E.g. when a global flush is triggered because all RAM is consumed, and we pick the DWPT with the highest amount of allocated memory for flushing, what will the other DWPTs do during that flush? Wouldn't we have to pause the other DWPTs to make sure we don't exceed the maxRAMBufferSize?

          The other DWs would keep indexing That's the beauty of this
          approach... a flush of one DW doesn't stop all other DWs from
          indexing, unliked today.

          And you want to serialize the flushing right? Ie, only one DW flushes
          at a time (the others keep indexing).

          Hmm I suppose flushing more than one should be allowed (OS/IO have
          alot of concurrency, esp since IO goes into write cache)... perhaps
          that's the best way to balance index vs flush time? EG we pick one to
          flush @ 90%, if we cross 95% we pick another to flush, another at
          100%, etc.

          Of course we could say "always flush when 90% of the overall memory is consumed", but how would we know that the remaining 10% won't fill up during the time the flush takes?

          Regardless of the approach for document -> DW binding, this is an
          issue (ie it's non-differentiating here)? Ie the other DWs continue
          to consume RAM while one DW is flushing. I think the low/high water
          mark is an OK solution here? Or the tiered flushing (I think I like
          that better ).

          Having a fully decoupled memory management is compelling I think, mainly because it makes everything so much simpler. A DWPT could decide itself when it's time to flush, and the other ones can keep going independently.

          I'm all for simplifying things, which you've already nicely done here,
          but not of it's at the cost of a non-trivial potential indexing perf
          loss. We're already taking a perf hit here, since the doc stores
          can't be shared... I think that case is justifiable (good
          simplification).

          Show
          Michael McCandless added a comment - The usual design is a queued ingestion pipeline, where a pool of indexer threads take docs out of a queue and feed them to an IndexWriter, I think? Mainly, because I think apps with such an affinity that you describe are very rare? Hmm I suspect it's not that rare.... yes one design is a single indexing queue w/ dedicated thread pool only for indexing, but a push model is equal valid, where your app already has separate threads (or thread pools) servicing different content sources, so when a doc arrives to one of those source-specific threads, it's that thread that indexes it, rather than handing off to a separately pool. Lucene is used in a very wide variety of apps – we shouldn't optimize the indexer on such hard app specific assumptions. And if a user really has so different docs, maybe the right answer would be to have more than one single index? Hmm but the app shouldn't have to resort to this... (it doesn't have to today). But... could we allow an add/updateDocument call to express this affinity, explicitly? If you index homogenous docs you wouldn't use it, but, if you index drastically different docs that fall into clear "categories", expressing the affinity can get you a good gain in indexing throughput. This may be the best solution, since then one could pass the affinity even through a thread pool, and then we would fallback to thread binding if the document class wasn't declared? I mean this is virtually identical to "having more than one index", since the DW is like its own index. It just saves some of the copy-back/merge cost of addIndexes... Even if today an app utilizes the thread affinity, this only results in maybe somewhat faster indexing performance, but the benefits would be lost after flusing/merging. Yes this optimization is only about the initial flush, but, it's potentially sizable. Merging matters less since typically it's not the bottleneck (happens in the BG, quickly enough). On the right apps, thread affinity can make a huge difference. EG if you allow up to 8 thread states, and the threads are indexing content w/ highly divergent terms (eg, one language per thread, or, docs w/ very different field names), in the worst case you'll be up to 1/8 as efficient since each term must now be copied in up to 8 places instead of one. We have a high per-term RAM cost (reduced thanks to the parallel arrays, but, still high). If we assign docs randomly to available DocumentsWriterPerThreads, then we should on average make good use of the overall memory? It really depends on the app – if the term space is highly thread dependent (above examples) you an end up flush much more frequently for a given RAM buffer. Alternatively we could also select the DWPT from the pool of available DWPTs that has the highest amount of free memory? Hmm... this would be kinda costly binder? You'd need a pqueue? Thread affinity (or the explicit affinity) is a single map/array/member lookup. But it's an interesting idea... If you do have a global RAM management, how would the flushing work? E.g. when a global flush is triggered because all RAM is consumed, and we pick the DWPT with the highest amount of allocated memory for flushing, what will the other DWPTs do during that flush? Wouldn't we have to pause the other DWPTs to make sure we don't exceed the maxRAMBufferSize? The other DWs would keep indexing That's the beauty of this approach... a flush of one DW doesn't stop all other DWs from indexing, unliked today. And you want to serialize the flushing right? Ie, only one DW flushes at a time (the others keep indexing). Hmm I suppose flushing more than one should be allowed (OS/IO have alot of concurrency, esp since IO goes into write cache)... perhaps that's the best way to balance index vs flush time? EG we pick one to flush @ 90%, if we cross 95% we pick another to flush, another at 100%, etc. Of course we could say "always flush when 90% of the overall memory is consumed", but how would we know that the remaining 10% won't fill up during the time the flush takes? Regardless of the approach for document -> DW binding, this is an issue (ie it's non-differentiating here)? Ie the other DWs continue to consume RAM while one DW is flushing. I think the low/high water mark is an OK solution here? Or the tiered flushing (I think I like that better ). Having a fully decoupled memory management is compelling I think, mainly because it makes everything so much simpler. A DWPT could decide itself when it's time to flush, and the other ones can keep going independently. I'm all for simplifying things, which you've already nicely done here, but not of it's at the cost of a non-trivial potential indexing perf loss. We're already taking a perf hit here, since the doc stores can't be shared... I think that case is justifiable (good simplification).
          Hide
          Michael Busch added a comment -

          It's for performance. I expect there are apps where a given
          thread/pool indexes certain kind of docs, ie, the app threads
          themselves have "affinity" for docs with similar term distributions.
          In which case, it's best (most RAM efficient) if those docs w/
          presumably similar term stats are sent back to the same DW. If you
          mix in different term stats into one buffer you get worse RAM
          efficiency.

          I do see your point, but I feel like we shouldn't optimize/make compromises for this use case. Mainly, because I think apps with such an affinity that you describe are very rare? The usual design is a queued ingestion pipeline, where a pool of indexer threads take docs out of a queue and feed them to an IndexWriter, I think? In such a world the threads wouldn't have an affinity for similar docs.

          And if a user really has so different docs, maybe the right answer would be to have more than one single index? Even if today an app utilizes the thread affinity, this only results in maybe somewhat faster indexing performance, but the benefits would be lost after flusing/merging.

          If we assign docs randomly to available DocumentsWriterPerThreads, then we should on average make good use of the overall memory? Alternatively we could also select the DWPT from the pool of available DWPTs that has the highest amount of free memory?

          Having a fully decoupled memory management is compelling I think, mainly because it makes everything so much simpler. A DWPT could decide itself when it's time to flush, and the other ones can keep going independently.

          If you do have a global RAM management, how would the flushing work? E.g. when a global flush is triggered because all RAM is consumed, and we pick the DWPT with the highest amount of allocated memory for flushing, what will the other DWPTs do during that flush? Wouldn't we have to pause the other DWPTs to make sure we don't exceed the maxRAMBufferSize?
          Of course we could say "always flush when 90% of the overall memory is consumed", but how would we know that the remaining 10% won't fill up during the time the flush takes?

          Show
          Michael Busch added a comment - It's for performance. I expect there are apps where a given thread/pool indexes certain kind of docs, ie, the app threads themselves have "affinity" for docs with similar term distributions. In which case, it's best (most RAM efficient) if those docs w/ presumably similar term stats are sent back to the same DW. If you mix in different term stats into one buffer you get worse RAM efficiency. I do see your point, but I feel like we shouldn't optimize/make compromises for this use case. Mainly, because I think apps with such an affinity that you describe are very rare? The usual design is a queued ingestion pipeline, where a pool of indexer threads take docs out of a queue and feed them to an IndexWriter, I think? In such a world the threads wouldn't have an affinity for similar docs. And if a user really has so different docs, maybe the right answer would be to have more than one single index? Even if today an app utilizes the thread affinity, this only results in maybe somewhat faster indexing performance, but the benefits would be lost after flusing/merging. If we assign docs randomly to available DocumentsWriterPerThreads, then we should on average make good use of the overall memory? Alternatively we could also select the DWPT from the pool of available DWPTs that has the highest amount of free memory? Having a fully decoupled memory management is compelling I think, mainly because it makes everything so much simpler. A DWPT could decide itself when it's time to flush, and the other ones can keep going independently. If you do have a global RAM management, how would the flushing work? E.g. when a global flush is triggered because all RAM is consumed, and we pick the DWPT with the highest amount of allocated memory for flushing, what will the other DWPTs do during that flush? Wouldn't we have to pause the other DWPTs to make sure we don't exceed the maxRAMBufferSize? Of course we could say "always flush when 90% of the overall memory is consumed", but how would we know that the remaining 10% won't fill up during the time the flush takes?
          Hide
          Michael McCandless added a comment -

          This is awesome Michael! Much simpler... nor more FreqProxMergeState, nor the logic to interleave/synchronize writing to the doc stores. I like it!

          Show
          Michael McCandless added a comment - This is awesome Michael! Much simpler... nor more FreqProxMergeState, nor the logic to interleave/synchronize writing to the doc stores. I like it!
          Hide
          Jason Rutherglen added a comment -

          Michael, nice! I guess I should've spent more time removing the PerThread classes, but now we're pretty much there. Indeed the simplification should make things a lot better. I guess I'll wait for the next patch to work on something.

          Show
          Jason Rutherglen added a comment - Michael, nice! I guess I should've spent more time removing the PerThread classes, but now we're pretty much there. Indeed the simplification should make things a lot better. I guess I'll wait for the next patch to work on something.
          Hide
          Michael Busch added a comment -

          The patch removes all *PerThread classes downstream of DocumentsWriter.

          This simplifies a lot of the flushing logic in the different consumers. The patch also removes FreqProxMergeState, because we don't have to interleave posting lists from different threads anymore of course. I really like these simplifications!

          There is still a lot to do: The changes in DocumentsWriter and IndexWriter are currently just experimental to make everything compile. Next I will introduce DocumentsWriterPerThread and implement the sequenceID logic (which was discussed here in earlier comments) and the new RAM management. I also want to go through the indexing chain once again - there are probably a few more things to clean up or simplify.

          The patch compiles and actually a surprising amount of tests pass. Only multi-threaded tests seem to fail,
          which is not very surprising, considering I removed all thread-handling logic from DocumentsWriter.

          So this patch isn't working yet - just wanted to post my current progress.

          Show
          Michael Busch added a comment - The patch removes all *PerThread classes downstream of DocumentsWriter. This simplifies a lot of the flushing logic in the different consumers. The patch also removes FreqProxMergeState, because we don't have to interleave posting lists from different threads anymore of course. I really like these simplifications! There is still a lot to do: The changes in DocumentsWriter and IndexWriter are currently just experimental to make everything compile. Next I will introduce DocumentsWriterPerThread and implement the sequenceID logic (which was discussed here in earlier comments) and the new RAM management. I also want to go through the indexing chain once again - there are probably a few more things to clean up or simplify. The patch compiles and actually a surprising amount of tests pass. Only multi-threaded tests seem to fail, which is not very surprising, considering I removed all thread-handling logic from DocumentsWriter. So this patch isn't working yet - just wanted to post my current progress.
          Hide
          Michael Busch added a comment -

          Sorry, Jason, I got sidetracked with LUCENE-2329 and other things at work. I'll try to write the sequence ID stuff asap. However, there's more we need to do here that is sort of independent of the deleted docs problem. E.g. removing all the downstream perThread classes.

          We should work with the flex code from now on, as the flex branch will be merged into trunk soon.

          Show
          Michael Busch added a comment - Sorry, Jason, I got sidetracked with LUCENE-2329 and other things at work. I'll try to write the sequence ID stuff asap. However, there's more we need to do here that is sort of independent of the deleted docs problem. E.g. removing all the downstream perThread classes. We should work with the flex code from now on, as the flex branch will be merged into trunk soon.
          Hide
          Jason Rutherglen added a comment -

          Michael B, I was going to wait for your prototype before continuing...

          Show
          Jason Rutherglen added a comment - Michael B, I was going to wait for your prototype before continuing...
          Hide
          Michael Busch added a comment -

          The clarify, the apply deletes doc id up to will be the flushed doc count saved per term/query per DW, though it won't be saved, it'll be derived from the sequence id int array where the action has been encoded into the seq id int?

          Yeah, that's the idea. Let's see if it works

          Show
          Michael Busch added a comment - The clarify, the apply deletes doc id up to will be the flushed doc count saved per term/query per DW, though it won't be saved, it'll be derived from the sequence id int array where the action has been encoded into the seq id int? Yeah, that's the idea. Let's see if it works
          Hide
          Jason Rutherglen added a comment -

          You only need one additional int per doc in the DWPTs, and the global map for the delete terms.

          Ok, lets give it a try, it'll be more clear with the prototype.

          The clarify, the apply deletes doc id up to will be the flushed doc count saved per term/query per DW, though it won't be saved, it'll be derived from the sequence id int array where the action has been encoded into the seq id int?

          Show
          Jason Rutherglen added a comment - You only need one additional int per doc in the DWPTs, and the global map for the delete terms. Ok, lets give it a try, it'll be more clear with the prototype. The clarify, the apply deletes doc id up to will be the flushed doc count saved per term/query per DW, though it won't be saved, it'll be derived from the sequence id int array where the action has been encoded into the seq id int?
          Hide
          Michael Busch added a comment -

          I'm not sure we need that level of complexity just yet? How
          would we make the transaction log memory efficient?

          Is that really so complex? You only need one additional int per doc in the DWPTs, and the global map for the delete terms. You don't need to buffer the actual terms per DWPT. I thought that's quite efficient? But I'm totally open to other ideas.

          I can try tonight to code a prototype of this - I don't think it would be very complex actually. But of course there might be complications I haven't thought of.

          Are there other uses you foresee?

          Not really for the "transaction log" as you called it. I'd remove that log once we switch to deletes in the FG (when the RAM buffer is searchable). But a nice thing would be for add/update/delete to return the seqID, and also the if RAMReader in the future had an API to check up to which seqID it's able to "see". Then it's very clear to user of the API where a given reader is at.
          For this to work we have to assign the seqID at the end of a call. E.g. when adding a large document, which takes a long time to process, it should get the seqID assigned after the "work" is done and right before the addDocument() call returns.

          Show
          Michael Busch added a comment - I'm not sure we need that level of complexity just yet? How would we make the transaction log memory efficient? Is that really so complex? You only need one additional int per doc in the DWPTs, and the global map for the delete terms. You don't need to buffer the actual terms per DWPT. I thought that's quite efficient? But I'm totally open to other ideas. I can try tonight to code a prototype of this - I don't think it would be very complex actually. But of course there might be complications I haven't thought of. Are there other uses you foresee? Not really for the "transaction log" as you called it. I'd remove that log once we switch to deletes in the FG (when the RAM buffer is searchable). But a nice thing would be for add/update/delete to return the seqID, and also the if RAMReader in the future had an API to check up to which seqID it's able to "see". Then it's very clear to user of the API where a given reader is at. For this to work we have to assign the seqID at the end of a call. E.g. when adding a large document, which takes a long time to process, it should get the seqID assigned after the "work" is done and right before the addDocument() call returns.
          Hide
          Jason Rutherglen added a comment -

          Michael B.: What you're talking about here:
          https://issues.apache.org/jira/browse/LUCENE-2324?focusedCommentI
          d=12850792page=com.atlassian.jira.plugin.system.issuetabpanels%3A
          comment-tabpanel#action_12850792 is a transaction log?

          I'm not sure we need that level of complexity just yet? How
          would we make the transaction log memory efficient? Are there
          other uses you foresee? Maybe there's a simpler solution for the
          BufferedDeletes.Num per DW problem that could make use of global
          sequence ids? I'd prefer to continue to use the per term/query
          max doc id. There aren't performance issues with concurrently
          accessing and updating maps, so a global sync lock as the DW map
          values are updated should be OK?

          Show
          Jason Rutherglen added a comment - Michael B.: What you're talking about here: https://issues.apache.org/jira/browse/LUCENE-2324?focusedCommentI d=12850792page=com.atlassian.jira.plugin.system.issuetabpanels%3A comment-tabpanel#action_12850792 is a transaction log? I'm not sure we need that level of complexity just yet? How would we make the transaction log memory efficient? Are there other uses you foresee? Maybe there's a simpler solution for the BufferedDeletes.Num per DW problem that could make use of global sequence ids? I'd prefer to continue to use the per term/query max doc id. There aren't performance issues with concurrently accessing and updating maps, so a global sync lock as the DW map values are updated should be OK?
          Hide
          Michael McCandless added a comment -

          Mike, can you explain what the advantages of this kind of thread affinity are? I was always wondering why the DocumentsWriter code currently makes efforts to assign a ThreadState always to the same Thread? Is that being done for performance reasons?

          It's for performance. I expect there are apps where a given
          thread/pool indexes certain kind of docs, ie, the app threads
          themselves have "affinity" for docs with similar term distributions.
          In which case, it's best (most RAM efficient) if those docs w/
          presumably similar term stats are sent back to the same DW. If you
          mix in different term stats into one buffer you get worse RAM
          efficiency.

          Also, for better RAM efficiency you want fewer DWs... because we get
          more RAM efficiency the higher the freq of the terms... but of course
          you want more DWs for better CPU efficiency whenever that many threads
          are running at once.

          Net/net CPU efficiency should trump RAM efficiency, I think, so if
          there is a conflict we should favor CPU efficiency.

          Though, thread affinity doesn't seem that CPU costly to implement?
          Lookup the DW your thread first used... if it's free, seize it. If
          it's not, fallback to any DW that's free.

          Show
          Michael McCandless added a comment - Mike, can you explain what the advantages of this kind of thread affinity are? I was always wondering why the DocumentsWriter code currently makes efforts to assign a ThreadState always to the same Thread? Is that being done for performance reasons? It's for performance. I expect there are apps where a given thread/pool indexes certain kind of docs, ie, the app threads themselves have "affinity" for docs with similar term distributions. In which case, it's best (most RAM efficient) if those docs w/ presumably similar term stats are sent back to the same DW. If you mix in different term stats into one buffer you get worse RAM efficiency. Also, for better RAM efficiency you want fewer DWs... because we get more RAM efficiency the higher the freq of the terms... but of course you want more DWs for better CPU efficiency whenever that many threads are running at once. Net/net CPU efficiency should trump RAM efficiency, I think, so if there is a conflict we should favor CPU efficiency. Though, thread affinity doesn't seem that CPU costly to implement? Lookup the DW your thread first used... if it's free, seize it. If it's not, fallback to any DW that's free.
          Hide
          Michael McCandless added a comment -

          Yeah I think we're gonna need the global sequenceID in some form – my Options 1 or 2 can't work because the interleaving issue (as seen/required by the app) is a global thing.

          Show
          Michael McCandless added a comment - Yeah I think we're gonna need the global sequenceID in some form – my Options 1 or 2 can't work because the interleaving issue (as seen/required by the app) is a global thing.
          Hide
          Michael Busch added a comment -

          However, in the apply deletes
          method how would we know which doc to stop deleting at? How
          would the seq id map to a DW's doc id?

          We could have a global deletes-map that stores seqID -> DeleteAction. DeleteAction either contains a Term or a Query, and in addition an int "flushCount" (I'll explain in a bit what flushCount is used for.)

          Each DocumentsWriterPerThread would have a growing array that contains each seqID that "affected" that DWPT, i.e. the seqIDs of all deletes, plus the seqIDs of the adds/updates performed by that particular DWPT. One bit of a seqID in that array can indicate if it's a delete or add/update.

          When it's time to flush we sort the array by increasing seqID and then loop a single time through it to find the seqIDs of all DeleteActions. During the loop we count the number of adds/updates to determine the number of docs the DeleteActions affect. After applying the deletes the DWPT makes a synchronized call to the global deletes-map and increments the flushCount int for each applied DeleteAction. If flushCount==numThreadStates (== number of DWPT instances) the corresponding DeleteAction entry can be removed, because it was applied to all DWPT.

          I think this should work? Or is there a simpler solution?

          Show
          Michael Busch added a comment - However, in the apply deletes method how would we know which doc to stop deleting at? How would the seq id map to a DW's doc id? We could have a global deletes-map that stores seqID -> DeleteAction. DeleteAction either contains a Term or a Query, and in addition an int "flushCount" (I'll explain in a bit what flushCount is used for.) Each DocumentsWriterPerThread would have a growing array that contains each seqID that "affected" that DWPT, i.e. the seqIDs of all deletes, plus the seqIDs of the adds/updates performed by that particular DWPT. One bit of a seqID in that array can indicate if it's a delete or add/update. When it's time to flush we sort the array by increasing seqID and then loop a single time through it to find the seqIDs of all DeleteActions. During the loop we count the number of adds/updates to determine the number of docs the DeleteActions affect. After applying the deletes the DWPT makes a synchronized call to the global deletes-map and increments the flushCount int for each applied DeleteAction. If flushCount==numThreadStates (== number of DWPT instances) the corresponding DeleteAction entry can be removed, because it was applied to all DWPT. I think this should work? Or is there a simpler solution?
          Hide
          Jason Rutherglen added a comment -

          I mentioned the idea of having a sequence ID, that gets
          incremented on add, delete, update. What if we had even with
          separate DWs a global sequence ID? The sequence ID would tell
          you unambiguously which action happened when. The
          add/update/delete methods could return the sequenceID that was
          assigned to that particular action.

          I think adding the global sequence id makes sense and would be
          simple to add (eg, AtomicLong). However, in the apply deletes
          method how would we know which doc to stop deleting at? How
          would the seq id map to a DW's doc id?

          If we have independent buffered deletes per DW-thread, then how
          will we keep track of the memory usage? eg, we'll be flushing
          DWs independently, but will not want to double count the same
          term/query's size for memory tracking? I'm not sure how we'd do
          that without having a global deletes manager that individual DWs
          interact with (maybe have a ref count per term/query?). The deletes
          manager would have the size/ram usage method, not individual
          DWs. The DWs would need to keep a hash map of term ->
          BufferedDeletes.Num.

          Show
          Jason Rutherglen added a comment - I mentioned the idea of having a sequence ID, that gets incremented on add, delete, update. What if we had even with separate DWs a global sequence ID? The sequence ID would tell you unambiguously which action happened when. The add/update/delete methods could return the sequenceID that was assigned to that particular action. I think adding the global sequence id makes sense and would be simple to add (eg, AtomicLong). However, in the apply deletes method how would we know which doc to stop deleting at? How would the seq id map to a DW's doc id? If we have independent buffered deletes per DW-thread, then how will we keep track of the memory usage? eg, we'll be flushing DWs independently, but will not want to double count the same term/query's size for memory tracking? I'm not sure how we'd do that without having a global deletes manager that individual DWs interact with (maybe have a ref count per term/query?). The deletes manager would have the size/ram usage method, not individual DWs. The DWs would need to keep a hash map of term -> BufferedDeletes.Num.
          Hide
          Michael Busch added a comment -

          I think for this same reason the ThreadBinder should have affinity

          Mike, can you explain what the advantages of this kind of thread affinity are? I was always wondering why the DocumentsWriter code currently makes efforts to assign a ThreadState always to the same Thread? Is that being done for performance reasons?

          Show
          Michael Busch added a comment - I think for this same reason the ThreadBinder should have affinity Mike, can you explain what the advantages of this kind of thread affinity are? I was always wondering why the DocumentsWriter code currently makes efforts to assign a ThreadState always to the same Thread? Is that being done for performance reasons?
          Hide
          Michael Busch added a comment -

          Yes, we would need to buffer terms/queries per DW and also per DW the BufferedDeletes.Num. The docID spaces in two DWs will be completely independent of each other after this change.

          One potential problem that we (I think) even today have is the following: If you index with multiple threads, and then call e.g. deleteDocuments(Term) with one of the indexer threads while you keep adding documents with the other threads, it's not clear to the caller when exactly the deleteDocuments(Term) will happen. It depends on the thread scheduling.

          Going back to the idea I mentioned here:
          https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841407&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841407

          I mentioned the idea of having a sequence ID, that gets incremented on add, delete, update. What if we had even with separate DWs a global sequence ID? The sequence ID would tell you unambiguously which action happened when. The add/update/delete methods could return the sequenceID that was assigned to that particular action.

          Then we could e.g. track the delete terms globally together with the sequenceID of the corresponding delete call, while we still apply deletes during flush. Since sequenceIDs enforce a strict ordering we can figure out to how many docs per DW we need to apply the delete terms.

          Later when we switch to real-time deletes (when the RAM is searchable) we will simply store the sequenceIDs in the deletes int[] array which I mentioned in my comment on LUCENE-2293.

          Does this make sense?

          Show
          Michael Busch added a comment - Yes, we would need to buffer terms/queries per DW and also per DW the BufferedDeletes.Num. The docID spaces in two DWs will be completely independent of each other after this change. One potential problem that we (I think) even today have is the following: If you index with multiple threads, and then call e.g. deleteDocuments(Term) with one of the indexer threads while you keep adding documents with the other threads, it's not clear to the caller when exactly the deleteDocuments(Term) will happen. It depends on the thread scheduling. Going back to the idea I mentioned here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841407&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841407 I mentioned the idea of having a sequence ID, that gets incremented on add, delete, update. What if we had even with separate DWs a global sequence ID? The sequence ID would tell you unambiguously which action happened when. The add/update/delete methods could return the sequenceID that was assigned to that particular action. Then we could e.g. track the delete terms globally together with the sequenceID of the corresponding delete call, while we still apply deletes during flush. Since sequenceIDs enforce a strict ordering we can figure out to how many docs per DW we need to apply the delete terms. Later when we switch to real-time deletes (when the RAM is searchable) we will simply store the sequenceIDs in the deletes int[] array which I mentioned in my comment on LUCENE-2293 . Does this make sense?
          Hide
          Jason Rutherglen added a comment -

          I don't think we should delete in FG - I suspect this'll
          give net/net worse performance, due to loss of locality. It also
          means you must always keep readers available, which is an
          unnecessary cost for non-NRT apps.

          Right, I agree it's best to buffer the deleted terms/queries...
          The comments below touch on why this came up as a thought.

          deletesInRAM are those deletes done during the current
          segment. deletesFlushed absorbs deletesInRAM on successful
          segment flush. We have to segregate the two for proper recovery
          if we fail to flush the RAM buffer, eg say you hit a disk full
          while flushing a new segment, and then you close your IW
          successfully. We have to make sure in that case that
          deletesInRAM are discarded.

          Ok, this helps...

          The design issue I'm running into is the buffered deletes "num
          limit" which seems to be the highest doc id of DW at the time
          the delete docs method is called? Then when apply deletes is
          called, deletes are made only up to "num limit"? This is to
          insure that documents (with for example a del term) that are
          added after the delete docs call, are not deleted when apply
          deletes is called, unless another call to del docs is made again
          (at which point the "num limit" is set to the current DW max doc
          id).

          If the above is true, in order to not remap deletes on each
          merge, would we need to maintain this "num limit" variable per
          DW? I don't think the global remap is useful with per thread DWs
          because a DW could fail, and we wouldn't know the order in
          segment infos it could/would've been placed? Whereas with a
          single DW, we know the new segment will be placed last in
          segment infos.

          We could maintain a global deleted terms/queries queue, but then
          we'd also need to maintain a term -> "num limit" map per DW? It
          seems a bit redundant, but maybe it's ok?

          Show
          Jason Rutherglen added a comment - I don't think we should delete in FG - I suspect this'll give net/net worse performance, due to loss of locality. It also means you must always keep readers available, which is an unnecessary cost for non-NRT apps. Right, I agree it's best to buffer the deleted terms/queries... The comments below touch on why this came up as a thought. deletesInRAM are those deletes done during the current segment. deletesFlushed absorbs deletesInRAM on successful segment flush. We have to segregate the two for proper recovery if we fail to flush the RAM buffer, eg say you hit a disk full while flushing a new segment, and then you close your IW successfully. We have to make sure in that case that deletesInRAM are discarded. Ok, this helps... The design issue I'm running into is the buffered deletes "num limit" which seems to be the highest doc id of DW at the time the delete docs method is called? Then when apply deletes is called, deletes are made only up to "num limit"? This is to insure that documents (with for example a del term) that are added after the delete docs call, are not deleted when apply deletes is called, unless another call to del docs is made again (at which point the "num limit" is set to the current DW max doc id). If the above is true, in order to not remap deletes on each merge, would we need to maintain this "num limit" variable per DW? I don't think the global remap is useful with per thread DWs because a DW could fail, and we wouldn't know the order in segment infos it could/would've been placed? Whereas with a single DW, we know the new segment will be placed last in segment infos. We could maintain a global deleted terms/queries queue, but then we'd also need to maintain a term -> "num limit" map per DW? It seems a bit redundant, but maybe it's ok?
          Hide
          Michael McCandless added a comment -

          I don't think we should delete in FG – I suspect this'll give net/net worse performance, due to loss of locality. It also means you must always keep readers available, which is an unnecessary cost for non-NRT apps.

          deletesInRAM are those deletes done during the current segment. deletesFlushed absorbs deletesInRAM on successful segment flush. We have to segregate the two for proper recovery if we fail to flush the RAM buffer, eg say you hit a disk full while flushing a new segment, and then you close your IW successfully. We have to make sure in that case that deletesInRAM are discarded.

          Show
          Michael McCandless added a comment - I don't think we should delete in FG – I suspect this'll give net/net worse performance, due to loss of locality. It also means you must always keep readers available, which is an unnecessary cost for non-NRT apps. deletesInRAM are those deletes done during the current segment. deletesFlushed absorbs deletesInRAM on successful segment flush. We have to segregate the two for proper recovery if we fail to flush the RAM buffer, eg say you hit a disk full while flushing a new segment, and then you close your IW successfully. We have to make sure in that case that deletesInRAM are discarded.
          Hide
          Jason Rutherglen added a comment -

          Mike, can you describe the difference between DW's deletesInRAM and deletesFlushed? I have some idea, however maybe some clarification will help.

          Show
          Jason Rutherglen added a comment - Mike, can you describe the difference between DW's deletesInRAM and deletesFlushed? I have some idea, however maybe some clarification will help.
          Hide
          Jason Rutherglen added a comment -

          In regards to remap deletes. I looked at buffered deletes in
          depth when implementing LUCENE-2047. I think deletes are fairly
          simple, though there's all this logic to manage them because
          they're buffered. I'm liking deleting in the foreground and
          queuing the deleted doc ids per segment as a way to avoid
          remapping absolute doc ids in commit merge. Hopefully this
          simplifies things?

          Show
          Jason Rutherglen added a comment - In regards to remap deletes. I looked at buffered deletes in depth when implementing LUCENE-2047 . I think deletes are fairly simple, though there's all this logic to manage them because they're buffered. I'm liking deleting in the foreground and queuing the deleted doc ids per segment as a way to avoid remapping absolute doc ids in commit merge. Hopefully this simplifies things?
          Hide
          Jason Rutherglen added a comment -

          The patch is not committable.

          The basics are here. We can worry about max threads later. For now there's a doc writer per thread. I still don't fully understand remap deletes, so it's commented out. There's a flushedDocCount per DW, and a global one in IW.

          IWC is accessed directly where possible.

          When running TestIndexWriter, the DW resumeAllThreads assertion fails.

          If the total ram buffer size is too large, and the current thread's DW is the largest, it's flushed. The next test case can be in regards to the memory management.

          Show
          Jason Rutherglen added a comment - The patch is not committable. The basics are here. We can worry about max threads later. For now there's a doc writer per thread. I still don't fully understand remap deletes, so it's commented out. There's a flushedDocCount per DW, and a global one in IW. IWC is accessed directly where possible. When running TestIndexWriter, the DW resumeAllThreads assertion fails. If the total ram buffer size is too large, and the current thread's DW is the largest, it's flushed. The next test case can be in regards to the memory management.
          Hide
          Michael Busch added a comment -

          Not all apps index only 140 character docs from all threads

          What a luxury!

          I think for this same reason the ThreadBinder should have affinity, ie, try to schedule the same thread to the same DW, assuming it's free. If it's not free and another DW is free you should use the other one.

          If you didn't have such an affinity but use a random assignment of DWs to threads, would that balance the RAM usage across DWs without a global RAM management?

          Show
          Michael Busch added a comment - Not all apps index only 140 character docs from all threads What a luxury! I think for this same reason the ThreadBinder should have affinity, ie, try to schedule the same thread to the same DW, assuming it's free. If it's not free and another DW is free you should use the other one. If you didn't have such an affinity but use a random assignment of DWs to threads, would that balance the RAM usage across DWs without a global RAM management?
          Hide
          Michael McCandless added a comment -

          Yes, doFlushInternal should no longer be sync'd. It should have a small sync'd at the end where it 1) inserts the new SegmentInfo into the in-memory segments, and 2) computes the new base for remapping the deletes. I think it can then release the sync while it does the remapping of buffered deletes?

          Show
          Michael McCandless added a comment - Yes, doFlushInternal should no longer be sync'd. It should have a small sync'd at the end where it 1) inserts the new SegmentInfo into the in-memory segments, and 2) computes the new base for remapping the deletes. I think it can then release the sync while it does the remapping of buffered deletes?
          Hide
          Michael McCandless added a comment -

          I was thinking we were going to do that... having a fixed number of DocumentsWriterPerThread instances, and a ThreadBinder that let's a thread wait if the perthread is not available. You don't need to interleave docIds then?

          I think we'd have a fixed max? And then we'd create a new DW when all existing ones are in-use and we're not yet at the max?

          Eg if it's a thread pool of size 50 that's indexing, but, the rate of docs is very slow such that in practice only one of these 50 threads is indexing at once, we'd only use one DW. And we'd flush when that DW hits the RAM limit.

          Show
          Michael McCandless added a comment - I was thinking we were going to do that... having a fixed number of DocumentsWriterPerThread instances, and a ThreadBinder that let's a thread wait if the perthread is not available. You don't need to interleave docIds then? I think we'd have a fixed max? And then we'd create a new DW when all existing ones are in-use and we're not yet at the max? Eg if it's a thread pool of size 50 that's indexing, but, the rate of docs is very slow such that in practice only one of these 50 threads is indexing at once, we'd only use one DW. And we'd flush when that DW hits the RAM limit.
          Hide
          Michael McCandless added a comment -

          But is that a realistic scenario? That the "avg. document size per thread" differ significantly in an application?

          I think this could happen, eg if an app uses different threads for indexing different sources of docs. Not all apps index only 140 character docs from all threads

          I think for this same reason the ThreadBinder should have affinity, ie, try to schedule the same thread to the same DW, assuming it's free. If it's not free and another DW is free you should use the other one.

          Show
          Michael McCandless added a comment - But is that a realistic scenario? That the "avg. document size per thread" differ significantly in an application? I think this could happen, eg if an app uses different threads for indexing different sources of docs. Not all apps index only 140 character docs from all threads I think for this same reason the ThreadBinder should have affinity, ie, try to schedule the same thread to the same DW, assuming it's free. If it's not free and another DW is free you should use the other one.
          Hide
          Michael Busch added a comment -

          I'm not sure how we'd enforce the number of threads? Or we'd
          have to re-implement the wait system implemented in DW?

          I was thinking we were going to do that... having a fixed number of DocumentsWriterPerThread instances, and a ThreadBinder that let's a thread wait if the perthread is not available. You don't need to interleave docIds then?

          Show
          Michael Busch added a comment - I'm not sure how we'd enforce the number of threads? Or we'd have to re-implement the wait system implemented in DW? I was thinking we were going to do that... having a fixed number of DocumentsWriterPerThread instances, and a ThreadBinder that let's a thread wait if the perthread is not available. You don't need to interleave docIds then?
          Hide
          Michael Busch added a comment -

          But if 1 thread tends to index lots of biggish docs... don't we want to allow it to use up more than 1/nth?
          Ie we don't want to flush unless total RAM usage has hit the limit?

          Sure that'd be the disadvantage. But is that a realistic scenario? That the "avg. document size per thread" differ significantly in an application?

          Show
          Michael Busch added a comment - But if 1 thread tends to index lots of biggish docs... don't we want to allow it to use up more than 1/nth? Ie we don't want to flush unless total RAM usage has hit the limit? Sure that'd be the disadvantage. But is that a realistic scenario? That the "avg. document size per thread" differ significantly in an application?
          Hide
          Jason Rutherglen added a comment -

          Also, I think we can remove the sync on doFlushInternal?

          Show
          Jason Rutherglen added a comment - Also, I think we can remove the sync on doFlushInternal?
          Hide
          Jason Rutherglen added a comment -

          Of course then you'd need two config parameters: number of

          concurrent threads and buffer size per thread.

          I'm not sure how we'd enforce the number of threads? Or we'd
          have to re-implement the wait system implemented in DW? In
          practice, each thread's DW will probably have roughly the same
          ram buffer size so if they each have the same max size, that'd
          be ok. I don't think we can limit the number of threads though,
          because we'd need to then implement interleaving doc ids?

          Show
          Jason Rutherglen added a comment - Of course then you'd need two config parameters: number of concurrent threads and buffer size per thread. I'm not sure how we'd enforce the number of threads? Or we'd have to re-implement the wait system implemented in DW? In practice, each thread's DW will probably have roughly the same ram buffer size so if they each have the same max size, that'd be ok. I don't think we can limit the number of threads though, because we'd need to then implement interleaving doc ids?
          Hide
          Michael McCandless added a comment -

          But if 1 thread tends to index lots of biggish docs... don't we want to allow it to use up more than 1/nth?

          Ie we don't want to flush unless total RAM usage has hit the limit?

          Show
          Michael McCandless added a comment - But if 1 thread tends to index lots of biggish docs... don't we want to allow it to use up more than 1/nth? Ie we don't want to flush unless total RAM usage has hit the limit?
          Hide
          Michael Busch added a comment -

          The easiest would be if each DocumentsWriterPerThread had a fixed buffer size, then they can flush fully independently and you don't need to manage RAM globally across threads.

          Of course then you'd need two config parameters: number of concurrent threads and buffer size per thread.

          Show
          Michael Busch added a comment - The easiest would be if each DocumentsWriterPerThread had a fixed buffer size, then they can flush fully independently and you don't need to manage RAM globally across threads. Of course then you'd need two config parameters: number of concurrent threads and buffer size per thread.
          Hide
          Jason Rutherglen added a comment -

          Following up on the previous comment, if the current thread (the one calling add doc) is also the one that needs to do the flushing, then only the thread attached to the doc writer with the greatest ram usage can/should do the flushing?

          Show
          Jason Rutherglen added a comment - Following up on the previous comment, if the current thread (the one calling add doc) is also the one that needs to do the flushing, then only the thread attached to the doc writer with the greatest ram usage can/should do the flushing?
          Hide
          Jason Rutherglen added a comment -

          Currently the doc writer manages the ram buffer size, however
          this needs to be implemented across doc writers for this issue
          to be complete. IW addDoc returns doFlush from DW. I don't think
          doFlush will be useful anymore?

          A slightly different memory management needs to be designed.
          Right now we allow the user to set the max ram buffer size and
          when the doc writer's buffers exceed the ram limit, the buffer
          is flushed and the process is complete.

          With this issue, the flush logic probably needs to be bumped up
          into IW, and flushing becomes a multi-docwriter ram usage
          examination. For starters, if the aggregate ram usage of all doc
          writers exceeds the IWC defined ram buffer size, we need to
          schedule flushing the doc writer with the greatest ram usage? I
          wonder if there's something I'm missing here in regards to
          synchronization issues with DW?

          Show
          Jason Rutherglen added a comment - Currently the doc writer manages the ram buffer size, however this needs to be implemented across doc writers for this issue to be complete. IW addDoc returns doFlush from DW. I don't think doFlush will be useful anymore? A slightly different memory management needs to be designed. Right now we allow the user to set the max ram buffer size and when the doc writer's buffers exceed the ram limit, the buffer is flushed and the process is complete. With this issue, the flush logic probably needs to be bumped up into IW, and flushing becomes a multi-docwriter ram usage examination. For starters, if the aggregate ram usage of all doc writers exceeds the IWC defined ram buffer size, we need to schedule flushing the doc writer with the greatest ram usage? I wonder if there's something I'm missing here in regards to synchronization issues with DW?
          Hide
          Michael McCandless added a comment -

          I think the process of making the doc id absolute is simply adding up the previous segments num docs to be the base?

          Right.

          Option 2 would use reader cloning?

          I don't think so – I think it'd have to pull a SegmentReader for every segment every time we flush a new segment, to resolve the deletions. In the non-pooled case that'd be a newly opened SegmentReader for every segment in the index every time a new segment is flushed.

          Show
          Michael McCandless added a comment - I think the process of making the doc id absolute is simply adding up the previous segments num docs to be the base? Right. Option 2 would use reader cloning? I don't think so – I think it'd have to pull a SegmentReader for every segment every time we flush a new segment, to resolve the deletions. In the non-pooled case that'd be a newly opened SegmentReader for every segment in the index every time a new segment is flushed.
          Hide
          Jason Rutherglen added a comment -

          Mike, lets do option 1. I think the process of making the doc id absolute is simply adding up the previous segments num docs to be the base?

          Option 2 would use reader cloning?

          Show
          Jason Rutherglen added a comment - Mike, lets do option 1. I think the process of making the doc id absolute is simply adding up the previous segments num docs to be the base? Option 2 would use reader cloning?
          Hide
          Michael McCandless added a comment -

          Good question...

          When we buffer delete Term/Query we record the current docID as of when that delete had arrived (so that interleaved delete/adds are resolved properly). The docID we record is "absolute" (ie, adds in the base flushedDocCount), so that we can decouple when deletes are materialized (moved into the deletedDocs BitVectors) from when new segments are flushed.

          I think we have a couple options.

          Option 1 is to use a relative (within the current segment) docID when the deleted Term/Query/docID is first buffered, but then make it absolute only when the segment is finally flushed.

          Option 2 is to use a relative docID, but do away with the decoupling, ie force deletions to always flush at the same time the segment is flushed.

          I think I like option 1 the best – I suspect the decoupling gains us performance as it allows us to batch up more deletions (doing deletions in batch gets better locality, and also means opening/closing readers left often, in the non-pooling case).

          Show
          Michael McCandless added a comment - Good question... When we buffer delete Term/Query we record the current docID as of when that delete had arrived (so that interleaved delete/adds are resolved properly). The docID we record is "absolute" (ie, adds in the base flushedDocCount), so that we can decouple when deletes are materialized (moved into the deletedDocs BitVectors) from when new segments are flushed. I think we have a couple options. Option 1 is to use a relative (within the current segment) docID when the deleted Term/Query/docID is first buffered, but then make it absolute only when the segment is finally flushed. Option 2 is to use a relative docID, but do away with the decoupling, ie force deletions to always flush at the same time the segment is flushed. I think I like option 1 the best – I suspect the decoupling gains us performance as it allows us to batch up more deletions (doing deletions in batch gets better locality, and also means opening/closing readers left often, in the non-pooling case).
          Hide
          Jason Rutherglen added a comment -

          I'm a little confused in the flushedDocCount, remap deletes conversion portions of DocWriter. flushedDocCount is used as a global counter, however when we move to per thread doc writers, it won't be global anymore. Is there a different (easier) way to perform remap deletes?

          Show
          Jason Rutherglen added a comment - I'm a little confused in the flushedDocCount, remap deletes conversion portions of DocWriter. flushedDocCount is used as a global counter, however when we move to per thread doc writers, it won't be global anymore. Is there a different (easier) way to perform remap deletes?
          Hide
          Michael Busch added a comment -

          Awesome!

          Show
          Michael Busch added a comment - Awesome!
          Hide
          Jason Rutherglen added a comment -

          Michael, I'm working on a patch and will post one (hopefully) shortly.

          Show
          Jason Rutherglen added a comment - Michael, I'm working on a patch and will post one (hopefully) shortly.
          Hide
          Michael Busch added a comment -

          Hey Jason,

          Disregard my patch here. I just experimented with removal of pooling, but then did LUCENE-2329 instead. TermsHash and TermsHashPerThread are now much simpler, because all the pooling code is gone after 2329 was committed. Should make it a little easier to get this patch done.

          Sure it'd be awesome if you could provide a patch here. I can help you, we should just frequently post patches here so that we don't both work on the same areas.

          Show
          Michael Busch added a comment - Hey Jason, Disregard my patch here. I just experimented with removal of pooling, but then did LUCENE-2329 instead. TermsHash and TermsHashPerThread are now much simpler, because all the pooling code is gone after 2329 was committed. Should make it a little easier to get this patch done. Sure it'd be awesome if you could provide a patch here. I can help you, we should just frequently post patches here so that we don't both work on the same areas.
          Hide
          Jason Rutherglen added a comment -

          Actually, I just browsed the patch again, I don't think it implements private doc writers as of yet?

          I think you're right, we can get this issue completed. LUCENE-2312's path looks clear at this point. Shall I take a whack at it?

          Show
          Jason Rutherglen added a comment - Actually, I just browsed the patch again, I don't think it implements private doc writers as of yet? I think you're right, we can get this issue completed. LUCENE-2312 's path looks clear at this point. Shall I take a whack at it?
          Hide
          Jason Rutherglen added a comment -

          Michael, I'm guessing this patch needs to be updated as per LUCENE-2329?

          Show
          Jason Rutherglen added a comment - Michael, I'm guessing this patch needs to be updated as per LUCENE-2329 ?
          Hide
          Michael Busch added a comment -

          All tests pass but I have to review if with the changes the memory consumption calculation still works correctly. Not sure if the junits test that?

          Also haven't done any performance testing yet.

          Show
          Michael Busch added a comment - All tests pass but I have to review if with the changes the memory consumption calculation still works correctly. Not sure if the junits test that? Also haven't done any performance testing yet.
          Hide
          Michael Busch added a comment -

          Michael, Agreed, can you outline how you think we should proceed then?

          Sorry for not responding earlier...

          I'm currently working on removing the PostingList object pooling, because it makes TermsHash and TermsHashPerThread much easier. Have written the patch and all tests pass, though I haven't done performance testing yet. Making TermsHash and TermsHashPerThread smaller will also make the patch here easier which will remove them. I'll post the patch soon.

          Next steps I think here are to make everything downstream of DocumentsWriter single-threaded (removal of *PerThread) classes. Then we need to write the DocumentsWriterThreadBinder and have to think about how to apply deletes, commits and rollbacks to all DocumentsWriter instances.

          Show
          Michael Busch added a comment - Michael, Agreed, can you outline how you think we should proceed then? Sorry for not responding earlier... I'm currently working on removing the PostingList object pooling, because it makes TermsHash and TermsHashPerThread much easier. Have written the patch and all tests pass, though I haven't done performance testing yet. Making TermsHash and TermsHashPerThread smaller will also make the patch here easier which will remove them. I'll post the patch soon. Next steps I think here are to make everything downstream of DocumentsWriter single-threaded (removal of *PerThread) classes. Then we need to write the DocumentsWriterThreadBinder and have to think about how to apply deletes, commits and rollbacks to all DocumentsWriter instances.
          Hide
          Jason Rutherglen added a comment -

          Michael, Agreed, can you outline how you think we should proceed then?

          Show
          Jason Rutherglen added a comment - Michael, Agreed, can you outline how you think we should proceed then?
          Hide
          Michael Busch added a comment -

          I think we all agree that we want to have a single writer thread, multi reader thread model. Only then the thread-safety problems in LUCENE-2312 can be reduced to visibility (no write-locking). So I think making this change first makes most sense. It involves a bit boring refactoring work unfortunately.

          Show
          Michael Busch added a comment - I think we all agree that we want to have a single writer thread, multi reader thread model. Only then the thread-safety problems in LUCENE-2312 can be reduced to visibility (no write-locking). So I think making this change first makes most sense. It involves a bit boring refactoring work unfortunately.
          Hide
          Jason Rutherglen added a comment -

          Actually TermsHashField doesn't need to be concurrent, it's only being written to and the terms concurrent skiplist (was a btree) holds the reference to the posting list. So I think we're good there because terms enum never accesses the terms hash. Nice!

          Show
          Jason Rutherglen added a comment - Actually TermsHashField doesn't need to be concurrent, it's only being written to and the terms concurrent skiplist (was a btree) holds the reference to the posting list. So I think we're good there because terms enum never accesses the terms hash. Nice!
          Hide
          Jason Rutherglen added a comment -

          NormsWriterPerField has a growing norm byte array, we'd need a way to read/write lock it...

          I think we have concurrency issues in the TermsHash table? Maybe it'd need to be rewritten to use ConcurrentHashMap?

          Show
          Jason Rutherglen added a comment - NormsWriterPerField has a growing norm byte array, we'd need a way to read/write lock it... I think we have concurrency issues in the TermsHash table? Maybe it'd need to be rewritten to use ConcurrentHashMap?
          Hide
          Jason Rutherglen added a comment -

          Michael,

          For LUCENE-2312, I think the searching isn't going to be an
          issue, I've got basic per thread doc writers working (though not
          thoroughly tested). I didn't see a great need to rework all the
          classes, which even if we did, I'm not sure helps with the byte
          array read write issues? I'd prefer to get a proof of concept
          more or less working, then refine it from there. I think there's
          two main design/implementation issues before we can roll
          something out:

          1) A new skip list implementation that at specific intervals
          writes a new skip (ie, single level). Right now in trunk we have
          a multilevel skiplist that requires ahead of time the number of
          docs.

          2) Figure out the low -> high levels of byte/char/int array
          visibility to reader threads. The main challenge here is the
          fact that the DW related code that utilizes this is really hard
          for me to understand enough to know what can be changed, without
          the side effect being bunches of other broken stuff. If there
          was a Directory like class abstraction we could simply override
          and reimplement, we could do that, and maybe there is one, I'm
          not sure yet.

          However if reworking the PerThread classes somehow makes the tie
          into the IO (eg, the byte array pooling) system abstracted and
          easier, then I'm all for it.

          Show
          Jason Rutherglen added a comment - Michael, For LUCENE-2312 , I think the searching isn't going to be an issue, I've got basic per thread doc writers working (though not thoroughly tested). I didn't see a great need to rework all the classes, which even if we did, I'm not sure helps with the byte array read write issues? I'd prefer to get a proof of concept more or less working, then refine it from there. I think there's two main design/implementation issues before we can roll something out: 1) A new skip list implementation that at specific intervals writes a new skip (ie, single level). Right now in trunk we have a multilevel skiplist that requires ahead of time the number of docs. 2) Figure out the low -> high levels of byte/char/int array visibility to reader threads. The main challenge here is the fact that the DW related code that utilizes this is really hard for me to understand enough to know what can be changed, without the side effect being bunches of other broken stuff. If there was a Directory like class abstraction we could simply override and reimplement, we could do that, and maybe there is one, I'm not sure yet. However if reworking the PerThread classes somehow makes the tie into the IO (eg, the byte array pooling) system abstracted and easier, then I'm all for it.
          Hide
          Michael Busch added a comment -

          Shall we not first try to remove the downstream *PerThread classes and make the DocumentsWriter single-threaded without locking. Then we add a PerThreadDocumentsWriter and DocumentsWriterThreadBinder, which talks to the PerThreadDWs and IW talks to DWTB. We can pick other names

          When that's done we can think about what kind of locking/synchronization/volatile stuff we need for LUCENE-2312.

          Show
          Michael Busch added a comment - Shall we not first try to remove the downstream *PerThread classes and make the DocumentsWriter single-threaded without locking. Then we add a PerThreadDocumentsWriter and DocumentsWriterThreadBinder, which talks to the PerThreadDWs and IW talks to DWTB. We can pick other names When that's done we can think about what kind of locking/synchronization/volatile stuff we need for LUCENE-2312 .
          Hide
          Jason Rutherglen added a comment -

          Are there going to be issues with the char array buffers as well (ie, will we need to also flush them for concurrency?)

          Show
          Jason Rutherglen added a comment - Are there going to be issues with the char array buffers as well (ie, will we need to also flush them for concurrency?)
          Hide
          Jason Rutherglen added a comment -

          Carrying over from LUCENE-2312. I'm proposing we for starters have a byte slice writer, lock, move or copy the bytes from the writable byte pool/writer to a read only byte block pool, unlock. This sounds like a fairly self-contained thing that can be unit tested at a low level.

          Mike, can you add a bit as to how this could work? Also, what is the IntBlockPool used for?

          Show
          Jason Rutherglen added a comment - Carrying over from LUCENE-2312 . I'm proposing we for starters have a byte slice writer, lock, move or copy the bytes from the writable byte pool/writer to a read only byte block pool, unlock. This sounds like a fairly self-contained thing that can be unit tested at a low level. Mike, can you add a bit as to how this could work? Also, what is the IntBlockPool used for?
          Hide
          Michael McCandless added a comment -

          Seems ilke it's 8 bytes

          Object header is two words, so that's 16bytes for 64bit arch. (probably 12 for 64bit+CompressedOops?)

          Right, and the pointer'd also be 8 bytes (but compact int stays at 4
          bytes) so net/net on 64bit JRE savings would be 16-20 bytes per term.

          Another thing we could do if we cutover to parallel arrays is to
          switch to packed ints. Many of these fields are horribly wasteful as
          ints, eg docFreq or lastPosition.

          Show
          Michael McCandless added a comment - Seems ilke it's 8 bytes Object header is two words, so that's 16bytes for 64bit arch. (probably 12 for 64bit+CompressedOops?) Right, and the pointer'd also be 8 bytes (but compact int stays at 4 bytes) so net/net on 64bit JRE savings would be 16-20 bytes per term. Another thing we could do if we cutover to parallel arrays is to switch to packed ints. Many of these fields are horribly wasteful as ints, eg docFreq or lastPosition.
          Hide
          Michael McCandless added a comment -

          Hmm I think we'd need a separate hash. Otherwise you have to subclass PostingList for the different cases (freq. vs. non-frequent terms) and do instanceof checks? Or with the parallel arrays idea maybe we could encode more information in the dense ID? E.g. use one bit to indicate if that term occurred more than once.

          Or 2 sets of parallel arrays (one for the singletons).... or, something.

          So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth it?

          And also the GC cost.

          But it seems like specializing singleton fields will be the bigger win.

          I was wondering if it makes sense to make these kinds of experiments (pooling vs. non-pooling) with the flex code?

          Last I tested (a while back now) indexing perf was the same – need to
          test again w/ recent changes (eg terms index is switching to packed
          ints). For pooling vs not I'd just do the experiment on trunk?

          And most of this change (changing how postings data is buffered in
          RAM) is "above" flex I expect.

          But if for some reason you need to start changing index postings
          format then you should probably do that on flex.

          Show
          Michael McCandless added a comment - Hmm I think we'd need a separate hash. Otherwise you have to subclass PostingList for the different cases (freq. vs. non-frequent terms) and do instanceof checks? Or with the parallel arrays idea maybe we could encode more information in the dense ID? E.g. use one bit to indicate if that term occurred more than once. Or 2 sets of parallel arrays (one for the singletons).... or, something. So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth it? And also the GC cost. But it seems like specializing singleton fields will be the bigger win. I was wondering if it makes sense to make these kinds of experiments (pooling vs. non-pooling) with the flex code? Last I tested (a while back now) indexing perf was the same – need to test again w/ recent changes (eg terms index is switching to packed ints). For pooling vs not I'd just do the experiment on trunk? And most of this change (changing how postings data is buffered in RAM) is "above" flex I expect. But if for some reason you need to start changing index postings format then you should probably do that on flex.
          Hide
          Earwin Burrfoot added a comment -

          > Seems ilke it's 8 bytes
          Object header is two words, so that's 16bytes for 64bit arch. (probably 12 for 64bit+CompressedOops?)

          Also, GC time is (roughly) linear in number of objects on heap, so replacing single huge array of objects with few huge primitive arrays for their fields does miracles to your GC delays.

          Show
          Earwin Burrfoot added a comment - > Seems ilke it's 8 bytes Object header is two words, so that's 16bytes for 64bit arch. (probably 12 for 64bit+CompressedOops?) Also, GC time is (roughly) linear in number of objects on heap, so replacing single huge array of objects with few huge primitive arrays for their fields does miracles to your GC delays.
          Hide
          Michael Busch added a comment -

          Sounds great - let's test it in practice.

          I have to admit that I need to catch up a bit on the flex branch. I was wondering if it makes sense to make these kinds of experiments (pooling vs. non-pooling) with the flex code? Is it as fast as trunk already, or are there related nocommits left that affect indexing performance? I would think not much of the flex changes should affect the in-memory indexing performance (in TermsHash*).

          Show
          Michael Busch added a comment - Sounds great - let's test it in practice. I have to admit that I need to catch up a bit on the flex branch. I was wondering if it makes sense to make these kinds of experiments (pooling vs. non-pooling) with the flex code? Is it as fast as trunk already, or are there related nocommits left that affect indexing performance? I would think not much of the flex changes should affect the in-memory indexing performance (in TermsHash*).
          Hide
          Michael Busch added a comment - - edited

          Reply to Mike's comment on LUCENE-2293: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12845263&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12845263

          I think we can do even better, ie, that class wastes RAM for the single posting case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are not needed).

          EG we could have a separate class dedicated to the singleton case. When term is first encountered it's enrolled there. We'd probably need a separate hash to store these (though not necessarily?). If it's seen again it's switched to the full posting.

          Hmm I think we'd need a separate hash. Otherwise you have to subclass PostingList for the different cases (freq. vs. non-frequent terms) and do instanceof checks? Or with the parallel arrays idea maybe we could encode more information in the dense ID? E.g. use one bit to indicate if that term occurred more than once.

          I mean instead of allocating an instance per unique term, we assign an integer ID (dense, ie, 0, 1, 2...).

          And then we have an array for each member now in FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. Then to look up say the lastDocID for a given postingID you just get lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we can make these arrays paged... but that'd slow down each access.

          Yeah I like that idea. I've done something similar for representing trees - I had a very compact Node class with no data but such a dense ID, and arrays that stored the associated data. Very easy to add another data type with no RAM overhead (you only use the amount of RAM the data needs).

          Though, the price you pay is for dereferencing multiple times for each array?
          And how much RAM would we safe? The pointer for the PostingList object (4-8 bytes), plus the size of the object header - how much is that in Java?

          Seems ilke it's 8 bytes: http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html

          So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth it?

          Show
          Michael Busch added a comment - - edited Reply to Mike's comment on LUCENE-2293 : https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12845263&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12845263 I think we can do even better, ie, that class wastes RAM for the single posting case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are not needed). EG we could have a separate class dedicated to the singleton case. When term is first encountered it's enrolled there. We'd probably need a separate hash to store these (though not necessarily?). If it's seen again it's switched to the full posting. Hmm I think we'd need a separate hash. Otherwise you have to subclass PostingList for the different cases (freq. vs. non-frequent terms) and do instanceof checks? Or with the parallel arrays idea maybe we could encode more information in the dense ID? E.g. use one bit to indicate if that term occurred more than once. I mean instead of allocating an instance per unique term, we assign an integer ID (dense, ie, 0, 1, 2...). And then we have an array for each member now in FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. Then to look up say the lastDocID for a given postingID you just get lastDocIDs [postingID] . If we're worried about oversize allocation overhead, we can make these arrays paged... but that'd slow down each access. Yeah I like that idea. I've done something similar for representing trees - I had a very compact Node class with no data but such a dense ID, and arrays that stored the associated data. Very easy to add another data type with no RAM overhead (you only use the amount of RAM the data needs). Though, the price you pay is for dereferencing multiple times for each array? And how much RAM would we safe? The pointer for the PostingList object (4-8 bytes), plus the size of the object header - how much is that in Java? Seems ilke it's 8 bytes: http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth it?
          Hide
          Michael McCandless added a comment -

          Sounds great – let's test it in practice.

          Show
          Michael McCandless added a comment - Sounds great – let's test it in practice.
          Hide
          Michael Busch added a comment -

          Here is an interesting article about allocation/deallocation on modern JVMs:
          http://www.ibm.com/developerworks/java/library/j-jtp09275.html

          And here is a snippet that mentions how pooling is generally not faster anymore:


          Allocation in JVMs was not always so fast – early JVMs indeed had poor allocation and garbage collection performance, which is almost certainly where this myth got started. In the very early days, we saw a lot of "allocation is slow" advice – because it was, along with everything else in early JVMs – and performance gurus advocated various tricks to avoid allocation, such as object pooling. (Public service announcement: Object pooling is now a serious performance loss for all but the most heavyweight of objects, and even then it is tricky to get right without introducing concurrency bottlenecks.) However, a lot has happened since the JDK 1.0 days; the introduction of generational collectors in JDK 1.2 has enabled a much simpler approach to allocation, greatly improving performance.


          Show
          Michael Busch added a comment - Here is an interesting article about allocation/deallocation on modern JVMs: http://www.ibm.com/developerworks/java/library/j-jtp09275.html And here is a snippet that mentions how pooling is generally not faster anymore: Allocation in JVMs was not always so fast – early JVMs indeed had poor allocation and garbage collection performance, which is almost certainly where this myth got started. In the very early days, we saw a lot of "allocation is slow" advice – because it was, along with everything else in early JVMs – and performance gurus advocated various tricks to avoid allocation, such as object pooling. (Public service announcement: Object pooling is now a serious performance loss for all but the most heavyweight of objects, and even then it is tricky to get right without introducing concurrency bottlenecks.) However, a lot has happened since the JDK 1.0 days; the introduction of generational collectors in JDK 1.2 has enabled a much simpler approach to allocation, greatly improving performance.

            People

            • Assignee:
              Michael Busch
              Reporter:
              Michael Busch
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development