Lucene - Core
  1. Lucene - Core
  2. LUCENE-2293

IndexWriter has hard limit on max concurrency

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      DocumentsWriter has this nasty hardwired constant:

      private final static int MAX_THREAD_STATE = 5;
      

      which probably I should have attached a //nocommit to the moment I
      wrote it

      That constant sets the max number of thread states to 5. This means,
      if more than 5 threads enter IndexWriter at once, they will "share"
      only 5 thread states, meaning we gate CPU concurrency to 5 running
      threads inside IW (each thread must first wait for the last thread to
      finish using the thread state before grabbing it).

      This is bad because modern hardware can make use of more than 5
      threads. So I think an immediate fix is to make this settable
      (expert), and increase the default (8?).

      It's tricky, though, because the more thread states, the less RAM
      efficiency you have, meaning the worse indexing throughput. So you
      shouldn't up and set this to 50: you'll be flushing too often.

      But... I think a better fix is to re-think how threads write state
      into DocumentsWriter. Today, a single docID stream is assigned across
      threads (eg one thread gets docID=0, next one docID=1, etc.), and each
      thread writes to a private RAM buffer (living in the thread state),
      and then on flush we do a merge sort. The merge sort is inefficient
      (does not currently use a PQ)... and, wasteful because we must
      re-decode every posting byte.

      I think we could change this, so that threads write to private RAM
      buffers, with a private docID stream, but then instead of merging on
      flush, we directly flush each thread as its own segment (and, allocate
      private docIDs to each thread). We can then leave merging to CMS
      which can already run merges in the BG without blocking ongoing
      indexing (unlike the merge we do in flush, today).

      This would also allow us to separately flush thread states. Ie, we
      need not flush all thread states at once – we can flush one when it
      gets too big, and then let the others keep running. This should be a
      good concurrency gain since is uses IO & CPU resources "throughout"
      indexing instead of "big burst of CPU only" then "big burst of IO
      only" that we have today (flush today "stops the world").

      One downside I can think of is... docIDs would now be "less
      monotonic", meaning if N threads are indexing, you'll roughly get
      in-time-order assignment of docIDs. But with this change, all of one
      thread state would get 0..N docIDs, the next thread state'd get
      N+1...M docIDs, etc. However, a single thread would still get
      monotonic assignment of docIDs.

      1. LUCENE-2293.patch
        6 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael Busch added a comment -

          Good timing - a couple days ago I was thinking about how threading could be changed in the indexer.

          The other downside is that you would have to buffer deleted docs and queries separately for each thread state, because you have to keep the private docID? So that would nee a bit more memory.

          Couldn't we make the DocumentsWriter and all related down-stream classes single-threaded then? The IndexWriter (or a new class) would have the doc queue, basically a load balancer, that multiple DocumentsWriter instances would pull from as soon as they are done inverting the previous document?

          This would allow us to simplify the indexer chain a lot - we could get rid of all the *PerThread classes. We'd also have to separate then the docstores from the DocumentsWriter, so that multiple DocumentsWriter instances could share it. (what I'd like to do anyway for LUCENE-2026 anyway).

          Show
          Michael Busch added a comment - Good timing - a couple days ago I was thinking about how threading could be changed in the indexer. The other downside is that you would have to buffer deleted docs and queries separately for each thread state, because you have to keep the private docID? So that would nee a bit more memory. Couldn't we make the DocumentsWriter and all related down-stream classes single-threaded then? The IndexWriter (or a new class) would have the doc queue, basically a load balancer, that multiple DocumentsWriter instances would pull from as soon as they are done inverting the previous document? This would allow us to simplify the indexer chain a lot - we could get rid of all the *PerThread classes. We'd also have to separate then the docstores from the DocumentsWriter, so that multiple DocumentsWriter instances could share it. (what I'd like to do anyway for LUCENE-2026 anyway).
          Hide
          Jason Rutherglen added a comment -

          Mike, good one! Would having a doc id stream per thread make implementing a searchable RAM buffer easier?

          Show
          Jason Rutherglen added a comment - Mike, good one! Would having a doc id stream per thread make implementing a searchable RAM buffer easier?
          Hide
          Earwin Burrfoot added a comment -

          The IndexWriter (or a new class) would have the doc queue, basically a load balancer, that multiple DocumentsWriter instances would pull from as soon as they are done inverting the previous document?

          I hope we won't lose monotonic docIDs for a singlethreaded indexation somewhere along that path.

          Show
          Earwin Burrfoot added a comment - The IndexWriter (or a new class) would have the doc queue, basically a load balancer, that multiple DocumentsWriter instances would pull from as soon as they are done inverting the previous document? I hope we won't lose monotonic docIDs for a singlethreaded indexation somewhere along that path.
          Hide
          Michael Busch added a comment -

          I hope we won't lose monotonic docIDs for a singlethreaded indexation somewhere along that path.

          No. The order in the single threaded case won't be different from today with the changes Mike is proposing.

          Show
          Michael Busch added a comment - I hope we won't lose monotonic docIDs for a singlethreaded indexation somewhere along that path. No. The order in the single threaded case won't be different from today with the changes Mike is proposing.
          Hide
          Shai Erera added a comment -

          The IndexWriter (or a new class) would have the doc queue, basically a load balancer, that multiple DocumentsWriter instances would pull from as soon as they are done inverting the previous document?

          Today, DW enforces thread binding - the same thread will always receive the same ThreadState. This allows applications who distribute the documents between threads based on some criteria, to get a locality of Documents indexed by each thread. I can't think of why an application would rely on that, but still that's something that happens.

          Also, in the pull approach, Lucene would introduce another place where it allocates threads. Not only would we need to allow setting that concurrency level, we'd also need to allow overriding how a thread is instantiated. That will change the way applications are written today - I assume lots of applications that are multi-threaded rely on the multiple threads to index the documents. But now those threads won't do anything besides register a document in a queue. Therefore such applications will need to move to single-threaded indexing (because multi-threaded gives them nothing), and control the threads IW allocates.

          I personally prefer to leave multi-threaded indexing to the application. If it anyway contains a queue of incoming documents (from the outside) and allocates threads to process them in parallel (for example to parse rich text documents, fetch content from remote machines etc.), we wouldn't want them to do all this just to waste those threads at the end and let IW control another level of concurrency.

          Another downside of such approach is that it breaks backward compatibility in a new way we've never considered. If the application allocates threads from a pool, and we introduce a new IW/DW w/ concurrency level=3 (for example), then the application will suddenly spawn more threads that it intended to. Perhaps it chose to use SMS, or overrode CMS to handle the threads allocation, but it's definitely not ready to handle another thread allocator.

          Another thing is that the queue cannot be of just Document objects, but a DocAndOp objects to account for add/delete/updates ... another complication.

          My preference is to keep the queue to the application.

          The other downside is that you would have to buffer deleted docs and queries separately for each thread state

          Just for clarity - you'll need to do it with the queue approach as well, right? I mean, a DW which pulled an operation from the queue, which is a DELETE op, will need to cache that DELETE so that it will be executed on all documents that were indexed up until flush. So that does not save anything vs. if we change DW to flush by ThreadState.

          Instead, I prefer to take advantage of the application's concurrency level in the following way:

          • Each thread will continue to write documents to a ThreadState. We'll allow changing the MAX_LEVEL, so if an app wants to get more concurrency, it can.
            • MAX_LEVEL will set the number of ThreadState objects available.
          • All threads will obtain memory buffers from a pull which will be limited by IW's RAM limit.
          • When a thread finishes indexing a document and realizes the pool has been exhausted, it flushes its ThreadState.
            • At that moment, that ThreadState is pulled out of the 'active' list and is flushed. When it's done, it reclaims its used buffers and being put again in the active list.
            • New threads that come in will simply pick a ThreadState from the pool (but we'll bind them to that instance until it's flushed) and add documents to them.
            • That way, we hijack an application thread to do the flushing, which is anyway what happens today.

          That way we are less likely to reach a state like Mike described - "big burst of CPU only" then "big burst of IO only" - and more likely to balance the two.

          If the application wants to be single threaded, we allow it to be like that all the way through, not introducing more thread allocations. Otherwise, we let it control its concurrency level and use it to our needs.

          Show
          Shai Erera added a comment - The IndexWriter (or a new class) would have the doc queue, basically a load balancer, that multiple DocumentsWriter instances would pull from as soon as they are done inverting the previous document? Today, DW enforces thread binding - the same thread will always receive the same ThreadState. This allows applications who distribute the documents between threads based on some criteria, to get a locality of Documents indexed by each thread. I can't think of why an application would rely on that, but still that's something that happens. Also, in the pull approach, Lucene would introduce another place where it allocates threads. Not only would we need to allow setting that concurrency level, we'd also need to allow overriding how a thread is instantiated. That will change the way applications are written today - I assume lots of applications that are multi-threaded rely on the multiple threads to index the documents. But now those threads won't do anything besides register a document in a queue. Therefore such applications will need to move to single-threaded indexing (because multi-threaded gives them nothing), and control the threads IW allocates. I personally prefer to leave multi-threaded indexing to the application. If it anyway contains a queue of incoming documents (from the outside) and allocates threads to process them in parallel (for example to parse rich text documents, fetch content from remote machines etc.), we wouldn't want them to do all this just to waste those threads at the end and let IW control another level of concurrency. Another downside of such approach is that it breaks backward compatibility in a new way we've never considered. If the application allocates threads from a pool, and we introduce a new IW/DW w/ concurrency level=3 (for example), then the application will suddenly spawn more threads that it intended to. Perhaps it chose to use SMS, or overrode CMS to handle the threads allocation, but it's definitely not ready to handle another thread allocator. Another thing is that the queue cannot be of just Document objects, but a DocAndOp objects to account for add/delete/updates ... another complication. My preference is to keep the queue to the application. The other downside is that you would have to buffer deleted docs and queries separately for each thread state Just for clarity - you'll need to do it with the queue approach as well, right? I mean, a DW which pulled an operation from the queue, which is a DELETE op, will need to cache that DELETE so that it will be executed on all documents that were indexed up until flush. So that does not save anything vs. if we change DW to flush by ThreadState. Instead, I prefer to take advantage of the application's concurrency level in the following way: Each thread will continue to write documents to a ThreadState. We'll allow changing the MAX_LEVEL, so if an app wants to get more concurrency, it can. MAX_LEVEL will set the number of ThreadState objects available. All threads will obtain memory buffers from a pull which will be limited by IW's RAM limit. When a thread finishes indexing a document and realizes the pool has been exhausted, it flushes its ThreadState. At that moment, that ThreadState is pulled out of the 'active' list and is flushed. When it's done, it reclaims its used buffers and being put again in the active list. New threads that come in will simply pick a ThreadState from the pool (but we'll bind them to that instance until it's flushed) and add documents to them. That way, we hijack an application thread to do the flushing, which is anyway what happens today. That way we are less likely to reach a state like Mike described - "big burst of CPU only" then "big burst of IO only" - and more likely to balance the two. If the application wants to be single threaded, we allow it to be like that all the way through, not introducing more thread allocations. Otherwise, we let it control its concurrency level and use it to our needs.
          Hide
          Michael Busch added a comment -

          Also, in the pull approach, Lucene would introduce another place where it allocates threads.

          What I described is not much different from what's happening today. DocumentsWriter has already a WaitQueue, that ensures that the docs are written in the right order.

          I simply tried to suggest a way to refactor our classes... functionally the same as what Mike suggested. I shouldn't have said "pulled from" (the queue).

          Show
          Michael Busch added a comment - Also, in the pull approach, Lucene would introduce another place where it allocates threads. What I described is not much different from what's happening today. DocumentsWriter has already a WaitQueue, that ensures that the docs are written in the right order. I simply tried to suggest a way to refactor our classes... functionally the same as what Mike suggested. I shouldn't have said "pulled from" (the queue).
          Hide
          Shai Erera added a comment -

          What I described is not much different from what's happening today.

          Maybe I didn't understand then:

          basically a load balancer, that multiple DocumentsWriter instances would pull from as soon as they are done inverting the previous document?

          Who adds documents to that queue and what are the DW instances? The way I read it, I understood those are different threads than the application threads. If I misunderstood that, could you please clarify?

          Also, I thought that each thread writes to different ThreadState does not ensure documents are written in order, but that finally when DW flushes, the different ThreadStates are merged together and one segment is written, somehow restores the orderness ...

          If only WaitQueue was documented .

          I obviously don't know that part of the code as well as you. So if I misunderstood your meaning, I'd appreciate if you clarify it for me. What I would like to avoid is having Lucene allocate indexing threads on its own.

          Also, is my proposal above different than what you suggest?

          Show
          Shai Erera added a comment - What I described is not much different from what's happening today. Maybe I didn't understand then: basically a load balancer, that multiple DocumentsWriter instances would pull from as soon as they are done inverting the previous document? Who adds documents to that queue and what are the DW instances? The way I read it, I understood those are different threads than the application threads. If I misunderstood that, could you please clarify? Also, I thought that each thread writes to different ThreadState does not ensure documents are written in order, but that finally when DW flushes, the different ThreadStates are merged together and one segment is written, somehow restores the orderness ... If only WaitQueue was documented . I obviously don't know that part of the code as well as you. So if I misunderstood your meaning, I'd appreciate if you clarify it for me. What I would like to avoid is having Lucene allocate indexing threads on its own. Also, is my proposal above different than what you suggest?
          Hide
          Michael Busch added a comment -

          Sorry - after reading my comment again I can see why it was confusing. Loadbalancer wasn't a very good analogy.

          I totally agree that Lucene should still piggyback on the application's threads and not start its own thread for document inversion.

          Today, as you said, does the DocumentsWriter manage a certain number of thread states, has the WaitQueue, and its own memory management.

          What I was thinking was that it would be simpler if the DocumentsWriter was only used by a single thread. The IndexWriter would have multiple DocumentsWriters and do the thread binding (+waitqueue). This would make the code in DocumentsWriter and the downstream classes simpler. The side-effect is that each DocumentsWriter would manage its own memory.

          Also, I thought that each thread writes to different ThreadState does not ensure documents are written in order, but that finally when DW flushes, the different ThreadStates are merged together and one segment is written, somehow restores the orderness ...

          Stored fields are written to an on-disk stream (docstore) in order. The WaitQueue takes care of finishing the docs in the right order.
          The postings are written into TermHashes per threadstate in parallel. The doc ids are in increasing order, but can have gaps. E.g. Threadstate 1 inverts doc 1 and 3, Threadstate 2 inverts doc 2. When it's time to flush the whole buffer these different TermHash postingslists get interleaved.

          Show
          Michael Busch added a comment - Sorry - after reading my comment again I can see why it was confusing. Loadbalancer wasn't a very good analogy. I totally agree that Lucene should still piggyback on the application's threads and not start its own thread for document inversion. Today, as you said, does the DocumentsWriter manage a certain number of thread states, has the WaitQueue, and its own memory management. What I was thinking was that it would be simpler if the DocumentsWriter was only used by a single thread. The IndexWriter would have multiple DocumentsWriters and do the thread binding (+waitqueue). This would make the code in DocumentsWriter and the downstream classes simpler. The side-effect is that each DocumentsWriter would manage its own memory. Also, I thought that each thread writes to different ThreadState does not ensure documents are written in order, but that finally when DW flushes, the different ThreadStates are merged together and one segment is written, somehow restores the orderness ... Stored fields are written to an on-disk stream (docstore) in order. The WaitQueue takes care of finishing the docs in the right order. The postings are written into TermHashes per threadstate in parallel. The doc ids are in increasing order, but can have gaps. E.g. Threadstate 1 inverts doc 1 and 3, Threadstate 2 inverts doc 2. When it's time to flush the whole buffer these different TermHash postingslists get interleaved.
          Hide
          Shai Erera added a comment -

          Ok so I think I understand now. You propose to change IW to bind a Thread to a DW, instead of that being done inside DW. And therefore it will simplify DW's code ... I wonder if that won't complicate IW code in return? Perhaps we'll gain a lot of simplification on DW, so a bit of complexity on IW will be ok.

          If we do that .. why not renaming DW to SegmentWriter? If each DW will eventually flush its own Segment, the name would make more sense?

          BTW, I was thinking that an application can emulate this sort of thing even today (well ... to some extent - w/o deletes). It can create an IW for each indexing thread and at the end call addIndexes. What we'd need to introduce on IW to make it efficient though is something like addRawIndexes, which will just update the segments file about the new segments, but won't attempt to merge them and clean deletes out of them.
          I think I want this API anyway for being able to add segments faster to an index, if e.g. you don't care about the merges at the moment ... but that is separate issue.

          Then I think what I proposed is more or less the same as you propose, therefore I'm fine with that approach. When a DW/SW realizes it exhausted its memory pool, it just flushes and new threads will bind to other DW/SW.

          Thanks for the explanation on WaitQueue.

          Show
          Shai Erera added a comment - Ok so I think I understand now. You propose to change IW to bind a Thread to a DW, instead of that being done inside DW. And therefore it will simplify DW's code ... I wonder if that won't complicate IW code in return? Perhaps we'll gain a lot of simplification on DW, so a bit of complexity on IW will be ok. If we do that .. why not renaming DW to SegmentWriter? If each DW will eventually flush its own Segment, the name would make more sense? BTW, I was thinking that an application can emulate this sort of thing even today (well ... to some extent - w/o deletes). It can create an IW for each indexing thread and at the end call addIndexes. What we'd need to introduce on IW to make it efficient though is something like addRawIndexes, which will just update the segments file about the new segments, but won't attempt to merge them and clean deletes out of them. I think I want this API anyway for being able to add segments faster to an index, if e.g. you don't care about the merges at the moment ... but that is separate issue. Then I think what I proposed is more or less the same as you propose, therefore I'm fine with that approach. When a DW/SW realizes it exhausted its memory pool, it just flushes and new threads will bind to other DW/SW. Thanks for the explanation on WaitQueue.
          Hide
          Earwin Burrfoot added a comment -

          I wonder if that won't complicate IW code in return? Perhaps we'll gain a lot of simplification on DW, so a bit of complexity on IW will be ok.

          That will get rid of all that *PerThread insanity for each DW component, if I'm getting it right. That's -13 classes. Yay for the issue!

          On a random sidenote, can we group things like these into subpackages? Having 132 files in oal.index is somewhat intimidating when trying to read/understand things.

          Show
          Earwin Burrfoot added a comment - I wonder if that won't complicate IW code in return? Perhaps we'll gain a lot of simplification on DW, so a bit of complexity on IW will be ok. That will get rid of all that *PerThread insanity for each DW component, if I'm getting it right. That's -13 classes. Yay for the issue! On a random sidenote, can we group things like these into subpackages? Having 132 files in oal.index is somewhat intimidating when trying to read/understand things.
          Hide
          Michael McCandless added a comment -

          I agree IW should not spawn its own threads. It should piggy back on
          incoming threads.

          On whether we can remove the "perThread" layer throughout the chain –
          that would be compelling. But, we should scrutinize what that layer
          does throughout the current chain to assess what we might lose.

          But, I was proposing a bigger change (call it "private RAM segments"):
          there would be multiple DWs, each one writing to its own private RAM
          segment (each one getting private docID assignment) and its own doc
          stores.

          There would be no more WaitQueue in IW.

          Each DW would flush its own segment privately. They would not all
          flush at once (merging their postings) like we must do today because
          they "share" a single docID space.

          As I understand it, this would be step towards how Lucy handles
          concurrency during indexing. Ie, it'd make the DWs nearly fully
          independent from one another, and then IW is just there to dispatch/do
          merging/etc. (In Lucy each writer is a separate process, I think –
          VERY independent).

          We could do both changes, too (remove the "perThread" layer of
          indexing chaing and switch to private RAM segments) – I think they
          are actually orthogonal.

          The other downside is that you would have to buffer deleted docs and queries separately for each thread state, because you have to keep the private docID? So that would nee a bit more memory.

          Right.

          Mike, good one! Would having a doc id stream per thread make implementing a searchable RAM buffer easier?

          Yes – they would just appear like sub segments.

          I hope we won't lose monotonic docIDs for a singlethreaded indexation somewhere along that path.

          We won't.

          Instead, I prefer to take advantage of the application's concurrency level in the following way:

          • Each thread will continue to write documents to a ThreadState. We'll allow changing the MAX_LEVEL, so if an app wants to get more concurrency, it can.
          • MAX_LEVEL will set the number of ThreadState objects available.
          • All threads will obtain memory buffers from a pull which will be limited by IW's RAM limit.
          • When a thread finishes indexing a document and realizes the pool has been exhausted, it flushes its ThreadState.
          • At that moment, that ThreadState is pulled out of the 'active' list and is flushed. When it's done, it reclaims its used buffers and being put again in the active list.
          • New threads that come in will simply pick a ThreadState from the pool (but we'll bind them to that instance until it's flushed) and add documents to them.
          • That way, we hijack an application thread to do the flushing, which is anyway what happens today.

          +1 – this I think matches what I was thinking.

          If only WaitQueue was documented

          Sorry

          But WaitQueue would go away with this change. We would no longer have
          shared doc stores!

          Show
          Michael McCandless added a comment - I agree IW should not spawn its own threads. It should piggy back on incoming threads. On whether we can remove the "perThread" layer throughout the chain – that would be compelling. But, we should scrutinize what that layer does throughout the current chain to assess what we might lose. But, I was proposing a bigger change (call it "private RAM segments"): there would be multiple DWs, each one writing to its own private RAM segment (each one getting private docID assignment) and its own doc stores. There would be no more WaitQueue in IW. Each DW would flush its own segment privately. They would not all flush at once (merging their postings) like we must do today because they "share" a single docID space. As I understand it, this would be step towards how Lucy handles concurrency during indexing. Ie, it'd make the DWs nearly fully independent from one another, and then IW is just there to dispatch/do merging/etc. (In Lucy each writer is a separate process, I think – VERY independent). We could do both changes, too (remove the "perThread" layer of indexing chaing and switch to private RAM segments) – I think they are actually orthogonal. The other downside is that you would have to buffer deleted docs and queries separately for each thread state, because you have to keep the private docID? So that would nee a bit more memory. Right. Mike, good one! Would having a doc id stream per thread make implementing a searchable RAM buffer easier? Yes – they would just appear like sub segments. I hope we won't lose monotonic docIDs for a singlethreaded indexation somewhere along that path. We won't. Instead, I prefer to take advantage of the application's concurrency level in the following way: Each thread will continue to write documents to a ThreadState. We'll allow changing the MAX_LEVEL, so if an app wants to get more concurrency, it can. MAX_LEVEL will set the number of ThreadState objects available. All threads will obtain memory buffers from a pull which will be limited by IW's RAM limit. When a thread finishes indexing a document and realizes the pool has been exhausted, it flushes its ThreadState. At that moment, that ThreadState is pulled out of the 'active' list and is flushed. When it's done, it reclaims its used buffers and being put again in the active list. New threads that come in will simply pick a ThreadState from the pool (but we'll bind them to that instance until it's flushed) and add documents to them. That way, we hijack an application thread to do the flushing, which is anyway what happens today. +1 – this I think matches what I was thinking. If only WaitQueue was documented Sorry But WaitQueue would go away with this change. We would no longer have shared doc stores!
          Hide
          Shai Erera added a comment -

          Perhaps instead of buffering the delete Terms/Queries somewhere central, when a delete by term is performed by a certain DW, it can register it immediately on all existing DWs. Each DW will record the doc ID up until which this term delete should be executed, and when it's its time to flush, will apply all the deletes that were accumulated on itself. It'll be like doing a Parallel segment deletes (but maybe I'm too into Parallel Indexing ).

          This should not affect any documents that were added to any DW after the delete happened, and if we simply do it (sycned) across all active DWs, I think we should be fine?

          Show
          Shai Erera added a comment - Perhaps instead of buffering the delete Terms/Queries somewhere central, when a delete by term is performed by a certain DW, it can register it immediately on all existing DWs. Each DW will record the doc ID up until which this term delete should be executed, and when it's its time to flush, will apply all the deletes that were accumulated on itself. It'll be like doing a Parallel segment deletes (but maybe I'm too into Parallel Indexing ). This should not affect any documents that were added to any DW after the delete happened, and if we simply do it (sycned) across all active DWs, I think we should be fine?
          Hide
          Jason Rutherglen added a comment -

          But WaitQueue would go away with this change. We would no longer have shared doc stores!

          Cool, most of the DW code is intuitive except the shared doc stores because it's hard to when see when a doc store ends. Also the interleaving is a bit difficult to visualize. I look forward to checking out DW after this change.

          Show
          Jason Rutherglen added a comment - But WaitQueue would go away with this change. We would no longer have shared doc stores! Cool, most of the DW code is intuitive except the shared doc stores because it's hard to when see when a doc store ends. Also the interleaving is a bit difficult to visualize. I look forward to checking out DW after this change.
          Hide
          Michael McCandless added a comment -

          Yes, I think each DW will have to record its own buffered delete Term/Query, mapping to its docID at the time the delete arrived.

          Syncing across all of them would work but may be overkill. I think we could instead have a lock free collection (need not even be FIFO – the order doesn't matter) into which we add all Term/Query that are deleted. Then, any time a thread hits that DW to add a document, it must first service that queue, by popping out all Term/Query stored in it and enrolling them the un-synchronized map of Term/Query -> docID).

          Show
          Michael McCandless added a comment - Yes, I think each DW will have to record its own buffered delete Term/Query, mapping to its docID at the time the delete arrived. Syncing across all of them would work but may be overkill. I think we could instead have a lock free collection (need not even be FIFO – the order doesn't matter) into which we add all Term/Query that are deleted. Then, any time a thread hits that DW to add a document, it must first service that queue, by popping out all Term/Query stored in it and enrolling them the un-synchronized map of Term/Query -> docID).
          Hide
          Michael Busch added a comment -

          But, I was proposing a bigger change (call it "private RAM segments"):
          there would be multiple DWs, each one writing to its own private RAM
          segment (each one getting private docID assignment) and its own doc
          stores.

          Cool! I wasn't sure if you wanted to give them private doc stores too. +1, I like it.

          Show
          Michael Busch added a comment - But, I was proposing a bigger change (call it "private RAM segments"): there would be multiple DWs, each one writing to its own private RAM segment (each one getting private docID assignment) and its own doc stores. Cool! I wasn't sure if you wanted to give them private doc stores too. +1, I like it.
          Hide
          Michael McCandless added a comment -

          Cool! I wasn't sure if you wanted to give them private doc stores too. +1, I like it.

          I wasn't sure either Ie, I forgot about that aspect of my proposal until it was raised in the discussion... but I think that'd be necessary.

          This will be a perf hit, when building up a big new index. But since doc stores now merge by bulk copy (when there are no deletions) hopefully the impact isn't too much. And, hopefully it's more than made up for by the improvement in IO/CPU interleaved concurrency.

          I'll work out a patch to at least make the hardwired 5 configurable... but does anyone out there wanna work out the "private RAM segments"?

          Show
          Michael McCandless added a comment - Cool! I wasn't sure if you wanted to give them private doc stores too. +1, I like it. I wasn't sure either Ie, I forgot about that aspect of my proposal until it was raised in the discussion... but I think that'd be necessary. This will be a perf hit, when building up a big new index. But since doc stores now merge by bulk copy (when there are no deletions) hopefully the impact isn't too much. And, hopefully it's more than made up for by the improvement in IO/CPU interleaved concurrency. I'll work out a patch to at least make the hardwired 5 configurable... but does anyone out there wanna work out the "private RAM segments"?
          Hide
          Michael Busch added a comment -

          Yes, I think each DW will have to record its own buffered delete Term/Query, mapping to its docID at the time the delete arrived.

          I think in the future deletes in DW could work like this:

          • DW keeps of course track of a private sequence id, which gets incremented in the add, delete, update calls
          • a DW has a getReader() call, the reader can search the ram buffer
          • when DW.gerReader() gets called, then the new reader remembers the current seqID at the time it was opened - let's call it RAMReader.seqID; if such a reader gets reopened, simply its seqID gets updated.
          • we keep an growing int array with the size of DW's maxDoc, which replaces the usual deletes bitset
          • when DW.updateDocument() or .deleteDocument() needs to delete a doc we do that right away, before inverting the new doc. We can do that by running a query using a RAMReader to find all docs that must be deleted. Instead of flipping a bit in a bitset, for each hit we now keep track of when it was deleted:
          // init each slot in deletes array with -1
          static final int NOT_DELETED = Integer.MAX_INT;
          ...
          Arrays.fill(deletes, NOT_DELETED);
          
          ...
          
          public void deleteDocument(Query q) {
            reopen RAMReader
            run query q using RAMReader
            for each hit {
              int hitDocId = ...
              if (deletes[hitDocId] == NOT_DELETED) {
                deletes[hitDocId] = DW.seqID;
              }
            }
          ...
            DW.seqID++;
          }
          

          Now no matter of how often you (re)open RAMReaders, they can share the deletes array. No cloning like with the BitSet approach would be necessary:

          When the RAMReader iterates posting lists it's as simple as this to treat deletes docs correctly. Instead of doing this in RAMTermDocs.next():

            if (deletedDocsBitSet.get(doc)) {
              skip this doc
           }
          

          we can now do:

            if (deletes[doc] < ramReader.seqID) {
              skip this doc
            }
          

          Here is an example:
          1. Add 3 docs with DW.addDocument()
          2. User opens ramReader_a
          3. Delete doc 1
          4. User opens ramReader_b

          After 1: DW.seqID = 2; deletes[]=

          {MAX_INT, MAX_INT, MAX_INT}

          After 2: ramReader_a.seqID = 2
          After 3: DW.seqID = 3; deletes[]=

          {MAX_INT, 2, MAX_INT}

          After 3: ramReader_b.seqID = 3

          Note that both ramReader_a and ramReader_b share the same deletes[] array. Now when ramReader_a is used to read posting lists, it will not treat doc 1 as deleted, because (deletes[1] < ramReader_a.seqID) = (2 < 2) = false; But ramReader_b will see it as deleted, because (deletes[1] < ramReader_b.seqID) = (2 < 3) = true.

          What do you think about this approach for the future when we have a searchable DW buffer?

          Show
          Michael Busch added a comment - Yes, I think each DW will have to record its own buffered delete Term/Query, mapping to its docID at the time the delete arrived. I think in the future deletes in DW could work like this: DW keeps of course track of a private sequence id, which gets incremented in the add, delete, update calls a DW has a getReader() call, the reader can search the ram buffer when DW.gerReader() gets called, then the new reader remembers the current seqID at the time it was opened - let's call it RAMReader.seqID; if such a reader gets reopened, simply its seqID gets updated. we keep an growing int array with the size of DW's maxDoc, which replaces the usual deletes bitset when DW.updateDocument() or .deleteDocument() needs to delete a doc we do that right away, before inverting the new doc. We can do that by running a query using a RAMReader to find all docs that must be deleted. Instead of flipping a bit in a bitset, for each hit we now keep track of when it was deleted: // init each slot in deletes array with -1 static final int NOT_DELETED = Integer .MAX_INT; ... Arrays.fill(deletes, NOT_DELETED); ... public void deleteDocument(Query q) { reopen RAMReader run query q using RAMReader for each hit { int hitDocId = ... if (deletes[hitDocId] == NOT_DELETED) { deletes[hitDocId] = DW.seqID; } } ... DW.seqID++; } Now no matter of how often you (re)open RAMReaders, they can share the deletes array. No cloning like with the BitSet approach would be necessary: When the RAMReader iterates posting lists it's as simple as this to treat deletes docs correctly. Instead of doing this in RAMTermDocs.next(): if (deletedDocsBitSet.get(doc)) { skip this doc } we can now do: if (deletes[doc] < ramReader.seqID) { skip this doc } Here is an example: 1. Add 3 docs with DW.addDocument() 2. User opens ramReader_a 3. Delete doc 1 4. User opens ramReader_b After 1: DW.seqID = 2; deletes[]= {MAX_INT, MAX_INT, MAX_INT} After 2: ramReader_a.seqID = 2 After 3: DW.seqID = 3; deletes[]= {MAX_INT, 2, MAX_INT} After 3: ramReader_b.seqID = 3 Note that both ramReader_a and ramReader_b share the same deletes[] array. Now when ramReader_a is used to read posting lists, it will not treat doc 1 as deleted, because (deletes [1] < ramReader_a.seqID) = (2 < 2) = false; But ramReader_b will see it as deleted, because (deletes [1] < ramReader_b.seqID) = (2 < 3) = true. What do you think about this approach for the future when we have a searchable DW buffer?
          Hide
          Shai Erera added a comment -

          What about the following scenario:

          1. A document is added w/ term A to DW1
          2. A document is added w/ term A to DW2 (by another thread)
          3. A deleteDocuments(Term-A) is issued against DW1 (could be even 3, where A does not exist)

          I thought that when (3) happens, the delete-by-term needs to be issued against all DWs, so that later when they apply their deletes they'll remember to do so. Issuing that against all DWs will record the docID of each DW up until which the delete should apply.

          We could move to doing the delete right-away, by reopening a DW reader, and we could move to storing deletes in int[] rather than bit set. But I'm not sure I understand how your proposal will handle the scenario I've described.

          Also, I don't see the advantage of moving to store the deletes in int[] rather than bitset ... is it just to avoid calling the get(doc)?

          Show
          Shai Erera added a comment - What about the following scenario: A document is added w/ term A to DW1 A document is added w/ term A to DW2 (by another thread) A deleteDocuments(Term-A) is issued against DW1 (could be even 3, where A does not exist) I thought that when (3) happens, the delete-by-term needs to be issued against all DWs, so that later when they apply their deletes they'll remember to do so. Issuing that against all DWs will record the docID of each DW up until which the delete should apply. We could move to doing the delete right-away, by reopening a DW reader, and we could move to storing deletes in int[] rather than bit set. But I'm not sure I understand how your proposal will handle the scenario I've described. Also, I don't see the advantage of moving to store the deletes in int[] rather than bitset ... is it just to avoid calling the get(doc)?
          Hide
          Michael Busch added a comment -

          I thought that when (3) happens, the delete-by-term needs to be issued against all DWs, so that later when they apply their deletes they'll remember to do so. Issuing that against all DWs will record the docID of each DW up until which the delete should apply.

          Yes, you still need to apply deletes on all DWs. My approach is not different in that regard.

          Also, I don't see the advantage of moving to store the deletes in int[] rather than bitset ... is it just to avoid calling the get(doc)?

          The big advantage is that all (re)opened readers can share the single int[] array. If you use a bitset you need to clone it for each reader. With the int[] reopening becomes basically free from a deletes perspective.

          Show
          Michael Busch added a comment - I thought that when (3) happens, the delete-by-term needs to be issued against all DWs, so that later when they apply their deletes they'll remember to do so. Issuing that against all DWs will record the docID of each DW up until which the delete should apply. Yes, you still need to apply deletes on all DWs. My approach is not different in that regard. Also, I don't see the advantage of moving to store the deletes in int[] rather than bitset ... is it just to avoid calling the get(doc)? The big advantage is that all (re)opened readers can share the single int[] array. If you use a bitset you need to clone it for each reader. With the int[] reopening becomes basically free from a deletes perspective.
          Hide
          Michael Busch added a comment -

          The big advantage is that all (re)opened readers can share the single int[] array.

          Dirty reads will be a problem with sharing the array. An AtomicIntegerArray could be used. We need to experiment how expensive that would be.

          Show
          Michael Busch added a comment - The big advantage is that all (re)opened readers can share the single int[] array. Dirty reads will be a problem with sharing the array. An AtomicIntegerArray could be used. We need to experiment how expensive that would be.
          Hide
          Shai Erera added a comment -

          But if each DW maintains its own doc IDs, separately from the others, what will be stored in the int[]? DW1 deleted docID 0 (its 0) and DW4 deleted the same. The two documents are not the same one ... no?

          Won't this complicate the entire solution? What I liked about keeping each DW separate (and call it SegmentWriter) is that it really operates on its own. When a delete happens on IW, it is synced so that it could be registered on all DWs. But besides that, the DWs don't know about each other nor care. Code should be really simple that way - the only thing that will be shared is the pool of buffers.

          I guess I'm missing something, because you're far more knowledgeable in this code than I am ...

          Show
          Shai Erera added a comment - But if each DW maintains its own doc IDs, separately from the others, what will be stored in the int[]? DW1 deleted docID 0 (its 0) and DW4 deleted the same. The two documents are not the same one ... no? Won't this complicate the entire solution? What I liked about keeping each DW separate (and call it SegmentWriter) is that it really operates on its own. When a delete happens on IW, it is synced so that it could be registered on all DWs. But besides that, the DWs don't know about each other nor care. Code should be really simple that way - the only thing that will be shared is the pool of buffers. I guess I'm missing something, because you're far more knowledgeable in this code than I am ...
          Hide
          Michael Busch added a comment -

          But if each DW maintains its own doc IDs, separately from the others, what will be stored in the int[]? DW1 deleted docID 0 (its 0) and DW4 deleted the same. The two documents are not the same one ... no?

          In DW you don't delete by docID. You can only delete by term or query. You have to run the (term)query in all DWs to determine if any of the DWs have one or more matching docs that have to be deleted.

          Today the queries and/or terms are buffered, along with the maxDocID at the time the delete or update was called. They are applied just after the DW buffer was flushed to a segment, be cause that's the first time the docs are searchable and the delete queries can be executed.

          In the future, when we can search the DW buffer(s), you can apply the deletes right away. Using this int[] approach for deletes will avoid the need of cloning bitsets in each reopen.

          Show
          Michael Busch added a comment - But if each DW maintains its own doc IDs, separately from the others, what will be stored in the int[]? DW1 deleted docID 0 (its 0) and DW4 deleted the same. The two documents are not the same one ... no? In DW you don't delete by docID. You can only delete by term or query. You have to run the (term)query in all DWs to determine if any of the DWs have one or more matching docs that have to be deleted. Today the queries and/or terms are buffered, along with the maxDocID at the time the delete or update was called. They are applied just after the DW buffer was flushed to a segment, be cause that's the first time the docs are searchable and the delete queries can be executed. In the future, when we can search the DW buffer(s), you can apply the deletes right away. Using this int[] approach for deletes will avoid the need of cloning bitsets in each reopen.
          Hide
          Michael Busch added a comment -

          Won't this complicate the entire solution? What I liked about keeping each DW separate (and call it SegmentWriter) is that it really operates on its own. When a delete happens on IW, it is synced so that it could be registered on all DWs. But besides that, the DWs don't know about each other nor care. Code should be really simple that way - the only thing that will be shared is the pool of buffers.

          What I'm proposing is not different or makes it more complicated. Either way, you have to apply all deletes on all DWs, because you delete by query or term.

          This might not be the right time for this proposal, because it'll only work with searchable DW buffers. But I wanted to mention this idea already, so that we can keep it in mind. And hopefully we can work on searchable DW buffers soon.

          but does anyone out there wanna work out the "private RAM segments"?

          I would like to try to help, but I'm likely not going to have enough time right now to write an entire patch for this big change myself.

          Show
          Michael Busch added a comment - Won't this complicate the entire solution? What I liked about keeping each DW separate (and call it SegmentWriter) is that it really operates on its own. When a delete happens on IW, it is synced so that it could be registered on all DWs. But besides that, the DWs don't know about each other nor care. Code should be really simple that way - the only thing that will be shared is the pool of buffers. What I'm proposing is not different or makes it more complicated. Either way, you have to apply all deletes on all DWs, because you delete by query or term. This might not be the right time for this proposal, because it'll only work with searchable DW buffers. But I wanted to mention this idea already, so that we can keep it in mind. And hopefully we can work on searchable DW buffers soon. but does anyone out there wanna work out the "private RAM segments"? I would like to try to help, but I'm likely not going to have enough time right now to write an entire patch for this big change myself.
          Hide
          Michael McCandless added a comment -

          I think in the future deletes in DW could work like this:

          This approach looks great Michael! Allows you to efficiently share a single int[] for deletions across many reopened RAM segment readers.

          How will we handle flushing? Ie, when a RAM segment is flushed to disk, it'd have to remain alive so long as a reader is still using it? Which means we can't recycle its buffers until all open readers using it are closed? Or... we could forcefully somehow cut it over to the identical now-on-disk segment?

          This is a great approach for speeding up NRT – NRT readers will no longer have to flush. It's similar in spirit to LUCENE-1313, but that issue is still flushing segments (but, into an intermediate RAMDir).

          Show
          Michael McCandless added a comment - I think in the future deletes in DW could work like this: This approach looks great Michael! Allows you to efficiently share a single int[] for deletions across many reopened RAM segment readers. How will we handle flushing? Ie, when a RAM segment is flushed to disk, it'd have to remain alive so long as a reader is still using it? Which means we can't recycle its buffers until all open readers using it are closed? Or... we could forcefully somehow cut it over to the identical now-on-disk segment? This is a great approach for speeding up NRT – NRT readers will no longer have to flush. It's similar in spirit to LUCENE-1313 , but that issue is still flushing segments (but, into an intermediate RAMDir).
          Hide
          Shai Erera added a comment -

          Michael - I see that we were on the same page. Probably I misread your description. I know that in IW deletes are applied by Term/Query and I thought that that delete should be registered on all DWs so they can apply it later. I'm glad that you think like that as well.

          So about the int[], would that be of the size of the index (flushed and unflushed) segments? Suppose that:

          • I've indexed 5 documents, flushed. (IDs 0-4)
          • Indexed 2 on DW1. (IDs 0,1)
          • Indexed 2 on DW2. (IDs 0,1)
          • Delete by term which affects: flushed IDs 1, 4, DW1-0, DW2 - 0, 1

          Would the int[] be of size 9, and the deleted IDs be 1, 4, 5, 7, 8? How would DW1- be mapped to 5, and DW2-0,1 be mapped to 7 and 8? Will the int[] be initially of size 5 and after DW1 flushes expand to 7, and ID=5 will be set (and afterwards expand to 9 with IDs 7,8)? If so then I understand.

          Why would using an int[] be any better than sharing a bitset?

          Show
          Shai Erera added a comment - Michael - I see that we were on the same page. Probably I misread your description. I know that in IW deletes are applied by Term/Query and I thought that that delete should be registered on all DWs so they can apply it later. I'm glad that you think like that as well. So about the int[], would that be of the size of the index (flushed and unflushed) segments? Suppose that: I've indexed 5 documents, flushed. (IDs 0-4) Indexed 2 on DW1. (IDs 0,1) Indexed 2 on DW2. (IDs 0,1) Delete by term which affects: flushed IDs 1, 4, DW1-0, DW2 - 0, 1 Would the int[] be of size 9, and the deleted IDs be 1, 4, 5, 7, 8? How would DW1- be mapped to 5, and DW2-0,1 be mapped to 7 and 8? Will the int[] be initially of size 5 and after DW1 flushes expand to 7, and ID=5 will be set (and afterwards expand to 9 with IDs 7,8)? If so then I understand. Why would using an int[] be any better than sharing a bitset?
          Hide
          Michael Busch added a comment -

          This is a great approach for speeding up NRT - NRT readers will no longer have to flush. It's similar in spirit to LUCENE-1313, but that issue is still flushing segments (but, into an intermediate RAMDir).

          I agree! Thinking further about this: Each (re)opened RAM segment reader needs to also remember the maxDoc of the corresponding DW at the time it was (re)opened. This way we can prevent a RAM reader to read postinglists beyond that maxDoc, even if the writer thread keeps building the lists in parallel. This allows us to guarantee the point-in-time requirements.

          Also, the PostingList objects we store in the TermHash already contain a lastDocID (if I remember correctly). So when a RAM reader termEnum iterates the dictionary it can skip all terms where term.lastDocID > RAMReader.maxDoc.

          It's quite neat that all we have to do in reopen then is to update ramReader.maxDoc and ramReader.seqID.

          Of course one big thing is still missing: keeping the term dictionary sorted. In order to implement the full IndexReader interface, specifically TermEnum, it's necessary to give each RAM reader a point-in-time sorted dictionary. At least in one direction, as a TermEnum only seeks forward.

          I think we have two options here: Either we try to keep the dictionary always sorted, whenever a term is added. I guess then we'd have to implement a b-tree or something similar?

          The second option I can think of is to add a "nextTerm" pointer to TermHash.Postinglist. This allows us to build up a linked list across all terms. When a ramReader is opened we would sort all terms, but not by changing their position in the hash - instead by building the single-linked list in sorted order.

          When a new reader gets (re)opened we need to mergesort the new terms into the linked list. I guess it's easy to get this implemented lock-free. E.g. if you have the linked list a->c, and you want to add b in the middle, you set b->c before changing a->c. Then it's undefined if an in-flight older reader would see term b. The old reader must not return b, since b was added after the old reader was (re)opened. So either case is fine: either it doesn't see b cause the link wasn't updated yet, or it sees it but doesn't return it, because b.lastDocID>ramReader.maxDoc.

          The downside is that we will have to pay the price of sorting in reader.reopen, which however should be cheap if readers are reopened frequently. Not sure though if this linkedlist approach is more or less compelling than something like a btree?

          Btw: Shall we open a new "searchable DW buffer" issue or continue using this issue for these discussions?

          Show
          Michael Busch added a comment - This is a great approach for speeding up NRT - NRT readers will no longer have to flush. It's similar in spirit to LUCENE-1313 , but that issue is still flushing segments (but, into an intermediate RAMDir). I agree! Thinking further about this: Each (re)opened RAM segment reader needs to also remember the maxDoc of the corresponding DW at the time it was (re)opened. This way we can prevent a RAM reader to read postinglists beyond that maxDoc, even if the writer thread keeps building the lists in parallel. This allows us to guarantee the point-in-time requirements. Also, the PostingList objects we store in the TermHash already contain a lastDocID (if I remember correctly). So when a RAM reader termEnum iterates the dictionary it can skip all terms where term.lastDocID > RAMReader.maxDoc. It's quite neat that all we have to do in reopen then is to update ramReader.maxDoc and ramReader.seqID. Of course one big thing is still missing: keeping the term dictionary sorted. In order to implement the full IndexReader interface, specifically TermEnum, it's necessary to give each RAM reader a point-in-time sorted dictionary. At least in one direction, as a TermEnum only seeks forward. I think we have two options here: Either we try to keep the dictionary always sorted, whenever a term is added. I guess then we'd have to implement a b-tree or something similar? The second option I can think of is to add a "nextTerm" pointer to TermHash.Postinglist. This allows us to build up a linked list across all terms. When a ramReader is opened we would sort all terms, but not by changing their position in the hash - instead by building the single-linked list in sorted order. When a new reader gets (re)opened we need to mergesort the new terms into the linked list. I guess it's easy to get this implemented lock-free. E.g. if you have the linked list a->c, and you want to add b in the middle, you set b->c before changing a->c. Then it's undefined if an in-flight older reader would see term b. The old reader must not return b, since b was added after the old reader was (re)opened. So either case is fine: either it doesn't see b cause the link wasn't updated yet, or it sees it but doesn't return it, because b.lastDocID>ramReader.maxDoc. The downside is that we will have to pay the price of sorting in reader.reopen, which however should be cheap if readers are reopened frequently. Not sure though if this linkedlist approach is more or less compelling than something like a btree? Btw: Shall we open a new "searchable DW buffer" issue or continue using this issue for these discussions?
          Hide
          Michael Busch added a comment -

          So about the int[], would that be of the size of the index (flushed and unflushed) segments? Suppose that:

          Each DW would have its own int[]. The size would correspond to the number of docs the DW has in its buffer.

          I've indexed 5 documents, flushed. (IDs 0-4)
          Indexed 2 on DW1. (IDs 0,1)
          Indexed 2 on DW2. (IDs 0,1)
          Delete by term which affects: flushed IDs 1, 4, DW1-0, DW2 - 0, 1
          Would the int[] be of size 9, and the deleted IDs be 1, 4, 5, 7, 8? How would DW1- be mapped to 5, and DW2-0,1 be mapped to 7 and 8? Will the int[] be initially of size 5 and after DW1 flushes expand to 7, and ID=5 will be set (and afterwards expand to 9 with IDs 7,8)? If so then I understand.

          DW1 will have an int[] of size 2, and DW2 will also have a separate int[] of size 2.

          I think you were thinking of one big int[] across the entire index? I believe you will understand the whole approach now when you think of the int[]s as per ram segment.

          Show
          Michael Busch added a comment - So about the int[], would that be of the size of the index (flushed and unflushed) segments? Suppose that: Each DW would have its own int[]. The size would correspond to the number of docs the DW has in its buffer. I've indexed 5 documents, flushed. (IDs 0-4) Indexed 2 on DW1. (IDs 0,1) Indexed 2 on DW2. (IDs 0,1) Delete by term which affects: flushed IDs 1, 4, DW1-0, DW2 - 0, 1 Would the int[] be of size 9, and the deleted IDs be 1, 4, 5, 7, 8? How would DW1- be mapped to 5, and DW2-0,1 be mapped to 7 and 8? Will the int[] be initially of size 5 and after DW1 flushes expand to 7, and ID=5 will be set (and afterwards expand to 9 with IDs 7,8)? If so then I understand. DW1 will have an int[] of size 2, and DW2 will also have a separate int[] of size 2. I think you were thinking of one big int[] across the entire index? I believe you will understand the whole approach now when you think of the int[]s as per ram segment.
          Hide
          Shai Erera added a comment -

          Thanks Michael. If the int[] are per DW then it's starting to make sense . I will read this issue again, especially the last several posts.

          I wanted to propose earlier to move the discussion to a seperate issue and close this one by allowing better control over the concurrency level. So a +1 from me on that.

          Show
          Shai Erera added a comment - Thanks Michael. If the int[] are per DW then it's starting to make sense . I will read this issue again, especially the last several posts. I wanted to propose earlier to move the discussion to a seperate issue and close this one by allowing better control over the concurrency level. So a +1 from me on that.
          Hide
          Jason Rutherglen added a comment -

          I think we have two options here: Either we try to keep the dictionary always sorted, whenever a term is added. I guess then we'd have to implement a b-tree or something similar?

          The int[] for deletes makes sense, I guess because we're assuming the number of docs in the RAM buffer won't be too large. Can't we simply instantiate a new terms array (merging in the new terms) for each reopen?

          Won't we need to wait for flex and this issue to be completed before tackling this?

          Show
          Jason Rutherglen added a comment - I think we have two options here: Either we try to keep the dictionary always sorted, whenever a term is added. I guess then we'd have to implement a b-tree or something similar? The int[] for deletes makes sense, I guess because we're assuming the number of docs in the RAM buffer won't be too large. Can't we simply instantiate a new terms array (merging in the new terms) for each reopen? Won't we need to wait for flex and this issue to be completed before tackling this?
          Hide
          Michael McCandless added a comment -

          We'll add this (max internal concurrency, now hardwired to 5) to IWC once it's in...

          Show
          Michael McCandless added a comment - We'll add this (max internal concurrency, now hardwired to 5) to IWC once it's in...
          Hide
          Jason Rutherglen added a comment -

          but does anyone out there wanna work out the "private RAM
          segments"?

          I didn't see this before, I figured private RAM segments was on
          the roadmap for this issue, it sounds like it'll be a different
          one?

          Mike, can you outline what would need to change? It seems like
          large amounts of code could be removed (i.e.
          FreqProxFieldMergeState)? The *PerThread classes? If so, I think
          it would go over my head (because I don't have a mental mapping
          of how all the classes tie together).

          Show
          Jason Rutherglen added a comment - but does anyone out there wanna work out the "private RAM segments"? I didn't see this before, I figured private RAM segments was on the roadmap for this issue, it sounds like it'll be a different one? Mike, can you outline what would need to change? It seems like large amounts of code could be removed (i.e. FreqProxFieldMergeState)? The *PerThread classes? If so, I think it would go over my head (because I don't have a mental mapping of how all the classes tie together).
          Hide
          Michael McCandless added a comment -

          I think this issue has these steps:

          • Allow the 5 to be changed (trivial first step) – I'll do this
            after LUCENE-2294 is in
          • Change the approach for how we buffer in RAM to a more isolated
            approach, whereby IW has N fully independent RAM segments
            in-process and when a doc needs to be indexed it's added to one of
            them. Each segment would also write its own doc stores and
            "normal" segment merging (not the inefficient merge we now do on
            flush) would merge them. This should be a good simplification in
            the chain (eg maybe we can remove the *PerThread classes). The
            segments can flush independently, letting us make much better
            concurrent use of IO & CPU.
          • Enable NRT readers to directly search these RAM segments. This
            entails recording deletes on the RAM segments as an int[]. We
            need to solve the Term sorting issue... (b-tree, or, simply
            sort-on-demand the first time a query needs it, though that cost
            increases the larger your RAM segments get, ie, not incremental to
            the # docs you just added). Also, we have to solve what happens
            to a reader using a RAM segment that's been flushed. Perhaps we
            don't reuse RAM at that point, ie, rely on GC to reclaim once all
            readers using that RAM segmeent have closed. We should do this
            part under a separate issue (LUCENE-2312).
          Show
          Michael McCandless added a comment - I think this issue has these steps: Allow the 5 to be changed (trivial first step) – I'll do this after LUCENE-2294 is in Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and "normal" segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO & CPU. Enable NRT readers to directly search these RAM segments. This entails recording deletes on the RAM segments as an int[]. We need to solve the Term sorting issue... (b-tree, or, simply sort-on-demand the first time a query needs it, though that cost increases the larger your RAM segments get, ie, not incremental to the # docs you just added). Also, we have to solve what happens to a reader using a RAM segment that's been flushed. Perhaps we don't reuse RAM at that point, ie, rely on GC to reclaim once all readers using that RAM segmeent have closed. We should do this part under a separate issue ( LUCENE-2312 ).
          Hide
          Jason Rutherglen added a comment -

          Change the approach for how we buffer in RAM to a more
          isolated approach

          Would we reuse the DocumentsWriter class, and assign one to each
          thread? Then start to rework DW on down in the code tree,
          removing the per thread logic? Or do we need to do something
          more dramatic?

          Show
          Jason Rutherglen added a comment - Change the approach for how we buffer in RAM to a more isolated approach Would we reuse the DocumentsWriter class, and assign one to each thread? Then start to rework DW on down in the code tree, removing the per thread logic? Or do we need to do something more dramatic?
          Hide
          Michael McCandless added a comment -

          Probably one DW instance per thread? Seems like that'd work?

          And possibly remove *PerThread throughout the default indexing chain?

          Show
          Michael McCandless added a comment - Probably one DW instance per thread? Seems like that'd work? And possibly remove *PerThread throughout the default indexing chain?
          Hide
          Michael McCandless added a comment -

          Simple patch, just adds maxThreadStates setting to IndexWriterConfig.

          Show
          Michael McCandless added a comment - Simple patch, just adds maxThreadStates setting to IndexWriterConfig.
          Hide
          Jason Rutherglen added a comment -

          Probably one DW instance per thread? Seems like that'd work?

          Ok

          And possibly remove *PerThread throughout the default indexing chain?

          I like removing this as there's many loops per thread right now, it's not easy to glance at and know what's going on.

          Show
          Jason Rutherglen added a comment - Probably one DW instance per thread? Seems like that'd work? Ok And possibly remove *PerThread throughout the default indexing chain? I like removing this as there's many loops per thread right now, it's not easy to glance at and know what's going on.
          Hide
          Michael Busch added a comment -

          but does anyone out there wanna work out the "private RAM segments"?

          Shall we use this issue for the private RAM segments? Or do you want to commit the simple patch, close this one and open a new issue?

          Show
          Michael Busch added a comment - but does anyone out there wanna work out the "private RAM segments"? Shall we use this issue for the private RAM segments? Or do you want to commit the simple patch, close this one and open a new issue?
          Hide
          Michael Busch added a comment -

          I'm tempted to get rid of the pooling for PostingLIst objects. The objects are very small and Java does a good job since 1.5 with object creation and gc. I even read that the JVM guys think that pooling can be slower than not-pooling.

          Also, I've mostly seen gc performance problems so far if there were a big number of long-living objects - it makes the mark time of the garbage collection very long. Pooling of course exactly gets you in such a situation.

          So what do you think about removing the pooling of the PostingList objects?

          Show
          Michael Busch added a comment - I'm tempted to get rid of the pooling for PostingLIst objects. The objects are very small and Java does a good job since 1.5 with object creation and gc. I even read that the JVM guys think that pooling can be slower than not-pooling. Also, I've mostly seen gc performance problems so far if there were a big number of long-living objects - it makes the mark time of the garbage collection very long. Pooling of course exactly gets you in such a situation. So what do you think about removing the pooling of the PostingList objects?
          Hide
          Michael McCandless added a comment -

          Or do you want to commit the simple patch, close this one and open a new issue?

          How about a new issue?

          Also, I've mostly seen gc performance problems so far if there were a big number of long-living objects - So what do you think about removing the pooling of the PostingList objects?

          It's not only the GC cost, it's also the cost of init'ing these objects. EG filling in 0s for all the fields, when we're gonna overwrite them anyway.

          But, let's test on modern JREs to confirm this? I do agree pooling adds code complexity, so, if it's not buying us anything (or very little) we should remove it.

          The worst case should be docs with many unique terms... Though... to reduce our per-unique-term RAM cost, we may want to move away from separate postings object per term to parallel arrays. We also could do something different for singleton terms vs the rest (if Zipf's law is applying, half the terms should be singletons; if it's not, you could have many more singleton terms...). I'd do this as an experimental indexing chain

          Show
          Michael McCandless added a comment - Or do you want to commit the simple patch, close this one and open a new issue? How about a new issue? Also, I've mostly seen gc performance problems so far if there were a big number of long-living objects - So what do you think about removing the pooling of the PostingList objects? It's not only the GC cost, it's also the cost of init'ing these objects. EG filling in 0s for all the fields, when we're gonna overwrite them anyway. But, let's test on modern JREs to confirm this? I do agree pooling adds code complexity, so, if it's not buying us anything (or very little) we should remove it. The worst case should be docs with many unique terms... Though... to reduce our per-unique-term RAM cost, we may want to move away from separate postings object per term to parallel arrays. We also could do something different for singleton terms vs the rest (if Zipf's law is applying, half the terms should be singletons; if it's not, you could have many more singleton terms...). I'd do this as an experimental indexing chain
          Hide
          Michael Busch added a comment -

          How about a new issue?

          OK, will open one.

          (if Zipf's law is applying, half the terms should be singletons; if it's not, you could have many more singleton terms...)

          Yeah we should utilize our knowledge of term distribution to optimize in-memory postings. For example, currently a nice optimization would be to store the first posting in the PostingList object and only allocate slices once you see the second occurrence (similar to the pulsing codec)?

          Though... to reduce our per-unique-term RAM cost, we may want to move away from separate postings object per term to parallel arrays.

          What exactly do you mean with parallel arrays? Parallel to the termHash array? Then the termsHash array would not be an array of PostingList objects anymore, but an array of pointers into the char[] array? And you'd have e.g. a parallel int[] array for df, another int[] for pointers into the postings byte pool, etc? Something like that?

          Show
          Michael Busch added a comment - How about a new issue? OK, will open one. (if Zipf's law is applying, half the terms should be singletons; if it's not, you could have many more singleton terms...) Yeah we should utilize our knowledge of term distribution to optimize in-memory postings. For example, currently a nice optimization would be to store the first posting in the PostingList object and only allocate slices once you see the second occurrence (similar to the pulsing codec)? Though... to reduce our per-unique-term RAM cost, we may want to move away from separate postings object per term to parallel arrays. What exactly do you mean with parallel arrays? Parallel to the termHash array? Then the termsHash array would not be an array of PostingList objects anymore, but an array of pointers into the char[] array? And you'd have e.g. a parallel int[] array for df, another int[] for pointers into the postings byte pool, etc? Something like that?
          Hide
          Michael Busch added a comment -

          OK I opened LUCENE-2324. We can close this one after you committed your patch, Mike.

          Show
          Michael Busch added a comment - OK I opened LUCENE-2324 . We can close this one after you committed your patch, Mike.
          Hide
          Michael McCandless added a comment -

          For example, currently a nice optimization would be to store the first posting in the PostingList object and only allocate slices once you see the second occurrence (similar to the pulsing codec)?

          I think we can do even better, ie, that class wastes RAM for the single posting case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are not needed).

          EG we could have a separate class dedicated to the singleton case. When term is first encountered it's enrolled there. We'd probably need a separate hash to store these (though not necessarily?). If it's seen again it's switched to the full posting.

          What exactly do you mean with parallel arrays? Parallel to the termHash array? Then the termsHash array would not be an array of PostingList objects anymore, but an array of pointers into the char[] array? And you'd have e.g. a parallel int[] array for df, another int[] for pointers into the postings byte pool, etc? Something like that?

          I mean instead of allocating an instance per unique term, we assign an integer ID (dense, ie, 0, 1, 2...).

          And then we have an array for each member now in FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. Then to look up say the lastDocID for a given postingID you just get lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we can make these arrays paged... but that'd slow down each access.

          Show
          Michael McCandless added a comment - For example, currently a nice optimization would be to store the first posting in the PostingList object and only allocate slices once you see the second occurrence (similar to the pulsing codec)? I think we can do even better, ie, that class wastes RAM for the single posting case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are not needed). EG we could have a separate class dedicated to the singleton case. When term is first encountered it's enrolled there. We'd probably need a separate hash to store these (though not necessarily?). If it's seen again it's switched to the full posting. What exactly do you mean with parallel arrays? Parallel to the termHash array? Then the termsHash array would not be an array of PostingList objects anymore, but an array of pointers into the char[] array? And you'd have e.g. a parallel int[] array for df, another int[] for pointers into the postings byte pool, etc? Something like that? I mean instead of allocating an instance per unique term, we assign an integer ID (dense, ie, 0, 1, 2...). And then we have an array for each member now in FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. Then to look up say the lastDocID for a given postingID you just get lastDocIDs [postingID] . If we're worried about oversize allocation overhead, we can make these arrays paged... but that'd slow down each access.
          Hide
          Michael Busch added a comment -

          I'll reply on LUCENE-2324.

          Show
          Michael Busch added a comment - I'll reply on LUCENE-2324 .

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              3 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development