Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I've been thinking for a while about refactoring the IndexWriter into
      two main components.

      One could be called a SegmentWriter and as the
      name says its job would be to write one particular index segment. The
      default one just as today will provide methods to add documents and
      flushes when its buffer is full.
      Other SegmentWriter implementations would do things like e.g. appending or
      copying external segments [what addIndexes*() currently does].

      The second component's job would it be to manage writing the segments
      file and merging/deleting segments. It would know about
      DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
      provide hooks that allow users to manage external data structures and
      keep them in sync with Lucene's data during segment merges.

      API wise there are things we have to figure out, such as where the
      updateDocument() method would fit in, because its deletion part
      affects all segments, whereas the new document is only being added to
      the new segment.

      Of course these should be lower level APIs for things like parallel
      indexing and related use cases. That's why we should still provide
      easy to use APIs like today for people who don't need to care about
      per-segment ops during indexing. So the current IndexWriter could
      probably keeps most of its APIs and delegate to the new classes.

        Issue Links

          Activity

          Hide
          John Wang added a comment -

          +1

          Show
          John Wang added a comment - +1
          Hide
          Michael McCandless added a comment -

          +1! IndexWriter has become immense.

          I think we should also pull out ReaderPool?

          Show
          Michael McCandless added a comment - +1! IndexWriter has become immense. I think we should also pull out ReaderPool?
          Hide
          Michael Busch added a comment -

          I think we should also pull out ReaderPool?

          +1!

          Show
          Michael Busch added a comment - I think we should also pull out ReaderPool? +1!
          Hide
          Earwin Burrfoot added a comment -

          We need an ability to see segment write (and probably deleted doc list write) as a discernible atomic operation. Right now it looks like several file writes, and we can't, say - redirect all files belonging to a certain segment to another Directory (well, in a simple manner). 'Something' should sit between a Directory (or several Directories) and IndexWriter.

          If we could do this, the current NRT search implementation will be largely obsoleted, innit? Just override the default impl of 'something' and send smaller segments to ram, bigger to disk, copy ram segments to disk asynchronously if we want to. Then we can use your granma's IndexReader and IndexWriter, totally decoupled from each other, and have blazing fast addDocument-commit-reopen turnaround.

          Show
          Earwin Burrfoot added a comment - We need an ability to see segment write (and probably deleted doc list write) as a discernible atomic operation. Right now it looks like several file writes, and we can't, say - redirect all files belonging to a certain segment to another Directory (well, in a simple manner). 'Something' should sit between a Directory (or several Directories) and IndexWriter. If we could do this, the current NRT search implementation will be largely obsoleted, innit? Just override the default impl of 'something' and send smaller segments to ram, bigger to disk, copy ram segments to disk asynchronously if we want to. Then we can use your granma's IndexReader and IndexWriter, totally decoupled from each other, and have blazing fast addDocument-commit-reopen turnaround.
          Hide
          Earwin Burrfoot added a comment -

          Oh, forgive me if I just said something stupid

          Show
          Earwin Burrfoot added a comment - Oh, forgive me if I just said something stupid
          Hide
          Michael McCandless added a comment -

          I think what you're describing is in fact the approach that LUCENE-1313 is taking; it's doing the switching internally between the main Dir & a private RAM Dir.

          But in my testing so far (LUCENE-2061), it doesn't seem like it'll help performance much. Ie, the OS generally seems to do a fine job putting those segments in RAM, itself. Ie, by maintaining a write cache. The weirdness is: that only holds true if you flush the segments when they are tiny (once per second, every 100 docs, in my test) – not yet sure why that's the case. I'm going to re-run perf tests on a more mainstream OS (my tests are all OpenSolaris) and see if that strangeness still happens.

          But I think you still need to not do commit() during the reopen.

          I do think refactoring IW so that there is a separate component that keeps track of segments in the index, may simplify NRT, in that you can go to that source for your current "segments file" even if that segments file is uncommitted. In such a world you could do something like IndexReader.open(SegmentState) and it would be able to open (and, reopen) the real-time reader. It's just that it's seeing changes to the SegmentState done by the writer, even if they're not yet committed.

          Show
          Michael McCandless added a comment - I think what you're describing is in fact the approach that LUCENE-1313 is taking; it's doing the switching internally between the main Dir & a private RAM Dir. But in my testing so far ( LUCENE-2061 ), it doesn't seem like it'll help performance much. Ie, the OS generally seems to do a fine job putting those segments in RAM, itself. Ie, by maintaining a write cache. The weirdness is: that only holds true if you flush the segments when they are tiny (once per second, every 100 docs, in my test) – not yet sure why that's the case. I'm going to re-run perf tests on a more mainstream OS (my tests are all OpenSolaris) and see if that strangeness still happens. But I think you still need to not do commit() during the reopen. I do think refactoring IW so that there is a separate component that keeps track of segments in the index, may simplify NRT, in that you can go to that source for your current "segments file" even if that segments file is uncommitted. In such a world you could do something like IndexReader.open(SegmentState) and it would be able to open (and, reopen) the real-time reader. It's just that it's seeing changes to the SegmentState done by the writer, even if they're not yet committed.
          Hide
          Earwin Burrfoot added a comment -

          If I understand everything right, with current uberfast reopens (thanks per-segment search), the only thing that makes index/commit/reopen cycle slow is the 'sync' call. That sync call on memory-based Directory is noop.

          And no, you really should commit() to be able to see stuff on reopen() My god, seeing changes that aren't yet commited - that violates the meaning of 'commit'.

          The original purporse of current NRT code was.. well.. let me remember.. NRT search! With per-segment caches and sync lag defeated you get the delay between doc being indexed and becoming searchable under tens of milliseconds. Is that not NRT enough to introduce tight coupling between classes that have absolutely no other reason to be coupled??
          Lucene 4.0. Simplicity is our candidate! Vote for Simplicity!

          *: Okay, there remains an issue of merges that piggyback on commits, so writing and commiting one smallish segment suddenly becomes a time-consuming operation. But that's a completely separate issue. Go, fix your mergepolicies and have a thread that merges asynchronously.

          Show
          Earwin Burrfoot added a comment - If I understand everything right, with current uberfast reopens (thanks per-segment search), the only thing that makes index/commit/reopen cycle slow is the 'sync' call. That sync call on memory-based Directory is noop. And no, you really should commit() to be able to see stuff on reopen() My god, seeing changes that aren't yet commited - that violates the meaning of 'commit'. The original purporse of current NRT code was.. well.. let me remember.. NRT search! With per-segment caches and sync lag defeated you get the delay between doc being indexed and becoming searchable under tens of milliseconds. Is that not NRT enough to introduce tight coupling between classes that have absolutely no other reason to be coupled?? Lucene 4.0. Simplicity is our candidate! Vote for Simplicity! *: Okay, there remains an issue of merges that piggyback on commits, so writing and commiting one smallish segment suddenly becomes a time-consuming operation. But that's a completely separate issue. Go, fix your mergepolicies and have a thread that merges asynchronously.
          Hide
          Michael McCandless added a comment -

          If I understand everything right, with current uberfast reopens (thanks per-segment search), the only thing that makes index/commit/reopen cycle slow is the 'sync' call.

          I agree, per-segment searching was the most important step towards
          NRT. It's a great step forward...

          But the fsync call is a killer, so avoiding it in the NRT path is
          necessary. It's also very OS/FS dependent.

          That sync call on memory-based Directory is noop.

          Until you need to spillover to disk because your RAM buffer is full?

          Also, if IW.commit() is called, I would expect any changes in RAM
          should be committed to the real dir (stable storage)?

          And, going through RAM first will necessarily be a hit on indexing
          throughput (Jake estimates 10% hit in Zoie's case). Really, our
          current approach goes through RAM as well, in that OS's write cache
          (if the machine has spare RAM) will quickly accept the small index
          files & write them in the BG. It's not clear we can do better than
          the OS here...

          And no, you really should commit() to be able to see stuff on reopen() My god, seeing changes that aren't yet commited - that violates the meaning of 'commit'.

          Uh, this is an API that clearly states that its purpose is to search
          the uncommitted changes. If you really want to be "pure"
          transactional, don't use this API

          The original purporse of current NRT code was.. well.. let me remember.. NRT search! With per-segment caches and sync lag defeated you get the delay between doc being indexed and becoming searchable under tens of milliseconds. Is that not NRT enough to introduce tight coupling between classes that have absolutely no other reason to be coupled?? Lucene 4.0. Simplicity is our candidate! Vote for Simplicity!

          In fact I favor our current approach because of its simplicity.

          Have a look at LUCENE-1313 (adds RAMDir as you're discussing), or,
          Zoie, which also adds the RAMDir and backgrounds resolving deleted
          docs – they add complexity to Lucene that I don't think is warranted.

          My general feeling at this point is with per-segment searching, and
          fsync avoided, NRT performance is excellent.

          We've explored a number of possible tweaks to improve it –
          writing first to RAMDir (LUCENE-1313), resolving deletes in the
          foreground (LUCENE-2047), using paged BitVector for deletions
          (LUCENE-1526), Zoie (buffering segments in RAM & backgrounds resolving
          deletes), etc., but, based on testing so far, I don't see the
          justification for the added complexity.

          *: Okay, there remains an issue of merges that piggyback on commits, so writing and commiting one smallish segment suddenly becomes a time-consuming operation. But that's a completely separate issue. Go, fix your mergepolicies and have a thread that merges asynchronously.

          This already runs in the BG by default. But warming the reader on the
          merged segment (before lighting it) is important (IW does this today).

          Show
          Michael McCandless added a comment - If I understand everything right, with current uberfast reopens (thanks per-segment search), the only thing that makes index/commit/reopen cycle slow is the 'sync' call. I agree, per-segment searching was the most important step towards NRT. It's a great step forward... But the fsync call is a killer, so avoiding it in the NRT path is necessary. It's also very OS/FS dependent. That sync call on memory-based Directory is noop. Until you need to spillover to disk because your RAM buffer is full? Also, if IW.commit() is called, I would expect any changes in RAM should be committed to the real dir (stable storage)? And, going through RAM first will necessarily be a hit on indexing throughput (Jake estimates 10% hit in Zoie's case). Really, our current approach goes through RAM as well, in that OS's write cache (if the machine has spare RAM) will quickly accept the small index files & write them in the BG. It's not clear we can do better than the OS here... And no, you really should commit() to be able to see stuff on reopen() My god, seeing changes that aren't yet commited - that violates the meaning of 'commit'. Uh, this is an API that clearly states that its purpose is to search the uncommitted changes. If you really want to be "pure" transactional, don't use this API The original purporse of current NRT code was.. well.. let me remember.. NRT search! With per-segment caches and sync lag defeated you get the delay between doc being indexed and becoming searchable under tens of milliseconds. Is that not NRT enough to introduce tight coupling between classes that have absolutely no other reason to be coupled?? Lucene 4.0. Simplicity is our candidate! Vote for Simplicity! In fact I favor our current approach because of its simplicity. Have a look at LUCENE-1313 (adds RAMDir as you're discussing), or, Zoie, which also adds the RAMDir and backgrounds resolving deleted docs – they add complexity to Lucene that I don't think is warranted. My general feeling at this point is with per-segment searching, and fsync avoided, NRT performance is excellent. We've explored a number of possible tweaks to improve it – writing first to RAMDir ( LUCENE-1313 ), resolving deletes in the foreground ( LUCENE-2047 ), using paged BitVector for deletions ( LUCENE-1526 ), Zoie (buffering segments in RAM & backgrounds resolving deletes), etc., but, based on testing so far, I don't see the justification for the added complexity. *: Okay, there remains an issue of merges that piggyback on commits, so writing and commiting one smallish segment suddenly becomes a time-consuming operation. But that's a completely separate issue. Go, fix your mergepolicies and have a thread that merges asynchronously. This already runs in the BG by default. But warming the reader on the merged segment (before lighting it) is important (IW does this today).
          Hide
          Earwin Burrfoot added a comment - - edited

          Until you need to spillover to disk because your RAM buffer is full?

          No, buffer is there only to decouple indexing from writing. Can be spilt over asynchronously without waiting for it to be filled up.

          Okay, we agree on a zillion of things, except simpicity of the current NRT, and approach to commit().

          Good commit() behaviour consists of two parts:
          1. Everything commit()ed is guaranteed to be on disk.
          2. Until commit() is called, reading threads don't see new/updated records.

          Now we want more speed, and are ready to sacrifice something if needed.
          You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

          I say it's better to sacrifice write guarantee. In the rare case the process/machine crashes, you can reindex last few minutes' worth of docs. Now you don't have to hack into IW and write specialized readers. Hence, simpicity. You have only one straightforward writer, you have only one straightforward reader (which is nicely immutable and doesn't need any synchronization code).

          In fact you don't even need to sacrifice write guarantee. What was the reason for it? The only one I can come up with is - the thread that does writes and sync() is different from the thread that calls commit(). But, commit() can return a Future.
          So the process goes as:

          • You index docs, nobody sees them, nor deletions.
          • You call commit(), the docs/deletes are written down to memory (NRT case)/disk (non-NRT case). Right after calling commit() every newly reopened Reader is guaranteed to see your docs/deletes.
          • Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), and fires up the Future returned from commit(). At this point all data is guaranteed to be written and braced for a crash, ram cache or not, OS/raid controller cache or not.

          For back-compat purporses we can use another name for that Future-returning-commit(), and current commit() will just call this new method and wait on future returned.

          Okay, with that I'm probably shutting up on the topic until I can back myself up with code. Sadly, my current employer is happy with update lag in tens of seconds

          Show
          Earwin Burrfoot added a comment - - edited Until you need to spillover to disk because your RAM buffer is full? No, buffer is there only to decouple indexing from writing. Can be spilt over asynchronously without waiting for it to be filled up. Okay, we agree on a zillion of things, except simpicity of the current NRT, and approach to commit(). Good commit() behaviour consists of two parts: 1. Everything commit()ed is guaranteed to be on disk. 2. Until commit() is called, reading threads don't see new/updated records. Now we want more speed, and are ready to sacrifice something if needed. You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards. I say it's better to sacrifice write guarantee. In the rare case the process/machine crashes, you can reindex last few minutes' worth of docs. Now you don't have to hack into IW and write specialized readers. Hence, simpicity. You have only one straightforward writer, you have only one straightforward reader (which is nicely immutable and doesn't need any synchronization code). In fact you don't even need to sacrifice write guarantee. What was the reason for it? The only one I can come up with is - the thread that does writes and sync() is different from the thread that calls commit(). But, commit() can return a Future. So the process goes as: You index docs, nobody sees them, nor deletions. You call commit(), the docs/deletes are written down to memory (NRT case)/disk (non-NRT case). Right after calling commit() every newly reopened Reader is guaranteed to see your docs/deletes. Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), and fires up the Future returned from commit(). At this point all data is guaranteed to be written and braced for a crash, ram cache or not, OS/raid controller cache or not. For back-compat purporses we can use another name for that Future-returning-commit(), and current commit() will just call this new method and wait on future returned. Okay, with that I'm probably shutting up on the topic until I can back myself up with code. Sadly, my current employer is happy with update lag in tens of seconds
          Hide
          Marvin Humphrey added a comment -

          > I say it's better to sacrifice write guarantee.

          I don't grok why sync is the default, especially given how sketchy hardware
          drivers are about obeying fsync:

          But, beware: some hardware devices may in fact cache writes even during
          fsync, and return before the bits are actually on stable storage, to give the
          appearance of faster performance.

          IMO, it should have been an option which defaults to false, to be enabled only by
          users who have the expertise to ensure that fsync() is actually doing what
          it advertises. But what's done is done (and Lucy will probably just do something
          different.)

          With regard to Lucene NRT, though, turning sync() off would really help. If and
          when some sort of settings class comes about, an enableSync(boolean enabled)
          method seems like it would come in handy.

          Show
          Marvin Humphrey added a comment - > I say it's better to sacrifice write guarantee. I don't grok why sync is the default, especially given how sketchy hardware drivers are about obeying fsync: But, beware: some hardware devices may in fact cache writes even during fsync, and return before the bits are actually on stable storage, to give the appearance of faster performance. IMO, it should have been an option which defaults to false, to be enabled only by users who have the expertise to ensure that fsync() is actually doing what it advertises. But what's done is done (and Lucy will probably just do something different.) With regard to Lucene NRT, though, turning sync() off would really help. If and when some sort of settings class comes about, an enableSync(boolean enabled) method seems like it would come in handy.
          Hide
          Jake Mannix added a comment -

          Now we want more speed, and are ready to sacrifice something if needed.

          You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

          Chiming in here that of course, you don't need (ie there is a choice) to hack into the IW to do this. Zoie is a completely user-land solution which modifies no IW/IR internals and yet achieves millisecond index-to-query-visibility turnaround while keeping speedy indexing and query performance. It just keeps the RAMDir outside encapsulated in an object (an IndexingSystem) which has IndexReaders built off of both the RAMDir and the FSDir, and hides the implementation details (in fact the IW itself) from the user.

          The API for this kind of thing doesn't have to be tightly coupled, and I would agree with you that it shouldn't be.

          Show
          Jake Mannix added a comment - Now we want more speed, and are ready to sacrifice something if needed. You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards. Chiming in here that of course, you don't need (ie there is a choice) to hack into the IW to do this. Zoie is a completely user-land solution which modifies no IW/IR internals and yet achieves millisecond index-to-query-visibility turnaround while keeping speedy indexing and query performance. It just keeps the RAMDir outside encapsulated in an object (an IndexingSystem) which has IndexReaders built off of both the RAMDir and the FSDir, and hides the implementation details (in fact the IW itself) from the user. The API for this kind of thing doesn't have to be tightly coupled, and I would agree with you that it shouldn't be.
          Hide
          Michael McCandless added a comment -

          Until you need to spillover to disk because your RAM buffer is full?

          No, buffer is there only to decouple indexing from writing. Can be spilt over asynchronously without waiting for it to be filled up.

          But this is where things start to get complex... the devil is in the
          details here. How do you carry over your deletes? This spillover
          will take time – do you block all indexing while that's happening
          (not great)? Do you do it gradually (start spillover when half full,
          but still accept indexing)? Do you throttle things if index rate
          exceeds flush rate? How do you recover on exception?

          NRT today let's the OS's write cache decide how to use RAM to speed up
          writing of these small files, which keeps things alot simpler for us.
          I don't see why we should add complexity to Lucene to replicate what
          the OS is doing for us (NOTE: I don't really trust the OS in the
          reverse case... I do think Lucene should read into RAM the data
          structures that are important).

          You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

          Now you don't have to hack into IW and write specialized readers.

          Probably we'll just have to disagree here... NRT isn't a hack

          IW is already hanging onto completely normal segments. Ie, the index
          has been updated with these segments, just not yet published so
          outside readers can see it. All NRT does is let a reader see this
          private view.

          The readers that an NRT reader expoes are normal SegmentReaders –
          it's just that rather than consult a segments_N on disk to get the
          segment metadata, they pulled from IW's uncommitted in memory
          SegmentInfos instance.

          Yes we've talked about the "hot innards" solution – an IndexReader
          impl that can directly search DW's ram buffer – but that doesn't look
          necessary today, because performance of NRT is good with the simple
          solution we have now.

          NRT reader also gains performance by carrying over deletes in RAM. We
          should eventually do the same thing with norms & field cache. No
          reason to write to disk, then right away read again.

          • You index docs, nobody sees them, nor deletions.
          • You call commit(), the docs/deletes are written down to memory (NRT case)/disk (non-NRT case). Right after calling commit() every newly reopened Reader is guaranteed to see your docs/deletes.
          • Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), and fires up the Future returned from commit(). At this point all data is guaranteed to be written and braced for a crash, ram cache or not, OS/raid controller cache or not.

          But this is not a commit, if docs/deletes are written down into RAM?
          Ie, commit could return, then the machine could crash, and you've lost
          changes? Commit should go through to stable storage before returning?
          Maybe I'm just missing the big picture of what you're proposing
          here...

          Also, you can build all this out on top of Lucene today? Zoie is a
          proof point of this. (Actually: how does your proposal differ from
          Zoie? Maybe that'd help shed light...).

          I say it's better to sacrifice write guarantee. In the rare case the process/machine crashes, you can reindex last few minutes' worth of docs.

          It is not that simple – if you skip the fsync, and OS crashes/you
          lose power, your index can easily become corrupt. The resulting
          CheckIndex -fix can easily need to remove large segments.

          The OS's write cache makes no gurantees on the order in which the
          files you've written find their way to disk.

          Another option (we've discussed this) would be journal file approach
          (ie transaction log, like most DBs use). You only have one file to
          fsync, and you replay to recover. But that'd be a big change for
          Lucene, would add complexity, and can be accomplished outside of
          Lucene if an app really wants to...

          Let me try turning this around: in your componentization of
          SegmentReader, why does it matter who's tracking which components are
          needed to make up a given SR? In the IndexReader.open case, it's a
          SegmntInfos instance (obtained by loading segments_N file from disk).
          In the NRT case, it's also a SegmentInfos instace (the one IW is
          privately keeping track of and only publishing on commit). At the
          component level, creating the SegmentReader should be no different?

          Show
          Michael McCandless added a comment - Until you need to spillover to disk because your RAM buffer is full? No, buffer is there only to decouple indexing from writing. Can be spilt over asynchronously without waiting for it to be filled up. But this is where things start to get complex... the devil is in the details here. How do you carry over your deletes? This spillover will take time – do you block all indexing while that's happening (not great)? Do you do it gradually (start spillover when half full, but still accept indexing)? Do you throttle things if index rate exceeds flush rate? How do you recover on exception? NRT today let's the OS's write cache decide how to use RAM to speed up writing of these small files, which keeps things alot simpler for us. I don't see why we should add complexity to Lucene to replicate what the OS is doing for us (NOTE: I don't really trust the OS in the reverse case... I do think Lucene should read into RAM the data structures that are important). You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards. Now you don't have to hack into IW and write specialized readers. Probably we'll just have to disagree here... NRT isn't a hack IW is already hanging onto completely normal segments. Ie, the index has been updated with these segments, just not yet published so outside readers can see it. All NRT does is let a reader see this private view. The readers that an NRT reader expoes are normal SegmentReaders – it's just that rather than consult a segments_N on disk to get the segment metadata, they pulled from IW's uncommitted in memory SegmentInfos instance. Yes we've talked about the "hot innards" solution – an IndexReader impl that can directly search DW's ram buffer – but that doesn't look necessary today, because performance of NRT is good with the simple solution we have now. NRT reader also gains performance by carrying over deletes in RAM. We should eventually do the same thing with norms & field cache. No reason to write to disk, then right away read again. You index docs, nobody sees them, nor deletions. You call commit(), the docs/deletes are written down to memory (NRT case)/disk (non-NRT case). Right after calling commit() every newly reopened Reader is guaranteed to see your docs/deletes. Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), and fires up the Future returned from commit(). At this point all data is guaranteed to be written and braced for a crash, ram cache or not, OS/raid controller cache or not. But this is not a commit, if docs/deletes are written down into RAM? Ie, commit could return, then the machine could crash, and you've lost changes? Commit should go through to stable storage before returning? Maybe I'm just missing the big picture of what you're proposing here... Also, you can build all this out on top of Lucene today? Zoie is a proof point of this. (Actually: how does your proposal differ from Zoie? Maybe that'd help shed light...). I say it's better to sacrifice write guarantee. In the rare case the process/machine crashes, you can reindex last few minutes' worth of docs. It is not that simple – if you skip the fsync, and OS crashes/you lose power, your index can easily become corrupt. The resulting CheckIndex -fix can easily need to remove large segments. The OS's write cache makes no gurantees on the order in which the files you've written find their way to disk. Another option (we've discussed this) would be journal file approach (ie transaction log, like most DBs use). You only have one file to fsync, and you replay to recover. But that'd be a big change for Lucene, would add complexity, and can be accomplished outside of Lucene if an app really wants to... Let me try turning this around: in your componentization of SegmentReader, why does it matter who's tracking which components are needed to make up a given SR? In the IndexReader.open case, it's a SegmntInfos instance (obtained by loading segments_N file from disk). In the NRT case, it's also a SegmentInfos instace (the one IW is privately keeping track of and only publishing on commit). At the component level, creating the SegmentReader should be no different?
          Hide
          Michael McCandless added a comment -

          > I say it's better to sacrifice write guarantee.

          I don't grok why sync is the default, especially given how sketchy hardware
          drivers are about obeying fsync:

          But, beware: some hardware devices may in fact cache writes even during
          fsync, and return before the bits are actually on stable storage, to give the
          appearance of faster performance.

          It's unclear how often this scare-warning is true in practice (scare
          warnings tend to spread very easily without concrete data); it's in
          the javadocs for completeness sake. I expect (though have no data to
          back this up...) that most OS/IO systems "out there" do properly
          implement fsync.

          IMO, it should have been an option which defaults to false, to be enabled only by
          users who have the expertise to ensure that fsync() is actually doing what
          it advertises. But what's done is done (and Lucy will probably just do something
          different.)

          I think that's a poor default (trades safety for performance), unless
          Lucy eg uses a transaction log so you can concretely bound what's lost
          on crash/power loss. Or, if you go back to autocommitting I guess...

          If we did this in Lucene, you can have unbounded corruption. It's not
          just the last few minutes of updates...

          So, I don't think we should even offer the option to turn it off. You
          can easily subclass your FSDir impl and make sync() a no-op if your
          really want to...

          With regard to Lucene NRT, though, turning sync() off would really help. If and
          when some sort of settings class comes about, an enableSync(boolean enabled)
          method seems like it would come in handy.

          You don't need to turn off sync for NRT – that's the whole point. It
          gives you a reader without syncing the files. Really, this is your
          safety tradeoff – it means you can commit less frequently, since the
          NRT reader can search the latest updates. But, your app has
          complete control over how it wants to to trade safety for performance.

          Show
          Michael McCandless added a comment - > I say it's better to sacrifice write guarantee. I don't grok why sync is the default, especially given how sketchy hardware drivers are about obeying fsync: But, beware: some hardware devices may in fact cache writes even during fsync, and return before the bits are actually on stable storage, to give the appearance of faster performance. It's unclear how often this scare-warning is true in practice (scare warnings tend to spread very easily without concrete data); it's in the javadocs for completeness sake. I expect (though have no data to back this up...) that most OS/IO systems "out there" do properly implement fsync. IMO, it should have been an option which defaults to false, to be enabled only by users who have the expertise to ensure that fsync() is actually doing what it advertises. But what's done is done (and Lucy will probably just do something different.) I think that's a poor default (trades safety for performance), unless Lucy eg uses a transaction log so you can concretely bound what's lost on crash/power loss. Or, if you go back to autocommitting I guess... If we did this in Lucene, you can have unbounded corruption. It's not just the last few minutes of updates... So, I don't think we should even offer the option to turn it off. You can easily subclass your FSDir impl and make sync() a no-op if your really want to... With regard to Lucene NRT, though, turning sync() off would really help. If and when some sort of settings class comes about, an enableSync(boolean enabled) method seems like it would come in handy. You don't need to turn off sync for NRT – that's the whole point. It gives you a reader without syncing the files. Really, this is your safety tradeoff – it means you can commit less frequently, since the NRT reader can search the latest updates. But, your app has complete control over how it wants to to trade safety for performance.
          Hide
          Michael McCandless added a comment -

          Zoie is a completely user-land solution which modifies no IW/IR internals and yet achieves millisecond index-to-query-visibility turnaround while keeping speedy indexing and query performance. It just keeps the RAMDir outside encapsulated in an object (an IndexingSystem) which has IndexReaders built off of both the RAMDir and the FSDir, and hides the implementation details (in fact the IW itself) from the user.

          Right, one can always not use NRT and build their own layers on top.

          But, Zoie has alot of code to accomplish this – the devil really is
          in the details to "simply write first to a RAMDir". This is why I'd
          like Earwin to look @ Zoie and clarify his proposed approach, in
          contrast...

          Actually, here's a question: how quickly can Zoie turn around a
          commit()? Seems like it must take more time than Lucene, since it does
          extra stuff (flush RAM buffers to disk, materialize deletes) before
          even calling IW.commit.

          At the end of the day, any NRT system has to trade safety for
          performance (bypass the sync call in the NRT reader)....

          The API for this kind of thing doesn't have to be tightly coupled, and I would agree with you that it shouldn't be.

          I don't consider NRT today to be a tight coupling (eg, the pending
          refactoring of IW would nicely separate it out). If we implement the
          IR that searches DW's RAM buffer, then I'd agree

          Show
          Michael McCandless added a comment - Zoie is a completely user-land solution which modifies no IW/IR internals and yet achieves millisecond index-to-query-visibility turnaround while keeping speedy indexing and query performance. It just keeps the RAMDir outside encapsulated in an object (an IndexingSystem) which has IndexReaders built off of both the RAMDir and the FSDir, and hides the implementation details (in fact the IW itself) from the user. Right, one can always not use NRT and build their own layers on top. But, Zoie has alot of code to accomplish this – the devil really is in the details to "simply write first to a RAMDir". This is why I'd like Earwin to look @ Zoie and clarify his proposed approach, in contrast... Actually, here's a question: how quickly can Zoie turn around a commit()? Seems like it must take more time than Lucene, since it does extra stuff (flush RAM buffers to disk, materialize deletes) before even calling IW.commit. At the end of the day, any NRT system has to trade safety for performance (bypass the sync call in the NRT reader).... The API for this kind of thing doesn't have to be tightly coupled, and I would agree with you that it shouldn't be. I don't consider NRT today to be a tight coupling (eg, the pending refactoring of IW would nicely separate it out). If we implement the IR that searches DW's RAM buffer, then I'd agree
          Hide
          Marvin Humphrey added a comment -

          > I think that's a poor default (trades safety for performance), unless
          > Lucy eg uses a transaction log so you can concretely bound what's lost
          > on crash/power loss. Or, if you go back to autocommitting I guess...

          Search indexes should not be used for canonical data storage – they should be
          built on top of canonical data storage. Guarding against power failure
          induced corruption in a database is an imperative. Guarding against power
          failure induced corruption in a search index is a feature, not an imperative.

          Users have many options for dealing with the potential for such corruption.
          You can go back to your canonical data store and rebuild your index from
          scratch when it happens. In a search cluster environment, you can rsync a
          known-good copy from another node. Potentially, you might enable
          fsync-before-commit and keep your own transaction log. However, if the time
          it takes to rebuild or recover an index from scratch would have caused you
          unacceptable downtime, you can't possibly be operating in a
          single-point-of-failure environment where a power failure could take you down
          anyway – so other recovery options are available to you.

          Turning on fsync is only one step towards ensuring index integrity; others
          steps involve making decisions about hard drives, RAID arrays, failover
          strategies, network and off-site backups, etc, and are outside of our domain
          as library authors. We cannot meet the needs of users who need guaranteed
          index integrity on our own.

          For everybody else, what turning on fsync by default achieves is to make an
          exceedingly rare event rarer. That's valuable, but not essential. My
          argument is that since the search indexes should not be used for canonical
          storage, and since fsync is not testably reliable and not sufficient on its
          own, it's a good engineering compromise to prioritize performance.

          > If we did this in Lucene, you can have unbounded corruption. It's not
          > just the last few minutes of updates...

          Wasn't that a possibility under autocommit as well? All it takes is for the
          OS to finish flushing the new snapshot file to persistent storage before it
          finishes flushing a segment data file needed by that snapshot, and for the
          power failure to squeeze in between.

          In practice, locality of reference is going to make the window very very
          small, since those two pieces of data will usually get written very close to
          each other on the persistent media.

          I've seen a lot more messages to our user lists over the years about data
          corruption caused by bugs and misconfigurations than by power failures.

          But really, that's as it should be. Ensuring data integrity to the degree
          required by a database is costly – it requires far more rigorous testing, and
          far more conservative development practices. If we accept that our indexes
          must never go corrupt, it will retard innovation.

          Of course we should work very hard to prevent index corruption. However, I'm
          much more concerned about stuff like silent omission of search results due to
          overzealous, overly complex optimizations than I am about problems arising
          from power failures. When a power failure occurs, you know it – so you get
          the opportunity to fsck the disk, run checkIndex(), perform data integrity
          reconciliation tests against canonical storage, and if anything fails, take
          whatever recovery actions you deem necessary.

          > You don't need to turn off sync for NRT - that's the whole point. It
          > gives you a reader without syncing the files.

          I suppose this is where Lucy and Lucene differ. Thanks to mmap and the
          near-instantaneous reader opens it has enabled, we don't need to keep a
          special reader alive. Since there's no special reader, the only way to get
          data to a search process is to go through a commit. But if we fsync on every
          commit, we'll drag down indexing responsiveness. Fishishing the commit and
          returning control to client code as quickly as possible is a high priority for
          us.

          Furthermore, I don't want us to have to write the code to support a
          near-real-time reader hanging off of IndexWriter a la Lucene. The
          architectural discussions have made for very interesting reading, but the
          design seems to be tricky to pull off, and implementation simplicity in core
          search code is a high priority for Lucy. It's better for Lucy to kill two
          birds with one stone and concentrate on making all index opens fast.

          > Really, this is your safety tradeoff - it means you can commit less
          > frequently, since the NRT reader can search the latest updates. But, your
          > app has complete control over how it wants to to trade safety for
          > performance.

          So long as fsync is an option, the app always has complete control, regardless
          of whether the default setting is fsync or no fsync.

          If a Lucene app wanted to increase NRT responsiveness and throughput, and if
          absolute index integrity wasn't a concern because it had been addressed
          through other means (e.g. multi-node search cluster), would turning off fsync
          speed things up under any of the proposed designs?

          Show
          Marvin Humphrey added a comment - > I think that's a poor default (trades safety for performance), unless > Lucy eg uses a transaction log so you can concretely bound what's lost > on crash/power loss. Or, if you go back to autocommitting I guess... Search indexes should not be used for canonical data storage – they should be built on top of canonical data storage. Guarding against power failure induced corruption in a database is an imperative. Guarding against power failure induced corruption in a search index is a feature, not an imperative. Users have many options for dealing with the potential for such corruption. You can go back to your canonical data store and rebuild your index from scratch when it happens. In a search cluster environment, you can rsync a known-good copy from another node. Potentially, you might enable fsync-before-commit and keep your own transaction log. However, if the time it takes to rebuild or recover an index from scratch would have caused you unacceptable downtime, you can't possibly be operating in a single-point-of-failure environment where a power failure could take you down anyway – so other recovery options are available to you. Turning on fsync is only one step towards ensuring index integrity; others steps involve making decisions about hard drives, RAID arrays, failover strategies, network and off-site backups, etc, and are outside of our domain as library authors. We cannot meet the needs of users who need guaranteed index integrity on our own. For everybody else, what turning on fsync by default achieves is to make an exceedingly rare event rarer. That's valuable, but not essential. My argument is that since the search indexes should not be used for canonical storage, and since fsync is not testably reliable and not sufficient on its own, it's a good engineering compromise to prioritize performance. > If we did this in Lucene, you can have unbounded corruption. It's not > just the last few minutes of updates... Wasn't that a possibility under autocommit as well? All it takes is for the OS to finish flushing the new snapshot file to persistent storage before it finishes flushing a segment data file needed by that snapshot, and for the power failure to squeeze in between. In practice, locality of reference is going to make the window very very small, since those two pieces of data will usually get written very close to each other on the persistent media. I've seen a lot more messages to our user lists over the years about data corruption caused by bugs and misconfigurations than by power failures. But really, that's as it should be. Ensuring data integrity to the degree required by a database is costly – it requires far more rigorous testing, and far more conservative development practices. If we accept that our indexes must never go corrupt, it will retard innovation. Of course we should work very hard to prevent index corruption. However, I'm much more concerned about stuff like silent omission of search results due to overzealous, overly complex optimizations than I am about problems arising from power failures. When a power failure occurs, you know it – so you get the opportunity to fsck the disk, run checkIndex(), perform data integrity reconciliation tests against canonical storage, and if anything fails, take whatever recovery actions you deem necessary. > You don't need to turn off sync for NRT - that's the whole point. It > gives you a reader without syncing the files. I suppose this is where Lucy and Lucene differ. Thanks to mmap and the near-instantaneous reader opens it has enabled, we don't need to keep a special reader alive. Since there's no special reader, the only way to get data to a search process is to go through a commit. But if we fsync on every commit, we'll drag down indexing responsiveness. Fishishing the commit and returning control to client code as quickly as possible is a high priority for us. Furthermore, I don't want us to have to write the code to support a near-real-time reader hanging off of IndexWriter a la Lucene. The architectural discussions have made for very interesting reading, but the design seems to be tricky to pull off, and implementation simplicity in core search code is a high priority for Lucy. It's better for Lucy to kill two birds with one stone and concentrate on making all index opens fast. > Really, this is your safety tradeoff - it means you can commit less > frequently, since the NRT reader can search the latest updates. But, your > app has complete control over how it wants to to trade safety for > performance. So long as fsync is an option, the app always has complete control, regardless of whether the default setting is fsync or no fsync. If a Lucene app wanted to increase NRT responsiveness and throughput, and if absolute index integrity wasn't a concern because it had been addressed through other means (e.g. multi-node search cluster), would turning off fsync speed things up under any of the proposed designs?
          Hide
          Jason Rutherglen added a comment -

          I think large scale NRT installations may eventually require a
          distributed transaction log. The implementation details have yet
          to be determined however it could potentially solve the issue of
          data loss being discussed. One candidate is a combo of Zookeeper
          + Bookeeper. I would venture to guess this could be implemented
          as a part of Solr, however we've got a lot of work to do for
          Solr to be reasonably NRT efficient (see the tracking issue
          SOLR-1606), and we're just starting on the Zookeeper
          implementation SOLR-1277...

          Show
          Jason Rutherglen added a comment - I think large scale NRT installations may eventually require a distributed transaction log. The implementation details have yet to be determined however it could potentially solve the issue of data loss being discussed. One candidate is a combo of Zookeeper + Bookeeper. I would venture to guess this could be implemented as a part of Solr, however we've got a lot of work to do for Solr to be reasonably NRT efficient (see the tracking issue SOLR-1606 ), and we're just starting on the Zookeeper implementation SOLR-1277 ...
          Hide
          Michael McCandless added a comment -

          I think that's a poor default (trades safety for performance), unless Lucy eg uses a transaction log so you can concretely bound what's lost on crash/power loss. Or, if you go back to autocommitting I guess...

          Search indexes should not be used for canonical data storage - they should be
          built on top of canonical data storage.

          I agree with that, in theory, but I think in practice it's too
          idealistic to force/expect apps to meet that ideal.

          I expect for many apps it's a major cost to unexpectedly lose the
          search index on power loss / OS crash.

          Users have many options for dealing with the potential for such corruption.
          You can go back to your canonical data store and rebuild your index from
          scratch when it happens. In a search cluster environment, you can rsync a
          known-good copy from another node. Potentially, you might enable
          fsync-before-commit and keep your own transaction log. However, if the time
          it takes to rebuild or recover an index from scratch would have caused you
          unacceptable downtime, you can't possibly be operating in a
          single-point-of-failure environment where a power failure could take you down
          anyway - so other recovery options are available to you.

          Turning on fsync is only one step towards ensuring index integrity; others
          steps involve making decisions about hard drives, RAID arrays, failover
          strategies, network and off-site backups, etc, and are outside of our domain
          as library authors. We cannot meet the needs of users who need guaranteed
          index integrity on our own.

          Yes, high availability apps will already take their measures to
          protect the search index / recovery process, going beyond fsync.
          EG, making a hot backup of Lucene index is now straightforwarded.

          For everybody else, what turning on fsync by default achieves is to make an
          exceedingly rare event rarer. That's valuable, but not essential. My
          argument is that since the search indexes should not be used for canonical
          storage, and since fsync is not testably reliable and not sufficient on its
          own, it's a good engineering compromise to prioritize performance.

          Losing power to the machine, or OS crash, or the user doing a hard
          power down because OS isn't responding, I think are not actually
          that uncommon in an end user setting. Think of a desktop app
          embedding Lucene/Lucy...

          If we did this in Lucene, you can have unbounded corruption. It's not just the last few minutes of updates...

          Wasn't that a possibility under autocommit as well? All it takes is for the
          OS to finish flushing the new snapshot file to persistent storage before it
          finishes flushing a segment data file needed by that snapshot, and for the
          power failure to squeeze in between.

          Not after LUCENE-1044... autoCommit simply called commit() at certain
          opportune times (after finish big merges), which does the right thing
          (I hope!). The segments file is not written until all files it
          references are sync'd.

          In practice, locality of reference is going to make the window very very
          small, since those two pieces of data will usually get written very close to
          each other on the persistent media.

          Not sure about that – it depends on how effectively the OS's write cache
          "preserves" that locality.

          I've seen a lot more messages to our user lists over the years about data
          corruption caused by bugs and misconfigurations than by power failures.

          I would agree, though, I think it may be a sampling problem... ie
          people whose machines crashed and they lost the search index would
          often not raise it on the list (vs say a persistent config issue that keeps
          leading to corruption).

          But really, that's as it should be. Ensuring data integrity to the degree
          required by a database is costly - it requires far more rigorous testing, and
          far more conservative development practices. If we accept that our indexes
          must never go corrupt, it will retard innovation.

          It's not really that costly, with NRT – you can get a searcher on the
          index without paying the commit cost. And now you can call commit
          however frequently you need to. Quickly turning around a new
          searcher, and how frequently you commit, are now independent.

          Also, having the app explicitly decouple these two notions keeps the
          door open for future improvements. If we force absolutely all sharing
          to go through the filesystem then that limits the improvements we can
          make to NRT.

          Of course we should work very hard to prevent index corruption. However, I'm
          much more concerned about stuff like silent omission of search results due to
          overzealous, overly complex optimizations than I am about problems arising
          from power failures. When a power failure occurs, you know it - so you get
          the opportunity to fsck the disk, run checkIndex(), perform data integrity
          reconciliation tests against canonical storage, and if anything fails, take
          whatever recovery actions you deem necessary.

          Well... I think search performance is important, and we should pursue it
          even if we risk bugs.

          You don't need to turn off sync for NRT - that's the whole point. It gives you a reader without syncing the files.

          I suppose this is where Lucy and Lucene differ. Thanks to mmap and the
          near-instantaneous reader opens it has enabled, we don't need to keep a
          special reader alive. Since there's no special reader, the only way to get
          data to a search process is to go through a commit. But if we fsync on every
          commit, we'll drag down indexing responsiveness. Fishishing the commit and
          returning control to client code as quickly as possible is a high priority for
          us.

          NRT reader isn't that special – the only things different is 1) it
          loaded the segments_N "file" from IW instead of the filesystem, and 2)
          it uses a reader pool to "share" the underlying SegmentReaders with
          other places that have loaded them. I guess, if Lucy won't allow
          this, then, yes, forcing a commit in order to reopen is very costly,
          and so sacrificing safety is a tradeoff you have to make.

          Alternatively, you could keep the notion "flush" (an unsafe commit)
          alive? You write the segments file, but make no effort to ensure it's
          durability (and also preserve the last "true" commit). Then a normal
          IR.reopen suffices...

          Furthermore, I don't want us to have to write the code to support a
          near-real-time reader hanging off of IndexWriter a la Lucene. The
          architectural discussions have made for very interesting reading, but the
          design seems to be tricky to pull off, and implementation simplicity in core
          search code is a high priority for Lucy. It's better for Lucy to kill two
          birds with one stone and concentrate on making all index opens fast.

          But shouldn't you at least give an option for index durability? Even
          if we disagree about the default?

          Really, this is your safety tradeoff - it means you can commit less frequently, since the NRT reader can search the latest updates. But, your app has complete control over how it wants to to trade safety for performance.

          So long as fsync is an option, the app always has complete control,
          regardless of whether the default setting is fsync or no fsync.

          Well it is an "option" in Lucene – "it's just software" I don't
          want to make it easy to be unsafe. Lucene shouldn't sacrifice safety
          of the index... and with NRT there's no need to make that tradeoff.

          If a Lucene app wanted to increase NRT responsiveness and throughput, and if
          absolute index integrity wasn't a concern because it had been addressed
          through other means (e.g. multi-node search cluster), would turning off fsync
          speed things up under any of the proposed designs?

          Yes, turning off fsync would speed things up – you could fall back to
          simple reopen and get good performance (NRT should still be faster
          since the readers are pooled). The "use RAMDir on top of Lucene"
          designs would be helped less since fsync is a noop in RAMDir.

          Show
          Michael McCandless added a comment - I think that's a poor default (trades safety for performance), unless Lucy eg uses a transaction log so you can concretely bound what's lost on crash/power loss. Or, if you go back to autocommitting I guess... Search indexes should not be used for canonical data storage - they should be built on top of canonical data storage. I agree with that, in theory, but I think in practice it's too idealistic to force/expect apps to meet that ideal. I expect for many apps it's a major cost to unexpectedly lose the search index on power loss / OS crash. Users have many options for dealing with the potential for such corruption. You can go back to your canonical data store and rebuild your index from scratch when it happens. In a search cluster environment, you can rsync a known-good copy from another node. Potentially, you might enable fsync-before-commit and keep your own transaction log. However, if the time it takes to rebuild or recover an index from scratch would have caused you unacceptable downtime, you can't possibly be operating in a single-point-of-failure environment where a power failure could take you down anyway - so other recovery options are available to you. Turning on fsync is only one step towards ensuring index integrity; others steps involve making decisions about hard drives, RAID arrays, failover strategies, network and off-site backups, etc, and are outside of our domain as library authors. We cannot meet the needs of users who need guaranteed index integrity on our own. Yes, high availability apps will already take their measures to protect the search index / recovery process, going beyond fsync. EG, making a hot backup of Lucene index is now straightforwarded. For everybody else, what turning on fsync by default achieves is to make an exceedingly rare event rarer. That's valuable, but not essential. My argument is that since the search indexes should not be used for canonical storage, and since fsync is not testably reliable and not sufficient on its own, it's a good engineering compromise to prioritize performance. Losing power to the machine, or OS crash, or the user doing a hard power down because OS isn't responding, I think are not actually that uncommon in an end user setting. Think of a desktop app embedding Lucene/Lucy... If we did this in Lucene, you can have unbounded corruption. It's not just the last few minutes of updates... Wasn't that a possibility under autocommit as well? All it takes is for the OS to finish flushing the new snapshot file to persistent storage before it finishes flushing a segment data file needed by that snapshot, and for the power failure to squeeze in between. Not after LUCENE-1044 ... autoCommit simply called commit() at certain opportune times (after finish big merges), which does the right thing (I hope!). The segments file is not written until all files it references are sync'd. In practice, locality of reference is going to make the window very very small, since those two pieces of data will usually get written very close to each other on the persistent media. Not sure about that – it depends on how effectively the OS's write cache "preserves" that locality. I've seen a lot more messages to our user lists over the years about data corruption caused by bugs and misconfigurations than by power failures. I would agree, though, I think it may be a sampling problem... ie people whose machines crashed and they lost the search index would often not raise it on the list (vs say a persistent config issue that keeps leading to corruption). But really, that's as it should be. Ensuring data integrity to the degree required by a database is costly - it requires far more rigorous testing, and far more conservative development practices. If we accept that our indexes must never go corrupt, it will retard innovation. It's not really that costly, with NRT – you can get a searcher on the index without paying the commit cost. And now you can call commit however frequently you need to. Quickly turning around a new searcher, and how frequently you commit, are now independent. Also, having the app explicitly decouple these two notions keeps the door open for future improvements. If we force absolutely all sharing to go through the filesystem then that limits the improvements we can make to NRT. Of course we should work very hard to prevent index corruption. However, I'm much more concerned about stuff like silent omission of search results due to overzealous, overly complex optimizations than I am about problems arising from power failures. When a power failure occurs, you know it - so you get the opportunity to fsck the disk, run checkIndex(), perform data integrity reconciliation tests against canonical storage, and if anything fails, take whatever recovery actions you deem necessary. Well... I think search performance is important, and we should pursue it even if we risk bugs. You don't need to turn off sync for NRT - that's the whole point. It gives you a reader without syncing the files. I suppose this is where Lucy and Lucene differ. Thanks to mmap and the near-instantaneous reader opens it has enabled, we don't need to keep a special reader alive. Since there's no special reader, the only way to get data to a search process is to go through a commit. But if we fsync on every commit, we'll drag down indexing responsiveness. Fishishing the commit and returning control to client code as quickly as possible is a high priority for us. NRT reader isn't that special – the only things different is 1) it loaded the segments_N "file" from IW instead of the filesystem, and 2) it uses a reader pool to "share" the underlying SegmentReaders with other places that have loaded them. I guess, if Lucy won't allow this, then, yes, forcing a commit in order to reopen is very costly, and so sacrificing safety is a tradeoff you have to make. Alternatively, you could keep the notion "flush" (an unsafe commit) alive? You write the segments file, but make no effort to ensure it's durability (and also preserve the last "true" commit). Then a normal IR.reopen suffices... Furthermore, I don't want us to have to write the code to support a near-real-time reader hanging off of IndexWriter a la Lucene. The architectural discussions have made for very interesting reading, but the design seems to be tricky to pull off, and implementation simplicity in core search code is a high priority for Lucy. It's better for Lucy to kill two birds with one stone and concentrate on making all index opens fast. But shouldn't you at least give an option for index durability? Even if we disagree about the default? Really, this is your safety tradeoff - it means you can commit less frequently, since the NRT reader can search the latest updates. But, your app has complete control over how it wants to to trade safety for performance. So long as fsync is an option, the app always has complete control, regardless of whether the default setting is fsync or no fsync. Well it is an "option" in Lucene – "it's just software" I don't want to make it easy to be unsafe. Lucene shouldn't sacrifice safety of the index... and with NRT there's no need to make that tradeoff. If a Lucene app wanted to increase NRT responsiveness and throughput, and if absolute index integrity wasn't a concern because it had been addressed through other means (e.g. multi-node search cluster), would turning off fsync speed things up under any of the proposed designs? Yes, turning off fsync would speed things up – you could fall back to simple reopen and get good performance (NRT should still be faster since the readers are pooled). The "use RAMDir on top of Lucene" designs would be helped less since fsync is a noop in RAMDir.
          Hide
          Marvin Humphrey added a comment -

          >> Wasn't that a possibility under autocommit as well? All it takes is for the
          >> OS to finish flushing the new snapshot file to persistent storage before it
          >> finishes flushing a segment data file needed by that snapshot, and for the
          >> power failure to squeeze in between.
          >
          > Not after LUCENE-1044... autoCommit simply called commit() at certain
          > opportune times (after finish big merges), which does the right thing (I
          > hope!). The segments file is not written until all files it references are
          > sync'd.

          FWIW, autoCommit doesn't really have a place in Lucy's
          one-segment-per-indexing-session model.

          Revisiting the LUCENE-1044 threads, one passage stood out:

          http://www.gossamer-threads.com/lists/lucene/java-dev/54321#54321

          This is why in a db system, the only file that is sync'd is the log
          file - all other files can be made "in sync" from the log file - and
          this file is normally striped for optimum write performance. Some
          systems have special "log file drives" (some even solid state, or
          battery backed ram) to aid the performance.

          The fact that we have to sync all files instead of just one seems sub-optimal.

          Yet Lucene is not well set up to maintain a transaction log. The very act of
          adding a document to Lucene is inherently lossy even if all fields are stored,
          because doc boost is not preserved.

          > Also, having the app explicitly decouple these two notions keeps the
          > door open for future improvements. If we force absolutely all sharing
          > to go through the filesystem then that limits the improvements we can
          > make to NRT.

          However, Lucy has much more to gain going through the file system than Lucene
          does, because we don't necessarily incur JVM startup costs when launching a
          new process. The Lucene approach to NRT – specialized reader hanging off of
          writer – is constrained to a single process. The Lucy approach – fast index
          opens enabled by mmap-friendly index formats – is not.

          The two approaches aren't mutually exclusive. It will be possible to augment
          Lucy with a specialized index reader within a single process. However, A)
          there seems to be a lot of disagreement about just how to integrate that
          reader, and B) there seem to be ways to bolt that functionality on top of the
          existing classes. Under those circumstances, I think it makes more sense to
          keep that feature external for now.

          > Alternatively, you could keep the notion "flush" (an unsafe commit)
          > alive? You write the segments file, but make no effort to ensure it's
          > durability (and also preserve the last "true" commit). Then a normal
          > IR.reopen suffices...

          That sounds promising. The semantics would differ from those of Lucene's
          flush(), which doesn't make changes visible.

          We could implement this by somehow marking a "committed" snapshot and a
          "flushed" snapshot differently, either by adding an "fsync" property to the
          snapshot file that would be false after a flush() but true after a commit(),
          or by encoding the property within the snapshot filename. The file purger
          would have to ensure that all index files referenced by either the last
          committed snapshot or the last flushed snapshot were off limits. A rollback()
          would zap all changes since the last commit().

          Such a scheme allows the the top level app to avoid the costs of fsync while
          maintaining its own transaction log – perhaps with the optimizations
          suggested above (separate disk, SSD, etc).

          Show
          Marvin Humphrey added a comment - >> Wasn't that a possibility under autocommit as well? All it takes is for the >> OS to finish flushing the new snapshot file to persistent storage before it >> finishes flushing a segment data file needed by that snapshot, and for the >> power failure to squeeze in between. > > Not after LUCENE-1044 ... autoCommit simply called commit() at certain > opportune times (after finish big merges), which does the right thing (I > hope!). The segments file is not written until all files it references are > sync'd. FWIW, autoCommit doesn't really have a place in Lucy's one-segment-per-indexing-session model. Revisiting the LUCENE-1044 threads, one passage stood out: http://www.gossamer-threads.com/lists/lucene/java-dev/54321#54321 This is why in a db system, the only file that is sync'd is the log file - all other files can be made "in sync" from the log file - and this file is normally striped for optimum write performance. Some systems have special "log file drives" (some even solid state, or battery backed ram) to aid the performance. The fact that we have to sync all files instead of just one seems sub-optimal. Yet Lucene is not well set up to maintain a transaction log. The very act of adding a document to Lucene is inherently lossy even if all fields are stored, because doc boost is not preserved. > Also, having the app explicitly decouple these two notions keeps the > door open for future improvements. If we force absolutely all sharing > to go through the filesystem then that limits the improvements we can > make to NRT. However, Lucy has much more to gain going through the file system than Lucene does, because we don't necessarily incur JVM startup costs when launching a new process. The Lucene approach to NRT – specialized reader hanging off of writer – is constrained to a single process. The Lucy approach – fast index opens enabled by mmap-friendly index formats – is not. The two approaches aren't mutually exclusive. It will be possible to augment Lucy with a specialized index reader within a single process. However, A) there seems to be a lot of disagreement about just how to integrate that reader, and B) there seem to be ways to bolt that functionality on top of the existing classes. Under those circumstances, I think it makes more sense to keep that feature external for now. > Alternatively, you could keep the notion "flush" (an unsafe commit) > alive? You write the segments file, but make no effort to ensure it's > durability (and also preserve the last "true" commit). Then a normal > IR.reopen suffices... That sounds promising. The semantics would differ from those of Lucene's flush(), which doesn't make changes visible. We could implement this by somehow marking a "committed" snapshot and a "flushed" snapshot differently, either by adding an "fsync" property to the snapshot file that would be false after a flush() but true after a commit(), or by encoding the property within the snapshot filename. The file purger would have to ensure that all index files referenced by either the last committed snapshot or the last flushed snapshot were off limits. A rollback() would zap all changes since the last commit(). Such a scheme allows the the top level app to avoid the costs of fsync while maintaining its own transaction log – perhaps with the optimizations suggested above (separate disk, SSD, etc).
          Hide
          Michael McCandless added a comment -

          FWIW, autoCommit doesn't really have a place in Lucy's
          one-segment-per-indexing-session model.

          Well, autoCommit just means "periodically call commit". So, if you
          decide to offer a commit() operation, then autoCommit would just wrap
          that? But, I don't think autoCommit should be offered... app should
          decide.

          Revisiting the LUCENE-1044 threads, one passage stood out:

          http://www.gossamer-threads.com/lists/lucene/java-dev/54321#54321

          This is why in a db system, the only file that is sync'd is the log
          file - all other files can be made "in sync" from the log file - and
          this file is normally striped for optimum write performance. Some
          systems have special "log file drives" (some even solid state, or
          battery backed ram) to aid the performance.

          The fact that we have to sync all files instead of just one seems sub-optimal.

          Yes, but, that cost is not on the reopen path, so it's much less
          important. Ie, the app can freely choose how frequently it wants to
          commit, completely independent from how often it needs to reopen.

          Yet Lucene is not well set up to maintain a transaction log. The very act of
          adding a document to Lucene is inherently lossy even if all fields are stored,
          because doc boost is not preserved.

          I don't see that those two statements are related.

          One can "easily" (meaning, it's easily decoupled from core) make a
          transaction log on top of lucene – just serialize your docs/analzyer
          selection/etc to the log & sync it periodically.

          But, that's orthogonal to what Lucene does & doesn't preserve in its
          index (and, yes, Lucene doesn't precisely preserve boosts).

          Also, having the app explicitly decouple these two notions keeps the door open for future improvements. If we force absolutely all sharing to go through the filesystem then that limits the improvements we can make to NRT.

          However, Lucy has much more to gain going through the file system than Lucene
          does, because we don't necessarily incur JVM startup costs when launching a
          new process. The Lucene approach to NRT - specialized reader hanging off of
          writer - is constrained to a single process. The Lucy approach - fast index
          opens enabled by mmap-friendly index formats - is not.

          The two approaches aren't mutually exclusive. It will be possible to augment
          Lucy with a specialized index reader within a single process. However, A)
          there seems to be a lot of disagreement about just how to integrate that
          reader, and B) there seem to be ways to bolt that functionality on top of the
          existing classes. Under those circumstances, I think it makes more sense to
          keep that feature external for now.

          Again: NRT is not a "specialized reader". It's a normal read-only
          DirectoryReader, just like you'd get from IndexReader.open, with the
          only difference being that it consulted IW to find which segments to
          open. Plus, it's pooled, so that if IW already has a given segment
          reader open (say because deletes were applied or merges are running),
          it's reused.

          We've discussed making it specialized (eg directly asearching DW's ram
          buffer, caching recently flushed segments in RAM, special
          incremental-copy-on-write data structures for deleted docs, etc.) but
          so far these changes don't seem worthwhile.

          The current approach to NRT is simple... I haven't yet seen
          performance gains strong enough to justify moving to "specialized
          readers".

          Yes, Lucene's approach must be in the same JVM. But we get important
          gains from this – reusing a single reader (the pool), carrying over
          merged deletions directly in RAM (and eventually field cache & norms
          too – LUCENE-1785).

          Instead, Lucy (by design) must do all sharing & access all index data
          through the filesystem (a decision, I think, could be dangerous),
          which will necessarily increase your reopen time. Maybe in practice
          that cost is small though... the OS write cache should keep everything
          fresh... but you still must serialize.

          Alternatively, you could keep the notion "flush" (an unsafe commit) alive? You write the segments file, but make no effort to ensure it's durability (and also preserve the last "true" commit). Then a normal IR.reopen suffices...

          That sounds promising. The semantics would differ from those of Lucene's
          flush(), which doesn't make changes visible.

          We could implement this by somehow marking a "committed" snapshot and a
          "flushed" snapshot differently, either by adding an "fsync" property to the
          snapshot file that would be false after a flush() but true after a commit(),
          or by encoding the property within the snapshot filename. The file purger
          would have to ensure that all index files referenced by either the last
          committed snapshot or the last flushed snapshot were off limits. A rollback()
          would zap all changes since the last commit().

          Such a scheme allows the the top level app to avoid the costs of fsync while
          maintaining its own transaction log - perhaps with the optimizations
          suggested above (separate disk, SSD, etc).

          In fact, this would make Lucy's approach to NRT nearly identical to
          Lucene NRT.

          The only difference is, instead of getting the current uncommitted
          segments_N via RAM, Lucy uses the filesystem. And, of course
          Lucy doesn't pool readers. So this is really a Lucy-ification of
          Lucene's approach to NRT.

          So it has the same benefits as Lucene's NRT, ie, lets Lucy apps
          decouple decisions about safety (commit) and freshness (reopen
          turnaround time).

          Show
          Michael McCandless added a comment - FWIW, autoCommit doesn't really have a place in Lucy's one-segment-per-indexing-session model. Well, autoCommit just means "periodically call commit". So, if you decide to offer a commit() operation, then autoCommit would just wrap that? But, I don't think autoCommit should be offered... app should decide. Revisiting the LUCENE-1044 threads, one passage stood out: http://www.gossamer-threads.com/lists/lucene/java-dev/54321#54321 This is why in a db system, the only file that is sync'd is the log file - all other files can be made "in sync" from the log file - and this file is normally striped for optimum write performance. Some systems have special "log file drives" (some even solid state, or battery backed ram) to aid the performance. The fact that we have to sync all files instead of just one seems sub-optimal. Yes, but, that cost is not on the reopen path, so it's much less important. Ie, the app can freely choose how frequently it wants to commit, completely independent from how often it needs to reopen. Yet Lucene is not well set up to maintain a transaction log. The very act of adding a document to Lucene is inherently lossy even if all fields are stored, because doc boost is not preserved. I don't see that those two statements are related. One can "easily" (meaning, it's easily decoupled from core) make a transaction log on top of lucene – just serialize your docs/analzyer selection/etc to the log & sync it periodically. But, that's orthogonal to what Lucene does & doesn't preserve in its index (and, yes, Lucene doesn't precisely preserve boosts). Also, having the app explicitly decouple these two notions keeps the door open for future improvements. If we force absolutely all sharing to go through the filesystem then that limits the improvements we can make to NRT. However, Lucy has much more to gain going through the file system than Lucene does, because we don't necessarily incur JVM startup costs when launching a new process. The Lucene approach to NRT - specialized reader hanging off of writer - is constrained to a single process. The Lucy approach - fast index opens enabled by mmap-friendly index formats - is not. The two approaches aren't mutually exclusive. It will be possible to augment Lucy with a specialized index reader within a single process. However, A) there seems to be a lot of disagreement about just how to integrate that reader, and B) there seem to be ways to bolt that functionality on top of the existing classes. Under those circumstances, I think it makes more sense to keep that feature external for now. Again: NRT is not a "specialized reader". It's a normal read-only DirectoryReader, just like you'd get from IndexReader.open, with the only difference being that it consulted IW to find which segments to open. Plus, it's pooled, so that if IW already has a given segment reader open (say because deletes were applied or merges are running), it's reused. We've discussed making it specialized (eg directly asearching DW's ram buffer, caching recently flushed segments in RAM, special incremental-copy-on-write data structures for deleted docs, etc.) but so far these changes don't seem worthwhile. The current approach to NRT is simple... I haven't yet seen performance gains strong enough to justify moving to "specialized readers". Yes, Lucene's approach must be in the same JVM. But we get important gains from this – reusing a single reader (the pool), carrying over merged deletions directly in RAM (and eventually field cache & norms too – LUCENE-1785 ). Instead, Lucy (by design) must do all sharing & access all index data through the filesystem (a decision, I think, could be dangerous), which will necessarily increase your reopen time. Maybe in practice that cost is small though... the OS write cache should keep everything fresh... but you still must serialize. Alternatively, you could keep the notion "flush" (an unsafe commit) alive? You write the segments file, but make no effort to ensure it's durability (and also preserve the last "true" commit). Then a normal IR.reopen suffices... That sounds promising. The semantics would differ from those of Lucene's flush(), which doesn't make changes visible. We could implement this by somehow marking a "committed" snapshot and a "flushed" snapshot differently, either by adding an "fsync" property to the snapshot file that would be false after a flush() but true after a commit(), or by encoding the property within the snapshot filename. The file purger would have to ensure that all index files referenced by either the last committed snapshot or the last flushed snapshot were off limits. A rollback() would zap all changes since the last commit(). Such a scheme allows the the top level app to avoid the costs of fsync while maintaining its own transaction log - perhaps with the optimizations suggested above (separate disk, SSD, etc). In fact, this would make Lucy's approach to NRT nearly identical to Lucene NRT. The only difference is, instead of getting the current uncommitted segments_N via RAM, Lucy uses the filesystem. And, of course Lucy doesn't pool readers. So this is really a Lucy-ification of Lucene's approach to NRT. So it has the same benefits as Lucene's NRT, ie, lets Lucy apps decouple decisions about safety (commit) and freshness (reopen turnaround time).
          Hide
          Marvin Humphrey added a comment -

          > Well, autoCommit just means "periodically call commit". So, if you
          > decide to offer a commit() operation, then autoCommit would just wrap
          > that? But, I don't think autoCommit should be offered... app should
          > decide.

          Agreed, autoCommit had benefits under legacy Lucene, but wouldn't be important
          now. If we did add some sort of "automatic commit" feature, it would mean
          something else: commit every change instantly. But that's easy to implement
          via a wrapper, so there's no point cluttering the the primary index writer
          class to support such a feature.

          > Again: NRT is not a "specialized reader". It's a normal read-only
          > DirectoryReader, just like you'd get from IndexReader.open, with the
          > only difference being that it consulted IW to find which segments to
          > open. Plus, it's pooled, so that if IW already has a given segment
          > reader open (say because deletes were applied or merges are running),
          > it's reused.

          Well, it seems to me that those two features make it special – particularly
          the pooling of SegmentReaders. You can't take advantage of that outside the
          context of IndexWriter:

          > Yes, Lucene's approach must be in the same JVM. But we get important
          > gains from this - reusing a single reader (the pool), carrying over
          > merged deletions directly in RAM (and eventually field cache & norms
          > too - LUCENE-1785).

          Exactly. In my view, that's what makes that reader "special": unlike ordinary
          Lucene IndexReaders, this one springs into being with its caches already
          primed rather than in need of lazy loading.

          But to achieve those benefits, you have to mod the index writing process.
          Those modifications are not necessary under the Lucy model, because the mere
          act of writing the index stores our data in the system IO cache.

          > Instead, Lucy (by design) must do all sharing & access all index data
          > through the filesystem (a decision, I think, could be dangerous),
          > which will necessarily increase your reopen time.

          Dangerous in what sense?

          Going through the file system is a tradeoff, sure – but it's pretty nice to
          design your low-latency search app free from any concern about whether
          indexing and search need to be coordinated within a single process.
          Furthermore, if separate processes are your primary concurrency model, going
          through the file system is actually mandatory to achieve best performance on a
          multi-core box. Lucy won't always be used with multi-threaded hosts.

          I actually think going through the file system is dangerous in a different
          sense: it puts pressure on the file format spec. The easy way to achieve IPC
          between writers and readers will be to dump stuff into one of the JSON files
          to support the killer-feature-du-jour – such as what I'm proposing with this
          "fsync" key in the snapshot file. But then we wind up with a bunch of crap
          cluttering up our index metadata files. I'm determined that Lucy will have a
          more coherent file format than Lucene, but with this IPC requirement we're
          setting our community up to push us in the wrong direction. If we're not
          careful, we could end up with a file format that's an unmaintainable jumble.

          But you're talking performance, not complexity costs, right?

          > Maybe in practice that cost is small though... the OS write cache should
          > keep everything fresh... but you still must serialize.

          Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records
          and 900 MB worth of sort cache data; opening a fresh searcher and loading all
          sort caches takes circa 21 ms.

          There's room to improve that further – we haven't yet implemented
          IndexReader.reopen() – but that was fast enough to achieve what we wanted to
          achieve.

          Show
          Marvin Humphrey added a comment - > Well, autoCommit just means "periodically call commit". So, if you > decide to offer a commit() operation, then autoCommit would just wrap > that? But, I don't think autoCommit should be offered... app should > decide. Agreed, autoCommit had benefits under legacy Lucene, but wouldn't be important now. If we did add some sort of "automatic commit" feature, it would mean something else: commit every change instantly. But that's easy to implement via a wrapper, so there's no point cluttering the the primary index writer class to support such a feature. > Again: NRT is not a "specialized reader". It's a normal read-only > DirectoryReader, just like you'd get from IndexReader.open, with the > only difference being that it consulted IW to find which segments to > open. Plus, it's pooled, so that if IW already has a given segment > reader open (say because deletes were applied or merges are running), > it's reused. Well, it seems to me that those two features make it special – particularly the pooling of SegmentReaders. You can't take advantage of that outside the context of IndexWriter: > Yes, Lucene's approach must be in the same JVM. But we get important > gains from this - reusing a single reader (the pool), carrying over > merged deletions directly in RAM (and eventually field cache & norms > too - LUCENE-1785 ). Exactly. In my view, that's what makes that reader "special": unlike ordinary Lucene IndexReaders, this one springs into being with its caches already primed rather than in need of lazy loading. But to achieve those benefits, you have to mod the index writing process. Those modifications are not necessary under the Lucy model, because the mere act of writing the index stores our data in the system IO cache. > Instead, Lucy (by design) must do all sharing & access all index data > through the filesystem (a decision, I think, could be dangerous), > which will necessarily increase your reopen time. Dangerous in what sense? Going through the file system is a tradeoff, sure – but it's pretty nice to design your low-latency search app free from any concern about whether indexing and search need to be coordinated within a single process. Furthermore, if separate processes are your primary concurrency model, going through the file system is actually mandatory to achieve best performance on a multi-core box. Lucy won't always be used with multi-threaded hosts. I actually think going through the file system is dangerous in a different sense: it puts pressure on the file format spec. The easy way to achieve IPC between writers and readers will be to dump stuff into one of the JSON files to support the killer-feature-du-jour – such as what I'm proposing with this "fsync" key in the snapshot file. But then we wind up with a bunch of crap cluttering up our index metadata files. I'm determined that Lucy will have a more coherent file format than Lucene, but with this IPC requirement we're setting our community up to push us in the wrong direction. If we're not careful, we could end up with a file format that's an unmaintainable jumble. But you're talking performance, not complexity costs, right? > Maybe in practice that cost is small though... the OS write cache should > keep everything fresh... but you still must serialize. Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records and 900 MB worth of sort cache data; opening a fresh searcher and loading all sort caches takes circa 21 ms. There's room to improve that further – we haven't yet implemented IndexReader.reopen() – but that was fast enough to achieve what we wanted to achieve.
          Hide
          Jason Rutherglen added a comment -

          Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records
          and 900 MB worth of sort cache data; opening a fresh searcher and loading all
          sort caches takes circa 21 ms.

          Marvin, very cool! Are you using the mmap module you mentioned at ApacheCon?

          Show
          Jason Rutherglen added a comment - Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records and 900 MB worth of sort cache data; opening a fresh searcher and loading all sort caches takes circa 21 ms. Marvin, very cool! Are you using the mmap module you mentioned at ApacheCon?
          Hide
          Marvin Humphrey added a comment -

          Yes, this is using the sort cache model worked out this spring on lucy-dev.
          The memory mapping happens within FSFileHandle (LUCY-83). SortWriter
          and SortReader haven't made it into the Lucy repository yet.

          Show
          Marvin Humphrey added a comment - Yes, this is using the sort cache model worked out this spring on lucy-dev. The memory mapping happens within FSFileHandle ( LUCY-83 ). SortWriter and SortReader haven't made it into the Lucy repository yet.
          Hide
          Michael McCandless added a comment -

          Again: NRT is not a "specialized reader". It's a normal read-only DirectoryReader, just like you'd get from IndexReader.open, with the only difference being that it consulted IW to find which segments to open. Plus, it's pooled, so that if IW already has a given segment reader open (say because deletes were applied or merges are running), it's reused.

          Well, it seems to me that those two features make it special - particularly
          the pooling of SegmentReaders. You can't take advantage of that outside the
          context of IndexWriter:

          OK so mabye a little special But, really that pooling should be
          factored out of IW. It's not writer specific.

          Yes, Lucene's approach must be in the same JVM. But we get important gains from this - reusing a single reader (the pool), carrying over merged deletions directly in RAM (and eventually field cache & norms too - LUCENE-1785).

          Exactly. In my view, that's what makes that reader "special": unlike ordinary
          Lucene IndexReaders, this one springs into being with its caches already
          primed rather than in need of lazy loading.

          But to achieve those benefits, you have to mod the index writing process.

          Mod the index writing, and the reader reopen, to use the shared pool.
          The pool in itself isn't writer specific.

          Really the pool is just like what you tap into when you call reopen –
          that method looks at the current "pool" of already opened segments,
          sharing what it can.

          Those modifications are not necessary under the Lucy model, because the mere act of writing the index stores our data in the system IO cache.

          But, that's where Lucy presumably takes a perf hit. Lucene can share
          these in RAM, not usign the filesystem as the intermediary (eg we do
          that today with deletions; norms/field cache/eventual CSF can do the
          same.) Lucy must go through the filesystem to share.

          Instead, Lucy (by design) must do all sharing & access all index data through the filesystem (a decision, I think, could be dangerous), which will necessarily increase your reopen time.

          Dangerous in what sense?

          Going through the file system is a tradeoff, sure - but it's pretty nice to
          design your low-latency search app free from any concern about whether
          indexing and search need to be coordinated within a single process.
          Furthermore, if separate processes are your primary concurrency model, going
          through the file system is actually mandatory to achieve best performance on a
          multi-core box. Lucy won't always be used with multi-threaded hosts.

          I actually think going through the file system is dangerous in a different
          sense: it puts pressure on the file format spec. The easy way to achieve IPC
          between writers and readers will be to dump stuff into one of the JSON files
          to support the killer-feature-du-jour - such as what I'm proposing with this
          "fsync" key in the snapshot file. But then we wind up with a bunch of crap
          cluttering up our index metadata files. I'm determined that Lucy will have a
          more coherent file format than Lucene, but with this IPC requirement we're
          setting our community up to push us in the wrong direction. If we're not
          careful, we could end up with a file format that's an unmaintainable jumble.

          But you're talking performance, not complexity costs, right?

          Mostly I was thinking performance, ie, trusting the OS to make good
          decisions about what should be RAM resident, when it has limited
          information...

          But, also risky is that all important data structures must be
          "file-flat", though in practice that doesn't seem like an issue so
          far? The RAM resident things Lucene has – norms, deleted docs, terms
          index, field cache – seem to "cast" just fine to file-flat. If we
          switched to an FST for the terms index I guess that could get
          tricky...

          Wouldn't shared memory be possible for process-only concurrent models?
          Also, what popular systems/environments have this requirement (only
          process level concurrency) today?

          It's wonderful that Lucy can startup really fast, but, for most apps
          that's not nearly as important as searching/indexing performance,
          right? I mean, you start only once, and then you handle many, many
          searches / index many documents, with that process, usually?

          Maybe in practice that cost is small though... the OS write cache should keep everything fresh... but you still must serialize.

          Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records
          and 900 MB worth of sort cache data; opening a fresh searcher and loading all
          sort caches takes circa 21 ms.

          That's fabulously fast!

          But you really need to also test search/indexing throughput, reopen time
          (I think) once that's online for Lucy...

          There's room to improve that further - we haven't yet implemented
          IndexReader.reopen() - but that was fast enough to achieve what we wanted to
          achieve.

          Is reopen even necessary in Lucy?

          Show
          Michael McCandless added a comment - Again: NRT is not a "specialized reader". It's a normal read-only DirectoryReader, just like you'd get from IndexReader.open, with the only difference being that it consulted IW to find which segments to open. Plus, it's pooled, so that if IW already has a given segment reader open (say because deletes were applied or merges are running), it's reused. Well, it seems to me that those two features make it special - particularly the pooling of SegmentReaders. You can't take advantage of that outside the context of IndexWriter: OK so mabye a little special But, really that pooling should be factored out of IW. It's not writer specific. Yes, Lucene's approach must be in the same JVM. But we get important gains from this - reusing a single reader (the pool), carrying over merged deletions directly in RAM (and eventually field cache & norms too - LUCENE-1785 ). Exactly. In my view, that's what makes that reader "special": unlike ordinary Lucene IndexReaders, this one springs into being with its caches already primed rather than in need of lazy loading. But to achieve those benefits, you have to mod the index writing process. Mod the index writing, and the reader reopen, to use the shared pool. The pool in itself isn't writer specific. Really the pool is just like what you tap into when you call reopen – that method looks at the current "pool" of already opened segments, sharing what it can. Those modifications are not necessary under the Lucy model, because the mere act of writing the index stores our data in the system IO cache. But, that's where Lucy presumably takes a perf hit. Lucene can share these in RAM, not usign the filesystem as the intermediary (eg we do that today with deletions; norms/field cache/eventual CSF can do the same.) Lucy must go through the filesystem to share. Instead, Lucy (by design) must do all sharing & access all index data through the filesystem (a decision, I think, could be dangerous), which will necessarily increase your reopen time. Dangerous in what sense? Going through the file system is a tradeoff, sure - but it's pretty nice to design your low-latency search app free from any concern about whether indexing and search need to be coordinated within a single process. Furthermore, if separate processes are your primary concurrency model, going through the file system is actually mandatory to achieve best performance on a multi-core box. Lucy won't always be used with multi-threaded hosts. I actually think going through the file system is dangerous in a different sense: it puts pressure on the file format spec. The easy way to achieve IPC between writers and readers will be to dump stuff into one of the JSON files to support the killer-feature-du-jour - such as what I'm proposing with this "fsync" key in the snapshot file. But then we wind up with a bunch of crap cluttering up our index metadata files. I'm determined that Lucy will have a more coherent file format than Lucene, but with this IPC requirement we're setting our community up to push us in the wrong direction. If we're not careful, we could end up with a file format that's an unmaintainable jumble. But you're talking performance, not complexity costs, right? Mostly I was thinking performance, ie, trusting the OS to make good decisions about what should be RAM resident, when it has limited information... But, also risky is that all important data structures must be "file-flat", though in practice that doesn't seem like an issue so far? The RAM resident things Lucene has – norms, deleted docs, terms index, field cache – seem to "cast" just fine to file-flat. If we switched to an FST for the terms index I guess that could get tricky... Wouldn't shared memory be possible for process-only concurrent models? Also, what popular systems/environments have this requirement (only process level concurrency) today? It's wonderful that Lucy can startup really fast, but, for most apps that's not nearly as important as searching/indexing performance, right? I mean, you start only once, and then you handle many, many searches / index many documents, with that process, usually? Maybe in practice that cost is small though... the OS write cache should keep everything fresh... but you still must serialize. Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records and 900 MB worth of sort cache data; opening a fresh searcher and loading all sort caches takes circa 21 ms. That's fabulously fast! But you really need to also test search/indexing throughput, reopen time (I think) once that's online for Lucy... There's room to improve that further - we haven't yet implemented IndexReader.reopen() - but that was fast enough to achieve what we wanted to achieve. Is reopen even necessary in Lucy?
          Hide
          Marvin Humphrey added a comment -

          > But, that's where Lucy presumably takes a perf hit. Lucene can share
          > these in RAM, not usign the filesystem as the intermediary (eg we do
          > that today with deletions; norms/field cache/eventual CSF can do the
          > same.) Lucy must go through the filesystem to share.

          For a flush(), I don't think there's a significant penalty. The only extra
          costs Lucy will pay are the bookkeeping costs to update the file system state
          and to create the objects that read the index data. Those are real, but since
          we're skipping the fsync(), they're small. As far as the actual data, I don't
          see that there's a difference. Reading from memory mapped RAM isn't any
          slower than reading from malloc'd RAM.

          If we have to fsync(), there'll be a cost, but in Lucene you have to pay that
          same cost, too. Lucene expects to get around it with IndexWriter.getReader().
          In Lucy, we'll get around it by having you call flush() and then reopen a
          reader somewhere, often in another proecess.

          • In both cases, the availability of fresh data is decoupled from the fsync.
          • In both cases, the indexing process has to be careful about dropping data
            on the floor before a commit() succeeds.
          • In both cases, it's possible to protect against unbounded corruption by
            rolling back to the last commit.

          > Mostly I was thinking performance, ie, trusting the OS to make good
          > decisions about what should be RAM resident, when it has limited
          > information...

          Right, for instance because we generally can't force the OS to pin term
          dictionaries in RAM, as discussed a while back. It's not an ideal situation,
          but Lucene's approach isn't bulletproof either, since Lucene's term
          dictionaries can get paged out too.

          We're sure not going to throw away all the advantages of mmap and go back to
          reading data structures into process RAM just because of that.

          > But, also risky is that all important data structures must be "file-flat",
          > though in practice that doesn't seem like an issue so far?

          It's a constraint. For instance, to support mmap, string sort caches
          currently require three "files" each: ords, offsets, and UTF-8 character data.

          The compound file system makes the file proliferation bearable, though. And
          it's actually nice in a way to have data structures as named files, strongly
          separated from each other and persistent.

          If we were willing to ditch portability, we could cast to arrays of structs in
          Lucy – but so far we've just used primitives. I'd like to keep it that way,
          since it would be nice if the core Lucy file format was at least theoretically
          compatible with a pure Java implementation. But Lucy plugins could break that
          rule and cast to structs if desired.

          > The RAM resident things Lucene has - norms, deleted docs, terms index, field
          > cache - seem to "cast" just fine to file-flat.

          There are often benefits to keeping stuff "file-flat", particularly when the
          file-flat form is compressed. If we were to expand those sort caches to
          string objects, they'd take up more RAM than they do now.

          I think the only significant drawback is security: we can't trust memory
          mapped data the way we can data which has been read into process RAM and
          checked on the way in. For instance, we need to perform UTF-8 sanity checking
          each time a string sort cache value escapes the controlled environment of the
          cache reader. If the sort cache value was instead derived from an existing
          string in process RAM, we wouldn't need to check it.

          > If we switched to an FST for the terms index I guess that could get
          > tricky...

          Hmm, I haven't been following that. Too much work to keep up with those
          giganto patches for flex indexing, even though it's a subject I'm intimately
          acquainted with and deeply interested in. I plan to look it over when you're
          done and see if we can simplify it.

          > Wouldn't shared memory be possible for process-only concurrent models?

          IPC is a platform-compatibility nightmare. By restricting ourselves to
          communicating via the file system, we save ourselves oodles of engineering
          time. And on really boring, frustrating work, to boot.

          > Also, what popular systems/environments have this requirement (only process
          > level concurrency) today?

          Perl's threads suck. Actually all threads suck. Perl's are just worse than
          average – and so many Perl binaries are compiled without them. Java threads
          suck less, but they still suck – look how much engineering time you folks
          blow on managing that stuff. Threads are a terrible programming model.

          I'm not into the idea of forcing Lucy users to use threads. They should be
          able to use processes as their primary concurrency model if they want.

          > It's wonderful that Lucy can startup really fast, but, for most apps that's
          > not nearly as important as searching/indexing performance, right?

          Depends.

          Total indexing throughput in both Lucene and KinoSearch has been pretty decent
          for a long time. However, there's been a large gap between average index
          update performance and worst case index update performance, especially when
          you factor in sort cache loading. There are plenty of applications that may
          not have very high throughput requirements but where it may not be acceptable
          for an index update to take several seconds or several minutes every once in a
          while, even if it usually completes faster.

          > I mean, you start only once, and then you handle many, many
          > searches / index many documents, with that process, usually?

          Sometimes the person who just performed the action that updated the index is
          the only one you care about. For instance, to use a feature request that came
          in from Slashdot a while back, if someone leaves a comment on your website,
          it's nice to have it available in the search index right away.

          Consistently fast index update responsiveness makes personalization of the
          customer experience easier.

          > But you really need to also test search/indexing throughput, reopen time
          > (I think) once that's online for Lucy...

          Naturally.

          > Is reopen even necessary in Lucy?

          Probably. If you have a boatload of segments and a boatload of fields, you
          might start to see file opening and metadata parsing costs come into play. If
          it turns out that for some indexes reopen() can knock down the time from say,
          100 ms to 10 ms or less, I'd consider that sufficient justification.

          Show
          Marvin Humphrey added a comment - > But, that's where Lucy presumably takes a perf hit. Lucene can share > these in RAM, not usign the filesystem as the intermediary (eg we do > that today with deletions; norms/field cache/eventual CSF can do the > same.) Lucy must go through the filesystem to share. For a flush(), I don't think there's a significant penalty. The only extra costs Lucy will pay are the bookkeeping costs to update the file system state and to create the objects that read the index data. Those are real, but since we're skipping the fsync(), they're small. As far as the actual data, I don't see that there's a difference. Reading from memory mapped RAM isn't any slower than reading from malloc'd RAM. If we have to fsync(), there'll be a cost, but in Lucene you have to pay that same cost, too. Lucene expects to get around it with IndexWriter.getReader(). In Lucy, we'll get around it by having you call flush() and then reopen a reader somewhere, often in another proecess. In both cases, the availability of fresh data is decoupled from the fsync. In both cases, the indexing process has to be careful about dropping data on the floor before a commit() succeeds. In both cases, it's possible to protect against unbounded corruption by rolling back to the last commit. > Mostly I was thinking performance, ie, trusting the OS to make good > decisions about what should be RAM resident, when it has limited > information... Right, for instance because we generally can't force the OS to pin term dictionaries in RAM, as discussed a while back. It's not an ideal situation, but Lucene's approach isn't bulletproof either, since Lucene's term dictionaries can get paged out too. We're sure not going to throw away all the advantages of mmap and go back to reading data structures into process RAM just because of that. > But, also risky is that all important data structures must be "file-flat", > though in practice that doesn't seem like an issue so far? It's a constraint. For instance, to support mmap, string sort caches currently require three "files" each: ords, offsets, and UTF-8 character data. The compound file system makes the file proliferation bearable, though. And it's actually nice in a way to have data structures as named files, strongly separated from each other and persistent. If we were willing to ditch portability, we could cast to arrays of structs in Lucy – but so far we've just used primitives. I'd like to keep it that way, since it would be nice if the core Lucy file format was at least theoretically compatible with a pure Java implementation. But Lucy plugins could break that rule and cast to structs if desired. > The RAM resident things Lucene has - norms, deleted docs, terms index, field > cache - seem to "cast" just fine to file-flat. There are often benefits to keeping stuff "file-flat", particularly when the file-flat form is compressed. If we were to expand those sort caches to string objects, they'd take up more RAM than they do now. I think the only significant drawback is security: we can't trust memory mapped data the way we can data which has been read into process RAM and checked on the way in. For instance, we need to perform UTF-8 sanity checking each time a string sort cache value escapes the controlled environment of the cache reader. If the sort cache value was instead derived from an existing string in process RAM, we wouldn't need to check it. > If we switched to an FST for the terms index I guess that could get > tricky... Hmm, I haven't been following that. Too much work to keep up with those giganto patches for flex indexing, even though it's a subject I'm intimately acquainted with and deeply interested in. I plan to look it over when you're done and see if we can simplify it. > Wouldn't shared memory be possible for process-only concurrent models? IPC is a platform-compatibility nightmare. By restricting ourselves to communicating via the file system, we save ourselves oodles of engineering time. And on really boring, frustrating work, to boot. > Also, what popular systems/environments have this requirement (only process > level concurrency) today? Perl's threads suck. Actually all threads suck. Perl's are just worse than average – and so many Perl binaries are compiled without them. Java threads suck less, but they still suck – look how much engineering time you folks blow on managing that stuff. Threads are a terrible programming model. I'm not into the idea of forcing Lucy users to use threads. They should be able to use processes as their primary concurrency model if they want. > It's wonderful that Lucy can startup really fast, but, for most apps that's > not nearly as important as searching/indexing performance, right? Depends. Total indexing throughput in both Lucene and KinoSearch has been pretty decent for a long time. However, there's been a large gap between average index update performance and worst case index update performance, especially when you factor in sort cache loading. There are plenty of applications that may not have very high throughput requirements but where it may not be acceptable for an index update to take several seconds or several minutes every once in a while, even if it usually completes faster. > I mean, you start only once, and then you handle many, many > searches / index many documents, with that process, usually? Sometimes the person who just performed the action that updated the index is the only one you care about. For instance, to use a feature request that came in from Slashdot a while back, if someone leaves a comment on your website, it's nice to have it available in the search index right away. Consistently fast index update responsiveness makes personalization of the customer experience easier. > But you really need to also test search/indexing throughput, reopen time > (I think) once that's online for Lucy... Naturally. > Is reopen even necessary in Lucy? Probably. If you have a boatload of segments and a boatload of fields, you might start to see file opening and metadata parsing costs come into play. If it turns out that for some indexes reopen() can knock down the time from say, 100 ms to 10 ms or less, I'd consider that sufficient justification.
          Hide
          Michael McCandless added a comment -

          But, that's where Lucy presumably takes a perf hit. Lucene can share these in RAM, not usign the filesystem as the intermediary (eg we do that today with deletions; norms/field cache/eventual CSF can do the same.) Lucy must go through the filesystem to share.

          For a flush(), I don't think there's a significant penalty. The only extra
          costs Lucy will pay are the bookkeeping costs to update the file system state
          and to create the objects that read the index data. Those are real, but since
          we're skipping the fsync(), they're small. As far as the actual data, I don't
          see that there's a difference.

          But everything must go through the filesystem with Lucy...

          Eg, with Lucene, deletions are not written to disk until you commit.
          Flush doesn't write the del file, merging doesn't, etc. The deletes
          are carried in RAM. We could (but haven't yet – NRT turnaround time
          is already plenty fast) do the same with norms, field cache, terms
          dict index, etc.

          Reading from memory mapped RAM isn't any slower than reading from malloc'd RAM.

          Right, for instance because we generally can't force the OS to pin term
          dictionaries in RAM, as discussed a while back. It's not an ideal situation,
          but Lucene's approach isn't bulletproof either, since Lucene's term
          dictionaries can get paged out too.

          As long as the page is hot... (in both cases!).

          But by using file-backed RAM (not malloc'd RAM), you're telling the OS
          it's OK if it chooses to swap it out. Sure, malloc'd RAM can be
          swapped out too... but that should be less frequent (and, we can
          control this behavior, somewhat, eg swappiness).

          It's similar to using a weak v strong reference in java. By using
          file-backed RAM you tell the OS it's fair game for swapping.

          If we have to fsync(), there'll be a cost, but in Lucene you have to pay that
          same cost, too. Lucene expects to get around it with IndexWriter.getReader().
          In Lucy, we'll get around it by having you call flush() and then reopen a
          reader somewhere, often in another proecess.

          In both cases, the availability of fresh data is decoupled from the fsync.
          In both cases, the indexing process has to be careful about dropping data
          on the floor before a commit() succeeds.
          In both cases, it's possible to protect against unbounded corruption by
          rolling back to the last commit.

          The two approaches are basically the same, so, we get the same
          features

          It's just that Lucy uses the filesystem for sharing, and Lucene shares
          through RAM.

          We're sure not going to throw away all the advantages of mmap and go back to reading data structures into process RAM just because of that.

          I guess my confusion is what are all the other benefits of using
          file-backed RAM? You can efficiently use process only concurrency
          (though shared memory is technically an option for this too), and you
          have wicked fast open times (but, you still must warm, just like
          Lucene). What else? Oh maybe the ability to inform OS not to cache
          eg the reads done when merging segments. That's one I sure wish
          Lucene could use...

          In exchange you risk the OS making poor choices about what gets
          swapped out (LRU policy is too simplistic... not all pages are created
          equal), must down cast all data structures to file-flat, must share
          everything through the filesystem, (perf hit to NRT).

          I do love how pure the file-backed RAM approach is, but I worry that
          down the road it'll result in erratic search performance in certain
          app profiles.

          But, also risky is that all important data structures must be "file-flat", though in practice that doesn't seem like an issue so far?

          It's a constraint. For instance, to support mmap, string sort caches
          currently require three "files" each: ords, offsets, and UTF-8 character data.

          Yeah, that you need 3 files for the string sort cache is a little
          spooky... that's 3X the chance of a page fault.

          The compound file system makes the file proliferation bearable, though. And
          it's actually nice in a way to have data structures as named files, strongly
          separated from each other and persistent.

          But the CFS construction must also go through the filesystem (like
          Lucene) right? So you still incur IO load of creating the small
          files, then 2nd pass to consolidate.

          I agree there's a certain design purity to having the files clearly
          separate out the elements of the data structures, but if it means
          erratic search performance... function over form?

          If we were willing to ditch portability, we could cast to arrays of structs in
          Lucy - but so far we've just used primitives. I'd like to keep it that way,
          since it would be nice if the core Lucy file format was at least theoretically
          compatible with a pure Java implementation. But Lucy plugins could break that
          rule and cast to structs if desired.

          Someday we could make a Lucene codec that interacts with a Lucy
          index... would be a good exercise to go though to see if the flex API
          really is "flex" enough...

          The RAM resident things Lucene has - norms, deleted docs, terms index, field cache - seem to "cast" just fine to file-flat.

          There are often benefits to keeping stuff "file-flat", particularly when the
          file-flat form is compressed. If we were to expand those sort caches to
          string objects, they'd take up more RAM than they do now.

          We've leaving them as UTF8 by default for Lucene (with the flex
          changes). Still, the terms index once loaded does have silly RAM
          overhead... we can cut that back a fair amount though.

          I think the only significant drawback is security: we can't trust memory
          mapped data the way we can data which has been read into process RAM and
          checked on the way in. For instance, we need to perform UTF-8 sanity checking
          each time a string sort cache value escapes the controlled environment of the
          cache reader. If the sort cache value was instead derived from an existing
          string in process RAM, we wouldn't need to check it.

          Sigh, that's a curious downside... so term decode intensive uses
          (merging, range queries, I guess maybe term dict lookup) take the
          brunt of that hit?

          If we switched to an FST for the terms index I guess that could get tricky...

          Hmm, I haven't been following that.

          There's not much to follow – it's all just talk at this point. I
          don't think anyone's built a prototype yet

          Too much work to keep up with those giganto patches for flex indexing,
          even though it's a subject I'm intimately acquainted with and deeply
          interested in. I plan to look it over when you're done and see if we
          can simplify it.

          And then we'll borrow back your simplifications Lather, rinse,
          repeat.

          Wouldn't shared memory be possible for process-only concurrent models?

          IPC is a platform-compatibility nightmare. By restricting ourselves to
          communicating via the file system, we save ourselves oodles of
          engineering time. And on really boring, frustrating work, to boot.

          I had assumed so too, but I was surprised that Python's
          multiprocessing module exposes a simple API for sharing objects from
          parent to forked child. It's at least a counter example (though, in
          all fairness, I haven't looked at the impl ), ie, there seems to be
          some hope of containing shared memory under a consistent API.

          I'm just pointing out that "going through the filesystem" isn't the
          only way to have efficient process-only concurrency. Shared memory
          is another option, but, yes it has tradeoffs.

          Also, what popular systems/environments have this requirement (only process level concurrency) today?

          Perl's threads suck. Actually all threads suck. Perl's are just worse than
          average - and so many Perl binaries are compiled without them. Java threads
          suck less, but they still suck - look how much engineering time you folks
          blow on managing that stuff. Threads are a terrible programming model.

          I'm not into the idea of forcing Lucy users to use threads. They should be
          able to use processes as their primary concurrency model if they want.

          Yes, working with threads is a nightmare (eg have a look at Java's
          memory model). I think the jury is still out (for our species) just
          how, long term, we'll make use of concurrency with the machines. I
          think we may need to largely take "time" out of our programming
          languages, eg switch to much more declarative code, or
          something... wanna port Lucy to Erlang?

          But I'm not sure process only concurrency, sharing only via
          file-backed memory, is the answer either

          It's wonderful that Lucy can startup really fast, but, for most apps that's not nearly as important as searching/indexing performance, right?

          Depends.

          Total indexing throughput in both Lucene and KinoSearch has been pretty decent
          for a long time. However, there's been a large gap between average index
          update performance and worst case index update performance, especially when
          you factor in sort cache loading. There are plenty of applications that may
          not have very high throughput requirements but where it may not be acceptable
          for an index update to take several seconds or several minutes every once in a
          while, even if it usually completes faster.

          I mean, you start only once, and then you handle many, many searches / index many documents, with that process, usually?

          Sometimes the person who just performed the action that updated the index is
          the only one you care about. For instance, to use a feature request that came
          in from Slashdot a while back, if someone leaves a comment on your website,
          it's nice to have it available in the search index right away.

          Consistently fast index update responsiveness makes personalization of the
          customer experience easier.

          Turnaround time for Lucene NRT is already very fast, as is. After an
          immense merge, it'll be the worst, but if you warm the reader first,
          that won't be an issue.

          Using Zoie you can make reopen time insanely fast (much faster than I
          think necessary for most apps), but at the expense of some expected
          hit to searching/indexing throughput. I don't think that's the right
          tradeoff for Lucene.

          I suspect Lucy is making a similar tradeoff, ie, that search
          performance will be erratic due to page faults, at a smallish gain in
          reopen time.

          Do you have any hard numbers on how much time it takes Lucene to load
          from a hot IO cache, populating its RAM resident data structures? I
          wonder in practice what extra cost we are really talking about... it's
          RAM to RAM "translation" of data structures (if the files are hot).
          FieldCache we just have to fix to stop doing uninversion... (ie we
          need CSF).

          Is reopen even necessary in Lucy?

          Probably. If you have a boatload of segments and a boatload of fields, you
          might start to see file opening and metadata parsing costs come into play. If
          it turns out that for some indexes reopen() can knock down the time from say,
          100 ms to 10 ms or less, I'd consider that sufficient justification.

          OK. Then, you are basically pooling your readers Ie, you do allow
          in-process sharing, but only among readers.

          Show
          Michael McCandless added a comment - But, that's where Lucy presumably takes a perf hit. Lucene can share these in RAM, not usign the filesystem as the intermediary (eg we do that today with deletions; norms/field cache/eventual CSF can do the same.) Lucy must go through the filesystem to share. For a flush(), I don't think there's a significant penalty. The only extra costs Lucy will pay are the bookkeeping costs to update the file system state and to create the objects that read the index data. Those are real, but since we're skipping the fsync(), they're small. As far as the actual data, I don't see that there's a difference. But everything must go through the filesystem with Lucy... Eg, with Lucene, deletions are not written to disk until you commit. Flush doesn't write the del file, merging doesn't, etc. The deletes are carried in RAM. We could (but haven't yet – NRT turnaround time is already plenty fast) do the same with norms, field cache, terms dict index, etc. Reading from memory mapped RAM isn't any slower than reading from malloc'd RAM. Right, for instance because we generally can't force the OS to pin term dictionaries in RAM, as discussed a while back. It's not an ideal situation, but Lucene's approach isn't bulletproof either, since Lucene's term dictionaries can get paged out too. As long as the page is hot... (in both cases!). But by using file-backed RAM (not malloc'd RAM), you're telling the OS it's OK if it chooses to swap it out. Sure, malloc'd RAM can be swapped out too... but that should be less frequent (and, we can control this behavior, somewhat, eg swappiness). It's similar to using a weak v strong reference in java. By using file-backed RAM you tell the OS it's fair game for swapping. If we have to fsync(), there'll be a cost, but in Lucene you have to pay that same cost, too. Lucene expects to get around it with IndexWriter.getReader(). In Lucy, we'll get around it by having you call flush() and then reopen a reader somewhere, often in another proecess. In both cases, the availability of fresh data is decoupled from the fsync. In both cases, the indexing process has to be careful about dropping data on the floor before a commit() succeeds. In both cases, it's possible to protect against unbounded corruption by rolling back to the last commit. The two approaches are basically the same, so, we get the same features It's just that Lucy uses the filesystem for sharing, and Lucene shares through RAM. We're sure not going to throw away all the advantages of mmap and go back to reading data structures into process RAM just because of that. I guess my confusion is what are all the other benefits of using file-backed RAM? You can efficiently use process only concurrency (though shared memory is technically an option for this too), and you have wicked fast open times (but, you still must warm, just like Lucene). What else? Oh maybe the ability to inform OS not to cache eg the reads done when merging segments. That's one I sure wish Lucene could use... In exchange you risk the OS making poor choices about what gets swapped out (LRU policy is too simplistic... not all pages are created equal), must down cast all data structures to file-flat, must share everything through the filesystem, (perf hit to NRT). I do love how pure the file-backed RAM approach is, but I worry that down the road it'll result in erratic search performance in certain app profiles. But, also risky is that all important data structures must be "file-flat", though in practice that doesn't seem like an issue so far? It's a constraint. For instance, to support mmap, string sort caches currently require three "files" each: ords, offsets, and UTF-8 character data. Yeah, that you need 3 files for the string sort cache is a little spooky... that's 3X the chance of a page fault. The compound file system makes the file proliferation bearable, though. And it's actually nice in a way to have data structures as named files, strongly separated from each other and persistent. But the CFS construction must also go through the filesystem (like Lucene) right? So you still incur IO load of creating the small files, then 2nd pass to consolidate. I agree there's a certain design purity to having the files clearly separate out the elements of the data structures, but if it means erratic search performance... function over form? If we were willing to ditch portability, we could cast to arrays of structs in Lucy - but so far we've just used primitives. I'd like to keep it that way, since it would be nice if the core Lucy file format was at least theoretically compatible with a pure Java implementation. But Lucy plugins could break that rule and cast to structs if desired. Someday we could make a Lucene codec that interacts with a Lucy index... would be a good exercise to go though to see if the flex API really is "flex" enough... The RAM resident things Lucene has - norms, deleted docs, terms index, field cache - seem to "cast" just fine to file-flat. There are often benefits to keeping stuff "file-flat", particularly when the file-flat form is compressed. If we were to expand those sort caches to string objects, they'd take up more RAM than they do now. We've leaving them as UTF8 by default for Lucene (with the flex changes). Still, the terms index once loaded does have silly RAM overhead... we can cut that back a fair amount though. I think the only significant drawback is security: we can't trust memory mapped data the way we can data which has been read into process RAM and checked on the way in. For instance, we need to perform UTF-8 sanity checking each time a string sort cache value escapes the controlled environment of the cache reader. If the sort cache value was instead derived from an existing string in process RAM, we wouldn't need to check it. Sigh, that's a curious downside... so term decode intensive uses (merging, range queries, I guess maybe term dict lookup) take the brunt of that hit? If we switched to an FST for the terms index I guess that could get tricky... Hmm, I haven't been following that. There's not much to follow – it's all just talk at this point. I don't think anyone's built a prototype yet Too much work to keep up with those giganto patches for flex indexing, even though it's a subject I'm intimately acquainted with and deeply interested in. I plan to look it over when you're done and see if we can simplify it. And then we'll borrow back your simplifications Lather, rinse, repeat. Wouldn't shared memory be possible for process-only concurrent models? IPC is a platform-compatibility nightmare. By restricting ourselves to communicating via the file system, we save ourselves oodles of engineering time. And on really boring, frustrating work, to boot. I had assumed so too, but I was surprised that Python's multiprocessing module exposes a simple API for sharing objects from parent to forked child. It's at least a counter example (though, in all fairness, I haven't looked at the impl ), ie, there seems to be some hope of containing shared memory under a consistent API. I'm just pointing out that "going through the filesystem" isn't the only way to have efficient process-only concurrency. Shared memory is another option, but, yes it has tradeoffs. Also, what popular systems/environments have this requirement (only process level concurrency) today? Perl's threads suck. Actually all threads suck. Perl's are just worse than average - and so many Perl binaries are compiled without them. Java threads suck less, but they still suck - look how much engineering time you folks blow on managing that stuff. Threads are a terrible programming model. I'm not into the idea of forcing Lucy users to use threads. They should be able to use processes as their primary concurrency model if they want. Yes, working with threads is a nightmare (eg have a look at Java's memory model). I think the jury is still out (for our species) just how, long term, we'll make use of concurrency with the machines. I think we may need to largely take "time" out of our programming languages, eg switch to much more declarative code, or something... wanna port Lucy to Erlang? But I'm not sure process only concurrency, sharing only via file-backed memory, is the answer either It's wonderful that Lucy can startup really fast, but, for most apps that's not nearly as important as searching/indexing performance, right? Depends. Total indexing throughput in both Lucene and KinoSearch has been pretty decent for a long time. However, there's been a large gap between average index update performance and worst case index update performance, especially when you factor in sort cache loading. There are plenty of applications that may not have very high throughput requirements but where it may not be acceptable for an index update to take several seconds or several minutes every once in a while, even if it usually completes faster. I mean, you start only once, and then you handle many, many searches / index many documents, with that process, usually? Sometimes the person who just performed the action that updated the index is the only one you care about. For instance, to use a feature request that came in from Slashdot a while back, if someone leaves a comment on your website, it's nice to have it available in the search index right away. Consistently fast index update responsiveness makes personalization of the customer experience easier. Turnaround time for Lucene NRT is already very fast, as is. After an immense merge, it'll be the worst, but if you warm the reader first, that won't be an issue. Using Zoie you can make reopen time insanely fast (much faster than I think necessary for most apps), but at the expense of some expected hit to searching/indexing throughput. I don't think that's the right tradeoff for Lucene. I suspect Lucy is making a similar tradeoff, ie, that search performance will be erratic due to page faults, at a smallish gain in reopen time. Do you have any hard numbers on how much time it takes Lucene to load from a hot IO cache, populating its RAM resident data structures? I wonder in practice what extra cost we are really talking about... it's RAM to RAM "translation" of data structures (if the files are hot). FieldCache we just have to fix to stop doing uninversion... (ie we need CSF). Is reopen even necessary in Lucy? Probably. If you have a boatload of segments and a boatload of fields, you might start to see file opening and metadata parsing costs come into play. If it turns out that for some indexes reopen() can knock down the time from say, 100 ms to 10 ms or less, I'd consider that sufficient justification. OK. Then, you are basically pooling your readers Ie, you do allow in-process sharing, but only among readers.
          Hide
          Marvin Humphrey added a comment - - edited

          > I guess my confusion is what are all the other benefits of using
          > file-backed RAM? You can efficiently use process only concurrency
          > (though shared memory is technically an option for this too), and you
          > have wicked fast open times (but, you still must warm, just like
          > Lucene).

          Processes are Lucy's primary concurrency model. ("The OS is our JVM.")
          Making process-only concurrency efficient isn't optional – it's a core
          concern.

          > What else? Oh maybe the ability to inform OS not to cache
          > eg the reads done when merging segments. That's one I sure wish
          > Lucene could use...

          Lightweight searchers mean architectural freedom.

          Create 2, 10, 100, 1000 Searchers without a second thought – as many as you
          need for whatever app architecture you just dreamed up – then destroy them
          just as effortlessly. Add another worker thread to your search server without
          having to consider the RAM requirements of a heavy searcher object. Create a
          command-line app to search a documentation index without worrying about
          daemonizing it. Etc.

          If your normal development pattern is a single monolithic Java process, then
          that freedom might not mean much to you. But with their low per-object RAM
          requirements and fast opens, lightweight searchers are easy to use within a
          lot of other development patterns. For example: lightweight searchers work
          well for maxing out multiple CPU cores under process-only concurrency.

          > In exchange you risk the OS making poor choices about what gets
          > swapped out (LRU policy is too simplistic... not all pages are created
          > equal),

          The Linux virtual memory system, at least, is not a pure LRU. It utilizes a
          page aging algo which prioritizes pages that have historically been accessed
          frequently even when they have not been accessed recently:

          http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html

          The default action when a page is first allocated, is to give it an
          initial age of 3. Each time it is touched (by the memory management
          subsystem) it's age is increased by 3 to a maximum of 20. Each time the
          Kernel swap daemon runs it ages pages, decrementing their age by 1.

          And while that system may not be ideal from our standpoint, it's still pretty
          good. In general, the operating system's virtual memory scheme is going to
          work fine as designed, for us and everyone else, and minimize memory
          availability wait times.

          When will swapping out the term dictionary be a problem?

          • For indexes where queries are made frequently, no problem.
          • Foir systems with plenty of RAM, no problem.
          • For systems that aren't very busy, no problem.
          • For small indexes, no problem.

          The only situation we're talking about is infrequent queries against large
          indexes on busy boxes where RAM isn't abundant. Under those circumstances, it
          might be noticable that Lucy's term dictionary gets paged out somewhat
          sooner than Lucene's.

          But in general, if the term dictionary gets paged out, so what? Nobody was
          using it. Maybe nobody will make another query against that index until next
          week. Maybe the OS made the right decision.

          OK, so there's a vulnerable bubble where the the query rate against
          a large index an index is neither too fast nor too slow, on busy machines
          where RAM isn't abundant. I don't think that bubble ought to drive major
          architectural decisions.

          Let me turn your question on its head. What does Lucene gain in return for
          the slow index opens and large process memory footprint of its heavy
          searchers?

          > I do love how pure the file-backed RAM approach is, but I worry that
          > down the road it'll result in erratic search performance in certain
          > app profiles.

          If necessary, there's a straightforward remedy: slurp the relevant files into
          RAM at object construction rather than mmap them. The rest of the code won't
          know the difference between malloc'd RAM and mmap'd RAM. The slurped files
          won't take up any more space than the analogous Lucene data structures; more
          likely, they'll take up less.

          That's the kind of setting we'd hide away in the IndexManager class rather
          than expose as prominent API, and it would be a hint to index components
          rather than an edict.

          > Yeah, that you need 3 files for the string sort cache is a little
          > spooky... that's 3X the chance of a page fault.

          Not when using the compound format.

          > But the CFS construction must also go through the filesystem (like
          > Lucene) right? So you still incur IO load of creating the small
          > files, then 2nd pass to consolidate.

          Yes.

          > I think we may need to largely take "time" out of our programming
          > languages, eg switch to much more declarative code, or
          > something... wanna port Lucy to Erlang?
          >
          > But I'm not sure process only concurrency, sharing only via
          > file-backed memory, is the answer either

          I think relying heavily on file-backed memory is particularly appropriate for
          Lucy because the write-once file format works well with MAP_SHARED memory
          segments. If files were being modified and had to be protected with
          semaphores, it wouldn't be as sweet a match.

          Focusing on process-only concurrency also works well for Lucy because host
          threading models differ substantially and so will only be accessible via a
          generalized interface from the Lucy C core. It will be difficult to tune
          threading performance through that layer of indirection – I'm guessing beyond
          the ability of most developers since few will be experts in multiple host
          threading models. In contrast, expertise in process level concurrency will be
          easier to come by and to nourish.

          > Using Zoie you can make reopen time insanely fast (much faster than I
          > think necessary for most apps), but at the expense of some expected
          > hit to searching/indexing throughput. I don't think that's the right
          > tradeoff for Lucene.

          But as Jake pointed out early in the thread, Zoie achieves those insanely fast
          reopens without tight coupling to IndexWriter and its components. The
          auxiliary RAM index approach is well proven.

          > Do you have any hard numbers on how much time it takes Lucene to load
          > from a hot IO cache, populating its RAM resident data structures?

          Hmm, I don't spend a lot of time working with Lucene directly, so I might not
          be the person most likely to have data like that at my fingertips. Maybe that
          McCandless dude can help you out, he runs a lot of benchmarks.

          Or maybe ask the Solr folks? I see them on solr-user all the time talking
          about "MaxWarmingSearchers".

          > OK. Then, you are basically pooling your readers Ie, you do allow
          > in-process sharing, but only among readers.

          Not sure about that. Lucy's IndexReader.reopen() would open new SegReaders for
          each new segment, but they would be private to each parent PolyReader. So if
          you reopened two IndexReaders at the same time after e.g. segment "seg_12"
          had been added, each would create a new, private SegReader for "seg_12".

          Edit: updated to correct assertions about virtual memory performance with
          small indexes.

          Show
          Marvin Humphrey added a comment - - edited > I guess my confusion is what are all the other benefits of using > file-backed RAM? You can efficiently use process only concurrency > (though shared memory is technically an option for this too), and you > have wicked fast open times (but, you still must warm, just like > Lucene). Processes are Lucy's primary concurrency model. ("The OS is our JVM.") Making process-only concurrency efficient isn't optional – it's a core concern . > What else? Oh maybe the ability to inform OS not to cache > eg the reads done when merging segments. That's one I sure wish > Lucene could use... Lightweight searchers mean architectural freedom. Create 2, 10, 100, 1000 Searchers without a second thought – as many as you need for whatever app architecture you just dreamed up – then destroy them just as effortlessly. Add another worker thread to your search server without having to consider the RAM requirements of a heavy searcher object. Create a command-line app to search a documentation index without worrying about daemonizing it. Etc. If your normal development pattern is a single monolithic Java process, then that freedom might not mean much to you. But with their low per-object RAM requirements and fast opens, lightweight searchers are easy to use within a lot of other development patterns. For example: lightweight searchers work well for maxing out multiple CPU cores under process-only concurrency. > In exchange you risk the OS making poor choices about what gets > swapped out (LRU policy is too simplistic... not all pages are created > equal), The Linux virtual memory system, at least, is not a pure LRU. It utilizes a page aging algo which prioritizes pages that have historically been accessed frequently even when they have not been accessed recently: http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html The default action when a page is first allocated, is to give it an initial age of 3. Each time it is touched (by the memory management subsystem) it's age is increased by 3 to a maximum of 20. Each time the Kernel swap daemon runs it ages pages, decrementing their age by 1. And while that system may not be ideal from our standpoint, it's still pretty good. In general, the operating system's virtual memory scheme is going to work fine as designed, for us and everyone else, and minimize memory availability wait times. When will swapping out the term dictionary be a problem? For indexes where queries are made frequently, no problem. Foir systems with plenty of RAM, no problem. For systems that aren't very busy, no problem. For small indexes, no problem. The only situation we're talking about is infrequent queries against large indexes on busy boxes where RAM isn't abundant. Under those circumstances, it might be noticable that Lucy's term dictionary gets paged out somewhat sooner than Lucene's. But in general, if the term dictionary gets paged out, so what? Nobody was using it. Maybe nobody will make another query against that index until next week. Maybe the OS made the right decision. OK, so there's a vulnerable bubble where the the query rate against a large index an index is neither too fast nor too slow, on busy machines where RAM isn't abundant. I don't think that bubble ought to drive major architectural decisions. Let me turn your question on its head. What does Lucene gain in return for the slow index opens and large process memory footprint of its heavy searchers? > I do love how pure the file-backed RAM approach is, but I worry that > down the road it'll result in erratic search performance in certain > app profiles. If necessary, there's a straightforward remedy: slurp the relevant files into RAM at object construction rather than mmap them. The rest of the code won't know the difference between malloc'd RAM and mmap'd RAM. The slurped files won't take up any more space than the analogous Lucene data structures; more likely, they'll take up less. That's the kind of setting we'd hide away in the IndexManager class rather than expose as prominent API, and it would be a hint to index components rather than an edict. > Yeah, that you need 3 files for the string sort cache is a little > spooky... that's 3X the chance of a page fault. Not when using the compound format. > But the CFS construction must also go through the filesystem (like > Lucene) right? So you still incur IO load of creating the small > files, then 2nd pass to consolidate. Yes. > I think we may need to largely take "time" out of our programming > languages, eg switch to much more declarative code, or > something... wanna port Lucy to Erlang? > > But I'm not sure process only concurrency, sharing only via > file-backed memory, is the answer either I think relying heavily on file-backed memory is particularly appropriate for Lucy because the write-once file format works well with MAP_SHARED memory segments. If files were being modified and had to be protected with semaphores, it wouldn't be as sweet a match. Focusing on process-only concurrency also works well for Lucy because host threading models differ substantially and so will only be accessible via a generalized interface from the Lucy C core. It will be difficult to tune threading performance through that layer of indirection – I'm guessing beyond the ability of most developers since few will be experts in multiple host threading models. In contrast, expertise in process level concurrency will be easier to come by and to nourish. > Using Zoie you can make reopen time insanely fast (much faster than I > think necessary for most apps), but at the expense of some expected > hit to searching/indexing throughput. I don't think that's the right > tradeoff for Lucene. But as Jake pointed out early in the thread, Zoie achieves those insanely fast reopens without tight coupling to IndexWriter and its components. The auxiliary RAM index approach is well proven. > Do you have any hard numbers on how much time it takes Lucene to load > from a hot IO cache, populating its RAM resident data structures? Hmm, I don't spend a lot of time working with Lucene directly, so I might not be the person most likely to have data like that at my fingertips. Maybe that McCandless dude can help you out, he runs a lot of benchmarks. Or maybe ask the Solr folks? I see them on solr-user all the time talking about "MaxWarmingSearchers". > OK. Then, you are basically pooling your readers Ie, you do allow > in-process sharing, but only among readers. Not sure about that. Lucy's IndexReader.reopen() would open new SegReaders for each new segment, but they would be private to each parent PolyReader. So if you reopened two IndexReaders at the same time after e.g. segment "seg_12" had been added, each would create a new, private SegReader for "seg_12". Edit : updated to correct assertions about virtual memory performance with small indexes.
          Hide
          Michael McCandless added a comment -

          Processes are Lucy's primary concurrency model. ("The OS is our JVM.")
          Making process-only concurrency efficient isn't optional - it's a core
          concern.

          OK

          Lightweight searchers mean architectural freedom.

          Create 2, 10, 100, 1000 Searchers without a second thought - as many as you
          need for whatever app architecture you just dreamed up - then destroy them
          just as effortlessly. Add another worker thread to your search server without
          having to consider the RAM requirements of a heavy searcher object. Create a
          command-line app to search a documentation index without worrying about
          daemonizing it. Etc.

          This is definitely neat.

          The Linux virtual memory system, at least, is not a pure LRU. It utilizes a
          page aging algo which prioritizes pages that have historically been accessed
          frequently even when they have not been accessed recently:

          http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html

          Very interesting – thanks. So it also factors in how much the page
          was used in the past, not just how long it's been since the page was
          last used.

          When will swapping out the term dictionary be a problem?

          For indexes where queries are made frequently, no problem.
          Foir systems with plenty of RAM, no problem.
          For systems that aren't very busy, no problem.
          For small indexes, no problem.
          The only situation we're talking about is infrequent queries against large
          indexes on busy boxes where RAM isn't abundant. Under those circumstances, it
          might be noticable that Lucy's term dictionary gets paged out somewhat
          sooner than Lucene's.

          Even smallish indexes can see the pages swapped out? I'd think at
          low-to-moderate search traffic, any index could be at risk, depdending
          on whether other stuff in the machine wanting RAM or IO cache is
          running.

          But in general, if the term dictionary gets paged out, so what? Nobody was
          using it. Maybe nobody will make another query against that index until next
          week. Maybe the OS made the right decision.

          You can't afford many page faults until the latency becomes very
          apparent (until we're all on SSDs... at which point this may all be
          moot).

          Right – the metric that the swapper optimizes is overall efficient
          use of the machine's resources.

          But I think that's often a poor metric for search apps... I think
          consistency on the search latency is more important, though I agree it
          depends very much on the app.

          I don't like the same behavior in my desktop – when I switch to my
          mail client, I don't want to wait 10 seconds for it to swap the pages
          back in.

          Let me turn your question on its head. What does Lucene gain in return for
          the slow index opens and large process memory footprint of its heavy
          searchers?

          Consistency in the search time. Assuming the OS doesn't swap our
          pages out...

          And of course Java pretty much forces threads-as-concurrency (JVM
          startup time, hotspot compilation, are costly).

          If necessary, there's a straightforward remedy: slurp the relevant files into
          RAM at object construction rather than mmap them. The rest of the code won't
          know the difference between malloc'd RAM and mmap'd RAM. The slurped files
          won't take up any more space than the analogous Lucene data structures; more
          likely, they'll take up less.

          That's the kind of setting we'd hide away in the IndexManager class rather
          than expose as prominent API, and it would be a hint to index components
          rather than an edict.

          Right, this is how Lucy would force warming.

          Yeah, that you need 3 files for the string sort cache is a little spooky... that's 3X the chance of a page fault.

          Not when using the compound format.

          But, even within that CFS file, these three sub-files will not be
          local? Ie you'll still have to hit three pages per "lookup" right?

          I think relying heavily on file-backed memory is particularly appropriate for
          Lucy because the write-once file format works well with MAP_SHARED memory
          segments. If files were being modified and had to be protected with
          semaphores, it wouldn't be as sweet a match.

          Write-once is good for Lucene too.

          Focusing on process-only concurrency also works well for Lucy because host
          threading models differ substantially and so will only be accessible via a
          generalized interface from the Lucy C core. It will be difficult to tune
          threading performance through that layer of indirection - I'm guessing beyond
          the ability of most developers since few will be experts in multiple host
          threading models. In contrast, expertise in process level concurrency will be
          easier to come by and to nourish.

          I'm confused by this – eg Python does a great job presenting a simple
          threads interface and implementing it on major OSs. And it seems like
          Lucy would not need anything crazy-os-specific wrt threads?

          Do you have any hard numbers on how much time it takes Lucene to load from a hot IO cache, populating its RAM resident data structures?

          Hmm, I don't spend a lot of time working with Lucene directly, so I might not
          be the person most likely to have data like that at my fingertips. Maybe that
          McCandless dude can help you out, he runs a lot of benchmarks.

          Hmm I'd guess that field cache is slowish; deleted docs & norms are
          very fast; terms index is somewhere in between.

          Or maybe ask the Solr folks? I see them on solr-user all the time talking about "MaxWarmingSearchers".

          Hmm – not sure what's up with that. Looks like maybe it's the
          auto-warming that might happen after a commit.

          OK. Then, you are basically pooling your readers Ie, you do allow in-process sharing, but only among readers.

          Not sure about that. Lucy's IndexReader.reopen() would open new SegReaders for
          each new segment, but they would be private to each parent PolyReader. So if
          you reopened two IndexReaders at the same time after e.g. segment "seg_12"
          had been added, each would create a new, private SegReader for "seg_12".

          You're right, you'd get two readers for seg_12 in that case. By
          "pool" I meant you're tapping into all the sub-readers that the
          existing reader have opened – the reader is your pool of sub-readers.

          Show
          Michael McCandless added a comment - Processes are Lucy's primary concurrency model. ("The OS is our JVM.") Making process-only concurrency efficient isn't optional - it's a core concern. OK Lightweight searchers mean architectural freedom. Create 2, 10, 100, 1000 Searchers without a second thought - as many as you need for whatever app architecture you just dreamed up - then destroy them just as effortlessly. Add another worker thread to your search server without having to consider the RAM requirements of a heavy searcher object. Create a command-line app to search a documentation index without worrying about daemonizing it. Etc. This is definitely neat. The Linux virtual memory system, at least, is not a pure LRU. It utilizes a page aging algo which prioritizes pages that have historically been accessed frequently even when they have not been accessed recently: http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html Very interesting – thanks. So it also factors in how much the page was used in the past, not just how long it's been since the page was last used. When will swapping out the term dictionary be a problem? For indexes where queries are made frequently, no problem. Foir systems with plenty of RAM, no problem. For systems that aren't very busy, no problem. For small indexes, no problem. The only situation we're talking about is infrequent queries against large indexes on busy boxes where RAM isn't abundant. Under those circumstances, it might be noticable that Lucy's term dictionary gets paged out somewhat sooner than Lucene's. Even smallish indexes can see the pages swapped out? I'd think at low-to-moderate search traffic, any index could be at risk, depdending on whether other stuff in the machine wanting RAM or IO cache is running. But in general, if the term dictionary gets paged out, so what? Nobody was using it. Maybe nobody will make another query against that index until next week. Maybe the OS made the right decision. You can't afford many page faults until the latency becomes very apparent (until we're all on SSDs... at which point this may all be moot). Right – the metric that the swapper optimizes is overall efficient use of the machine's resources. But I think that's often a poor metric for search apps... I think consistency on the search latency is more important, though I agree it depends very much on the app. I don't like the same behavior in my desktop – when I switch to my mail client, I don't want to wait 10 seconds for it to swap the pages back in. Let me turn your question on its head. What does Lucene gain in return for the slow index opens and large process memory footprint of its heavy searchers? Consistency in the search time. Assuming the OS doesn't swap our pages out... And of course Java pretty much forces threads-as-concurrency (JVM startup time, hotspot compilation, are costly). If necessary, there's a straightforward remedy: slurp the relevant files into RAM at object construction rather than mmap them. The rest of the code won't know the difference between malloc'd RAM and mmap'd RAM. The slurped files won't take up any more space than the analogous Lucene data structures; more likely, they'll take up less. That's the kind of setting we'd hide away in the IndexManager class rather than expose as prominent API, and it would be a hint to index components rather than an edict. Right, this is how Lucy would force warming. Yeah, that you need 3 files for the string sort cache is a little spooky... that's 3X the chance of a page fault. Not when using the compound format. But, even within that CFS file, these three sub-files will not be local? Ie you'll still have to hit three pages per "lookup" right? I think relying heavily on file-backed memory is particularly appropriate for Lucy because the write-once file format works well with MAP_SHARED memory segments. If files were being modified and had to be protected with semaphores, it wouldn't be as sweet a match. Write-once is good for Lucene too. Focusing on process-only concurrency also works well for Lucy because host threading models differ substantially and so will only be accessible via a generalized interface from the Lucy C core. It will be difficult to tune threading performance through that layer of indirection - I'm guessing beyond the ability of most developers since few will be experts in multiple host threading models. In contrast, expertise in process level concurrency will be easier to come by and to nourish. I'm confused by this – eg Python does a great job presenting a simple threads interface and implementing it on major OSs. And it seems like Lucy would not need anything crazy-os-specific wrt threads? Do you have any hard numbers on how much time it takes Lucene to load from a hot IO cache, populating its RAM resident data structures? Hmm, I don't spend a lot of time working with Lucene directly, so I might not be the person most likely to have data like that at my fingertips. Maybe that McCandless dude can help you out, he runs a lot of benchmarks. Hmm I'd guess that field cache is slowish; deleted docs & norms are very fast; terms index is somewhere in between. Or maybe ask the Solr folks? I see them on solr-user all the time talking about "MaxWarmingSearchers". Hmm – not sure what's up with that. Looks like maybe it's the auto-warming that might happen after a commit. OK. Then, you are basically pooling your readers Ie, you do allow in-process sharing, but only among readers. Not sure about that. Lucy's IndexReader.reopen() would open new SegReaders for each new segment, but they would be private to each parent PolyReader. So if you reopened two IndexReaders at the same time after e.g. segment "seg_12" had been added, each would create a new, private SegReader for "seg_12". You're right, you'd get two readers for seg_12 in that case. By "pool" I meant you're tapping into all the sub-readers that the existing reader have opened – the reader is your pool of sub-readers.
          Hide
          Marvin Humphrey added a comment -

          > Very interesting - thanks. So it also factors in how much the page
          > was used in the past, not just how long it's been since the page was
          > last used.

          In theory, I think that means the term dictionary will tend to be favored over
          the posting lists. In practice... hard to say, it would be difficult to test.

          > Even smallish indexes can see the pages swapped out?

          Yes, you're right – the wait time to get at a small term dictionary isn't
          necessarily small. I've amended my previous post, thanks.

          > And of course Java pretty much forces threads-as-concurrency (JVM
          > startup time, hotspot compilation, are costly).

          Yes. Java does a lot of stuff that most operating systems can also do, but of
          course provides a coherent platform-independent interface. In Lucy we're
          going to try to go back to the OS for some of the stuff that Java likes to
          take over – provided that we can develop a sane genericized interface using
          configuration probing and #ifdefs.

          It's nice that as long as the box is up our OS-as-JVM is always running, so we
          don't have to worry about its (quite lengthy) startup time.

          > Right, this is how Lucy would force warming.

          I think slurp-instead-of-mmap is orthogonal to warming, because we can warm
          file-backed RAM structures by forcing them into the IO cache, using either the
          cat-to-dev-null trick or something more sophisticated. The
          slurp-instead-of-mmap setting would cause warming as a side effect, but the
          main point would be to attempt to persuade the virtual memory system that
          certain data structures should have a higher status and not be paged out as
          quickly.

          > But, even within that CFS file, these three sub-files will not be
          > local? Ie you'll still have to hit three pages per "lookup" right?

          They'll be next to each other in the compound file because CompoundFileWriter
          orders them alphabetically. For big segments, though, you're right that they
          won't be right next to each other, and you could possibly incur as many as
          three page faults when retrieving a sort cache value.

          But what are the alternatives for variable width data like strings? You need
          the ords array anyway for efficient comparisons, so what's left are the
          offsets array and the character data.

          An array of String objects isn't going to have better locality than one solid
          block of memory dedicated to offsets and another solid block of memory
          dedicated to file data, and it's no fewer derefs even if the string object
          stores its character data inline – more if it points to a separate allocation
          (like Lucy's CharBuf does, since it's mutable).

          For each sort cache value lookup, you're going to need to access two blocks of
          memory.

          • With the array of String objects, the first is the memory block dedicated
            to the array, and the second is the memory block dedicated to the String
            object itself, which contains the character data.
          • With the file-backed block sort cache, the first memory block is the
            offsets array, and the second is the character data array.

          I think the locality costs should be approximately the same... have I missed
          anything?

          > Write-once is good for Lucene too.

          Hellyeah.

          > And it seems like Lucy would not need anything crazy-os-specific wrt
          > threads?

          It depends on how many classes we want to make thread-safe, and it's not just
          the OS, it's the host.

          The bare minimum is simply to make Lucy thread-safe as a library. That's
          pretty close, because Lucy studiously avoided global variables whenever
          possible. The only problems that have to be addressed are the VTable_registry
          Hash, race conditions when creating new subclasses via dynamic VTable
          singletons, and refcounts on the VTable objects themselves.

          Once those issues are taken care of, you'll be able to use Lucy objects in
          separate threads with no problem, e.g. one Searcher per thread.

          However, if you want to share Lucy objects (other than VTables) across
          threads, all of a sudden we have to start thinking about "synchronized",
          "volatile", etc. Such constructs may not be efficient or even possible under
          some threading models.

          > Hmm I'd guess that field cache is slowish; deleted docs & norms are
          > very fast; terms index is somewhere in between.

          That jibes with my own experience. So maybe consider file-backed sort caches
          in Lucene, while keeping the status quo for everything else?

          > You're right, you'd get two readers for seg_12 in that case. By
          > "pool" I meant you're tapping into all the sub-readers that the
          > existing reader have opened - the reader is your pool of sub-readers.

          Each unique SegReader will also have dedicated "sub-reader" objects: two
          "seg_12" SegReaders means two "seg_12" DocReaders, two "seg_12"
          PostingsReaders, etc. However, all those sub-readers will share the same
          file-backed RAM data, so in that sense they're pooled.

          Show
          Marvin Humphrey added a comment - > Very interesting - thanks. So it also factors in how much the page > was used in the past, not just how long it's been since the page was > last used. In theory, I think that means the term dictionary will tend to be favored over the posting lists. In practice... hard to say, it would be difficult to test. > Even smallish indexes can see the pages swapped out? Yes, you're right – the wait time to get at a small term dictionary isn't necessarily small. I've amended my previous post, thanks. > And of course Java pretty much forces threads-as-concurrency (JVM > startup time, hotspot compilation, are costly). Yes. Java does a lot of stuff that most operating systems can also do, but of course provides a coherent platform-independent interface. In Lucy we're going to try to go back to the OS for some of the stuff that Java likes to take over – provided that we can develop a sane genericized interface using configuration probing and #ifdefs. It's nice that as long as the box is up our OS-as-JVM is always running, so we don't have to worry about its (quite lengthy) startup time. > Right, this is how Lucy would force warming. I think slurp-instead-of-mmap is orthogonal to warming, because we can warm file-backed RAM structures by forcing them into the IO cache, using either the cat-to-dev-null trick or something more sophisticated. The slurp-instead-of-mmap setting would cause warming as a side effect, but the main point would be to attempt to persuade the virtual memory system that certain data structures should have a higher status and not be paged out as quickly. > But, even within that CFS file, these three sub-files will not be > local? Ie you'll still have to hit three pages per "lookup" right? They'll be next to each other in the compound file because CompoundFileWriter orders them alphabetically. For big segments, though, you're right that they won't be right next to each other, and you could possibly incur as many as three page faults when retrieving a sort cache value. But what are the alternatives for variable width data like strings? You need the ords array anyway for efficient comparisons, so what's left are the offsets array and the character data. An array of String objects isn't going to have better locality than one solid block of memory dedicated to offsets and another solid block of memory dedicated to file data, and it's no fewer derefs even if the string object stores its character data inline – more if it points to a separate allocation (like Lucy's CharBuf does, since it's mutable). For each sort cache value lookup, you're going to need to access two blocks of memory. With the array of String objects, the first is the memory block dedicated to the array, and the second is the memory block dedicated to the String object itself, which contains the character data. With the file-backed block sort cache, the first memory block is the offsets array, and the second is the character data array. I think the locality costs should be approximately the same... have I missed anything? > Write-once is good for Lucene too. Hellyeah. > And it seems like Lucy would not need anything crazy-os-specific wrt > threads? It depends on how many classes we want to make thread-safe, and it's not just the OS, it's the host. The bare minimum is simply to make Lucy thread-safe as a library. That's pretty close, because Lucy studiously avoided global variables whenever possible. The only problems that have to be addressed are the VTable_registry Hash, race conditions when creating new subclasses via dynamic VTable singletons, and refcounts on the VTable objects themselves. Once those issues are taken care of, you'll be able to use Lucy objects in separate threads with no problem, e.g. one Searcher per thread. However, if you want to share Lucy objects (other than VTables) across threads, all of a sudden we have to start thinking about "synchronized", "volatile", etc. Such constructs may not be efficient or even possible under some threading models. > Hmm I'd guess that field cache is slowish; deleted docs & norms are > very fast; terms index is somewhere in between. That jibes with my own experience. So maybe consider file-backed sort caches in Lucene, while keeping the status quo for everything else? > You're right, you'd get two readers for seg_12 in that case. By > "pool" I meant you're tapping into all the sub-readers that the > existing reader have opened - the reader is your pool of sub-readers. Each unique SegReader will also have dedicated "sub-reader" objects: two "seg_12" SegReaders means two "seg_12" DocReaders, two "seg_12" PostingsReaders, etc. However, all those sub-readers will share the same file-backed RAM data, so in that sense they're pooled.
          Hide
          Michael McCandless added a comment -

          Very interesting - thanks. So it also factors in how much the page was used in the past, not just how long it's been since the page was last used.

          In theory, I think that means the term dictionary will tend to be
          favored over the posting lists. In practice... hard to say, it would
          be difficult to test.

          Right... though, I think the top "trunks" frequently used by the
          binary search, will stay hot. But as you get deeper into the terms
          index, it's not as clear.

          And of course Java pretty much forces threads-as-concurrency (JVM startup time, hotspot compilation, are costly).

          Yes. Java does a lot of stuff that most operating systems can also do, but of
          course provides a coherent platform-independent interface. In Lucy we're
          going to try to go back to the OS for some of the stuff that Java likes to
          take over - provided that we can develop a sane genericized interface using
          configuration probing and #ifdefs.

          It's nice that as long as the box is up our OS-as-JVM is always running, so we
          don't have to worry about its (quite lengthy) startup time.

          OS as JVM is a nice analogy. Java of course gets in the way, too,
          like we cannot properly set IO priorities, we can't give hints to the
          OS to tell it not to cache certain reads/writes (ie segment merging),
          can't pin pages , etc.

          Right, this is how Lucy would force warming.

          I think slurp-instead-of-mmap is orthogonal to warming, because we can warm
          file-backed RAM structures by forcing them into the IO cache, using either the
          cat-to-dev-null trick or something more sophisticated. The
          slurp-instead-of-mmap setting would cause warming as a side effect, but the
          main point would be to attempt to persuade the virtual memory system that
          certain data structures should have a higher status and not be paged out as
          quickly.

          Woops, sorry, I misread – now I understand. You can easily make
          certain files ram resident, and then be like Lucene (except the data
          structures are more compact). Nice.

          But, even within that CFS file, these three sub-files will not be local? Ie you'll still have to hit three pages per "lookup" right?

          They'll be next to each other in the compound file because CompoundFileWriter
          orders them alphabetically. For big segments, though, you're right that they
          won't be right next to each other, and you could possibly incur as many as
          three page faults when retrieving a sort cache value.

          But what are the alternatives for variable width data like strings? You need
          the ords array anyway for efficient comparisons, so what's left are the
          offsets array and the character data.

          An array of String objects isn't going to have better locality than one solid
          block of memory dedicated to offsets and another solid block of memory
          dedicated to file data, and it's no fewer derefs even if the string object
          stores its character data inline - more if it points to a separate allocation
          (like Lucy's CharBuf does, since it's mutable).

          For each sort cache value lookup, you're going to need to access two blocks of
          memory.

          With the array of String objects, the first is the memory block dedicated
          to the array, and the second is the memory block dedicated to the String
          object itself, which contains the character data.
          With the file-backed block sort cache, the first memory block is the
          offsets array, and the second is the character data array.
          I think the locality costs should be approximately the same... have I missed
          anything?

          You're right, Lucene risks 3 (ord array, String array, String object)
          page faults on each lookup as well.

          Actually why can't ord & offset be one, for the string sort cache?
          Ie, if you write your string data in sort order, then the offsets are
          also in sort order? (I think we may have discussed this already?)

          And it seems like Lucy would not need anything crazy-os-specific wrt threads?

          It depends on how many classes we want to make thread-safe, and it's not just
          the OS, it's the host.

          The bare minimum is simply to make Lucy thread-safe as a library. That's
          pretty close, because Lucy studiously avoided global variables whenever
          possible. The only problems that have to be addressed are the VTable_registry
          Hash, race conditions when creating new subclasses via dynamic VTable
          singletons, and refcounts on the VTable objects themselves.

          Once those issues are taken care of, you'll be able to use Lucy objects in
          separate threads with no problem, e.g. one Searcher per thread.

          However, if you want to share Lucy objects (other than VTables) across
          threads, all of a sudden we have to start thinking about "synchronized",
          "volatile", etc. Such constructs may not be efficient or even possible under
          some threading models.

          OK it is indeed hairy. You don't want to have to create Lucy's
          equivalent of the JMM...

          Hmm I'd guess that field cache is slowish; deleted docs & norms are very fast; terms index is somewhere in between.

          That jibes with my own experience. So maybe consider file-backed sort caches
          in Lucene, while keeping the status quo for everything else?

          Perhaps, but it'd still make me nervous When we get
          CSF (LUCENE-1231) online we should make it
          pluggable enough so that one could create an mmap impl.

          You're right, you'd get two readers for seg_12 in that case. By "pool" I meant you're tapping into all the sub-readers that the existing reader have opened - the reader is your pool of sub-readers.

          Each unique SegReader will also have dedicated "sub-reader" objects: two
          "seg_12" SegReaders means two "seg_12" DocReaders, two "seg_12"
          PostingsReaders, etc. However, all those sub-readers will share the same
          file-backed RAM data, so in that sense they're pooled.

          OK

          Show
          Michael McCandless added a comment - Very interesting - thanks. So it also factors in how much the page was used in the past, not just how long it's been since the page was last used. In theory, I think that means the term dictionary will tend to be favored over the posting lists. In practice... hard to say, it would be difficult to test. Right... though, I think the top "trunks" frequently used by the binary search, will stay hot. But as you get deeper into the terms index, it's not as clear. And of course Java pretty much forces threads-as-concurrency (JVM startup time, hotspot compilation, are costly). Yes. Java does a lot of stuff that most operating systems can also do, but of course provides a coherent platform-independent interface. In Lucy we're going to try to go back to the OS for some of the stuff that Java likes to take over - provided that we can develop a sane genericized interface using configuration probing and #ifdefs. It's nice that as long as the box is up our OS-as-JVM is always running, so we don't have to worry about its (quite lengthy) startup time. OS as JVM is a nice analogy. Java of course gets in the way, too, like we cannot properly set IO priorities, we can't give hints to the OS to tell it not to cache certain reads/writes (ie segment merging), can't pin pages , etc. Right, this is how Lucy would force warming. I think slurp-instead-of-mmap is orthogonal to warming, because we can warm file-backed RAM structures by forcing them into the IO cache, using either the cat-to-dev-null trick or something more sophisticated. The slurp-instead-of-mmap setting would cause warming as a side effect, but the main point would be to attempt to persuade the virtual memory system that certain data structures should have a higher status and not be paged out as quickly. Woops, sorry, I misread – now I understand. You can easily make certain files ram resident, and then be like Lucene (except the data structures are more compact). Nice. But, even within that CFS file, these three sub-files will not be local? Ie you'll still have to hit three pages per "lookup" right? They'll be next to each other in the compound file because CompoundFileWriter orders them alphabetically. For big segments, though, you're right that they won't be right next to each other, and you could possibly incur as many as three page faults when retrieving a sort cache value. But what are the alternatives for variable width data like strings? You need the ords array anyway for efficient comparisons, so what's left are the offsets array and the character data. An array of String objects isn't going to have better locality than one solid block of memory dedicated to offsets and another solid block of memory dedicated to file data, and it's no fewer derefs even if the string object stores its character data inline - more if it points to a separate allocation (like Lucy's CharBuf does, since it's mutable). For each sort cache value lookup, you're going to need to access two blocks of memory. With the array of String objects, the first is the memory block dedicated to the array, and the second is the memory block dedicated to the String object itself, which contains the character data. With the file-backed block sort cache, the first memory block is the offsets array, and the second is the character data array. I think the locality costs should be approximately the same... have I missed anything? You're right, Lucene risks 3 (ord array, String array, String object) page faults on each lookup as well. Actually why can't ord & offset be one, for the string sort cache? Ie, if you write your string data in sort order, then the offsets are also in sort order? (I think we may have discussed this already?) And it seems like Lucy would not need anything crazy-os-specific wrt threads? It depends on how many classes we want to make thread-safe, and it's not just the OS, it's the host. The bare minimum is simply to make Lucy thread-safe as a library. That's pretty close, because Lucy studiously avoided global variables whenever possible. The only problems that have to be addressed are the VTable_registry Hash, race conditions when creating new subclasses via dynamic VTable singletons, and refcounts on the VTable objects themselves. Once those issues are taken care of, you'll be able to use Lucy objects in separate threads with no problem, e.g. one Searcher per thread. However, if you want to share Lucy objects (other than VTables) across threads, all of a sudden we have to start thinking about "synchronized", "volatile", etc. Such constructs may not be efficient or even possible under some threading models. OK it is indeed hairy. You don't want to have to create Lucy's equivalent of the JMM... Hmm I'd guess that field cache is slowish; deleted docs & norms are very fast; terms index is somewhere in between. That jibes with my own experience. So maybe consider file-backed sort caches in Lucene, while keeping the status quo for everything else? Perhaps, but it'd still make me nervous When we get CSF ( LUCENE-1231 ) online we should make it pluggable enough so that one could create an mmap impl. You're right, you'd get two readers for seg_12 in that case. By "pool" I meant you're tapping into all the sub-readers that the existing reader have opened - the reader is your pool of sub-readers. Each unique SegReader will also have dedicated "sub-reader" objects: two "seg_12" SegReaders means two "seg_12" DocReaders, two "seg_12" PostingsReaders, etc. However, all those sub-readers will share the same file-backed RAM data, so in that sense they're pooled. OK
          Hide
          Marvin Humphrey added a comment -

          > we can't give hints to the OS to tell it not to cache certain reads/writes
          > (ie segment merging),

          For what it's worth, we haven't really solved that problem in Lucy either.
          The sliding window abstraction we wrapped around mmap/MapViewOfFile largely
          solved the problem of running out of address space on 32-bit operating
          systems. However, there's currently no way to invoke madvise through Lucy's
          IO abstraction layer – it's a little tricky with compound files.

          Linux, at least, requires that the buffer supplied to madvise be page-aligned.
          So, say we're starting off on a posting list, and we want to communicate to
          the OS that it should treat the region we're about to read as MADV_SEQUENTIAL.
          If the start of the postings file is in the middle of a 4k page and the file
          right before it is a term dictionary, we don't want to indicate that that
          region should be treated as sequential.

          I'm not sure how to solve that problem without violating the encapsulation of
          the compound file model. Hmm, maybe we could store metadata about the virtual
          files indicating usage patterns (sequential, random, etc.)? Since files are
          generally part of dedicated data structures whose usage patterns are known at
          index time.

          Or maybe we just punt on that use case and worry only about segment merging.
          Hmm, wouldn't the act of deleting a file (and releasing all file descriptors) tell
          the OS that it's free to recycle any memory pages associated with it?

          > Actually why can't ord & offset be one, for the string sort cache?
          > Ie, if you write your string data in sort order, then the offsets are
          > also in sort order? (I think we may have discussed this already?)

          Right, we discussed this on lucy-dev last spring:

          http://markmail.org/message/epc56okapbgit5lw

          Incidentally, some of this thread replays our exchange at the top of
          LUCENE-1458 from a year ago. It was fun to go back and reread that: in the
          interrim, we've implemented segment-centric search and memory mapped field
          caches and term dictionaries, both of which were first discussed back then.

          Ords are great for low cardinality fields of all kinds, but become less
          efficient for high cardinality primitive numeric fields. For simplicity's
          sake, the prototype implementation of mmap'd field caches in KS always uses
          ords.

          > You don't want to have to create Lucy's equivalent of the JMM...

          The more I think about making Lucy classes thread safe, the harder it seems.
          I'd like to make it possible to share a Schema across threads, for
          instance, but that means all its Analyzers, etc have to be thread-safe as
          well, which isn't practical when you start getting into contributed
          subclasses.

          Even if we succeed in getting Folders and FileHandles thread safe, it will be
          hard for the user to keep track of what they can and can't do across threads.
          "Don't share anything" is a lot easier to understand.

          We reap a big benefit by making Lucy's metaclass infrastructure thread-safe.
          Beyond that, seems like there's a lot of pain for little gain.

          Show
          Marvin Humphrey added a comment - > we can't give hints to the OS to tell it not to cache certain reads/writes > (ie segment merging), For what it's worth, we haven't really solved that problem in Lucy either. The sliding window abstraction we wrapped around mmap/MapViewOfFile largely solved the problem of running out of address space on 32-bit operating systems. However, there's currently no way to invoke madvise through Lucy's IO abstraction layer – it's a little tricky with compound files. Linux, at least, requires that the buffer supplied to madvise be page-aligned. So, say we're starting off on a posting list, and we want to communicate to the OS that it should treat the region we're about to read as MADV_SEQUENTIAL. If the start of the postings file is in the middle of a 4k page and the file right before it is a term dictionary, we don't want to indicate that that region should be treated as sequential. I'm not sure how to solve that problem without violating the encapsulation of the compound file model. Hmm, maybe we could store metadata about the virtual files indicating usage patterns (sequential, random, etc.)? Since files are generally part of dedicated data structures whose usage patterns are known at index time. Or maybe we just punt on that use case and worry only about segment merging. Hmm, wouldn't the act of deleting a file (and releasing all file descriptors) tell the OS that it's free to recycle any memory pages associated with it? > Actually why can't ord & offset be one, for the string sort cache? > Ie, if you write your string data in sort order, then the offsets are > also in sort order? (I think we may have discussed this already?) Right, we discussed this on lucy-dev last spring: http://markmail.org/message/epc56okapbgit5lw Incidentally, some of this thread replays our exchange at the top of LUCENE-1458 from a year ago. It was fun to go back and reread that: in the interrim, we've implemented segment-centric search and memory mapped field caches and term dictionaries, both of which were first discussed back then. Ords are great for low cardinality fields of all kinds, but become less efficient for high cardinality primitive numeric fields. For simplicity's sake, the prototype implementation of mmap'd field caches in KS always uses ords. > You don't want to have to create Lucy's equivalent of the JMM... The more I think about making Lucy classes thread safe, the harder it seems. I'd like to make it possible to share a Schema across threads, for instance, but that means all its Analyzers, etc have to be thread-safe as well, which isn't practical when you start getting into contributed subclasses. Even if we succeed in getting Folders and FileHandles thread safe, it will be hard for the user to keep track of what they can and can't do across threads. "Don't share anything" is a lot easier to understand. We reap a big benefit by making Lucy's metaclass infrastructure thread-safe. Beyond that, seems like there's a lot of pain for little gain.
          Hide
          Michael McCandless added a comment -

          For what it's worth, we haven't really solved that problem in Lucy either.
          The sliding window abstraction we wrapped around mmap/MapViewOfFile largely
          solved the problem of running out of address space on 32-bit operating
          systems. However, there's currently no way to invoke madvise through Lucy's
          IO abstraction layer - it's a little tricky with compound files.

          Linux, at least, requires that the buffer supplied to madvise be page-aligned.
          So, say we're starting off on a posting list, and we want to communicate to
          the OS that it should treat the region we're about to read as MADV_SEQUENTIAL.
          If the start of the postings file is in the middle of a 4k page and the file
          right before it is a term dictionary, we don't want to indicate that that
          region should be treated as sequential.

          I'm not sure how to solve that problem without violating the encapsulation of
          the compound file model. Hmm, maybe we could store metadata about the virtual
          files indicating usage patterns (sequential, random, etc.)? Since files are
          generally part of dedicated data structures whose usage patterns are known at
          index time.

          Or maybe we just punt on that use case and worry only about segment merging.

          Storing metadata seems OK. It'd be optional for codecs to declare that...

          Hmm, wouldn't the act of deleting a file (and releasing all file descriptors) tell
          the OS that it's free to recycle any memory pages associated with it?

          It better!

          Actually why can't ord & offset be one, for the string sort cache? Ie, if you write your string data in sort order, then the offsets are also in sort order? (I think we may have discussed this already?)

          Right, we discussed this on lucy-dev last spring:

          http://markmail.org/message/epc56okapbgit5lw

          OK I'll go try to catch up... but I'm about to drop [sort of]
          offline for a week and a half! There's alot of reading there! Should
          be a prereq that we first go back and re-read what we said "the last
          time"...

          Incidentally, some of this thread replays our exchange at the top of
          LUCENE-1458 from a year ago. It was fun to go back and reread that: in the
          interrim, we've implemented segment-centric search and memory mapped field
          caches and term dictionaries, both of which were first discussed back then.

          Nice!

          Ords are great for low cardinality fields of all kinds, but become less
          efficient for high cardinality primitive numeric fields. For simplicity's
          sake, the prototype implementation of mmap'd field caches in KS always uses
          ords.

          Right...

          You don't want to have to create Lucy's equivalent of the JMM...

          The more I think about making Lucy classes thread safe, the harder it seems.
          I'd like to make it possible to share a Schema across threads, for
          instance, but that means all its Analyzers, etc have to be thread-safe as
          well, which isn't practical when you start getting into contributed
          subclasses.

          Even if we succeed in getting Folders and FileHandles thread safe, it will be
          hard for the user to keep track of what they can and can't do across threads.
          "Don't share anything" is a lot easier to understand.

          We reap a big benefit by making Lucy's metaclass infrastructure thread-safe.
          Beyond that, seems like there's a lot of pain for little gain.

          Yeah. Threads are not easy

          Show
          Michael McCandless added a comment - For what it's worth, we haven't really solved that problem in Lucy either. The sliding window abstraction we wrapped around mmap/MapViewOfFile largely solved the problem of running out of address space on 32-bit operating systems. However, there's currently no way to invoke madvise through Lucy's IO abstraction layer - it's a little tricky with compound files. Linux, at least, requires that the buffer supplied to madvise be page-aligned. So, say we're starting off on a posting list, and we want to communicate to the OS that it should treat the region we're about to read as MADV_SEQUENTIAL. If the start of the postings file is in the middle of a 4k page and the file right before it is a term dictionary, we don't want to indicate that that region should be treated as sequential. I'm not sure how to solve that problem without violating the encapsulation of the compound file model. Hmm, maybe we could store metadata about the virtual files indicating usage patterns (sequential, random, etc.)? Since files are generally part of dedicated data structures whose usage patterns are known at index time. Or maybe we just punt on that use case and worry only about segment merging. Storing metadata seems OK. It'd be optional for codecs to declare that... Hmm, wouldn't the act of deleting a file (and releasing all file descriptors) tell the OS that it's free to recycle any memory pages associated with it? It better! Actually why can't ord & offset be one, for the string sort cache? Ie, if you write your string data in sort order, then the offsets are also in sort order? (I think we may have discussed this already?) Right, we discussed this on lucy-dev last spring: http://markmail.org/message/epc56okapbgit5lw OK I'll go try to catch up... but I'm about to drop [sort of] offline for a week and a half! There's alot of reading there! Should be a prereq that we first go back and re-read what we said "the last time"... Incidentally, some of this thread replays our exchange at the top of LUCENE-1458 from a year ago. It was fun to go back and reread that: in the interrim, we've implemented segment-centric search and memory mapped field caches and term dictionaries, both of which were first discussed back then. Nice! Ords are great for low cardinality fields of all kinds, but become less efficient for high cardinality primitive numeric fields. For simplicity's sake, the prototype implementation of mmap'd field caches in KS always uses ords. Right... You don't want to have to create Lucy's equivalent of the JMM... The more I think about making Lucy classes thread safe, the harder it seems. I'd like to make it possible to share a Schema across threads, for instance, but that means all its Analyzers, etc have to be thread-safe as well, which isn't practical when you start getting into contributed subclasses. Even if we succeed in getting Folders and FileHandles thread safe, it will be hard for the user to keep track of what they can and can't do across threads. "Don't share anything" is a lot easier to understand. We reap a big benefit by making Lucy's metaclass infrastructure thread-safe. Beyond that, seems like there's a lot of pain for little gain. Yeah. Threads are not easy
          Hide
          Tim A. added a comment -

          Hi,

          I am a Computer Science student from Germany. I would like to contribute to this project under GSoC 2012. I have very good experience in Java. I have some questions to this project, can someone help me? IRC or instant messanger?

          Thank You
          Tim

          Show
          Tim A. added a comment - Hi, I am a Computer Science student from Germany. I would like to contribute to this project under GSoC 2012. I have very good experience in Java. I have some questions to this project, can someone help me? IRC or instant messanger? Thank You Tim
          Hide
          Michael McCandless added a comment -

          Is there anyone who can volunteer to be a mentor for this issue...?

          Show
          Michael McCandless added a comment - Is there anyone who can volunteer to be a mentor for this issue...?
          Hide
          Simon Willnauer added a comment -

          I would but I am so overloaded with other work right now. I can be the primary mentor if you could help when I am totally blocked.

          Hi Tim, as we are in the Apache Foundation and a open source project we make everything public. So if you have questions please go and start a thread on the dev@l.a.o mailing list and I am happy to help you. For GSoC internal or private issues while GSoC is running we can do private communication.

          simon

          Show
          Simon Willnauer added a comment - I would but I am so overloaded with other work right now. I can be the primary mentor if you could help when I am totally blocked. Hi Tim, as we are in the Apache Foundation and a open source project we make everything public. So if you have questions please go and start a thread on the dev@l.a.o mailing list and I am happy to help you. For GSoC internal or private issues while GSoC is running we can do private communication. simon
          Hide
          Tim A. added a comment -

          Hello Michael, hello Simon,

          thanks for the fast response.

          So if you have questions please go and start a thread on the dev@l.a.o [...]

          Okay, I do this and start a thread. I have some special questions to the task (Refactoring IndexWriter).

          For example:
          1. Exist unit tests for the code (IndexWriter.java)?
          2. Where i can find the code/software btw. component? (svn, git etc.)
          3. Which IDE I can use for this project? Your Suggestion (Eclipse)?
          4. What's about coding style guides?
          5. [...]

          Show
          Tim A. added a comment - Hello Michael, hello Simon, thanks for the fast response. So if you have questions please go and start a thread on the dev@l.a.o [...] Okay, I do this and start a thread. I have some special questions to the task (Refactoring IndexWriter). For example: 1. Exist unit tests for the code (IndexWriter.java)? 2. Where i can find the code/software btw. component? (svn, git etc.) 3. Which IDE I can use for this project? Your Suggestion (Eclipse)? 4. What's about coding style guides? 5. [...]
          Hide
          Mark Miller added a comment -

          Been a long time since this has seen action - pushing out of 4.1.

          Show
          Mark Miller added a comment - Been a long time since this has seen action - pushing out of 4.1.
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.

            People

            • Assignee:
              Michael Busch
              Reporter:
              Michael Busch
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development