Lucene - Core
  1. Lucene - Core
  2. LUCENE-1618

Allow setting the IndexWriter docstore to be a different directory

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4.1
    • Fix Version/s: 2.9
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Add an IndexWriter.setDocStoreDirectory method that allows doc
      stores to be placed in a different directory than the IW default
      dir.

      1. LUCENE-1618.patch
        0.7 kB
        Jason Rutherglen
      2. LUCENE-1618.patch
        1 kB
        Jason Rutherglen
      3. LUCENE-1618.patch
        8 kB
        Michael McCandless
      4. LUCENE-1618.patch
        8 kB
        Jason Rutherglen
      5. LUCENE-1618.patch
        6 kB
        Jason Rutherglen
      6. MemoryCachedDirectory.java
        7 kB
        Earwin Burrfoot

        Issue Links

          Activity

          Jason Rutherglen created issue -
          Yonik Seeley made changes -
          Field Original Value New Value
          Link This issue is depended upon by LUCENE-1313 [ LUCENE-1313 ]
          Hide
          Yonik Seeley added a comment -

          I can see how this would potentially be useful for realtime... but it seems like only IndexWriter could eventually fix the situation of having the docstore on disk and the rest of a segment in RAM. Which means that this API shouldn't be public?

          Show
          Yonik Seeley added a comment - I can see how this would potentially be useful for realtime... but it seems like only IndexWriter could eventually fix the situation of having the docstore on disk and the rest of a segment in RAM. Which means that this API shouldn't be public?
          Hide
          Michael McCandless added a comment -

          Yeah I also think this should be an "under the hood" (done only by NRT) optimization inside IndexWriter.

          The only possible non-NRT case I can think of is when users make temporary indices in RAM, it's possible one would want to write the docStore files to an FSDirectory (because they are so large) but keep postings, norms, deletes, etc in RAM. But going down that road opens up a can of worms... eg does segments_N somehow have to keep track of which dir has which parts of a segment? Suddenly IndexReader must also know to look in different dirs for different parts of a segment, etc.

          it might be cleaner to make a Directory impl that dispatches certain files to a RAMDir and others to an FSDir, so IndexWriter/IndexReader still see a single Directory API.

          Show
          Michael McCandless added a comment - Yeah I also think this should be an "under the hood" (done only by NRT) optimization inside IndexWriter. The only possible non-NRT case I can think of is when users make temporary indices in RAM, it's possible one would want to write the docStore files to an FSDirectory (because they are so large) but keep postings, norms, deletes, etc in RAM. But going down that road opens up a can of worms... eg does segments_N somehow have to keep track of which dir has which parts of a segment? Suddenly IndexReader must also know to look in different dirs for different parts of a segment, etc. it might be cleaner to make a Directory impl that dispatches certain files to a RAMDir and others to an FSDir, so IndexWriter/IndexReader still see a single Directory API.
          Hide
          Jason Rutherglen added a comment -

          non-NRT case I can think of is when users make temporary indices in RAM

          Yes, and there could be others we don't know about.

          it might be cleaner to make a Directory impl that dispatches certain files to a RAMDir and others to an FSDir

          Good idea. I'll try that method first. If this one works out, then the API will be public?

          Show
          Jason Rutherglen added a comment - non-NRT case I can think of is when users make temporary indices in RAM Yes, and there could be others we don't know about. it might be cleaner to make a Directory impl that dispatches certain files to a RAMDir and others to an FSDir Good idea. I'll try that method first. If this one works out, then the API will be public?
          Hide
          Tim Smith added a comment -

          Would also further suggest that this Directory implementation would take one or more directories to store documents, along with one or more directories to store the index itself

          one of the directories should be explicitly marked for "reading" for each use

          this allows creating a Directory instance that will:

          • store documents to disk (reading from disk during searches)
          • write index to disk and ram (reading from RAM during searches)
          Show
          Tim Smith added a comment - Would also further suggest that this Directory implementation would take one or more directories to store documents, along with one or more directories to store the index itself one of the directories should be explicitly marked for "reading" for each use this allows creating a Directory instance that will: store documents to disk (reading from disk during searches) write index to disk and ram (reading from RAM during searches)
          Hide
          Michael McCandless added a comment -

          > it might be cleaner to make a Directory impl that dispatches certain files to a RAMDir and others to an FSDir

          Good idea. I'll try that method first. If this one works out, then the API will be public?

          Which API would be public?

          If this (call it "FileSwitchDirectory" for now ) works then we would not add any API to IndexWriter (ie it's either or)? But FileSwitchDirectory would be public & "expert".

          One downside to this approach is it's brittle – whenever we change file extensions you'd have to "know" to fix this Directory. Or maybe we make the Directory specialized to only storing the doc stores in the FSDir, then whenever we change file formats we would fix this directory? But in the future, with custom codecs, things could be named whatever... hmmm. Lacking clarity.

          Show
          Michael McCandless added a comment - > it might be cleaner to make a Directory impl that dispatches certain files to a RAMDir and others to an FSDir Good idea. I'll try that method first. If this one works out, then the API will be public? Which API would be public? If this (call it "FileSwitchDirectory" for now ) works then we would not add any API to IndexWriter (ie it's either or)? But FileSwitchDirectory would be public & "expert". One downside to this approach is it's brittle – whenever we change file extensions you'd have to "know" to fix this Directory. Or maybe we make the Directory specialized to only storing the doc stores in the FSDir, then whenever we change file formats we would fix this directory? But in the future, with custom codecs, things could be named whatever... hmmm. Lacking clarity.
          Hide
          Eks Dev added a comment -

          Maybe,
          FileSwitchDirectory should have possibility to get file list/extensions that should be loaded into RAM... making it maintenance free, pushing this decision to end user... if, and when we decide to support users in it, we could than maintain static list at separate place . Kind of separate execution and configuration

          I think I saw something similar Ning Lee made quite a while ago, from hadoop camp (indexing on hadoop something...). But cannot remember what was it

          Show
          Eks Dev added a comment - Maybe, FileSwitchDirectory should have possibility to get file list/extensions that should be loaded into RAM... making it maintenance free, pushing this decision to end user... if, and when we decide to support users in it, we could than maintain static list at separate place . Kind of separate execution and configuration I think I saw something similar Ning Lee made quite a while ago, from hadoop camp (indexing on hadoop something...). But cannot remember what was it
          Hide
          Michael McCandless added a comment -

          Would also further suggest that this Directory implementation would take one or more directories to store documents, along with one or more directories to store the index itself

          You mean an opened IndexOutput would write its output to two (or more) different places? So you could "write through" a RAMDir down to an FSDir? (This way both the RAMDir and FSDir have a copy of the index).

          Show
          Michael McCandless added a comment - Would also further suggest that this Directory implementation would take one or more directories to store documents, along with one or more directories to store the index itself You mean an opened IndexOutput would write its output to two (or more) different places? So you could "write through" a RAMDir down to an FSDir? (This way both the RAMDir and FSDir have a copy of the index).
          Hide
          Michael McCandless added a comment -

          ileSwitchDirectory should have possibility to get file list/extensions that should be loaded into RAM... making it maintenance free, pushing this decision to end user... if, and when we decide to support users in it, we could than maintain static list at separate place . Kind of separate execution and configuration

          +1

          With flexible indexing, presumably one could use their codec to ask it for the "doc store extensions" vs the "postings extensions", etc., and pass to this configurable FileSwitchDirectory.

          Show
          Michael McCandless added a comment - ileSwitchDirectory should have possibility to get file list/extensions that should be loaded into RAM... making it maintenance free, pushing this decision to end user... if, and when we decide to support users in it, we could than maintain static list at separate place . Kind of separate execution and configuration +1 With flexible indexing, presumably one could use their codec to ask it for the "doc store extensions" vs the "postings extensions", etc., and pass to this configurable FileSwitchDirectory.
          Hide
          Tim Smith added a comment -

          You mean an opened IndexOutput would write its output to two (or more) different places? So you could "write through" a RAMDir down to an FSDir? (This way both the RAMDir and FSDir have a copy of the index).

          yes, so if you register more than one directory for "index files", then the IndexOutput for the directory would dispatch to an IndexOutput for both sub directories
          then, the IndexInput would only be opened on the "primary" directory (for instance, the RAM directory)

          This will allow extremely fast searches, with the persistence of a backing FSDirectory

          coupled with then having a set of directories for the "Stored Documents", then allows:

          • RAM directory search speed
          • All changes persisted to disk
          • Documents Stored (and retrieved from disk) (or optionally retrieved from RAM)
          Show
          Tim Smith added a comment - You mean an opened IndexOutput would write its output to two (or more) different places? So you could "write through" a RAMDir down to an FSDir? (This way both the RAMDir and FSDir have a copy of the index). yes, so if you register more than one directory for "index files", then the IndexOutput for the directory would dispatch to an IndexOutput for both sub directories then, the IndexInput would only be opened on the "primary" directory (for instance, the RAM directory) This will allow extremely fast searches, with the persistence of a backing FSDirectory coupled with then having a set of directories for the "Stored Documents", then allows: RAM directory search speed All changes persisted to disk Documents Stored (and retrieved from disk) (or optionally retrieved from RAM)
          Hide
          Michael McCandless added a comment -

          Neat. This is sounding like one cool Directory...

          Show
          Michael McCandless added a comment - Neat. This is sounding like one cool Directory...
          Hide
          Earwin Burrfoot added a comment -

          You mean an opened IndexOutput would write its output to two (or more) different places?

          Except the best way is to write directly to FSDir.IndexOutput, and when it is closed, read back into memory.
          That way, if FSDir.IO hits an exception while writing, you don't have to jump through the hoops to keep your RAMDir in consistent state (we had real troubles when some files were 'written' to RAMDir, but failed to persist in FSDir).
          Also, when reading the file back you already know it's exact size and can allocate appropriate buffer, saving on resizings (my draft impl) / chunking (lucene's current impl) overhead.

          Show
          Earwin Burrfoot added a comment - You mean an opened IndexOutput would write its output to two (or more) different places? Except the best way is to write directly to FSDir.IndexOutput, and when it is closed, read back into memory. That way, if FSDir.IO hits an exception while writing, you don't have to jump through the hoops to keep your RAMDir in consistent state (we had real troubles when some files were 'written' to RAMDir, but failed to persist in FSDir). Also, when reading the file back you already know it's exact size and can allocate appropriate buffer, saving on resizings (my draft impl) / chunking (lucene's current impl) overhead.
          Earwin Burrfoot made changes -
          Attachment MemoryCachedDirectory.java [ 12406644 ]
          Hide
          Yonik Seeley added a comment -

          As it relates to near real time, the search speed of the RAM directory in relation to FSDirectory seems unimportant (what is this diff anyway?) - the FSDirectory will be much larger and that is where the bulk of the search time will be.

          It seems like the main benefit of RAMDirectory for NRT is faster creation time (no need to create on-disk files, write them, then sync them), right? Actually the sync is only needed if a new segments file will be written... but there still may be synchronous metadata operations for open-write-close of a file, depending on the FS?

          Show
          Yonik Seeley added a comment - As it relates to near real time, the search speed of the RAM directory in relation to FSDirectory seems unimportant (what is this diff anyway?) - the FSDirectory will be much larger and that is where the bulk of the search time will be. It seems like the main benefit of RAMDirectory for NRT is faster creation time (no need to create on-disk files, write them, then sync them), right? Actually the sync is only needed if a new segments file will be written... but there still may be synchronous metadata operations for open-write-close of a file, depending on the FS?
          Hide
          Earwin Burrfoot added a comment -

          what is this diff anyway?

          That's not a diff, I gave a sample of write-through ram directory Tim and Mike were speaking about.

          Show
          Earwin Burrfoot added a comment - what is this diff anyway? That's not a diff, I gave a sample of write-through ram directory Tim and Mike were speaking about.
          Hide
          Yonik Seeley added a comment -

          That's not a diff

          Sorry, by "diff" I meant the difference in search performance on a RAMDirectory vs NIOFSDirectory where the files are all cached by the OS.

          Show
          Yonik Seeley added a comment - That's not a diff Sorry, by "diff" I meant the difference in search performance on a RAMDirectory vs NIOFSDirectory where the files are all cached by the OS.
          Hide
          Michael McCandless added a comment -

          by "diff" I meant the difference in search performance on a RAMDirectory vs NIOFSDirectory where the files are all cached by the OS.

          It's a good question – I haven't tested it directly. I'd love to know too...

          For an NRT writer using RAMDir for recently flushed tiny segments (LUCENE-1313), the gains are more about the speed of reading/writing many tiny files. Probably we should try [somehow] to test this case, to see if LUCENE-1313 is even a worthwhile optimization.

          Show
          Michael McCandless added a comment - by "diff" I meant the difference in search performance on a RAMDirectory vs NIOFSDirectory where the files are all cached by the OS. It's a good question – I haven't tested it directly. I'd love to know too... For an NRT writer using RAMDir for recently flushed tiny segments ( LUCENE-1313 ), the gains are more about the speed of reading/writing many tiny files. Probably we should try [somehow] to test this case, to see if LUCENE-1313 is even a worthwhile optimization.
          Hide
          Earwin Burrfoot added a comment -

          Sorry, by "diff" I meant the difference in search performance on a RAMDirectory vs NIOFSDirectory where the files are all cached by the OS.

          Ah! It exists. Ranked by speed, directories are FSDirectory (native/sys calls), MMapDirectory (native), RAMDirectory (chunked), MemCachedDirectory (raw array access). But for the purporses of searching a small amount of freshly-indexed docs this difference is miniscule at best, me thinks.

          Show
          Earwin Burrfoot added a comment - Sorry, by "diff" I meant the difference in search performance on a RAMDirectory vs NIOFSDirectory where the files are all cached by the OS. Ah! It exists. Ranked by speed, directories are FSDirectory (native/sys calls), MMapDirectory (native), RAMDirectory (chunked), MemCachedDirectory (raw array access). But for the purporses of searching a small amount of freshly-indexed docs this difference is miniscule at best, me thinks.
          Hide
          Jason Rutherglen added a comment -

          For an NRT writer using RAMDir for recently flushed tiny
          segments (LUCENE-1313), the gains are more about the speed of
          reading/writing many tiny files. Probably we should try
          [somehow] to test this case, to see if LUCENE-1313 is even a
          worthwhile optimization.

          True a test would be good, how many files per second would it
          produce?

          When testing the realtime and the .del files (which are created
          numerously before LUCENE-1516) the slowdown was quite dramatic
          as it's not a sequential write which means the disk head can
          move each time. That coupled with merges going on which
          completely ties up the IO I think it's hard for small file
          writes to not slow down with a rapidly updating index.

          An index that is being updated rapidly presumably would be
          performing merges more often to remove deletes.

          Show
          Jason Rutherglen added a comment - For an NRT writer using RAMDir for recently flushed tiny segments ( LUCENE-1313 ), the gains are more about the speed of reading/writing many tiny files. Probably we should try [somehow] to test this case, to see if LUCENE-1313 is even a worthwhile optimization. True a test would be good, how many files per second would it produce? When testing the realtime and the .del files (which are created numerously before LUCENE-1516 ) the slowdown was quite dramatic as it's not a sequential write which means the disk head can move each time. That coupled with merges going on which completely ties up the IO I think it's hard for small file writes to not slow down with a rapidly updating index. An index that is being updated rapidly presumably would be performing merges more often to remove deletes.
          Hide
          Jason Rutherglen added a comment -

          One downside to this approach is it's brittle - whenever
          we change file extensions you'd have to "know" to fix this
          Directory.

          True, I don't think we can expect the user to pass in the
          correct FileSwitchDirectory (with the attendant file
          extensions), we can make the particular implementation of
          Directory we use to solve this problem internal to IW. Meaning
          the writer can pass through the real directory calls to FSD, and
          handle the RAMDir calls on it's own.

          Show
          Jason Rutherglen added a comment - One downside to this approach is it's brittle - whenever we change file extensions you'd have to "know" to fix this Directory. True, I don't think we can expect the user to pass in the correct FileSwitchDirectory (with the attendant file extensions), we can make the particular implementation of Directory we use to solve this problem internal to IW. Meaning the writer can pass through the real directory calls to FSD, and handle the RAMDir calls on it's own.
          Hide
          Jason Rutherglen added a comment -

          Implementation of the FileSwitchDirectory. It's nice this works
          so elegantly with the existing Lucene APIs.

          The test case makes sure the fdt and fdx files are written to
          the fsdirectory based on the files extensions. I feel that
          LUCENE-1313 will depend on this and I'll implement LUCENE-1313
          with this patch in mind. I'm not sure how we insure there are no
          file name collisions between the real dir and FSD? Because IW is
          managing the creation of the segment names I don't think we
          need to worry about this.

          Show
          Jason Rutherglen added a comment - Implementation of the FileSwitchDirectory. It's nice this works so elegantly with the existing Lucene APIs. The test case makes sure the fdt and fdx files are written to the fsdirectory based on the files extensions. I feel that LUCENE-1313 will depend on this and I'll implement LUCENE-1313 with this patch in mind. I'm not sure how we insure there are no file name collisions between the real dir and FSD? Because IW is managing the creation of the segment names I don't think we need to worry about this.
          Jason Rutherglen made changes -
          Attachment LUCENE-1618.patch [ 12406724 ]
          Hide
          Michael McCandless added a comment -

          Patch looks good Jason!

          Can you add copyright header & CHANGES.txt entry, and remove some noise (eg TestIndexWriterReader.java)?

          Also: I think you should allow any Directory instance as primary/secondary? (You're hardwiring to RAMDir/FSDir now). I realize NRT's use of this will be a RAMDir/FSDir, but I think this dir can be generic. Can you also implement listAll()?

          Finally: maybe for the "tee" (IndexOutput "writes through" two Dirs, suggested above) functionality, we should create a different Directory impl?

          Show
          Michael McCandless added a comment - Patch looks good Jason! Can you add copyright header & CHANGES.txt entry, and remove some noise (eg TestIndexWriterReader.java)? Also: I think you should allow any Directory instance as primary/secondary? (You're hardwiring to RAMDir/FSDir now). I realize NRT's use of this will be a RAMDir/FSDir, but I think this dir can be generic. Can you also implement listAll()? Finally: maybe for the "tee" (IndexOutput "writes through" two Dirs, suggested above) functionality, we should create a different Directory impl?
          Hide
          Jason Rutherglen added a comment -
          • Copyright added
          • CHANGES.txt added
          • Cleaned up
          • RAMDir specific stuff removed from FSD

          maybe for the "tee" (IndexOutput "writes through" two
          Dirs, suggested above) functionality, we should create a
          different Directory impl?

          I think a different directory impl makes sense, the
          functionality of FileSwitchDirectory is fairly specific.

          Show
          Jason Rutherglen added a comment - Copyright added CHANGES.txt added Cleaned up RAMDir specific stuff removed from FSD maybe for the "tee" (IndexOutput "writes through" two Dirs, suggested above) functionality, we should create a different Directory impl? I think a different directory impl makes sense, the functionality of FileSwitchDirectory is fairly specific.
          Jason Rutherglen made changes -
          Attachment LUCENE-1618.patch [ 12406864 ]
          Michael McCandless made changes -
          Assignee Michael McCandless [ mikemccand ]
          Hide
          Michael McCandless added a comment -

          New patch attached w/ minor fixes: added more detail in CHANGES entry; renamed "real" and "other" dir to "primary" and "secondary" dir; tweaked javadocs. I plan to commit later today.

          Once this is in, Jason can you update LUCENE-1313 to use this class? Thanks.

          Show
          Michael McCandless added a comment - New patch attached w/ minor fixes: added more detail in CHANGES entry; renamed "real" and "other" dir to "primary" and "secondary" dir; tweaked javadocs. I plan to commit later today. Once this is in, Jason can you update LUCENE-1313 to use this class? Thanks.
          Michael McCandless made changes -
          Attachment LUCENE-1618.patch [ 12407003 ]
          Hide
          Michael McCandless added a comment -

          Thanks Jason!

          Show
          Michael McCandless added a comment - Thanks Jason!
          Michael McCandless made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Jason Rutherglen added a comment -

          Added getter methods to FSD for the underlying directories

          Show
          Jason Rutherglen added a comment - Added getter methods to FSD for the underlying directories
          Jason Rutherglen made changes -
          Attachment LUCENE-1618.patch [ 12407128 ]
          Hide
          Michael McCandless added a comment -

          OK thanks Jason, I just committed that (w/ small change to listAll to directly allocate the String[]).

          Show
          Michael McCandless added a comment - OK thanks Jason, I just committed that (w/ small change to listAll to directly allocate the String[]).
          Hide
          Jason Rutherglen added a comment -

          Added fileExists checking in getDirectory before asking
          regarding the extension. This is useful when IndexFileDeleter
          uses FSD as a way to combine directories in LUCENE-1313.

          Show
          Jason Rutherglen added a comment - Added fileExists checking in getDirectory before asking regarding the extension. This is useful when IndexFileDeleter uses FSD as a way to combine directories in LUCENE-1313 .
          Jason Rutherglen made changes -
          Attachment LUCENE-1618.patch [ 12407568 ]
          Hide
          Michael McCandless added a comment -

          Added fileExists checking in getDirectory

          Jason, why is this needed? Why is the mapping based on extension insufficient?

          Show
          Michael McCandless added a comment - Added fileExists checking in getDirectory Jason, why is this needed? Why is the mapping based on extension insufficient?
          Hide
          Jason Rutherglen added a comment -

          One example of the use case is when IndexFileDeleter needs to
          access the directory's files as is without extension
          interpretation. A .fdt file that was written directly to the
          primary directory (not through FSD) would fit this case. When
          IFD tries to access the .fdt file (using the current code) FSD
          says it's not there (because it thinks it's in the secondary
          dir).

          Maybe we need a different type of FSD for this case?

          Show
          Jason Rutherglen added a comment - One example of the use case is when IndexFileDeleter needs to access the directory's files as is without extension interpretation. A .fdt file that was written directly to the primary directory (not through FSD) would fit this case. When IFD tries to access the .fdt file (using the current code) FSD says it's not there (because it thinks it's in the secondary dir). Maybe we need a different type of FSD for this case?
          Hide
          Michael McCandless added a comment -

          I think if one is directly writing a file to the primary directory (not through FSD) then one should/could also delete directly from that directory? I don't think we should be putting the magic inside FSD.

          Show
          Michael McCandless added a comment - I think if one is directly writing a file to the primary directory (not through FSD) then one should/could also delete directly from that directory? I don't think we should be putting the magic inside FSD.
          Hide
          Jason Rutherglen added a comment -

          Well, it was implemented this way to accommodate not passing two
          directories around (such as to IFD). So that methods such as
          Dir.list would work properly. It seems that we want an
          alternative to FSD that only combines directories?

          Show
          Jason Rutherglen added a comment - Well, it was implemented this way to accommodate not passing two directories around (such as to IFD). So that methods such as Dir.list would work properly. It seems that we want an alternative to FSD that only combines directories?
          Mark Miller made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Mark Thomas made changes -
          Workflow jira [ 12461926 ] Default workflow, editable Closed status [ 12563275 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12563275 ] jira [ 12584170 ]

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Jason Rutherglen
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 336h
                336h
                Remaining:
                Remaining Estimate - 336h
                336h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development