Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently CompoundFileReader could use some improvements, i see the following problems

      • its CSIndexInput extends bufferedindexinput, which is stupid for directories like mmap.
      • it seeks on every readInternal
      • its not possible for a directory to override or improve the handling of compound files.

      for example: it seems if you were impl'ing this thing from scratch, you would just wrap the II directly (not extend BufferedIndexInput,
      and add compound file offset X to seek() calls, and override length(). But of course, then you couldnt throw read past EOF always when you should,
      as a user could read into the next file and be left unaware.

      however, some directories could handle this better. for example MMapDirectory could return an indexinput that simply mmaps the 'slice' of the CFS file.
      its underlying bytebuffer etc naturally does bounds checks already etc, so it wouldnt need to be buffered, not even needing to add any offsets to seek(),
      as its position would just work.

      So I think we should try to refactor this so that a Directory can customize how compound files are handled, the simplest
      case for the least code change would be to add this to Directory.java:

        public Directory openCompoundInput(String filename) {
          return new CompoundFileReader(this, filename);
        }
      

      Because most code depends upon the fact compound files are implemented as a Directory and transparent. at least then a subclass could override...
      but the 'recursion' is a little ugly... we could still label it expert+internal+experimental or whatever.

      1. LUCENE-3201.patch
        58 kB
        Robert Muir
      2. LUCENE-3201.patch
        43 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          +1

          Show
          Michael McCandless added a comment - +1
          Hide
          Robert Muir added a comment -

          I think for this one, I prefer to wait for Uwe's refactoring of MMap on LUCENE-3200.
          Then mmap is simpler, and i think we can even use the same indexinput implementation here.

          This would mean no slowdown when searching CFS.

          Show
          Robert Muir added a comment - I think for this one, I prefer to wait for Uwe's refactoring of MMap on LUCENE-3200 . Then mmap is simpler, and i think we can even use the same indexinput implementation here. This would mean no slowdown when searching CFS.
          Hide
          Robert Muir added a comment -

          Initial patch for review. In this patch I only cut over MMapDirectory to using a special CompoundFileDirectory, all others use the default as before (but i cleaned up some things about it).

          Pretty sure i can easily improve SimpleFS and NIOFS, i'll take a look at that now, but I wanted to get this up for review.

          Show
          Robert Muir added a comment - Initial patch for review. In this patch I only cut over MMapDirectory to using a special CompoundFileDirectory, all others use the default as before (but i cleaned up some things about it). Pretty sure i can easily improve SimpleFS and NIOFS, i'll take a look at that now, but I wanted to get this up for review.
          Hide
          Michael McCandless added a comment -

          Patch looks great! Incredible that this means there's no penalty at all at search time when using CFS, if you use MMapDir.

          I like that CFS reader is now under oal.store not .index.

          Show
          Michael McCandless added a comment - Patch looks great! Incredible that this means there's no penalty at all at search time when using CFS, if you use MMapDir. I like that CFS reader is now under oal.store not .index.
          Hide
          Robert Muir added a comment -

          setting 3.3/4.0 as fix version, as the changes are backwards compatible (compoundfilereader is pkg-private still in 3.x)

          Show
          Robert Muir added a comment - setting 3.3/4.0 as fix version, as the changes are backwards compatible (compoundfilereader is pkg-private still in 3.x)
          Hide
          Uwe Schindler added a comment -

          Hi Robert, great patch, exactly as I would have wished to have it when we discussed about it!

          Patch looks file, small bug:

          • FileSwitchDirectory should also override the openCompoundInput() from Directory and delegate to the correct underlying directory. Now it always uses the default impl, which is double buffering. So if you e.g. put MMapDirectory as a delegate for CFS files, those files would be opened like before your patch. Just copy'n'paste the code from one of the other FileSwitchDirectory methods.

          Some suggestions:
          We currently map the whole compound file into address space, read the header/contents and unmap it again. This may be some overhead especially if unmapping is not supported.

          • We could use SimpleFSIndexInput to read CFS contents (we only need to pass the already open RAF there, alternatively use Dawids new wrapper IndexInput around a standard InputStream, got from RAF -> LUCENE-3202)
          • Only map the header of the CFS file, the problem: we dont know exact size.
          Show
          Uwe Schindler added a comment - Hi Robert, great patch, exactly as I would have wished to have it when we discussed about it! Patch looks file, small bug: FileSwitchDirectory should also override the openCompoundInput() from Directory and delegate to the correct underlying directory. Now it always uses the default impl, which is double buffering. So if you e.g. put MMapDirectory as a delegate for CFS files, those files would be opened like before your patch. Just copy'n'paste the code from one of the other FileSwitchDirectory methods. Some suggestions: We currently map the whole compound file into address space, read the header/contents and unmap it again. This may be some overhead especially if unmapping is not supported. We could use SimpleFSIndexInput to read CFS contents (we only need to pass the already open RAF there, alternatively use Dawids new wrapper IndexInput around a standard InputStream, got from RAF -> LUCENE-3202 ) Only map the header of the CFS file, the problem: we dont know exact size.
          Hide
          Robert Muir added a comment -

          I agree, the fileswitchdirectory should delegate the openCompoundInput.

          As far as mapping small things, I think we should set this aside for another issue.
          as far as this issue goes, I don't mind returning the DefaultCompound impl if unmapping isn't supported, but i'd really rather defer the open the can of worms of 'mapping small things' to some other issue

          Show
          Robert Muir added a comment - I agree, the fileswitchdirectory should delegate the openCompoundInput. As far as mapping small things, I think we should set this aside for another issue. as far as this issue goes, I don't mind returning the DefaultCompound impl if unmapping isn't supported, but i'd really rather defer the open the can of worms of 'mapping small things' to some other issue
          Hide
          Uwe Schindler added a comment -

          We have LUCENE-1743 for the small files can of worms.

          Show
          Uwe Schindler added a comment - We have LUCENE-1743 for the small files can of worms.
          Hide
          Robert Muir added a comment -

          here is an updated patch, including impls for SimpleFS and NIOFS, fixing the FileSwitchDirectory thing uwe mentioned, and also mockdirectorywrapper and NRTCachingDirectory.

          all the tests pass with Simple/NIO/MMap but we need to benchmark. haven't had good luck today with luceneutil

          Show
          Robert Muir added a comment - here is an updated patch, including impls for SimpleFS and NIOFS, fixing the FileSwitchDirectory thing uwe mentioned, and also mockdirectorywrapper and NRTCachingDirectory. all the tests pass with Simple/NIO/MMap but we need to benchmark. haven't had good luck today with luceneutil
          Hide
          Uwe Schindler added a comment - - edited

          Robert: Very nice. Small thing:

          • NIOFSCompoundFileDirectory / SimpleFSCompoundFileDirectory / MMapCompoundFileDirectory are non-static inner classes but still get parent Directory in ctor. This is douplicated as javac also passes the parent around (the special ParentClassName.this one). I would remove the ctor param and use "*FSDirectory.this" as reference to outer class. I nitpick, because at some places it references the parent directory without the ctor param, so its inconsistent.

          That's all for now, thanks for hard work!

          Show
          Uwe Schindler added a comment - - edited Robert: Very nice. Small thing: NIOFSCompoundFileDirectory / SimpleFSCompoundFileDirectory / MMapCompoundFileDirectory are non-static inner classes but still get parent Directory in ctor. This is douplicated as javac also passes the parent around (the special ParentClassName.this one). I would remove the ctor param and use "*FSDirectory.this" as reference to outer class. I nitpick, because at some places it references the parent directory without the ctor param, so its inconsistent. That's all for now, thanks for hard work!
          Hide
          Simon Willnauer added a comment -

          this seems ready to commit... I think we should get that in so I can take it further on LUCENE-3218

          Robert is it ok for you if I commit this or are you gonig to do it?

          simon

          Show
          Simon Willnauer added a comment - this seems ready to commit... I think we should get that in so I can take it further on LUCENE-3218 Robert is it ok for you if I commit this or are you gonig to do it? simon
          Hide
          Robert Muir added a comment -

          I didnt commit because I didn't measure any performance improvements from the patch (this frustrated me).
          Also, I didn't address Uwe's last comment...

          In general, I was thinking that this would be a good performance win, but it isn't. So we should consider it from a refactoring perspective only.

          Show
          Robert Muir added a comment - I didnt commit because I didn't measure any performance improvements from the patch (this frustrated me). Also, I didn't address Uwe's last comment... In general, I was thinking that this would be a good performance win, but it isn't. So we should consider it from a refactoring perspective only.
          Hide
          Simon Willnauer added a comment -

          incorporated in LUCENE-3218 I will track backporting there

          Show
          Simon Willnauer added a comment - incorporated in LUCENE-3218 I will track backporting there
          Hide
          Robert Muir added a comment -

          reopening, like LUCENE-3218, I think we should pull this stuff back and revisit.

          Show
          Robert Muir added a comment - reopening, like LUCENE-3218 , I think we should pull this stuff back and revisit.
          Hide
          Uwe Schindler added a comment -

          My last comment was wrong, the impl was changed before commit so it reuses RAF.

          Show
          Uwe Schindler added a comment - My last comment was wrong, the impl was changed before commit so it reuses RAF.
          Hide
          Robert Muir added a comment -

          thats just not true... but illustrates my point that this stuff is complicated and I think we need to take the safe option here and back it out.

          Show
          Robert Muir added a comment - thats just not true... but illustrates my point that this stuff is complicated and I think we need to take the safe option here and back it out.
          Hide
          Uwe Schindler added a comment -

          I reverted my comment

          Show
          Uwe Schindler added a comment - I reverted my comment
          Hide
          Simon Willnauer added a comment -

          I think we can close this issue unless we plan to backport the CFS changes to 3.x? Opinions?

          Show
          Simon Willnauer added a comment - I think we can close this issue unless we plan to backport the CFS changes to 3.x? Opinions?
          Hide
          Robert Muir added a comment -

          not a blocker, it was pulled from 3.x (and fixed in trunk)

          Show
          Robert Muir added a comment - not a blocker, it was pulled from 3.x (and fixed in trunk)
          Hide
          Simon Willnauer added a comment -

          I am closing this... if we feel like porting we can still reopen

          Show
          Simon Willnauer added a comment - I am closing this... if we feel like porting we can still reopen

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development