Details

    • Lucene Fields:
      New

      Description

      Today DirectIOLinuxDir is tricky/dangerous to use, because you only want to use it for indexWriter and not IndexReader (searching). It's a trap.

      But, once we do LUCENE-2793, we can make it fully general purpose because then a single native Dir impl can be used.

      I'd also like to make it generic to other Unices, if we can, so that it becomes UnixDirectory.

      1. LUCENE-2795.patch
        27 kB
        Michael McCandless
      2. LUCENE-2795.patch
        26 kB
        Varun Thacker
      3. LUCENE-2795.patch
        26 kB
        Varun Thacker
      4. LUCENE-2795.patch
        26 kB
        Varun Thacker
      5. LUCENE-2795.patch
        26 kB
        Varun Thacker
      6. LUCENE-2795.patch
        23 kB
        Simon Willnauer
      7. LUCENE-2795.patch
        23 kB
        Varun Thacker
      8. LUCENE-2795.patch
        10 kB
        Varun Thacker
      9. LUCENE-2795.patch
        21 kB
        Varun Thacker
      10. LUCENE-2795.patch
        15 kB
        Varun Thacker

        Issue Links

          Activity

          Show
          Michael McCandless added a comment - https://issues.apache.org/jira/browse/LUCENE-2500?focusedCommentId=12979588&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12979588 has details on what flags to pass to OS X to bypass its buffer cache...
          Hide
          Michael McCandless added a comment -

          It looks like recent Linux kernels have better behavior with SEQUENTIAL flag: http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html?showComment=1303235497682#c2572106601600642254

          If possible we should test on kernels after that patch was merged to see if passing SEQUENTIAL for merging prevents eviction of hot pages being used for searching...

          Show
          Michael McCandless added a comment - It looks like recent Linux kernels have better behavior with SEQUENTIAL flag: http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html?showComment=1303235497682#c2572106601600642254 If possible we should test on kernels after that patch was merged to see if passing SEQUENTIAL for merging prevents eviction of hot pages being used for searching...
          Hide
          Varun Thacker added a comment -

          This is great for merging . Does IndexReader use the SEQUENTIAL flag for accessing the index ?

          Show
          Varun Thacker added a comment - This is great for merging . Does IndexReader use the SEQUENTIAL flag for accessing the index ?
          Hide
          Michael McCandless added a comment -

          Today, we don't use any of these flags for searching/merging (that's what this issue and LUCENE-2793 will fix).

          Unfortunately, it looks like the best IO flags are going to vary depending on kernel version. The above patch was applied in kernel 2.6.29, and causes SEQUENTIAL to act like NOREUSE, and so we should use SEQUENTIAL for merging but not for searching, I think? We should experiment and see...

          Really, for searching we want the "old" SEQUENTIAL behavior (ie, do aggressive readahead, but use the normal eviction policy since that term may be searched on again).

          Hopefully NativeUnixDir will take care of all these os/kernel-version dependent flags for you.

          Show
          Michael McCandless added a comment - Today, we don't use any of these flags for searching/merging (that's what this issue and LUCENE-2793 will fix). Unfortunately, it looks like the best IO flags are going to vary depending on kernel version. The above patch was applied in kernel 2.6.29, and causes SEQUENTIAL to act like NOREUSE, and so we should use SEQUENTIAL for merging but not for searching, I think? We should experiment and see... Really, for searching we want the "old" SEQUENTIAL behavior (ie, do aggressive readahead, but use the normal eviction policy since that term may be searched on again). Hopefully NativeUnixDir will take care of all these os/kernel-version dependent flags for you.
          Hide
          Varun Thacker added a comment -

          Initial Patch..

          This will apply to the IOContext branch. This patch also contains some uncommitted code from LUCENE-2793.

          In NativePosixUtil.cpp the open_direct method where I use fcntl() for OS X could not be compiled on my machine because of the OS specific F_NOCACHE flag used. Let me know if the usage is correct.

          Show
          Varun Thacker added a comment - Initial Patch.. This will apply to the IOContext branch. This patch also contains some uncommitted code from LUCENE-2793 . In NativePosixUtil.cpp the open_direct method where I use fcntl() for OS X could not be compiled on my machine because of the OS specific F_NOCACHE flag used. Let me know if the usage is correct.
          Hide
          Mark Miller added a comment -

          Hey Varun - saw you asked for someone with a mac to run some code for you in IRC but you popped off before I saw - what do you need? Just apply the patch and run the tests?

          Show
          Mark Miller added a comment - Hey Varun - saw you asked for someone with a mac to run some code for you in IRC but you popped off before I saw - what do you need? Just apply the patch and run the tests?
          Hide
          Varun Thacker added a comment -

          Hey Varun - saw you asked for someone with a mac to run some code for you in IRC but you popped off before I saw - what do you need? Just apply the patch and run the tests?

          This patch will apply to the LUCENE2793 branch. Othewise in file : lucene/contrib/misc/src/java/org/apache/lucene/store/NativePosixUtil.cpp after line 117 inside the "if" add this line -

           fcntl(fd, F_NOCACHE, 1); 

          And then by running

          ant build-native-unix

          from the /contrib/misc folder to check if it compiles successfully.

          Thanks.

          Show
          Varun Thacker added a comment - Hey Varun - saw you asked for someone with a mac to run some code for you in IRC but you popped off before I saw - what do you need? Just apply the patch and run the tests? This patch will apply to the LUCENE2793 branch. Othewise in file : lucene/contrib/misc/src/java/org/apache/lucene/store/NativePosixUtil.cpp after line 117 inside the "if" add this line - fcntl(fd, F_NOCACHE, 1); And then by running ant build- native -unix from the /contrib/misc folder to check if it compiles successfully. Thanks.
          Hide
          Varun Thacker added a comment -

          the open_direct, posix_fadvise function in onlylinux.h remains the same.

          In onlybsd.h open_direct, posix_fadvise are the same too except that the O_NOATIME flag is not present.

          In onlyosx the open_direct is implemented in a different way.

          Also I have added a open_normal function to all of the headers which will be used in case the IOContext in not a MERGE.

          Show
          Varun Thacker added a comment - the open_direct, posix_fadvise function in onlylinux.h remains the same. In onlybsd.h open_direct, posix_fadvise are the same too except that the O_NOATIME flag is not present. In onlyosx the open_direct is implemented in a different way. Also I have added a open_normal function to all of the headers which will be used in case the IOContext in not a MERGE.
          Hide
          Varun Thacker added a comment -

          I'm not sure this is the correct/best way to this.

          Show
          Varun Thacker added a comment - I'm not sure this is the correct/best way to this.
          Hide
          Michael McCandless added a comment -

          OK, so now we create a FileDescriptor obj in C and return that, by
          setting the fd field directly? This is because fadvise/posix_fadvise
          on an already opened fd isn't portable enough? Ie we must open the
          file w/ the flags, instead, for some OS's?

          I don't think you need posix_fadvise.h at all? Can't you just
          #include the right headers per OS, inside each OS's ifdef?

          Missing some { } around single-statement if bodies. Can you move the
          "return NULL" in "if (class_ioex == NULL) return NULL;" to the next
          line, with { }?

          Show
          Michael McCandless added a comment - OK, so now we create a FileDescriptor obj in C and return that, by setting the fd field directly? This is because fadvise/posix_fadvise on an already opened fd isn't portable enough? Ie we must open the file w/ the flags, instead, for some OS's? I don't think you need posix_fadvise.h at all? Can't you just #include the right headers per OS, inside each OS's ifdef? Missing some { } around single-statement if bodies. Can you move the "return NULL" in "if (class_ioex == NULL) return NULL;" to the next line, with { }?
          Hide
          Varun Thacker added a comment -

          Changed DirectIOLinuxDirectory to NativeUnixDirectory and also made some other changes.

          Show
          Varun Thacker added a comment - Changed DirectIOLinuxDirectory to NativeUnixDirectory and also made some other changes.
          Hide
          Simon Willnauer added a comment -

          here is an updated patch that works on OSX. I had to copy the old DirectIOLinuxDirectory.java to NativeUnixDirectory.java (svn copy) to apply the patch cleanly.

          I added an additional path to the JDKs header files and a linker argument otherwise I got a linker error. GCC by default doesn't pass -lstdc++ to the linker

          Show
          Simon Willnauer added a comment - here is an updated patch that works on OSX. I had to copy the old DirectIOLinuxDirectory.java to NativeUnixDirectory.java (svn copy) to apply the patch cleanly. I added an additional path to the JDKs header files and a linker argument otherwise I got a linker error. GCC by default doesn't pass -lstdc++ to the linker
          Hide
          Michael McCandless added a comment -

          Can you touch up the javadocs? Ie it's more general now (not just
          linux), and "uses direct io" instead of "Linux's O_DIRECT flag", and
          uses direct io for all merge input & output. Make it clear that this
          is a general purpose dir impl...

          NativeUnixndexOutput should be NativeUnixIndexOutput.

          The createOutput function should also switch between normal vs direct,
          depending on MERGE or READ?

          I think we should have a default forced buffer size? The thing to
          keep in mind is how many open inputs/outputs we will have during
          merging, eg I think something like 5 or 8 files per merged segment
          plus another 5 or 8 for the resulting merged segment, so up to 88 file
          handles open. Maybe 256 KB default?

          We shouldn't by default use the BII's buffer size: it's way too small
          for direct IO. And then we should always use the forced buffer size
          in this dir...

          Tests pass for me on Fedora 13 linux! Next I'll try OS X.

          Show
          Michael McCandless added a comment - Can you touch up the javadocs? Ie it's more general now (not just linux), and "uses direct io" instead of "Linux's O_DIRECT flag", and uses direct io for all merge input & output. Make it clear that this is a general purpose dir impl... NativeUnixndexOutput should be NativeUnixIndexOutput. The createOutput function should also switch between normal vs direct, depending on MERGE or READ? I think we should have a default forced buffer size? The thing to keep in mind is how many open inputs/outputs we will have during merging, eg I think something like 5 or 8 files per merged segment plus another 5 or 8 for the resulting merged segment, so up to 88 file handles open. Maybe 256 KB default? We shouldn't by default use the BII's buffer size: it's way too small for direct IO. And then we should always use the forced buffer size in this dir... Tests pass for me on Fedora 13 linux! Next I'll try OS X.
          Hide
          Michael McCandless added a comment -

          Also, we should run some more tests here, on modern kernels, since the SEQUENTIAL fadvise flag is better implemented in kernels >= 2.6.29; see this comment:

          http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html?showComment=1303235497682#c2572106601600642254

          It could be, if kernel is new enough, we should not do DIRECT and instead do SEQUENTIAL.

          Show
          Michael McCandless added a comment - Also, we should run some more tests here, on modern kernels, since the SEQUENTIAL fadvise flag is better implemented in kernels >= 2.6.29; see this comment: http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html?showComment=1303235497682#c2572106601600642254 It could be, if kernel is new enough, we should not do DIRECT and instead do SEQUENTIAL.
          Hide
          Michael McCandless added a comment -

          OK, passes for me on OS X 10.6.6. I first had to install the Java
          developer package (jni.h was missing) from http://connect.apple.com

          Show
          Michael McCandless added a comment - OK, passes for me on OS X 10.6.6. I first had to install the Java developer package (jni.h was missing) from http://connect.apple.com
          Hide
          Varun Thacker added a comment -

          Updated jdocs.

          createOutput switches between direct and normal.

          Added constructors NativeUnixIndexOutput/NativeUnixIndexInput which does not take buffersize and uses 256kb by default.

          Changed logic of using forcedbuffersize as suggested - It is only used in MERGE context rest of the times BII.buffersize

          I am still trying to figure out how to find out different kernel versions in this case. This is what I found out till now. Use the system("uname -r") call in the c program directly. Use a makefile. Still haven't figured out how to do so using cpptasks though.

          Show
          Varun Thacker added a comment - Updated jdocs. createOutput switches between direct and normal. Added constructors NativeUnixIndexOutput/NativeUnixIndexInput which does not take buffersize and uses 256kb by default. Changed logic of using forcedbuffersize as suggested - It is only used in MERGE context rest of the times BII.buffersize I am still trying to figure out how to find out different kernel versions in this case. This is what I found out till now. Use the system("uname -r") call in the c program directly. Use a makefile. Still haven't figured out how to do so using cpptasks though.
          Hide
          Michael McCandless added a comment -

          I think NativeUnixDir should have a ctor that takes no buffer size, and uses the default? And merging should always use the forcedBufferSize (let's rename it to mergeBufferSize?), ie, never delegate to BII.bufferSize(context).

          Then, NativeUnixIndexInput/Output shouldn't have 2 ctors, only 1 (that takes bufferSize) and NativeUnixDir just passes the buffer size?

          Somehow, but this should be a different issue, this Dir impl should only provide the II/IO impls for use during merging; if it's not a merge it should delegate to another Dir impl. Maybe we pass a delegate to the ctor or something...

          Show
          Michael McCandless added a comment - I think NativeUnixDir should have a ctor that takes no buffer size, and uses the default? And merging should always use the forcedBufferSize (let's rename it to mergeBufferSize?), ie, never delegate to BII.bufferSize(context). Then, NativeUnixIndexInput/Output shouldn't have 2 ctors, only 1 (that takes bufferSize) and NativeUnixDir just passes the buffer size? Somehow, but this should be a different issue, this Dir impl should only provide the II/IO impls for use during merging; if it's not a merge it should delegate to another Dir impl. Maybe we pass a delegate to the ctor or something...
          Hide
          Varun Thacker added a comment -

          Sorry for the delay. I have made the changes suggested.

          Show
          Varun Thacker added a comment - Sorry for the delay. I have made the changes suggested.
          Hide
          Michael McCandless added a comment -

          Getting closer!

          Instead of #define LINUX "linux" you can just do #define LINUX.

          Can you rename forcedBufferSize -> mergeBufferSize in all places?
          (And also the default constant FORCED_...).

          I don't think "0" should mean we fallback to BII's default; ie, just
          remove that logic and always use mergeBufferSize, and fix the jdoc.

          How come you're passing true as the readOnly arg to
          open_direct/_normal in createOutput? Seems like this will fail? (Hmm
          did you run tests w/ these changes?).

          Show
          Michael McCandless added a comment - Getting closer! Instead of #define LINUX "linux" you can just do #define LINUX . Can you rename forcedBufferSize -> mergeBufferSize in all places? (And also the default constant FORCED_...). I don't think "0" should mean we fallback to BII's default; ie, just remove that logic and always use mergeBufferSize, and fix the jdoc. How come you're passing true as the readOnly arg to open_direct/_normal in createOutput? Seems like this will fail? (Hmm did you run tests w/ these changes?).
          Hide
          Varun Thacker added a comment -

          Updated patch.

          http://ant.apache.org/manual/Tasks/conditions.html#os

          With this I can get information on the kernel. But the problem is "Linux" is not a separate family attribute. Any pointers on how to go about it so that we can

          Also, we should run some more tests here, on modern kernels, since the SEQUENTIAL fadvise flag is better implemented in kernels >= 2.6.29; see this comment:

          http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html?showComment=0000000000000#c2572106601600642254

          It could be, if kernel is new enough, we should not do DIRECT and instead do SEQUENTIAL.

          Show
          Varun Thacker added a comment - Updated patch. http://ant.apache.org/manual/Tasks/conditions.html#os With this I can get information on the kernel. But the problem is "Linux" is not a separate family attribute. Any pointers on how to go about it so that we can Also, we should run some more tests here, on modern kernels, since the SEQUENTIAL fadvise flag is better implemented in kernels >= 2.6.29; see this comment: http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html?showComment=0000000000000#c2572106601600642254 It could be, if kernel is new enough, we should not do DIRECT and instead do SEQUENTIAL.
          Hide
          Varun Thacker added a comment -

          Hi,

          I think this patch is almost complete. It will be great if anyone can review it.

          Show
          Varun Thacker added a comment - Hi, I think this patch is almost complete. It will be great if anyone can review it.
          Hide
          Varun Thacker added a comment -

          I updated the patch so that it applies cleanly with the current trunk revision.

          ant build-native-unix

          Comiles successfully on a 64 bit Ubuntu Linux machine running kernel version 3.0.0-14-generic.

          ant-DDirectory=NativeUnixDirectory test-core

          Runs successfully.

          Is more testing required before committing it?

          Show
          Varun Thacker added a comment - I updated the patch so that it applies cleanly with the current trunk revision. ant build-native-unix Comiles successfully on a 64 bit Ubuntu Linux machine running kernel version 3.0.0-14-generic. ant-DDirectory=NativeUnixDirectory test-core Runs successfully. Is more testing required before committing it?
          Hide
          Michael McCandless added a comment -

          Hi Varun!

          Thanks for checking back here...

          I ran this directory on Linux and hit a number of test failures, from
          two bugs in the directory impl. First,
          NativeUnixIndexInput.fileLength() failed to include bytes still in the
          buffer; second, on clone, if the current II was positioned at EOF
          and EOF was on a (512 byte) page boundary then we'd hit
          a false IOE. I fixed those...

          I also changed NativeUnixDir to take a delegate Dir which we fall back
          to when we don't want to use direct IO; this enabled removing of
          open_normal since we just use the "normal" java Dir impls for this.

          To run all tests on Linux I do this:

          export CLASSPATH=/lucene/unixdir/lucene/build/contrib/misc/lucene-misc-4.0-SNAPSHOT.jar
          export LD_LIBRARY_PATH=/lucene/unixdir/lucene/build/native:/usr/local/lib
          ant test-core -Dtests.directory=NativeUnixDirectory
          

          (Change the path to your full path... or maybe use relative path but
          that could be dangerous since tests may change the CWD... not sure).

          It also requires a temporary ctor in the directory taking only File
          and using FSDirectory.open for the delegate directory.

          Finally I added another setting, minBytesDirect: if the file to open
          (in openInput), or expected merge size (in createOutput), is smaller
          than this, then we don't use direct IO.

          To test this... I temporarily fixed the dir to always use direct IO,
          and tests (eventually: ~75 minutes!) passed, except for some Solr tests which look
          like timeout problems. I temporarily dropped the default
          buffer size to 4 KB (from 256 KB) else the JVM ran out of direct
          buffer space.

          I'll test on OS X as well...

          Show
          Michael McCandless added a comment - Hi Varun! Thanks for checking back here... I ran this directory on Linux and hit a number of test failures, from two bugs in the directory impl. First, NativeUnixIndexInput.fileLength() failed to include bytes still in the buffer; second, on clone, if the current II was positioned at EOF and EOF was on a (512 byte) page boundary then we'd hit a false IOE. I fixed those... I also changed NativeUnixDir to take a delegate Dir which we fall back to when we don't want to use direct IO; this enabled removing of open_normal since we just use the "normal" java Dir impls for this. To run all tests on Linux I do this: export CLASSPATH=/lucene/unixdir/lucene/build/contrib/misc/lucene-misc-4.0-SNAPSHOT.jar export LD_LIBRARY_PATH=/lucene/unixdir/lucene/build/native:/usr/local/lib ant test-core -Dtests.directory=NativeUnixDirectory (Change the path to your full path... or maybe use relative path but that could be dangerous since tests may change the CWD... not sure). It also requires a temporary ctor in the directory taking only File and using FSDirectory.open for the delegate directory. Finally I added another setting, minBytesDirect: if the file to open (in openInput), or expected merge size (in createOutput), is smaller than this, then we don't use direct IO. To test this... I temporarily fixed the dir to always use direct IO, and tests (eventually: ~75 minutes!) passed, except for some Solr tests which look like timeout problems. I temporarily dropped the default buffer size to 4 KB (from 256 KB) else the JVM ran out of direct buffer space. I'll test on OS X as well...
          Hide
          Michael McCandless added a comment -

          OK tests pass (slowly: 46 minutes) on OS X as well.

          I think this is ready!

          Show
          Michael McCandless added a comment - OK tests pass (slowly: 46 minutes) on OS X as well. I think this is ready!
          Hide
          Michael McCandless added a comment -

          Thanks Varun!

          Show
          Michael McCandless added a comment - Thanks Varun!
          Hide
          Littlestar added a comment -

          /** Default buffer size before writing to disk (256 MB);

          • larger means less IO load but more RAM and direct
          • buffer storage space consumed during merging. */

          public final static int DEFAULT_MERGE_BUFFER_SIZE = 262144;

          in NativeUnixDirectory.java

          Does 256MB mistake ? 262144===256K

          Show
          Littlestar added a comment - /** Default buffer size before writing to disk (256 MB); larger means less IO load but more RAM and direct buffer storage space consumed during merging. */ public final static int DEFAULT_MERGE_BUFFER_SIZE = 262144; in NativeUnixDirectory.java Does 256MB mistake ? 262144===256K
          Hide
          Michael McCandless added a comment -

          Woops, thanks Littlestar: the comment should say 256 KB. I'll fix.

          Show
          Michael McCandless added a comment - Woops, thanks Littlestar: the comment should say 256 KB. I'll fix.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Michael McCandless
          http://svn.apache.org/viewvc?view=revision&revision=1430216

          LUCENE-2795: fix wrong comment (KB not MB)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1430216 LUCENE-2795 : fix wrong comment (KB not MB)
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Michael McCandless
          http://svn.apache.org/viewvc?view=revision&revision=1430214

          LUCENE-2795: fix wrong comment (KB not MB)

          Show
          Commit Tag Bot added a comment - [trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1430214 LUCENE-2795 : fix wrong comment (KB not MB)

            People

            • Assignee:
              Varun Thacker
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development