Lucene - Core
  1. Lucene - Core
  2. LUCENE-753

Use NIO positional read to avoid synchronization in FSIndexInput

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/store
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      As suggested by Doug, we could use NIO pread to avoid synchronization on the underlying file.
      This could mitigate any MT performance drop caused by reducing the number of files in the index format.

      1. LUCENE-753.patch
        8 kB
        Michael McCandless
      2. LUCENE-753.patch
        10 kB
        Michael McCandless
      3. LUCENE-753.patch
        13 kB
        Jason Rutherglen
      4. LUCENE-753.patch
        10 kB
        Michael McCandless
      5. LUCENE-753.patch
        11 kB
        Michael McCandless
      6. lucene-753.patch
        9 kB
        Jason Rutherglen
      7. lucene-753.patch
        8 kB
        Jason Rutherglen
      8. FSIndexInput.patch
        2 kB
        Yonik Seeley
      9. FSIndexInput.patch
        7 kB
        Yonik Seeley
      10. FSDirectoryPool.patch
        6 kB
        Michael McCandless
      11. FileReadTest.java
        5 kB
        Yonik Seeley
      12. FileReadTest.java
        5 kB
        Yonik Seeley
      13. FileReadTest.java
        6 kB
        Yonik Seeley
      14. FileReadTest.java
        6 kB
        Michael McCandless
      15. FileReadTest.java
        8 kB
        Yonik Seeley
      16. FileReadTest.java
        8 kB
        Yonik Seeley
      17. FileReadTest.java
        8 kB
        Brian Pinkerton
      18. FileReadTest.java
        8 kB
        Yonik Seeley

        Activity

        Hide
        Yonik Seeley added a comment -

        Patch for FSIndexInput to use a positional read call that doesn't use explicit synchronization. Note that the implementation of that read call may still involve some synchronization depending on the JVM and OS (notably Windows which lacks a native pread AFAIK).

        Show
        Yonik Seeley added a comment - Patch for FSIndexInput to use a positional read call that doesn't use explicit synchronization. Note that the implementation of that read call may still involve some synchronization depending on the JVM and OS (notably Windows which lacks a native pread AFAIK).
        Hide
        Yonik Seeley added a comment -

        This change should be faster on heavily loaded multi-threaded servers using the non-compound index format.
        Performance tests are needed to see if there is any negative impact on single-threaded performance.

        Compound index format (CSIndexInput) still does synchronization because the base IndexInput is not cloned (and hence shared by all CSIndexInput clones). It's unclear if getting rid of the synchronization is worth the cloning overhead in this case.

        Show
        Yonik Seeley added a comment - This change should be faster on heavily loaded multi-threaded servers using the non-compound index format. Performance tests are needed to see if there is any negative impact on single-threaded performance. Compound index format (CSIndexInput) still does synchronization because the base IndexInput is not cloned (and hence shared by all CSIndexInput clones). It's unclear if getting rid of the synchronization is worth the cloning overhead in this case.
        Hide
        Doug Cutting added a comment -

        This patch continues to use BufferedIndexInput and allocates a new ByteBuffer for each call to read(). I wonder if it might be more efficient to instead directly extend IndexInput and always represent the buffer as a ByteBuffer?

        Show
        Doug Cutting added a comment - This patch continues to use BufferedIndexInput and allocates a new ByteBuffer for each call to read(). I wonder if it might be more efficient to instead directly extend IndexInput and always represent the buffer as a ByteBuffer?
        Hide
        Yonik Seeley added a comment -

        CSIndexInput synchronization could also be elimitated if there was a pread added to IndexInput

        public abstract void readBytes(byte[] b, int offset, int len, long fileposition)

        Unfortunately, that would break any custom Directory based implementations out there, and we can't provide a suitable default with seek & read because we don't know what object to synchronize on.
        Worth it or not???

        Show
        Yonik Seeley added a comment - CSIndexInput synchronization could also be elimitated if there was a pread added to IndexInput public abstract void readBytes(byte[] b, int offset, int len, long fileposition) Unfortunately, that would break any custom Directory based implementations out there, and we can't provide a suitable default with seek & read because we don't know what object to synchronize on. Worth it or not???
        Hide
        Yonik Seeley added a comment -

        Here is a patch that directly extends IndexInput to make things a little easier.
        I started with the code for BufferedIndexInput to avoid any bugs in read().
        They share enough code that a common subclass could be factored out if desired (or changes made in BufferedIndexInput to enable easier sharing).

        ByteBuffer does have offset, length, etc, but I did not use them because BufferedIndexInput currently allocates the byte[] on demand, and thus would add additional checks to readByte(). Also, the NIO Buffer.get() isn't as efficient as our own array access.

        Show
        Yonik Seeley added a comment - Here is a patch that directly extends IndexInput to make things a little easier. I started with the code for BufferedIndexInput to avoid any bugs in read(). They share enough code that a common subclass could be factored out if desired (or changes made in BufferedIndexInput to enable easier sharing). ByteBuffer does have offset, length, etc, but I did not use them because BufferedIndexInput currently allocates the byte[] on demand, and thus would add additional checks to readByte(). Also, the NIO Buffer.get() isn't as efficient as our own array access.
        Hide
        Bogdan Ghidireac added a comment -

        You can find a NIO variation of IndexInput attached to this issue: http://issues.apache.org/jira/browse/LUCENE-519

        I had good results on multiprocessor machines under heavy load.

        Regards,
        Bogdan

        Show
        Bogdan Ghidireac added a comment - You can find a NIO variation of IndexInput attached to this issue: http://issues.apache.org/jira/browse/LUCENE-519 I had good results on multiprocessor machines under heavy load. Regards, Bogdan
        Hide
        Yonik Seeley added a comment -

        Thanks for the pointer Bogdan, it's interesting you use transferTo instead of read... is there any advantage to this? You still need to create a new object every read(), but at least it looks like a smaller object.

        It's also been pointed out to me that http://issues.apache.org/jira/browse/LUCENE-414 has some more NIO code.

        Show
        Yonik Seeley added a comment - Thanks for the pointer Bogdan, it's interesting you use transferTo instead of read... is there any advantage to this? You still need to create a new object every read(), but at least it looks like a smaller object. It's also been pointed out to me that http://issues.apache.org/jira/browse/LUCENE-414 has some more NIO code.
        Hide
        Bogdan Ghidireac added a comment -

        The Javadoc says that transferTo can be more efficient because the OS can transfer bytes directly from the filesystem cache to the target channel without actually copying them.

        Show
        Bogdan Ghidireac added a comment - The Javadoc says that transferTo can be more efficient because the OS can transfer bytes directly from the filesystem cache to the target channel without actually copying them.
        Hide
        Yonik Seeley added a comment -

        > The Javadoc says that transferTo can be more efficient because the OS can transfer bytes
        > directly from the filesystem cache to the target channel without actually copying them.

        Unfortunately, only for DirectByteBuffers and other FileChannels, not for HeapByteBuffers.
        Sounds like we just need to do some benchmarking, but I have a bad feeling that all the checking overhead Sun added to NIO will cause it to be slower in the single threaded case.

        Show
        Yonik Seeley added a comment - > The Javadoc says that transferTo can be more efficient because the OS can transfer bytes > directly from the filesystem cache to the target channel without actually copying them. Unfortunately, only for DirectByteBuffers and other FileChannels, not for HeapByteBuffers. Sounds like we just need to do some benchmarking, but I have a bad feeling that all the checking overhead Sun added to NIO will cause it to be slower in the single threaded case.
        Hide
        Yonik Seeley added a comment -

        Attaching test that reads a file in different ways, either random access or serially, from a number of threads.

        Show
        Yonik Seeley added a comment - Attaching test that reads a file in different ways, either random access or serially, from a number of threads.
        Hide
        Yonik Seeley added a comment -

        Single-threaded random access performance of a fully cached 64MB file on my home PC (WinXP) , Java6:

        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936
        answer=81332126, ms=7781, MB/sec=167.5603649916463

        config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936
        answer=81332126, ms=9203, MB/sec=141.66980332500273

        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936
        answer=81332126, ms=11672, MB/sec=111.70212474297463

        config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936
        answer=81332126, ms=17328, MB/sec=75.2416435826408

        Show
        Yonik Seeley added a comment - Single-threaded random access performance of a fully cached 64MB file on my home PC (WinXP) , Java6: config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936 answer=81332126, ms=7781, MB/sec=167.5603649916463 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936 answer=81332126, ms=9203, MB/sec=141.66980332500273 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936 answer=81332126, ms=11672, MB/sec=111.70212474297463 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936 answer=81332126, ms=17328, MB/sec=75.2416435826408
        Hide
        Brian Pinkerton added a comment -

        Most of my workloads would benefit by removing the synchronization in FSIndexInput, so I took a closer look at this issue. I found exactly the opposite results that Yonik did on two platforms that I use frequently in production (Solaris and Linux), and by a significant margin. I even get the same behavior on the Mac, though I'm not running Java6 there.

        1. uname -a
          Linux xxx 2.6.9-22.0.1.ELsmp #1 SMP Tue Oct 18 18:39:27 EDT 2005 i686 i686 i386 GNU/Linux
        2. java -version
          java version "1.6.0_02"
          Java(TM) SE Runtime Environment (build 1.6.0_02-b05)
          Java HotSpot(TM) Client VM (build 1.6.0_02-b05, mixed mode, sharing)

        config: impl=ChannelPread serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760
        answer=0, ms=88543, MB/sec=236.85124741650947
        config: impl=ClassicFile serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760
        answer=0, ms=150560, MB/sec=139.29011689691816

        1. uname -a
          SunOS xxx 5.10 Generic_118844-26 i86pc i386 i86pc
        2. java -version
          java version "1.6.0"
          Java(TM) SE Runtime Environment (build 1.6.0-b105)
          Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)

        config: impl=ChannelPread serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760
        answer=0, ms=39621, MB/sec=529.3031473208652

        config: impl=ClassicFile serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760
        answer=0, ms=119057, MB/sec=176.14688762525515

        Show
        Brian Pinkerton added a comment - Most of my workloads would benefit by removing the synchronization in FSIndexInput, so I took a closer look at this issue. I found exactly the opposite results that Yonik did on two platforms that I use frequently in production (Solaris and Linux), and by a significant margin. I even get the same behavior on the Mac, though I'm not running Java6 there. uname -a Linux xxx 2.6.9-22.0.1.ELsmp #1 SMP Tue Oct 18 18:39:27 EDT 2005 i686 i686 i386 GNU/Linux java -version java version "1.6.0_02" Java(TM) SE Runtime Environment (build 1.6.0_02-b05) Java HotSpot(TM) Client VM (build 1.6.0_02-b05, mixed mode, sharing) config: impl=ChannelPread serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760 answer=0, ms=88543, MB/sec=236.85124741650947 config: impl=ClassicFile serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760 answer=0, ms=150560, MB/sec=139.29011689691816 uname -a SunOS xxx 5.10 Generic_118844-26 i86pc i386 i86pc java -version java version "1.6.0" Java(TM) SE Runtime Environment (build 1.6.0-b105) Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode) config: impl=ChannelPread serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760 answer=0, ms=39621, MB/sec=529.3031473208652 config: impl=ClassicFile serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760 answer=0, ms=119057, MB/sec=176.14688762525515
        Hide
        Yonik Seeley added a comment -

        Brad, one possible difference is the number of threads we tested with.
        I tested single-threaded (nThreads=1) to see what kind of slowdown a single query might see.

        A normal production system shouldn't see 200 concurrent running search threads unless it's just about to fall over, or unless it's one of those massive multi-core systems. After you pass a certain amount of parallelism, NIO can help.

        Show
        Yonik Seeley added a comment - Brad, one possible difference is the number of threads we tested with. I tested single-threaded (nThreads=1) to see what kind of slowdown a single query might see. A normal production system shouldn't see 200 concurrent running search threads unless it's just about to fall over, or unless it's one of those massive multi-core systems. After you pass a certain amount of parallelism, NIO can help.
        Hide
        Brian Pinkerton added a comment -

        Whoops; I should have paid more attention to the args. The results in the single-threaded case still favor pread, but by a slimmer margin:

        Linux:

        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760
        answer=0, ms=9983, MB/sec=210.0723229490133

        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760
        answer=0, ms=9247, MB/sec=226.7926895209257

        Solaris 10:

        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760
        answer=0, ms=7381, MB/sec=284.12843788104595

        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760
        answer=0, ms=6245, MB/sec=335.81297037630105

        Mac OS X:

        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760
        answer=-914995, ms=19945, MB/sec=105.14675357232389

        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760
        answer=-914995, ms=26378, MB/sec=79.50382894836606

        Show
        Brian Pinkerton added a comment - Whoops; I should have paid more attention to the args. The results in the single-threaded case still favor pread, but by a slimmer margin: Linux: config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 answer=0, ms=9983, MB/sec=210.0723229490133 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 answer=0, ms=9247, MB/sec=226.7926895209257 Solaris 10: config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 answer=0, ms=7381, MB/sec=284.12843788104595 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 answer=0, ms=6245, MB/sec=335.81297037630105 Mac OS X: config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 answer=-914995, ms=19945, MB/sec=105.14675357232389 config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 answer=-914995, ms=26378, MB/sec=79.50382894836606
        Hide
        Doug Cutting added a comment -

        > Brad, [...]

        That's Brian. And right, the difference in your tests is the number of threads.

        Perhaps this is a case where one size will not fit all. MmapDirectory is fastest on 64-bit platforms with lots of threads, while good-old-FSDirectory has always been fastest for single-threaded access. Perhaps a PreadDirectory would be the Directory of choice for multi-threaded access of large indexes on 32-bit hardware? It would be useful to benchmark this patch against MmapDirectory, since they both remove synchronization.

        Show
        Doug Cutting added a comment - > Brad, [...] That's Brian. And right, the difference in your tests is the number of threads. Perhaps this is a case where one size will not fit all. MmapDirectory is fastest on 64-bit platforms with lots of threads, while good-old-FSDirectory has always been fastest for single-threaded access. Perhaps a PreadDirectory would be the Directory of choice for multi-threaded access of large indexes on 32-bit hardware? It would be useful to benchmark this patch against MmapDirectory, since they both remove synchronization.
        Hide
        Doug Cutting added a comment -

        My prior remarks were posted before I saw Brian's latest benchmarks.

        While it would still be good to throw mmap into the mix, pread now looks like a strong contender for the one that might beat all. It works well on 32-bit hardware, it's unsynchronized, and it's fast. What's not to like?

        Show
        Doug Cutting added a comment - My prior remarks were posted before I saw Brian's latest benchmarks. While it would still be good to throw mmap into the mix, pread now looks like a strong contender for the one that might beat all. It works well on 32-bit hardware, it's unsynchronized, and it's fast. What's not to like?
        Hide
        Yonik Seeley added a comment -

        Weird... I'm still getting slower results from pread on WinXP.
        Can someone else verify on a windows box?

        Yonik@spidey ~
        $ c:/opt/jdk16/bin/java -server FileReadTest testfile ClassicFile false 1 200
        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=9616000
        answer=160759732, ms=14984, MB/sec=128.35024025627337
        
        $ c:/opt/jdk16/bin/java -server FileReadTest testfile ClassicFile false 1 200
        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=9616000
        answer=160759732, ms=14640, MB/sec=131.36612021857923
        
        
        $ c:/opt/jdk16/bin/java -server FileReadTest testfile ChannelPread false 1 200
        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=9616000
        answer=160759732, ms=21766, MB/sec=88.35798952494717
        
        $ c:/opt/jdk16/bin/java -server FileReadTest testfile ChannelPread false 1 200
        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=9616000
        answer=160759732, ms=21718, MB/sec=88.55327378211622
        
        
        $ c:/opt/jdk16/bin/java -version
        java version "1.6.0_02"
        Java(TM) SE Runtime Environment (build 1.6.0_02-b06)
        Java HotSpot(TM) Client VM (build 1.6.0_02-b06, mixed mode)
        
        Show
        Yonik Seeley added a comment - Weird... I'm still getting slower results from pread on WinXP. Can someone else verify on a windows box? Yonik@spidey ~ $ c:/opt/jdk16/bin/java -server FileReadTest testfile ClassicFile false 1 200 config: impl=ClassicFile serial= false nThreads=1 iterations=200 bufsize=1024 filelen=9616000 answer=160759732, ms=14984, MB/sec=128.35024025627337 $ c:/opt/jdk16/bin/java -server FileReadTest testfile ClassicFile false 1 200 config: impl=ClassicFile serial= false nThreads=1 iterations=200 bufsize=1024 filelen=9616000 answer=160759732, ms=14640, MB/sec=131.36612021857923 $ c:/opt/jdk16/bin/java -server FileReadTest testfile ChannelPread false 1 200 config: impl=ChannelPread serial= false nThreads=1 iterations=200 bufsize=1024 filelen=9616000 answer=160759732, ms=21766, MB/sec=88.35798952494717 $ c:/opt/jdk16/bin/java -server FileReadTest testfile ChannelPread false 1 200 config: impl=ChannelPread serial= false nThreads=1 iterations=200 bufsize=1024 filelen=9616000 answer=160759732, ms=21718, MB/sec=88.55327378211622 $ c:/opt/jdk16/bin/java -version java version "1.6.0_02" Java(TM) SE Runtime Environment (build 1.6.0_02-b06) Java HotSpot(TM) Client VM (build 1.6.0_02-b06, mixed mode)
        Hide
        robert engels added a comment -

        I sent this via email, but probably need to add to the thread...

        I posted a bug on this to Sun a long while back. This is a KNOWN BUG on Windows.

        NIO preads actually sync behind the scenes on some platforms. Using multiple file descriptors is much faster.

        See bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734

        Show
        robert engels added a comment - I sent this via email, but probably need to add to the thread... I posted a bug on this to Sun a long while back. This is a KNOWN BUG on Windows. NIO preads actually sync behind the scenes on some platforms. Using multiple file descriptors is much faster. See bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734
        Hide
        Doug Cutting added a comment -

        So it looks like pread is ~50% slower on Windows, and ~5-25% faster on other platforms. Is that enough of a difference that it might be worth having FSDirectory use different implementations of FSIndexInput based on the value of Constants.WINDOWS (and perhaps JAVA_VERSION)?

        Show
        Doug Cutting added a comment - So it looks like pread is ~50% slower on Windows, and ~5-25% faster on other platforms. Is that enough of a difference that it might be worth having FSDirectory use different implementations of FSIndexInput based on the value of Constants.WINDOWS (and perhaps JAVA_VERSION)?
        Hide
        Michael McCandless added a comment -

        Is that enough of a difference that it might be worth having FSDirectory use different implementations of FSIndexInput based on the value of Constants.WINDOWS (and perhaps JAVA_VERSION)?

        +1

        I think having good out-of-the-box defaults is extremely important (most users won't tune), and given the substantial cross platform differences here I think we should conditionalize the defaults according to the platform.

        Show
        Michael McCandless added a comment - Is that enough of a difference that it might be worth having FSDirectory use different implementations of FSIndexInput based on the value of Constants.WINDOWS (and perhaps JAVA_VERSION)? +1 I think having good out-of-the-box defaults is extremely important (most users won't tune), and given the substantial cross platform differences here I think we should conditionalize the defaults according to the platform.
        Hide
        robert engels added a comment -

        As an aside, if the Lucene people voted on the Java bug (and or sent emails via the proper channels), hopefully the underlying bug can be fixed in the JVM.

        In my opinion it is a serious one - ruins any performance gains of using NIO on files.

        Show
        robert engels added a comment - As an aside, if the Lucene people voted on the Java bug (and or sent emails via the proper channels), hopefully the underlying bug can be fixed in the JVM. In my opinion it is a serious one - ruins any performance gains of using NIO on files.
        Hide
        Yonik Seeley added a comment -

        Updated test that fixes some thread synchronization issues to ensure that the "answer" is the same for all methods.

        Brian, in some of your tests the answer is "0"... is this because your test file consists of zeros (created via /dev/zero or equiv)? UNIX systems treat blocks of zeros differently than normal files (they are stored as holes). It shouldn't make too much of a difference in this case, but just to be sure, could you try with a real file?

        Show
        Yonik Seeley added a comment - Updated test that fixes some thread synchronization issues to ensure that the "answer" is the same for all methods. Brian, in some of your tests the answer is "0"... is this because your test file consists of zeros (created via /dev/zero or equiv)? UNIX systems treat blocks of zeros differently than normal files (they are stored as holes). It shouldn't make too much of a difference in this case, but just to be sure, could you try with a real file?
        Hide
        Brian Pinkerton added a comment -

        Yeah, the file was full of zeroes. But I created the files w/o holes and was using filesystems that don't compress file contents. Just to be sure, though, I repeated the tests with a file with random contents; the results above still hold.

        Show
        Brian Pinkerton added a comment - Yeah, the file was full of zeroes. But I created the files w/o holes and was using filesystems that don't compress file contents. Just to be sure, though, I repeated the tests with a file with random contents; the results above still hold.
        Hide
        Brian Pinkerton added a comment -

        BTW, I think the performance win with Yonik's patch for some workloads could be far greater than what the simple benchmark illustrates. Sure, pread might be marginally faster. But the real win is avoiding synchronized access to the file.

        I did some IO tracing a while back on one particular workload that is characterized by:

        • a small number of large compound indexes
        • short average execution time, particularly compared to disk response time
        • a 99+% FS cache hit rate
        • cache misses that tend to cluster on rare queries

        In this workload where each query hits each compound index, the locking in FSIndexInput means that a single rare query clobbers the response time for all queries. The requests to read cached data are serialized (fairly, even) with those that hit the disk. While we can't get rid of the rare queries, we can allow the common ones to proceed against cached data right away.

        Show
        Brian Pinkerton added a comment - BTW, I think the performance win with Yonik's patch for some workloads could be far greater than what the simple benchmark illustrates. Sure, pread might be marginally faster. But the real win is avoiding synchronized access to the file. I did some IO tracing a while back on one particular workload that is characterized by: a small number of large compound indexes short average execution time, particularly compared to disk response time a 99+% FS cache hit rate cache misses that tend to cluster on rare queries In this workload where each query hits each compound index, the locking in FSIndexInput means that a single rare query clobbers the response time for all queries. The requests to read cached data are serialized (fairly, even) with those that hit the disk. While we can't get rid of the rare queries, we can allow the common ones to proceed against cached data right away.
        Hide
        Michael McCandless added a comment -

        I ran Yonik's most recent FileReadTest.java on the platforms below,
        testing single-threaded random access for fully cached 64 MB file.

        I tested two Windows XP Pro machines and got opposite results from
        Yonik. Yonik is your machine XP Home?

        I'm showing ChannelTransfer to be much faster on all platforms except
        Windows Server 2003 R2 Enterprise x64 where it's about the same as
        ChannelPread and ChannelFile.

        The ChannelTransfer test is giving the wrong checksum, but I think
        just a bug in how checksum is computed (it's using "len" which with
        ChannelTransfer is just the chunk size written on each call to
        write). So I think the MB/sec is still correct.

        Mac OS X 10.4 (Sun java 1.5)
        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=32565, MB/sec=412.15331797942576

        config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=19512, MB/sec=687.8727347273473

        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=19492, MB/sec=688.5785347835009

        config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=147783, ms=16009, MB/sec=838.3892060715847

        Linux 2.6.22.1 (Sun java 1.5)
        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=37879, MB/sec=354.33281765622115

        config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=21845, MB/sec=614.4093751430535

        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=21902, MB/sec=612.8103734818737

        config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=147783, ms=15978, MB/sec=840.015821754913

        Windows Server 2003 R2 Enterprise x64 (Sun java 1.6)

        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=32703, MB/sec=410.4141149130049

        config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=23344, MB/sec=574.9559972583961

        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=23329, MB/sec=575.3256804835183

        config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=147783, ms=23422, MB/sec=573.0412774314747

        Windows XP Pro SP2, laptop (Sun Java 1.5)

        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=71253, MB/sec=188.36782731955148

        config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=57463, MB/sec=233.57243443607192

        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=58043, MB/sec=231.23844046655068

        config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=147783, ms=20039, MB/sec=669.7825640001995

        Windows XP Pro SP2, older desktop (Sun Java 1.6)

        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=53047, MB/sec=253.01662299470283

        config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=34047, MB/sec=394.2130819161747

        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=-44611, ms=34078, MB/sec=393.8544750278772

        config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864
        answer=147783, ms=18781, MB/sec=714.6463340610192

        Show
        Michael McCandless added a comment - I ran Yonik's most recent FileReadTest.java on the platforms below, testing single-threaded random access for fully cached 64 MB file. I tested two Windows XP Pro machines and got opposite results from Yonik. Yonik is your machine XP Home? I'm showing ChannelTransfer to be much faster on all platforms except Windows Server 2003 R2 Enterprise x64 where it's about the same as ChannelPread and ChannelFile. The ChannelTransfer test is giving the wrong checksum, but I think just a bug in how checksum is computed (it's using "len" which with ChannelTransfer is just the chunk size written on each call to write). So I think the MB/sec is still correct. Mac OS X 10.4 (Sun java 1.5) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=32565, MB/sec=412.15331797942576 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=19512, MB/sec=687.8727347273473 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=19492, MB/sec=688.5785347835009 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=16009, MB/sec=838.3892060715847 Linux 2.6.22.1 (Sun java 1.5) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=37879, MB/sec=354.33281765622115 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=21845, MB/sec=614.4093751430535 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=21902, MB/sec=612.8103734818737 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=15978, MB/sec=840.015821754913 Windows Server 2003 R2 Enterprise x64 (Sun java 1.6) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=32703, MB/sec=410.4141149130049 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=23344, MB/sec=574.9559972583961 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=23329, MB/sec=575.3256804835183 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=23422, MB/sec=573.0412774314747 Windows XP Pro SP2, laptop (Sun Java 1.5) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=71253, MB/sec=188.36782731955148 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=57463, MB/sec=233.57243443607192 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=58043, MB/sec=231.23844046655068 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=20039, MB/sec=669.7825640001995 Windows XP Pro SP2, older desktop (Sun Java 1.6) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=53047, MB/sec=253.01662299470283 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=34047, MB/sec=394.2130819161747 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=34078, MB/sec=393.8544750278772 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=18781, MB/sec=714.6463340610192
        Hide
        Michael McCandless added a comment -

        I also just ran a test with 4 threads, random access, on Linux 2.6.22.1:

        config: impl=ClassicFile serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864
        answer=-195110, ms=120856, MB/sec=444.22363142913883

        config: impl=ChannelFile serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864
        answer=-195110, ms=88272, MB/sec=608.2006887801341

        config: impl=ChannelPread serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864
        answer=-195110, ms=77672, MB/sec=691.2026367288084

        config: impl=ChannelTransfer serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864
        answer=594875, ms=38390, MB/sec=1398.465517061735

        ChannelTransfer got even faster (scales up with added threads better).

        Show
        Michael McCandless added a comment - I also just ran a test with 4 threads, random access, on Linux 2.6.22.1: config: impl=ClassicFile serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 answer=-195110, ms=120856, MB/sec=444.22363142913883 config: impl=ChannelFile serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 answer=-195110, ms=88272, MB/sec=608.2006887801341 config: impl=ChannelPread serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 answer=-195110, ms=77672, MB/sec=691.2026367288084 config: impl=ChannelTransfer serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 answer=594875, ms=38390, MB/sec=1398.465517061735 ChannelTransfer got even faster (scales up with added threads better).
        Hide
        Yonik Seeley added a comment -

        Mike, it looks like you are running with a bufsize of 6.5MB!
        Apologies for my hard-to-use benchmark program

        Show
        Yonik Seeley added a comment - Mike, it looks like you are running with a bufsize of 6.5MB! Apologies for my hard-to-use benchmark program
        Hide
        Yonik Seeley added a comment -

        I'll try fixing the transferTo test before anyone re-runs any tests.

        Show
        Yonik Seeley added a comment - I'll try fixing the transferTo test before anyone re-runs any tests.
        Hide
        Michael McCandless added a comment -

        Doh!! Woops I will rerun...

        Show
        Michael McCandless added a comment - Doh!! Woops I will rerun...
        Hide
        Yonik Seeley added a comment -

        OK, uploading latest version of the test that should fix ChannelTransfer (it's also slightly optimized to not create a new object per call).

        Well, at least we've learned that printing out all the input params for benchmarking programs is good practice

        Show
        Yonik Seeley added a comment - OK, uploading latest version of the test that should fix ChannelTransfer (it's also slightly optimized to not create a new object per call). Well, at least we've learned that printing out all the input params for benchmarking programs is good practice
        Hide
        Michael McCandless added a comment -

        Thanks! I'll re-run.

        Well, at least we've learned that printing out all the input params for benchmarking programs is good practice

        Yes indeed

        Show
        Michael McCandless added a comment - Thanks! I'll re-run. Well, at least we've learned that printing out all the input params for benchmarking programs is good practice Yes indeed
        Hide
        Michael McCandless added a comment -

        OK my results on Win XP now agree with Yonik's.

        On UNIX & OS X, ChannelPread is a bit (2-14%) better, but on windows
        it's quite a bit (31-34%) slower.

        Win Server 2003 R2 Enterprise x64 (Sun Java 1.6):

        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=68094, MB/sec=197.10654095808735
        
        config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=72594, MB/sec=184.88818359644048
        
        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=98328, MB/sec=136.5000081360345
        
        config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=201563, MB/sec=66.58847506734867
        

        Win XP Pro SP2, laptop (Sun Java 1.5):

        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=47449, MB/sec=282.8673481000653
        
        config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=54899, MB/sec=244.4811890926975
        
        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=71683, MB/sec=187.237877878995
        
        config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=149475, MB/sec=89.79275999330991
        

        Linux 2.6.22.1 (Sun Java 1.5):

        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=41162, MB/sec=326.0719304212623
        
        config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=53114, MB/sec=252.69745829724744
        
        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=40226, MB/sec=333.65914582608264
        
        config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=59163, MB/sec=226.86092321214272
        

        Mac OS X 10.4 (Sun Java 1.5):

        config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=85894, MB/sec=156.25972477705076
        
        config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=109939, MB/sec=122.08381738964336
        
        config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=75517, MB/sec=177.73180608339845
        
        config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
        answer=110480725, ms=130156, MB/sec=103.12066136021389
        
        Show
        Michael McCandless added a comment - OK my results on Win XP now agree with Yonik's. On UNIX & OS X, ChannelPread is a bit (2-14%) better, but on windows it's quite a bit (31-34%) slower. Win Server 2003 R2 Enterprise x64 (Sun Java 1.6): config: impl=ClassicFile serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=68094, MB/sec=197.10654095808735 config: impl=ChannelFile serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=72594, MB/sec=184.88818359644048 config: impl=ChannelPread serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=98328, MB/sec=136.5000081360345 config: impl=ChannelTransfer serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=201563, MB/sec=66.58847506734867 Win XP Pro SP2, laptop (Sun Java 1.5): config: impl=ClassicFile serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=47449, MB/sec=282.8673481000653 config: impl=ChannelFile serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=54899, MB/sec=244.4811890926975 config: impl=ChannelPread serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=71683, MB/sec=187.237877878995 config: impl=ChannelTransfer serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=149475, MB/sec=89.79275999330991 Linux 2.6.22.1 (Sun Java 1.5): config: impl=ClassicFile serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=41162, MB/sec=326.0719304212623 config: impl=ChannelFile serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=53114, MB/sec=252.69745829724744 config: impl=ChannelPread serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=40226, MB/sec=333.65914582608264 config: impl=ChannelTransfer serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=59163, MB/sec=226.86092321214272 Mac OS X 10.4 (Sun Java 1.5): config: impl=ClassicFile serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=85894, MB/sec=156.25972477705076 config: impl=ChannelFile serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=109939, MB/sec=122.08381738964336 config: impl=ChannelPread serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=75517, MB/sec=177.73180608339845 config: impl=ChannelTransfer serial= false nThreads=1 iterations=200 bufsize=1024 filelen=67108864 answer=110480725, ms=130156, MB/sec=103.12066136021389
        Hide
        Nat added a comment -

        I think bufsize has way much bigger impact than the implementation. I found that 64KB buffer size is at least 5-6 times faster than 1KB. Should we tune this parameter instead for maximum performance.

        Show
        Nat added a comment - I think bufsize has way much bigger impact than the implementation. I found that 64KB buffer size is at least 5-6 times faster than 1KB. Should we tune this parameter instead for maximum performance.
        Hide
        Jason Rutherglen added a comment -

        lucene-753.patch

        Made NIOFSDirectory that inherits from FSDirectory and includes the patch.

        Show
        Jason Rutherglen added a comment - lucene-753.patch Made NIOFSDirectory that inherits from FSDirectory and includes the patch.
        Hide
        Michael McCandless added a comment -

        Carrying forward from this thread:

        http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200806.mbox/%3C85d3c3b60806240501y96d3637r72b2181fa829fa00@mail.gmail.com%3E

        Jason Rutherglen <jason.rutherglen@gmail.com> wrote:

        After thinking more about the pool of RandomAccessFiles I think
        LUCENE-753 is the best solution. I am not sure how much work nor if
        pool of RandomAccessFiles creates more synchronization problems and if
        it is only to benefit windows, does not seem worthwhile.

        It wasn't clear to me that pread would in fact perform better than
        letting each thread uses its own private RandomAccessFile.

        So I modified (attached) FileReadTest.java to add a new SeparateFile
        implementation, which opens a private RandomAccessFile per-thread and
        then just does "classic" seeks & reads on that file. Then I ran the
        test on 3 platforms (results below), using 4 threads.

        The results are very interesting – using SeparateFile is always
        faster, especially so on WinXP Pro (115% faster than the next fastest,
        ClassicFile) but also surprisingly so on Linux (44% faster than the
        next fastest, ChannelPread). On Mac OS X it was 5% faster than
        ChannelPread. So on all platforms it's faster, when using multiple
        threads, to use separate files.

        I don't have a Windows server class machine readily accessible so if
        someone could run on such a machine, and run on other machines
        (Solaris) to see if these results are reproducible, that'd be great.

        This is a strong argument for some sort of pooling of
        RandomAccessFiles under FSDirectory, though the counter balance is
        clearly added complexity. I think if we combined the two approaches
        (use separate RandomAccessFile objects per thread as managed by a
        pool, and then use the best mode (classic on Windows & channel pread
        on all others)) we'd likely get the best performance yet.

        Mac OS X 10.5.3, single WD Velociraptor hard drive, Sun JRE 1.6.0_05

        
        config: impl=ClassicFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=151884, MB/sec=176.73715203708093
        
        config: impl=SeparateFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=97820, MB/sec=274.4177632386015
        
        config: impl=ChannelPread serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=103059, MB/sec=260.4677476008888
        
        config: impl=ChannelFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=176250, MB/sec=152.30380482269504
        
        config: impl=ChannelTransfer serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=365904, MB/sec=73.36226332589969
        
        

        Linux 2.6.22.1, 6-drive RAID 5 array, Sun JRE 1.6.0_06

        
        config: impl=ClassicFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=75592, MB/sec=355.1109323737962
        
        config: impl=SeparateFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=35505, MB/sec=756.0497282072947
        
        config: impl=ChannelPread serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=51075, MB/sec=525.5711326480665
        
        config: impl=ChannelFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=95640, MB/sec=280.6727896277708
        
        config: impl=ChannelTransfer serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=93711, MB/sec=286.45031639828835
        
        

        WIN XP PRO, laptop, Sun JRE 1.4.2_15:

        
        config: impl=ClassicFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=135349, MB/sec=198.32836297275932
        
        config: impl=SeparateFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=62970, MB/sec=426.2910211211688
        
        config: impl=ChannelPread serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=174606, MB/sec=153.73781886074937
        
        config: impl=ChannelFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=152171, MB/sec=176.4038193873997
        
        config: impl=ChannelTransfer serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
        answer=-23909200, ms=275603, MB/sec=97.39932293915524
        
        
        Show
        Michael McCandless added a comment - Carrying forward from this thread: http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200806.mbox/%3C85d3c3b60806240501y96d3637r72b2181fa829fa00@mail.gmail.com%3E Jason Rutherglen <jason.rutherglen@gmail.com> wrote: After thinking more about the pool of RandomAccessFiles I think LUCENE-753 is the best solution. I am not sure how much work nor if pool of RandomAccessFiles creates more synchronization problems and if it is only to benefit windows, does not seem worthwhile. It wasn't clear to me that pread would in fact perform better than letting each thread uses its own private RandomAccessFile. So I modified (attached) FileReadTest.java to add a new SeparateFile implementation, which opens a private RandomAccessFile per-thread and then just does "classic" seeks & reads on that file. Then I ran the test on 3 platforms (results below), using 4 threads. The results are very interesting – using SeparateFile is always faster, especially so on WinXP Pro (115% faster than the next fastest, ClassicFile) but also surprisingly so on Linux (44% faster than the next fastest, ChannelPread). On Mac OS X it was 5% faster than ChannelPread. So on all platforms it's faster, when using multiple threads, to use separate files. I don't have a Windows server class machine readily accessible so if someone could run on such a machine, and run on other machines (Solaris) to see if these results are reproducible, that'd be great. This is a strong argument for some sort of pooling of RandomAccessFiles under FSDirectory, though the counter balance is clearly added complexity. I think if we combined the two approaches (use separate RandomAccessFile objects per thread as managed by a pool, and then use the best mode (classic on Windows & channel pread on all others)) we'd likely get the best performance yet. Mac OS X 10.5.3, single WD Velociraptor hard drive, Sun JRE 1.6.0_05 config: impl=ClassicFile serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=151884, MB/sec=176.73715203708093 config: impl=SeparateFile serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=97820, MB/sec=274.4177632386015 config: impl=ChannelPread serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=103059, MB/sec=260.4677476008888 config: impl=ChannelFile serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=176250, MB/sec=152.30380482269504 config: impl=ChannelTransfer serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=365904, MB/sec=73.36226332589969 Linux 2.6.22.1, 6-drive RAID 5 array, Sun JRE 1.6.0_06 config: impl=ClassicFile serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=75592, MB/sec=355.1109323737962 config: impl=SeparateFile serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=35505, MB/sec=756.0497282072947 config: impl=ChannelPread serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=51075, MB/sec=525.5711326480665 config: impl=ChannelFile serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=95640, MB/sec=280.6727896277708 config: impl=ChannelTransfer serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=93711, MB/sec=286.45031639828835 WIN XP PRO, laptop, Sun JRE 1.4.2_15: config: impl=ClassicFile serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=135349, MB/sec=198.32836297275932 config: impl=SeparateFile serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=62970, MB/sec=426.2910211211688 config: impl=ChannelPread serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=174606, MB/sec=153.73781886074937 config: impl=ChannelFile serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=152171, MB/sec=176.4038193873997 config: impl=ChannelTransfer serial= true nThreads=4 iterations=100 bufsize=1024 filelen=67108864 answer=-23909200, ms=275603, MB/sec=97.39932293915524
        Hide
        Jason Rutherglen added a comment -

        Interesting results. The question would be, what would the algorithm for allocating RandomAccessFiles to which file look like? When would a new file open, when would a file be closed? If it is based on usage would it be based on the rate of calls to readInternal? This seems like an OS filesystem topic that maybe there is some standard algorithm for. How would the pool avoid the same synchronization issue given the default small buffer size of 1024? If there are 30 threads executing searches, there will not be 30 RandomAccessFiles per file so there is still contention over the limited number of RandomAccessFiles allocated.

        Show
        Jason Rutherglen added a comment - Interesting results. The question would be, what would the algorithm for allocating RandomAccessFiles to which file look like? When would a new file open, when would a file be closed? If it is based on usage would it be based on the rate of calls to readInternal? This seems like an OS filesystem topic that maybe there is some standard algorithm for. How would the pool avoid the same synchronization issue given the default small buffer size of 1024? If there are 30 threads executing searches, there will not be 30 RandomAccessFiles per file so there is still contention over the limited number of RandomAccessFiles allocated.
        Hide
        Yonik Seeley added a comment -

        Added a PooledPread impl to FileReadTest, but at least on Windows it always seems slower than non-pooled. I suppose it might be because of the extra synchronization.

        Show
        Yonik Seeley added a comment - Added a PooledPread impl to FileReadTest, but at least on Windows it always seems slower than non-pooled. I suppose it might be because of the extra synchronization.
        Hide
        Michael McCandless added a comment -

        I think you have a small bug – minCount is initialized to 0 but should be something effectively infinite instead?

        Show
        Michael McCandless added a comment - I think you have a small bug – minCount is initialized to 0 but should be something effectively infinite instead?
        Hide
        Yonik Seeley added a comment -

        Thanks Mike, after the bug is fixed, PooledPread is now faster on Windows when more than 1 thread is used.

        Show
        Yonik Seeley added a comment - Thanks Mike, after the bug is fixed, PooledPread is now faster on Windows when more than 1 thread is used.
        Hide
        Michael McCandless added a comment -

        OK I re-ran only PooledPread, SeparateFile and ChannelPread since they
        are the "leading contenders" on all platforms.

        Also, I changed to serial=false.

        Now the results are very close on all but windows, but on windows I'm
        seeing the opposite of what Yonik saw: PooledPread is slowest, and
        SeparateFile is fastest. But this is a laptop (Win XP Pro), and it is
        JRE 1.4. Also I ran with pool size == number of threads == 4.

        Mac OS X 10.5.3, single WD Velociraptor hard drive, Sun JRE 1.6.0_05

        config: impl=PooledPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
        answer=-23830370, ms=120807, MB/sec=222.20190551871994
        
        config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
        answer=-23830326, ms=119641, MB/sec=224.36744594244448
        
        config: impl=ChannelPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
        answer=-23830370, ms=119217, MB/sec=225.1654176837196
        

        Linux 2.6.22.1, 6-drive RAID 5 array, Sun JRE 1.6.0_06

        config: impl=PooledPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
        answer=-23830370, ms=52613, MB/sec=510.2074696367818
        
        config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
        answer=-23830370, ms=52715, MB/sec=509.22025230010433
        
        config: impl=ChannelPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
        answer=-23830370, ms=53792, MB/sec=499.0248661511005
        

        WIN XP PRO, laptop, Sun JRE 1.4.2_15:

        config: impl=PooledPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
        answer=-23830370, ms=209956, MB/sec=127.85319590771401
        
        config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
        answer=-23830370, ms=89101, MB/sec=301.27098012367986
        
        config: impl=ChannelPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
        answer=-23830370, ms=184087, MB/sec=145.81988733587923
        
        Show
        Michael McCandless added a comment - OK I re-ran only PooledPread, SeparateFile and ChannelPread since they are the "leading contenders" on all platforms. Also, I changed to serial=false. Now the results are very close on all but windows, but on windows I'm seeing the opposite of what Yonik saw: PooledPread is slowest, and SeparateFile is fastest. But this is a laptop (Win XP Pro), and it is JRE 1.4. Also I ran with pool size == number of threads == 4. Mac OS X 10.5.3, single WD Velociraptor hard drive, Sun JRE 1.6.0_05 config: impl=PooledPread serial= false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864 answer=-23830370, ms=120807, MB/sec=222.20190551871994 config: impl=SeparateFile serial= false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864 answer=-23830326, ms=119641, MB/sec=224.36744594244448 config: impl=ChannelPread serial= false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864 answer=-23830370, ms=119217, MB/sec=225.1654176837196 Linux 2.6.22.1, 6-drive RAID 5 array, Sun JRE 1.6.0_06 config: impl=PooledPread serial= false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864 answer=-23830370, ms=52613, MB/sec=510.2074696367818 config: impl=SeparateFile serial= false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864 answer=-23830370, ms=52715, MB/sec=509.22025230010433 config: impl=ChannelPread serial= false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864 answer=-23830370, ms=53792, MB/sec=499.0248661511005 WIN XP PRO, laptop, Sun JRE 1.4.2_15: config: impl=PooledPread serial= false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864 answer=-23830370, ms=209956, MB/sec=127.85319590771401 config: impl=SeparateFile serial= false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864 answer=-23830370, ms=89101, MB/sec=301.27098012367986 config: impl=ChannelPread serial= false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864 answer=-23830370, ms=184087, MB/sec=145.81988733587923
        Hide
        Brian Pinkerton added a comment - - edited

        I was curious about the discrepancy between the ChannelPread implementation and the SeparateFile implementation that Yonik saw. At least on Mac OS X, the kernel implementation of read is virtually the same as pread, so there shouldn't be any appreciable performance difference unless the VM is doing something funny. Sure enough, the implementations of read() under RandomAccessFile and read() under FileChannel are totally different. The former relies on a buffer allocated either on the stack or by malloc, while the latter allocates a native buffer and copies the results to the original array.

        Switching to a native buffer in the benchmark yields identical results for ChannelPread and SeparateFile on 1.5 and 1.6 on OS X. I'm attaching an implementation of ChannelPreadDirect that uses a native buffer.

        This may be a moot point because any implementation inside Lucene needs to consume a byte[] and not a ByteBuffer, but at least it's informative.

        Show
        Brian Pinkerton added a comment - - edited I was curious about the discrepancy between the ChannelPread implementation and the SeparateFile implementation that Yonik saw. At least on Mac OS X, the kernel implementation of read is virtually the same as pread, so there shouldn't be any appreciable performance difference unless the VM is doing something funny. Sure enough, the implementations of read() under RandomAccessFile and read() under FileChannel are totally different. The former relies on a buffer allocated either on the stack or by malloc, while the latter allocates a native buffer and copies the results to the original array. Switching to a native buffer in the benchmark yields identical results for ChannelPread and SeparateFile on 1.5 and 1.6 on OS X. I'm attaching an implementation of ChannelPreadDirect that uses a native buffer. This may be a moot point because any implementation inside Lucene needs to consume a byte[] and not a ByteBuffer, but at least it's informative.
        Hide
        Yonik Seeley added a comment -

        Here are some of my results with 4 threads and a pool size of 4 fds per file. Win XP on a Pentium4 w/ Java5_0_11 -server

        config: impl=PooledPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=9616000
        answer=322211190, ms=51891, MB/sec=74.12460735002217
        
        config: impl=ChannelPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=9616000
        answer=322211190, ms=71175, MB/sec=54.04144713733755
        
        config: impl=ClassicFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=9616000
        answer=322211190, ms=62699, MB/sec=61.34707092617107
        
        config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=9616000
        answer=322211410, ms=21324, MB/sec=180.37891577565185
        
        Show
        Yonik Seeley added a comment - Here are some of my results with 4 threads and a pool size of 4 fds per file. Win XP on a Pentium4 w/ Java5_0_11 -server config: impl=PooledPread serial= false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=9616000 answer=322211190, ms=51891, MB/sec=74.12460735002217 config: impl=ChannelPread serial= false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=9616000 answer=322211190, ms=71175, MB/sec=54.04144713733755 config: impl=ClassicFile serial= false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=9616000 answer=322211190, ms=62699, MB/sec=61.34707092617107 config: impl=SeparateFile serial= false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=9616000 answer=322211410, ms=21324, MB/sec=180.37891577565185
        Hide
        Michael McCandless added a comment -

        OK it's looking like SeparateFile is the best overall choice... it
        matches the best performance on Unix platforms and is very much the
        lead on Windows.

        It's somewhat surprising to me that after all this time, with these
        new IO APIs, the most naive approach (using a separate
        RandomAccessFile per thread) still yields the best performance. In
        fact, opening multiple IndexReaders to gain concurrency is doing just
        this.

        Of course this is a synthetic benchmark. Actual IO with Lucene is
        somewhat different. EG it's a mix of serial (when iterating through a
        term's docs with no skipping) and somewhat random access (when
        retrieving term vectors or stored fields), and presumably a mix of
        hits & misses to the OS's IO cache. So until we try this out with a
        real index and real queries we won't know for sure.

        The question would be, what would the algorithm for allocating
        RandomAccessFiles to which file look like?

        Ideally it would be based roughly on contention. EG a massive CFS
        file in your index should have a separate file per-thread, if there
        are not too many threads, whereas tiny CFS files in the index likely
        could share / synchronize on a single file

        I think it would have thread affinity (if the same thread wants the
        same file we give back the same RandomAccessFile that thread last
        used, if it's available).

        When would a new file open, when would a file be closed?

        I think this should be reference counting. The first time Lucene
        calls FSDirectory.openInput on a given name, we must for-real open the
        file (Lucene relies on OS protecting open files). Further opens on
        that file incRef it. Closes decRef it and when the reference count
        gets to 0 we close it for real.

        If it is based on usage would it be based on the rate of calls to
        readInternal?

        Fortunately, Lucene tends to call IndexInput.clone() when it wants to
        actively make use of a file.

        So I think the pool could work something like this: FSIndexInput.clone
        would "check out" a file from the pool. The pool decides at that
        point to either return a SharedFile (which has locking per-read, like
        we do now), or a PrivateFile (which has no locking because you are the
        only thread currently using that file), based on some measure of
        contention plus some configuration of the limit of allowed open files.

        One problem with this approach is I'm not sure clones are always
        closed, since they are currently very lightweight and can rely on GC
        to reclaim them.

        An alternative approach would be to sync() on every block (1024 bytes
        default now) read, find a file to use, and use it, but I fear that
        will have poor performance.

        In fact, if we build this pool, we can again try all these alternative
        IO APIs, maybe even leaving that choice to the Lucene user as
        "advanced tuning".

        Show
        Michael McCandless added a comment - OK it's looking like SeparateFile is the best overall choice... it matches the best performance on Unix platforms and is very much the lead on Windows. It's somewhat surprising to me that after all this time, with these new IO APIs, the most naive approach (using a separate RandomAccessFile per thread) still yields the best performance. In fact, opening multiple IndexReaders to gain concurrency is doing just this. Of course this is a synthetic benchmark. Actual IO with Lucene is somewhat different. EG it's a mix of serial (when iterating through a term's docs with no skipping) and somewhat random access (when retrieving term vectors or stored fields), and presumably a mix of hits & misses to the OS's IO cache. So until we try this out with a real index and real queries we won't know for sure. The question would be, what would the algorithm for allocating RandomAccessFiles to which file look like? Ideally it would be based roughly on contention. EG a massive CFS file in your index should have a separate file per-thread, if there are not too many threads, whereas tiny CFS files in the index likely could share / synchronize on a single file I think it would have thread affinity (if the same thread wants the same file we give back the same RandomAccessFile that thread last used, if it's available). When would a new file open, when would a file be closed? I think this should be reference counting. The first time Lucene calls FSDirectory.openInput on a given name, we must for-real open the file (Lucene relies on OS protecting open files). Further opens on that file incRef it. Closes decRef it and when the reference count gets to 0 we close it for real. If it is based on usage would it be based on the rate of calls to readInternal? Fortunately, Lucene tends to call IndexInput.clone() when it wants to actively make use of a file. So I think the pool could work something like this: FSIndexInput.clone would "check out" a file from the pool. The pool decides at that point to either return a SharedFile (which has locking per-read, like we do now), or a PrivateFile (which has no locking because you are the only thread currently using that file), based on some measure of contention plus some configuration of the limit of allowed open files. One problem with this approach is I'm not sure clones are always closed, since they are currently very lightweight and can rely on GC to reclaim them. An alternative approach would be to sync() on every block (1024 bytes default now) read, find a file to use, and use it, but I fear that will have poor performance. In fact, if we build this pool, we can again try all these alternative IO APIs, maybe even leaving that choice to the Lucene user as "advanced tuning".
        Hide
        robert engels added a comment -

        As I stated quit a while ago, this has been a long accepted bug in the JDK.

        See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734

        It was filed and accepted over 3 years ago.

        The problem is that the pread performs an unnecessary lock on the file descriptor, instead of using Windows "overlapped" reads.

        Show
        robert engels added a comment - As I stated quit a while ago, this has been a long accepted bug in the JDK. See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734 It was filed and accepted over 3 years ago. The problem is that the pread performs an unnecessary lock on the file descriptor, instead of using Windows "overlapped" reads.
        Hide
        robert engels added a comment -

        The point being - please vote for this issue so it can be fixed properly. It is really a trivial fix, but it needs to be done by SUN.

        Show
        robert engels added a comment - The point being - please vote for this issue so it can be fixed properly. It is really a trivial fix, but it needs to be done by SUN.
        Hide
        Yonik Seeley added a comment -

        .bq OK it's looking like SeparateFile is the best overall choice... it matches the best performance on Unix platforms and is very much the
        lead on Windows.

        The other implementations are fully-featured though (they could be used in lucene w/ extra synchronization, etc). SeparateFile (opening a new file descriptor per reader) is not a real implementation that could be used... it's more of a theoretical maximum IMO. Also remember that you can't open a new fd on demand since the file might already be deleted. We would need a real PooledClassicFile implementation (like PooledPread).

        On non-windows it looks like ChannelPread is probably the right choice.. near max performance and min fd usage

        Show
        Yonik Seeley added a comment - .bq OK it's looking like SeparateFile is the best overall choice... it matches the best performance on Unix platforms and is very much the lead on Windows. The other implementations are fully-featured though (they could be used in lucene w/ extra synchronization, etc). SeparateFile (opening a new file descriptor per reader) is not a real implementation that could be used... it's more of a theoretical maximum IMO. Also remember that you can't open a new fd on demand since the file might already be deleted. We would need a real PooledClassicFile implementation (like PooledPread). On non-windows it looks like ChannelPread is probably the right choice.. near max performance and min fd usage
        Hide
        Jason Rutherglen added a comment -

        Core2Duo Windows XP JDK1.5.15. PooledPread for 4 threads and pool size 2 the performance does not compare well to SeparateFile. PooledPread for 30 threads does not improve appreciably over ClassicFile. If there were 30 threads, how many RandomAccessFiles would there need to be to make a noticeable impact? The problem I see with the pooled implementation is setting the global file descriptor limit properly, will the user set this? There would almost need to be a native check to see if what the user is trying to do is possible.

        The results indicate there is significant contention in the pool code. The previous tests used a pool size the same as the number of threads which is probably not how most production systems look, at least the SOLR installations I've worked on. In SOLR the web request thread is the thread that executes the search, so the number of threads is determined by the J2EE server which can be quite high. Unless the assumption is the system is set for an unusually high number of file descriptors.

        There should probably be a MMapDirectory test as well.

        config: impl=PooledPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=18110448
        answer=53797223, ms=32715, MB/sec=221.4329573590096
        
        config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=18110448
        answer=53797223, ms=18687, MB/sec=387.6587574249478
        
        config: impl=SeparateFile serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=2 filelen=18110448
        answer=403087371, ms=137871, MB/sec=394.0737646060448
        
        config: impl=PooledPread serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=2 filelen=18110448
        answer=403087487, ms=526504, MB/sec=103.19265190767781
        
        config: impl=ChannelPread serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=2 filelen=18110448
        answer=403087487, ms=624291, MB/sec=87.02887595688549
        
        config: impl=ClassicFile serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=2 filelen=18110448
        answer=403087487, ms=587430, MB/sec=92.48990347786119
        
        config: impl=PooledPread serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=4 filelen=18110448
        answer=403087487, ms=552971, MB/sec=98.25351419875544
        
        Show
        Jason Rutherglen added a comment - Core2Duo Windows XP JDK1.5.15. PooledPread for 4 threads and pool size 2 the performance does not compare well to SeparateFile. PooledPread for 30 threads does not improve appreciably over ClassicFile. If there were 30 threads, how many RandomAccessFiles would there need to be to make a noticeable impact? The problem I see with the pooled implementation is setting the global file descriptor limit properly, will the user set this? There would almost need to be a native check to see if what the user is trying to do is possible. The results indicate there is significant contention in the pool code. The previous tests used a pool size the same as the number of threads which is probably not how most production systems look, at least the SOLR installations I've worked on. In SOLR the web request thread is the thread that executes the search, so the number of threads is determined by the J2EE server which can be quite high. Unless the assumption is the system is set for an unusually high number of file descriptors. There should probably be a MMapDirectory test as well. config: impl=PooledPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=18110448 answer=53797223, ms=32715, MB/sec=221.4329573590096 config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=18110448 answer=53797223, ms=18687, MB/sec=387.6587574249478 config: impl=SeparateFile serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=2 filelen=18110448 answer=403087371, ms=137871, MB/sec=394.0737646060448 config: impl=PooledPread serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=2 filelen=18110448 answer=403087487, ms=526504, MB/sec=103.19265190767781 config: impl=ChannelPread serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=2 filelen=18110448 answer=403087487, ms=624291, MB/sec=87.02887595688549 config: impl=ClassicFile serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=2 filelen=18110448 answer=403087487, ms=587430, MB/sec=92.48990347786119 config: impl=PooledPread serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=4 filelen=18110448 answer=403087487, ms=552971, MB/sec=98.25351419875544
        Hide
        Michael McCandless added a comment -

        SeparateFile (opening a new file descriptor per reader) is not a real implementation that could be used... it's more of a theoretical maximum IMO. Also remember that you can't open a new fd on demand since the file might already be deleted. We would need a real PooledClassicFile implementation (like PooledPread).

        True, we'd have to make a real pool, but I'd think we want the sync() to be on cloning and not on every read. I think Lucene's usage of the open files (clones are made & used up quickly and closed) would work well with that approach. I think at this point we should build out an underlying pool and then test all of these different approaches under the pool.

        And yes we cannot just open a new fd on demand if the file has been deleted. But I'm thinking that may not matter in practice. Ie if the pool wants to open a new fd, it can attempt to do so, and if the file was deleted it must then return a shared access wrapper to the fd it already has open. Large segments are where the contention will be and large segments are not often deleted. Plus people tend to open new readers if such a large change has taken place to the index.

        On non-windows it looks like ChannelPread is probably the right choice.. near max performance and min fd usage

        But on Linux I saw 44% speedup for serial=true case with 4 threads using SeparateFile vs ChannelPread, which I was very surprised by. But then again it's synthetic so it may not matter in real Lucene searches.

        Show
        Michael McCandless added a comment - SeparateFile (opening a new file descriptor per reader) is not a real implementation that could be used... it's more of a theoretical maximum IMO. Also remember that you can't open a new fd on demand since the file might already be deleted. We would need a real PooledClassicFile implementation (like PooledPread). True, we'd have to make a real pool, but I'd think we want the sync() to be on cloning and not on every read. I think Lucene's usage of the open files (clones are made & used up quickly and closed) would work well with that approach. I think at this point we should build out an underlying pool and then test all of these different approaches under the pool. And yes we cannot just open a new fd on demand if the file has been deleted. But I'm thinking that may not matter in practice. Ie if the pool wants to open a new fd, it can attempt to do so, and if the file was deleted it must then return a shared access wrapper to the fd it already has open. Large segments are where the contention will be and large segments are not often deleted. Plus people tend to open new readers if such a large change has taken place to the index. On non-windows it looks like ChannelPread is probably the right choice.. near max performance and min fd usage But on Linux I saw 44% speedup for serial=true case with 4 threads using SeparateFile vs ChannelPread, which I was very surprised by. But then again it's synthetic so it may not matter in real Lucene searches.
        Hide
        Jason Rutherglen added a comment -

        lucene-753.patch

        Added javadoc and removed unnecessary NIOFSIndexOutput class.

        Show
        Jason Rutherglen added a comment - lucene-753.patch Added javadoc and removed unnecessary NIOFSIndexOutput class.
        Hide
        Yonik Seeley added a comment -

        (clones are made & used up quickly and closed)

        IIRC, clones are often not closed at all.
        And for term expanding queries, you can get a lot of them all at once.

        And yes we cannot just open a new fd on demand if the file has been deleted. But I'm thinking that may not matter in practice. Ie if the pool wants to open a new fd, it can attempt to do so, and if the file was deleted it must then return a shared access wrapper to the fd it already has open.

        At first blush, sounds a bit too complex for the benefits.

        • one would have to reserve the last fd for synchronized access... can't really hand it out for unsynchronized exclusive access and then go and share it later.
        • the shared access should use pread... not seek+read

        But on Linux I saw 44% speedup for serial=true case with 4 threads using SeparateFile vs ChannelPread, which I was very surprised by.

        In the serial case, there are half the system calls (no seek). When both implementations have a single single system call, all the extra code and complexity that Sun threw into FileChannel comes into play. Compare that with RandomAccessFile.read() which drops down to a native call and presumably just the read with little overhead. I wish Sun would just add a RandomAccessFile.read with a file position.

        If access will be truly serial sometimes, larger buffers would help with that larger read() setup cost.

        Show
        Yonik Seeley added a comment - (clones are made & used up quickly and closed) IIRC, clones are often not closed at all. And for term expanding queries, you can get a lot of them all at once. And yes we cannot just open a new fd on demand if the file has been deleted. But I'm thinking that may not matter in practice. Ie if the pool wants to open a new fd, it can attempt to do so, and if the file was deleted it must then return a shared access wrapper to the fd it already has open. At first blush, sounds a bit too complex for the benefits. one would have to reserve the last fd for synchronized access... can't really hand it out for unsynchronized exclusive access and then go and share it later. the shared access should use pread... not seek+read But on Linux I saw 44% speedup for serial=true case with 4 threads using SeparateFile vs ChannelPread, which I was very surprised by. In the serial case, there are half the system calls (no seek). When both implementations have a single single system call, all the extra code and complexity that Sun threw into FileChannel comes into play. Compare that with RandomAccessFile.read() which drops down to a native call and presumably just the read with little overhead. I wish Sun would just add a RandomAccessFile.read with a file position. If access will be truly serial sometimes, larger buffers would help with that larger read() setup cost.
        Hide
        Michael McCandless added a comment -

        And for term expanding queries, you can get a lot of them all at once.

        Right but that'd all be under one thread right? The pool would always give the same RandomAccessFile (private or shared) for the same filename X thread.

        one would have to reserve the last fd for synchronized access... can't really hand it out for unsynchronized exclusive access and then go and share it later.

        Well, I think you'd hand it out first, as a shared file (so you reserve the right to hand it out again, later). If other threads come along you would open a new one (if you are under the budget) and loan it to them privately (so no syncing during read). I think sync'ing with no contention (the first shared file we hand out) should be OK performance in modern JVMs.

        the shared access should use pread... not seek+read

        But not on Windows...

        At first blush, sounds a bit too complex for the benefits.

        Yeah I'm on the fence too ... but this lack of concurrency hurts our search performance. It's ashame users have to resort to multiple IndexReaders. Though it still remains to be seen how much the pool or pread approaches really improve end to end search performance (vs other bottlenecks like IndexReader.isDeleted).

        Windows is an important platform and doing the pool approach, vs leaving Windows with classic if we do pread approach, lets us have good concurrency on Windows too.

        Show
        Michael McCandless added a comment - And for term expanding queries, you can get a lot of them all at once. Right but that'd all be under one thread right? The pool would always give the same RandomAccessFile (private or shared) for the same filename X thread. one would have to reserve the last fd for synchronized access... can't really hand it out for unsynchronized exclusive access and then go and share it later. Well, I think you'd hand it out first, as a shared file (so you reserve the right to hand it out again, later). If other threads come along you would open a new one (if you are under the budget) and loan it to them privately (so no syncing during read). I think sync'ing with no contention (the first shared file we hand out) should be OK performance in modern JVMs. the shared access should use pread... not seek+read But not on Windows... At first blush, sounds a bit too complex for the benefits. Yeah I'm on the fence too ... but this lack of concurrency hurts our search performance. It's ashame users have to resort to multiple IndexReaders. Though it still remains to be seen how much the pool or pread approaches really improve end to end search performance (vs other bottlenecks like IndexReader.isDeleted). Windows is an important platform and doing the pool approach, vs leaving Windows with classic if we do pread approach, lets us have good concurrency on Windows too.
        Hide
        Brian Gardner added a comment -

        This probably doesn't help much, but I implemented a pool and submitted a patch very similar to the SeparateFile approach. Before being directed to this thread:
        https://issues.apache.org/jira/browse/LUCENE-1337

        In our implementation the synchronization/lack of concurrency has been a big issue for us. On several occasions we've had to remove new features that perform searches from frequently hit pages, because threads build up waiting for synchronized access to the underlying files. It is possible that I would still have issue even with my patch, considering from my tests that I'm only increasing throughput by 300%, but it would be easier for me to tune and scale my application since resource utilization and contention would be visible from the OS level.

        > At first blush, sounds a bit too complex for the benefits.

        My vote is that the benefits outway the complexity, especially considering it's an out-of-the box solutions that works well for all platforms and single threaded as well as multi-threaded envirnments. If it's helpful, I can spend the time to implement some of the missing feature(s) of the pool that will be needed for it to be an acceptable solution (i.e, shared access once a file has been deleted, and perhaps a time-based closing mechanism).

        Show
        Brian Gardner added a comment - This probably doesn't help much, but I implemented a pool and submitted a patch very similar to the SeparateFile approach. Before being directed to this thread: https://issues.apache.org/jira/browse/LUCENE-1337 In our implementation the synchronization/lack of concurrency has been a big issue for us. On several occasions we've had to remove new features that perform searches from frequently hit pages, because threads build up waiting for synchronized access to the underlying files. It is possible that I would still have issue even with my patch, considering from my tests that I'm only increasing throughput by 300%, but it would be easier for me to tune and scale my application since resource utilization and contention would be visible from the OS level. > At first blush, sounds a bit too complex for the benefits. My vote is that the benefits outway the complexity, especially considering it's an out-of-the box solutions that works well for all platforms and single threaded as well as multi-threaded envirnments. If it's helpful, I can spend the time to implement some of the missing feature(s) of the pool that will be needed for it to be an acceptable solution (i.e, shared access once a file has been deleted, and perhaps a time-based closing mechanism).
        Hide
        Michael McCandless added a comment -

        In our implementation the synchronization/lack of concurrency has been a big issue for us. On several occasions we've had to remove new features that perform searches from frequently hit pages, because threads build up waiting for synchronized access to the underlying files. It is possible that I would still have issue even with my patch, considering from my tests that I'm only increasing throughput by 300%, but it would be easier for me to tune and scale my application since resource utilization and contention would be visible from the OS level.

        Can you describe your test – OS, JRE version, size/type of your index, number of cores, amount of RAM, type of IO system, etc? It's awesome that you see 300% gain in search throughput. Is your index largely cached in the OS's IO cache, or not?

        My vote is that the benefits outway the complexity, especially considering it's an out-of-the box solutions that works well for all platforms and single threaded as well as multi-threaded envirnments. If it's helpful, I can spend the time to implement some of the missing feature(s) of the pool that will be needed for it to be an acceptable solution (i.e, shared access once a file has been deleted, and perhaps a time-based closing mechanism).

        If we can see sizable concurreny gains, reliably & across platforms, I agree we should pursue this approach. One particular frustration is: if you optimize your index, thinking this gains you better search performance, you're actually making things far worse as far as concurrency is concerned because now you are down to a single immense file. I think we do need to fix this situation.

        On your patch, I think in addition to shared-access on a now-deleted file, we should add a global control on the "budget" of number of open files (right now I think your patch has a fixed cap per-filename). Probably the budget should be expressed as a multiplier off the minimum number of open files, rather than a fixed cap, so that an index with many segments is allowed to use more. Ideally over time the pool works out such that for small files in the index (small segments) since there is very little contention they only hold 1 descriptor open, but for large files many descriptors are opened.

        I created a separate test (will post a patch & details to this issue) to explore using SeparateFile inside FSDirectory, but unfortunately I see mixed results on both the cached & uncached cases. I'll post details separately.

        One issue with your patch is it's using Java 5 only classes (Lucene is still on 1.4); once you downgrade to 1.4 I wonder if the added synchronization will become costly.

        I like how your approach is to pull a RandomAccessFile from the pool only when a read is taking place – this automatically takes care of creating new descriptors when there truly is contention. But one concern I have is that this defeats the OS's IO system's read-ahead optimization since from the OS's perspective the file descriptors are getting shuffled. I'm not sure if this really matters much in Lucene, because many things (reading stored fields & term vectors) are likely not helped much by read-ahead, but for example a simple TermQuery on a large term should in theory benefit from read-ahead. You could gain this back with a simple thread affinity, such that the same thread gets the same file descriptor it got last time, if it's available. But that added complexity may offset any gains.

        Show
        Michael McCandless added a comment - In our implementation the synchronization/lack of concurrency has been a big issue for us. On several occasions we've had to remove new features that perform searches from frequently hit pages, because threads build up waiting for synchronized access to the underlying files. It is possible that I would still have issue even with my patch, considering from my tests that I'm only increasing throughput by 300%, but it would be easier for me to tune and scale my application since resource utilization and contention would be visible from the OS level. Can you describe your test – OS, JRE version, size/type of your index, number of cores, amount of RAM, type of IO system, etc? It's awesome that you see 300% gain in search throughput. Is your index largely cached in the OS's IO cache, or not? My vote is that the benefits outway the complexity, especially considering it's an out-of-the box solutions that works well for all platforms and single threaded as well as multi-threaded envirnments. If it's helpful, I can spend the time to implement some of the missing feature(s) of the pool that will be needed for it to be an acceptable solution (i.e, shared access once a file has been deleted, and perhaps a time-based closing mechanism). If we can see sizable concurreny gains, reliably & across platforms, I agree we should pursue this approach. One particular frustration is: if you optimize your index, thinking this gains you better search performance, you're actually making things far worse as far as concurrency is concerned because now you are down to a single immense file. I think we do need to fix this situation. On your patch, I think in addition to shared-access on a now-deleted file, we should add a global control on the "budget" of number of open files (right now I think your patch has a fixed cap per-filename). Probably the budget should be expressed as a multiplier off the minimum number of open files, rather than a fixed cap, so that an index with many segments is allowed to use more. Ideally over time the pool works out such that for small files in the index (small segments) since there is very little contention they only hold 1 descriptor open, but for large files many descriptors are opened. I created a separate test (will post a patch & details to this issue) to explore using SeparateFile inside FSDirectory, but unfortunately I see mixed results on both the cached & uncached cases. I'll post details separately. One issue with your patch is it's using Java 5 only classes (Lucene is still on 1.4); once you downgrade to 1.4 I wonder if the added synchronization will become costly. I like how your approach is to pull a RandomAccessFile from the pool only when a read is taking place – this automatically takes care of creating new descriptors when there truly is contention. But one concern I have is that this defeats the OS's IO system's read-ahead optimization since from the OS's perspective the file descriptors are getting shuffled. I'm not sure if this really matters much in Lucene, because many things (reading stored fields & term vectors) are likely not helped much by read-ahead, but for example a simple TermQuery on a large term should in theory benefit from read-ahead. You could gain this back with a simple thread affinity, such that the same thread gets the same file descriptor it got last time, if it's available. But that added complexity may offset any gains.
        Hide
        Michael McCandless added a comment -

        I attached FSDirectoryPool.patch, which adds
        oal.store.FSDirectoryPool, a Directory that will open a new file for
        every unique thread.

        This is intended only as a test (to see if shows consistent
        improvement in concurrency) – eg it does not close all these files,
        nor make any effort to budget itself if there are too many threads,
        it's not really a pool, etc. But it should give us an upper bound on
        the gains we could hope for.

        I also added a "pool=true|false" config option to contrib/benchmark so
        you can run tests with and without separate files.

        I ran some quick initial tests but didn't see obvious gains. I'll go
        back & re-run more carefully to confirm, and post back.

        Show
        Michael McCandless added a comment - I attached FSDirectoryPool.patch, which adds oal.store.FSDirectoryPool, a Directory that will open a new file for every unique thread. This is intended only as a test (to see if shows consistent improvement in concurrency) – eg it does not close all these files, nor make any effort to budget itself if there are too many threads, it's not really a pool, etc. But it should give us an upper bound on the gains we could hope for. I also added a "pool=true|false" config option to contrib/benchmark so you can run tests with and without separate files. I ran some quick initial tests but didn't see obvious gains. I'll go back & re-run more carefully to confirm, and post back.
        Hide
        Michael McCandless added a comment -

        I created a large index (indexed Wikipedia 4X times over, with stored
        fields & tv offsets/positions = 72 GB). I then randomly sampled 50
        terms > 1 million freq, plus 200 terms > 100,000 freq plus 100 terms >
        10,000 freq plus 100 terms > 1000 freq. Then I warmed the OS so these
        queries are fully cached in the IO cache.

        It's a highly synthetic test. I'd really love to test on real
        queries, instead of single term queries.

        Then I ran this alg:

        analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
        
        query.maker = org.apache.lucene.benchmark.byTask.feeds.FileBasedQueryMaker
        file.query.maker.file = /lucene/wikiQueries.txt
        
        directory=FSDirectory
        pool=true
        
        work.dir=/lucene/bigwork
        
        OpenReader
        
        { "Warmup" SearchTrav(20) > : 5
        
        { "Rounds"
          [{ "Search" Search > : 500]: 16
          NewRound
        }: 2
        
        CloseReader 
        
        RepSumByPrefRound Search
        

        I ran with 2, 4, 8 and 16 threads, on a Intel quad Mac Pro (2 cpus,
        each dual core) OS X 10.5.4, with 6 GB RAM, Sun JRE 1.6.0_05 and a
        single WD Velociraptor hard drive. To keep the number of searches
        constant I changed the 500 count above to match (ie with 8 threads I
        changed 500 -> 1000, 4 threads I changed it to 2000, etc.).

        Here're the results – each run is best of 2, and all searches are
        fully cached in OS's IO cache:

        Number of Threads Patch rec/s Trunk rec/s Pctg gain
        2 78.7 74.9 5.1%
        4 74.1 68.2 8.7%
        8 37.7 32.7 15.3%
        16 19.2 16.3 17.8%

        I also ran the same alg, replacing Search task with SearchTravRet(10)
        (retrieves the first 10 docs (hits) of each search), first warming so
        it's all fully cached:

        Number of Threads Patch rec/s Trunk rec/s Pctg gain
        2 1589.6 1519.8 4.6%
        4 1460.9 1395.3 4.7%
        8 748.9 676.0 10.8%
        16 382.7 338.4 13.1%

        So there are smallish gains, but rememember these are upper bounds on
        the gains because no pooling is happening. I'll test uncached next.

        Show
        Michael McCandless added a comment - I created a large index (indexed Wikipedia 4X times over, with stored fields & tv offsets/positions = 72 GB). I then randomly sampled 50 terms > 1 million freq, plus 200 terms > 100,000 freq plus 100 terms > 10,000 freq plus 100 terms > 1000 freq. Then I warmed the OS so these queries are fully cached in the IO cache. It's a highly synthetic test. I'd really love to test on real queries, instead of single term queries. Then I ran this alg: analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer query.maker = org.apache.lucene.benchmark.byTask.feeds.FileBasedQueryMaker file.query.maker.file = /lucene/wikiQueries.txt directory=FSDirectory pool= true work.dir=/lucene/bigwork OpenReader { "Warmup" SearchTrav(20) > : 5 { "Rounds" [{ "Search" Search > : 500]: 16 NewRound }: 2 CloseReader RepSumByPrefRound Search I ran with 2, 4, 8 and 16 threads, on a Intel quad Mac Pro (2 cpus, each dual core) OS X 10.5.4, with 6 GB RAM, Sun JRE 1.6.0_05 and a single WD Velociraptor hard drive. To keep the number of searches constant I changed the 500 count above to match (ie with 8 threads I changed 500 -> 1000, 4 threads I changed it to 2000, etc.). Here're the results – each run is best of 2, and all searches are fully cached in OS's IO cache: Number of Threads Patch rec/s Trunk rec/s Pctg gain 2 78.7 74.9 5.1% 4 74.1 68.2 8.7% 8 37.7 32.7 15.3% 16 19.2 16.3 17.8% I also ran the same alg, replacing Search task with SearchTravRet(10) (retrieves the first 10 docs (hits) of each search), first warming so it's all fully cached: Number of Threads Patch rec/s Trunk rec/s Pctg gain 2 1589.6 1519.8 4.6% 4 1460.9 1395.3 4.7% 8 748.9 676.0 10.8% 16 382.7 338.4 13.1% So there are smallish gains, but rememember these are upper bounds on the gains because no pooling is happening. I'll test uncached next.
        Hide
        Michael McCandless added a comment -

        OK I ran the uncached test, using the Search task. JRE & hardware are
        the same as above.

        I generated a larger (6150) set of queries to make sure the threads
        never wrap around and do the same queries again. I also run only 1
        round for the same reason. Between tests I evict the OS's IO cache.

        Number of Threads Patch rec/s Trunk rec/s Pctg gain
        2 32.2 23.8 35.3%
        4 16.4 12.7 29.1%
        8 8.5 3.5 142.9%
        16 3.8 2.7 40.7%

        The gains are better. The 8 thread case I don't get; I re-ran it and
        it still came out much better (135.7%). It could be 8 threads is the
        sweet spot for concurrency on this hardware.

        Show
        Michael McCandless added a comment - OK I ran the uncached test, using the Search task. JRE & hardware are the same as above. I generated a larger (6150) set of queries to make sure the threads never wrap around and do the same queries again. I also run only 1 round for the same reason. Between tests I evict the OS's IO cache. Number of Threads Patch rec/s Trunk rec/s Pctg gain 2 32.2 23.8 35.3% 4 16.4 12.7 29.1% 8 8.5 3.5 142.9% 16 3.8 2.7 40.7% The gains are better. The 8 thread case I don't get; I re-ran it and it still came out much better (135.7%). It could be 8 threads is the sweet spot for concurrency on this hardware.
        Hide
        Matthew Mastracci added a comment - - edited

        I just tried out the latest NIOFSDirectory patch and I'm seeing a bug. If I go back to the regular FSDirectory, everything works fine.

        I can't reproduce it on a smaller testcase. It only happens with the live index.

        Any ideas on where to debug?

         
        Caused by: java.lang.IndexOutOfBoundsException: Index: 24444, Size: 4
        	at java.util.ArrayList.RangeCheck(ArrayList.java:547)
        	at java.util.ArrayList.get(ArrayList.java:322)
        	at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:260)
        	at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:249)
        	at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:68)
        	at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:123)
        	at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:154)
        	at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
        	at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
        	at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:678)
        	at org.apache.lucene.index.MultiSegmentReader.docFreq(MultiSegmentReader.java:373)
        	at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:87)
        	at org.apache.lucene.search.Similarity.idf(Similarity.java:457)
        	at org.apache.lucene.search.TermQuery$TermWeight.<init>(TermQuery.java:44)
        	at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:146)
        	at org.apache.lucene.search.BooleanQuery$BooleanWeight.<init>(BooleanQuery.java:187)
        	at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:362)
        	at org.apache.lucene.search.Query.weight(Query.java:95)
        	at org.apache.lucene.search.Searcher.createWeight(Searcher.java:171)
        	at org.apache.lucene.search.Searcher.search(Searcher.java:132)
        

        The index is not using the compound file format:

         
        7731499698 Jul 28 03:46 _6zk.fdt
         232014520 Jul 28 03:50 _6zk.fdx
                32 Jul 28 03:50 _6zk.fnm
        3775713450 Jul 28 04:06 _6zk.frq
          58003634 Jul 28 04:07 _6zk.nrm
        2944298834 Jul 28 04:18 _6zk.prx
            432418 Jul 28 04:18 _6zk.tii
          30784106 Jul 28 04:19 _6zk.tis
         217354711 Jul 28 08:18 _76i.fdt
           6509864 Jul 28 08:18 _76i.fdx
                32 Jul 28 08:18 _76i.fnm
         144348761 Jul 28 08:18 _76i.frq
           1627470 Jul 28 08:18 _76i.nrm
         295528445 Jul 28 08:19 _76i.prx
             52622 Jul 28 08:19 _76i.tii
           3858378 Jul 28 08:19 _76i.tis
         199621206 Jul 29 13:29 _7cm.fdt
           5994720 Jul 29 13:29 _7cm.fdx
                32 Jul 29 13:29 _7cm.fnm
         136445620 Jul 29 13:29 _7cm.frq
           1498684 Jul 29 13:29 _7cm.nrm
         284805312 Jul 29 13:30 _7cm.prx
             48346 Jul 29 13:30 _7cm.tii
           3522117 Jul 29 13:30 _7cm.tis
           3914068 Jul 29 13:30 _7cn.fdt
            119184 Jul 29 13:30 _7cn.fdx
                32 Jul 29 13:30 _7cn.fnm
           2993343 Jul 29 13:30 _7cn.frq
             29800 Jul 29 13:30 _7cn.nrm
           7380878 Jul 29 13:30 _7cn.prx
              5277 Jul 29 13:30 _7cn.tii
            378816 Jul 29 13:30 _7cn.tis
            383147 Jul 29 13:30 _7cq.fdt
             11240 Jul 29 13:30 _7cq.fdx
                32 Jul 29 13:30 _7cq.fnm
            290398 Jul 29 13:30 _7cq.frq
              2814 Jul 29 13:30 _7cq.nrm
            763135 Jul 29 13:30 _7cq.prx
              1581 Jul 29 13:30 _7cq.tii
            115971 Jul 29 13:30 _7cq.tis
                19 Jul 29 13:30 date
                20 Jul 21 01:53 segments.gen
               155 Jul 29 13:30 segments_d61
        
        Show
        Matthew Mastracci added a comment - - edited I just tried out the latest NIOFSDirectory patch and I'm seeing a bug. If I go back to the regular FSDirectory, everything works fine. I can't reproduce it on a smaller testcase. It only happens with the live index. Any ideas on where to debug? Caused by: java.lang.IndexOutOfBoundsException: Index: 24444, Size: 4 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:260) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:249) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:68) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:123) at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:154) at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:678) at org.apache.lucene.index.MultiSegmentReader.docFreq(MultiSegmentReader.java:373) at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:87) at org.apache.lucene.search.Similarity.idf(Similarity.java:457) at org.apache.lucene.search.TermQuery$TermWeight.<init>(TermQuery.java:44) at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:146) at org.apache.lucene.search.BooleanQuery$BooleanWeight.<init>(BooleanQuery.java:187) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:362) at org.apache.lucene.search.Query.weight(Query.java:95) at org.apache.lucene.search.Searcher.createWeight(Searcher.java:171) at org.apache.lucene.search.Searcher.search(Searcher.java:132) The index is not using the compound file format: 7731499698 Jul 28 03:46 _6zk.fdt 232014520 Jul 28 03:50 _6zk.fdx 32 Jul 28 03:50 _6zk.fnm 3775713450 Jul 28 04:06 _6zk.frq 58003634 Jul 28 04:07 _6zk.nrm 2944298834 Jul 28 04:18 _6zk.prx 432418 Jul 28 04:18 _6zk.tii 30784106 Jul 28 04:19 _6zk.tis 217354711 Jul 28 08:18 _76i.fdt 6509864 Jul 28 08:18 _76i.fdx 32 Jul 28 08:18 _76i.fnm 144348761 Jul 28 08:18 _76i.frq 1627470 Jul 28 08:18 _76i.nrm 295528445 Jul 28 08:19 _76i.prx 52622 Jul 28 08:19 _76i.tii 3858378 Jul 28 08:19 _76i.tis 199621206 Jul 29 13:29 _7cm.fdt 5994720 Jul 29 13:29 _7cm.fdx 32 Jul 29 13:29 _7cm.fnm 136445620 Jul 29 13:29 _7cm.frq 1498684 Jul 29 13:29 _7cm.nrm 284805312 Jul 29 13:30 _7cm.prx 48346 Jul 29 13:30 _7cm.tii 3522117 Jul 29 13:30 _7cm.tis 3914068 Jul 29 13:30 _7cn.fdt 119184 Jul 29 13:30 _7cn.fdx 32 Jul 29 13:30 _7cn.fnm 2993343 Jul 29 13:30 _7cn.frq 29800 Jul 29 13:30 _7cn.nrm 7380878 Jul 29 13:30 _7cn.prx 5277 Jul 29 13:30 _7cn.tii 378816 Jul 29 13:30 _7cn.tis 383147 Jul 29 13:30 _7cq.fdt 11240 Jul 29 13:30 _7cq.fdx 32 Jul 29 13:30 _7cq.fnm 290398 Jul 29 13:30 _7cq.frq 2814 Jul 29 13:30 _7cq.nrm 763135 Jul 29 13:30 _7cq.prx 1581 Jul 29 13:30 _7cq.tii 115971 Jul 29 13:30 _7cq.tis 19 Jul 29 13:30 date 20 Jul 21 01:53 segments.gen 155 Jul 29 13:30 segments_d61
        Hide
        Michael McCandless added a comment -

        I just tried out the latest NIOFSDirectory patch and I'm seeing a bug. If I go back to the regular FSDirectory, everything works fine.

        Is the index itself corrupt, ie, NIOFSDirectory did something bad when writing the index? Or, is it only in reading the index with NIOFSDirectory that you see this? IE, can you swap in FSDirectory on your existing index and the problem goes away?

        Show
        Michael McCandless added a comment - I just tried out the latest NIOFSDirectory patch and I'm seeing a bug. If I go back to the regular FSDirectory, everything works fine. Is the index itself corrupt, ie, NIOFSDirectory did something bad when writing the index? Or, is it only in reading the index with NIOFSDirectory that you see this? IE, can you swap in FSDirectory on your existing index and the problem goes away?
        Hide
        Matthew Mastracci added a comment -

        Is the index itself corrupt, ie, NIOFSDirectory did something bad when writing the index? Or, is it only in reading the index with NIOFSDirectory that you see this? IE, can you swap in FSDirectory on your existing index and the problem goes away?

        I haven't seen any issues with writing the index under NIOFSDirectory. The failures seem to happen only when reading. When I switch to FSDirectory (or MMapDirectory), the same index that fails under NIOFSDirectory works flawlessly (indicating that the index is not corrupt).

        The error with NIOFSDirectory is determinate and repeatable (same error every time, same location, same query during warmup).

        I couldn't reproduce this on a smaller index, unfortunately.

        Show
        Matthew Mastracci added a comment - Is the index itself corrupt, ie, NIOFSDirectory did something bad when writing the index? Or, is it only in reading the index with NIOFSDirectory that you see this? IE, can you swap in FSDirectory on your existing index and the problem goes away? I haven't seen any issues with writing the index under NIOFSDirectory. The failures seem to happen only when reading. When I switch to FSDirectory (or MMapDirectory), the same index that fails under NIOFSDirectory works flawlessly (indicating that the index is not corrupt). The error with NIOFSDirectory is determinate and repeatable (same error every time, same location, same query during warmup). I couldn't reproduce this on a smaller index, unfortunately.
        Hide
        Michael McCandless added a comment -

        The error with NIOFSDirectory is determinate and repeatable (same error every time, same location, same query during warmup).

        Did you see a prior exception, before hitting the AIOOBE? If so, I think this is just LUCENE-1262 all over again. That issue was fixed in BufferedIndexInput, but the NIOFSIndexInput has copied a bunch of code from BufferedIndexInput (something I think we must fix before committing it – I think it should inherit from BufferedIndexInput instead) and so it still has that bug. I'll post a patch with the bug re-fixed so you can at least test it to see if it resolves your exception.

        Show
        Michael McCandless added a comment - The error with NIOFSDirectory is determinate and repeatable (same error every time, same location, same query during warmup). Did you see a prior exception, before hitting the AIOOBE? If so, I think this is just LUCENE-1262 all over again. That issue was fixed in BufferedIndexInput, but the NIOFSIndexInput has copied a bunch of code from BufferedIndexInput (something I think we must fix before committing it – I think it should inherit from BufferedIndexInput instead) and so it still has that bug. I'll post a patch with the bug re-fixed so you can at least test it to see if it resolves your exception.
        Hide
        Jason Rutherglen added a comment -

        I can possibly work on this, just go through and reedit the BufferedIndexInput portions of the code. Inheriting is difficult because of the ByteBuffer code. Needs to be done line by line.

        Show
        Jason Rutherglen added a comment - I can possibly work on this, just go through and reedit the BufferedIndexInput portions of the code. Inheriting is difficult because of the ByteBuffer code. Needs to be done line by line.
        Hide
        Michael McCandless added a comment -

        Attached new rev of NIOFSDirectory.

        Besides re-fixing LUCENE-1262, I also found & fixed a bug in the NIOFSIndexInput.clone() method.

        Matthew, could you give this one a shot to see if it fixes your case? Thanks.

        Show
        Michael McCandless added a comment - Attached new rev of NIOFSDirectory. Besides re-fixing LUCENE-1262 , I also found & fixed a bug in the NIOFSIndexInput.clone() method. Matthew, could you give this one a shot to see if it fixes your case? Thanks.
        Hide
        Michael McCandless added a comment -

        I can possibly work on this, just go through and reedit the BufferedIndexInput portions of the code. Inheriting is difficult because of the ByteBuffer code. Needs to be done line by line.

        That would be awesome, Jason. I think we should then commit NIOFSDirectory to core as at least a way around this bottleneck on all platforms but Windows. Maybe we can do this in time for 2.4?

        Show
        Michael McCandless added a comment - I can possibly work on this, just go through and reedit the BufferedIndexInput portions of the code. Inheriting is difficult because of the ByteBuffer code. Needs to be done line by line. That would be awesome, Jason. I think we should then commit NIOFSDirectory to core as at least a way around this bottleneck on all platforms but Windows. Maybe we can do this in time for 2.4?
        Hide
        Matthew Mastracci added a comment -

        Matthew, could you give this one a shot to see if it fixes your case? Thanks.

        Michael,

        I ran this new patch against our big index and it works very well. If I have time, I'll run some benchmarks to see what our real-life performance improvements are like.

        Note that I'm only running it for our read-only snapshot of the index, however, so this hasn't been tested for writing to a large index.

        Show
        Matthew Mastracci added a comment - Matthew, could you give this one a shot to see if it fixes your case? Thanks. Michael, I ran this new patch against our big index and it works very well. If I have time, I'll run some benchmarks to see what our real-life performance improvements are like. Note that I'm only running it for our read-only snapshot of the index, however, so this hasn't been tested for writing to a large index.
        Hide
        Michael McCandless added a comment -

        I ran this new patch against our big index and it works very well. If I have time, I'll run some benchmarks to see what our real-life performance improvements are like.

        Super, thanks! This was the same index that would reliably hit the exception above?

        Show
        Michael McCandless added a comment - I ran this new patch against our big index and it works very well. If I have time, I'll run some benchmarks to see what our real-life performance improvements are like. Super, thanks! This was the same index that would reliably hit the exception above?
        Hide
        Matthew Mastracci added a comment -

        Super, thanks! This was the same index that would reliably hit the exception above?

        Correct - it would hit the exception every time at startup.

        I've been running NIOFSDirectory for the last couple of hours with zero exceptions (except for running out of file descriptors after starting it the first time ). The previous incarnation was running MMapDirectory.

        Thanks for all the work on this patch.

        Show
        Matthew Mastracci added a comment - Super, thanks! This was the same index that would reliably hit the exception above? Correct - it would hit the exception every time at startup. I've been running NIOFSDirectory for the last couple of hours with zero exceptions (except for running out of file descriptors after starting it the first time ). The previous incarnation was running MMapDirectory. Thanks for all the work on this patch.
        Hide
        Matthew Mastracci added a comment -

        This exception popped up out of the blue a few hours in. No exceptions before it. I'll see if I can figure out whether it was caused by our index snapshotting or if it's a bug elsewhere in NIOFSDirectory.

        I haven't seen any exceptions like this with MMapDirectory, but it's possible there's something that we're doing that isn't correct.

         
        Caused by: java.nio.channels.ClosedChannelException
        	at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:91)
        	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:616)
        	at com.dotspots.analyzer.index.NIOFSDirectory$NIOFSIndexInput.read(NIOFSDirectory.java:186)
        	at com.dotspots.analyzer.index.NIOFSDirectory$NIOFSIndexInput.refill(NIOFSDirectory.java:218)
        	at com.dotspots.analyzer.index.NIOFSDirectory$NIOFSIndexInput.readByte(NIOFSDirectory.java:232)
        	at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
        	at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
        	at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:123)
        	at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:154)
        	at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
        	at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
        	at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:678)
        	at org.apache.lucene.index.MultiSegmentReader.docFreq(MultiSegmentReader.java:373)
        	at org.apache.lucene.index.MultiReader.docFreq(MultiReader.java:310)
        	at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:87)
        	at org.apache.lucene.search.Searcher.docFreqs(Searcher.java:178)
        
        Show
        Matthew Mastracci added a comment - This exception popped up out of the blue a few hours in. No exceptions before it. I'll see if I can figure out whether it was caused by our index snapshotting or if it's a bug elsewhere in NIOFSDirectory. I haven't seen any exceptions like this with MMapDirectory, but it's possible there's something that we're doing that isn't correct. Caused by: java.nio.channels.ClosedChannelException at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:91) at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:616) at com.dotspots.analyzer.index.NIOFSDirectory$NIOFSIndexInput.read(NIOFSDirectory.java:186) at com.dotspots.analyzer.index.NIOFSDirectory$NIOFSIndexInput.refill(NIOFSDirectory.java:218) at com.dotspots.analyzer.index.NIOFSDirectory$NIOFSIndexInput.readByte(NIOFSDirectory.java:232) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:123) at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:154) at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:678) at org.apache.lucene.index.MultiSegmentReader.docFreq(MultiSegmentReader.java:373) at org.apache.lucene.index.MultiReader.docFreq(MultiReader.java:310) at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:87) at org.apache.lucene.search.Searcher.docFreqs(Searcher.java:178)
        Hide
        Michael McCandless added a comment -

        Interesting...

        Are you really sure you're not accidentally closing the searcher before calling Searcher.docFreqs? Are you calling docFreqs directly from your app?

        It looks like MMapIndexInput.close() is a noop so it would not have detected calling Searcher.docFreqs after close, whereas NIOFSdirectory (and the normal FSDirectory) will.

        If you try the normal FSDirectory do you also an exception like this?

        Incidentally, what sort of performance differences are you noticing between these three different ways of accessing an index in the file system?

        Show
        Michael McCandless added a comment - Interesting... Are you really sure you're not accidentally closing the searcher before calling Searcher.docFreqs? Are you calling docFreqs directly from your app? It looks like MMapIndexInput.close() is a noop so it would not have detected calling Searcher.docFreqs after close, whereas NIOFSdirectory (and the normal FSDirectory) will. If you try the normal FSDirectory do you also an exception like this? Incidentally, what sort of performance differences are you noticing between these three different ways of accessing an index in the file system?
        Hide
        Yonik Seeley added a comment -

        Maybe we can do this in time for 2.4?

        +1

        Latest patch is looking good to me!
        Is there a reason we don't do lazy allocation in clone() like FSIndexInput?

        Also, our finalizers aren't technically thread safe which could lead to a double close in the finalizer (although I doubt if this particular case would ever happen). If we need to keep them, we could change Descriptor.isOpen to volatile and there should be pretty much no cost since it's only checked in close().

        Show
        Yonik Seeley added a comment - Maybe we can do this in time for 2.4? +1 Latest patch is looking good to me! Is there a reason we don't do lazy allocation in clone() like FSIndexInput? Also, our finalizers aren't technically thread safe which could lead to a double close in the finalizer (although I doubt if this particular case would ever happen). If we need to keep them, we could change Descriptor.isOpen to volatile and there should be pretty much no cost since it's only checked in close().
        Hide
        Michael McCandless added a comment -

        Is there a reason we don't do lazy allocation in clone() like FSIndexInput?

        Yonik, do you mean BufferedIndexInput.clone (not FSIndexInput)?

        I think once we fix NIOFSIndexInput to subclass from BufferedIndexInput, then cloning should be lazy again. Jason are you working on this (subclassing from BufferedIndexInput)? If not I can take it.

        Also, our finalizers aren't technically thread safe which could lead to a double close in the finalizer

        Hmmm... I'll update both FSDirectory and NIOFSDiretory's isOpen's to be volatile.

        Show
        Michael McCandless added a comment - Is there a reason we don't do lazy allocation in clone() like FSIndexInput? Yonik, do you mean BufferedIndexInput.clone (not FSIndexInput)? I think once we fix NIOFSIndexInput to subclass from BufferedIndexInput, then cloning should be lazy again. Jason are you working on this (subclassing from BufferedIndexInput)? If not I can take it. Also, our finalizers aren't technically thread safe which could lead to a double close in the finalizer Hmmm... I'll update both FSDirectory and NIOFSDiretory's isOpen's to be volatile.
        Hide
        Jason Rutherglen added a comment -

        Mike I have have not started on the subclassing from BufferedIndexInput yet. I can work on it monday though.

        Show
        Jason Rutherglen added a comment - Mike I have have not started on the subclassing from BufferedIndexInput yet. I can work on it monday though.
        Hide
        Michael McCandless added a comment -

        Mike I have have not started on the subclassing from BufferedIndexInput yet. I can work on it monday though.

        OK, thanks!

        Show
        Michael McCandless added a comment - Mike I have have not started on the subclassing from BufferedIndexInput yet. I can work on it monday though. OK, thanks!
        Hide
        Michael McCandless added a comment -

        Updated patch with Yonik's volatile suggestion – thanks Yonik!

        Also, I removed NIOFSDirectory.createOutput since it was doing the same thing as super().

        Show
        Michael McCandless added a comment - Updated patch with Yonik's volatile suggestion – thanks Yonik! Also, I removed NIOFSDirectory.createOutput since it was doing the same thing as super().
        Hide
        Matthew Mastracci added a comment -

        Michael,

        Are you really sure you're not accidentally closing the searcher before calling Searcher.docFreqs? Are you calling docFreqs directly from your app?

        Our IndexReaders are actually managed in a shared pool (currently 8 IndexReaders, shared round-robin style as requests come in). We have some custom reference counting logic that's supposed to keep the readers alive as long as somebody has them open. As new index snapshots come in, the IndexReaders are re-opened and reference counts ensure that any old index readers in use are kept alive until the searchers are done with them. I'm guessing we have an error in our reference counting logic that just doesn't show up under MMapDirectory (as you mentioned, close() is a no-op).

        We're calling docFreqs directly from our app. I'm guessing that it just happens to be the most likely item to be called after we roll to a new index snapshot.

        I don't have hard performance numbers right now, but we were having a hard time saturating I/O or CPU with FSDirectory. The locking was basically killing us. When we switched to MMapDirectory and turned on compound files, our performance jumped at least 2x. The preliminary results I'm seeing with NIOFSDirectory seem to indicate that it's slightly faster than MMapDirectory.

        I'll try setting our app back to using the old FSDirectory and see if the exceptions still occur. I'll also try to fiddle with our unit tests to make sure we're correctly ref-counting all of our index readers.

        BTW, I ran a quick FSDirectory/MMapDirectory/NIOFSDirectory shootout. It uses a parallel benchmark that roughly models what our real-life benchmark is like. I ran the benchmark once through to warm the disk cache, then got the following. The numbers are fairly stable across various runs once the disk caches are warm:

        FS: 33644ms
        MMap: 28616ms
        NIOFS: 33189ms

        I'm a bit surprised at the results myself, but I've spent a bit of time tuning the indexes to maximize concurrency. I'll double-check that the benchmark is correctly running all of the tests.

        The benchmark effectively runs 10-20 queries in parallel at a time, then waits for all queries to complete. It does this end-to-end for a number of different query batches, then totals up the time to complete each batch.

        Show
        Matthew Mastracci added a comment - Michael, Are you really sure you're not accidentally closing the searcher before calling Searcher.docFreqs? Are you calling docFreqs directly from your app? Our IndexReaders are actually managed in a shared pool (currently 8 IndexReaders, shared round-robin style as requests come in). We have some custom reference counting logic that's supposed to keep the readers alive as long as somebody has them open. As new index snapshots come in, the IndexReaders are re-opened and reference counts ensure that any old index readers in use are kept alive until the searchers are done with them. I'm guessing we have an error in our reference counting logic that just doesn't show up under MMapDirectory (as you mentioned, close() is a no-op). We're calling docFreqs directly from our app. I'm guessing that it just happens to be the most likely item to be called after we roll to a new index snapshot. I don't have hard performance numbers right now, but we were having a hard time saturating I/O or CPU with FSDirectory. The locking was basically killing us. When we switched to MMapDirectory and turned on compound files, our performance jumped at least 2x. The preliminary results I'm seeing with NIOFSDirectory seem to indicate that it's slightly faster than MMapDirectory. I'll try setting our app back to using the old FSDirectory and see if the exceptions still occur. I'll also try to fiddle with our unit tests to make sure we're correctly ref-counting all of our index readers. BTW, I ran a quick FSDirectory/MMapDirectory/NIOFSDirectory shootout. It uses a parallel benchmark that roughly models what our real-life benchmark is like. I ran the benchmark once through to warm the disk cache, then got the following. The numbers are fairly stable across various runs once the disk caches are warm: FS: 33644ms MMap: 28616ms NIOFS: 33189ms I'm a bit surprised at the results myself, but I've spent a bit of time tuning the indexes to maximize concurrency. I'll double-check that the benchmark is correctly running all of the tests. The benchmark effectively runs 10-20 queries in parallel at a time, then waits for all queries to complete. It does this end-to-end for a number of different query batches, then totals up the time to complete each batch.
        Hide
        Jason Rutherglen added a comment -

        LUCENE-753.patch

        NIOFSIndexInput now extends BufferedIndexInput. I was unable to test however and wanted to just get this up.

        Show
        Jason Rutherglen added a comment - LUCENE-753 .patch NIOFSIndexInput now extends BufferedIndexInput. I was unable to test however and wanted to just get this up.
        Hide
        Michael McCandless added a comment -

        FS: 33644ms
        MMap: 28616ms
        NIOFS: 33189ms

        I'm a bit surprised at the results myself, but I've spent a bit of time tuning the indexes to maximize concurrency. I'll double-check that the benchmark is correctly running all of the tests.

        This is surprising – your benchmark is very concurrent, yet FSDir and NIOFSDir are close to the same net throughput, while MMapDir is quite a bit faster. Is this on a non-Windows OS?

        Show
        Michael McCandless added a comment - FS: 33644ms MMap: 28616ms NIOFS: 33189ms I'm a bit surprised at the results myself, but I've spent a bit of time tuning the indexes to maximize concurrency. I'll double-check that the benchmark is correctly running all of the tests. This is surprising – your benchmark is very concurrent, yet FSDir and NIOFSDir are close to the same net throughput, while MMapDir is quite a bit faster. Is this on a non-Windows OS?
        Hide
        Michael McCandless added a comment -

        New patch attached. Matthew if you could try this version out on your
        index, that'd be awesome.

        I didn't like how we were still copying the hairy readBytes & refill
        methods from BufferedIndexInput, so I made some small additional mods
        to BufferedIndexInput to notify subclass when a byte[] buffer gets
        allocated, which then allowed us to fully inherit these methods.

        But, then I realized we were duplicating alot of code from
        FSIndexInput, so I switched to subclassing that instead and that made
        things even simpler.

        Some other things also fixed:

        • We were ignoring bufferSize (eg setBufferSize).
        • We weren't closing the FileChannel
        • clone() now lazily clones the buffer again

        To test this, I made NIOFSDirectory the default IMPL in
        FSDirectory.getDirectory and ran all tests. One test failed at first
        (because we were ignoring setBufferSize calls); with the new patch,
        all tests pass.

        I also built first 150K docs of wikipedia and ran various searches
        using NIOFSDirectory and all seems good.

        The class is quite a bit simpler now, however there's one thing I
        don't like: when you use CFS, the NIOFSIndexInput.readInternal method
        will wrap the CSIndexInput's byte[] (from it's parent
        BufferedIndexInput class) for every call (every 1024 bytes read from
        the file). I'd really like to find a clean way to reuse a single
        ByteBuffer. Not yet sure how to do that though...

        Show
        Michael McCandless added a comment - New patch attached. Matthew if you could try this version out on your index, that'd be awesome. I didn't like how we were still copying the hairy readBytes & refill methods from BufferedIndexInput, so I made some small additional mods to BufferedIndexInput to notify subclass when a byte[] buffer gets allocated, which then allowed us to fully inherit these methods. But, then I realized we were duplicating alot of code from FSIndexInput, so I switched to subclassing that instead and that made things even simpler. Some other things also fixed: We were ignoring bufferSize (eg setBufferSize). We weren't closing the FileChannel clone() now lazily clones the buffer again To test this, I made NIOFSDirectory the default IMPL in FSDirectory.getDirectory and ran all tests. One test failed at first (because we were ignoring setBufferSize calls); with the new patch, all tests pass. I also built first 150K docs of wikipedia and ran various searches using NIOFSDirectory and all seems good. The class is quite a bit simpler now, however there's one thing I don't like: when you use CFS, the NIOFSIndexInput.readInternal method will wrap the CSIndexInput's byte[] (from it's parent BufferedIndexInput class) for every call (every 1024 bytes read from the file). I'd really like to find a clean way to reuse a single ByteBuffer. Not yet sure how to do that though...
        Hide
        Michael McCandless added a comment -

        New version attached. This one re-uses a wrapped byte buffer even when it's CSIndexInput that's calling it.

        I plan to commit in a day or two.

        Show
        Michael McCandless added a comment - New version attached. This one re-uses a wrapped byte buffer even when it's CSIndexInput that's calling it. I plan to commit in a day or two.
        Hide
        Michael McCandless added a comment -

        I just committed revision 690539, adding NIOFSDirectory. I will leave this open, but move off of 2.4, until we can get similar performance gains on Windows...

        Show
        Michael McCandless added a comment - I just committed revision 690539, adding NIOFSDirectory. I will leave this open, but move off of 2.4, until we can get similar performance gains on Windows...
        Hide
        robert engels added a comment -

        SUN is accepting outside bug fixes to the Open JDK, and merging them to the commercial JDK (in most cases).

        If the underlying bug is fixed in the Windows JDK - not too hard - then you fix this properly in Lucene.

        If you don't fix it in the JDK you are always going to have the 'running out of file handles' synchronization, vs, the "locked position" synchronization - there is no way to fix this in user code...

        Show
        robert engels added a comment - SUN is accepting outside bug fixes to the Open JDK, and merging them to the commercial JDK (in most cases). If the underlying bug is fixed in the Windows JDK - not too hard - then you fix this properly in Lucene. If you don't fix it in the JDK you are always going to have the 'running out of file handles' synchronization, vs, the "locked position" synchronization - there is no way to fix this in user code...
        Hide
        Yonik Seeley added a comment -

        Attaching new FileReadTest.java that fixes a concurrency bug in SeparateFile - each reader needed it's own file position.

        Show
        Yonik Seeley added a comment - Attaching new FileReadTest.java that fixes a concurrency bug in SeparateFile - each reader needed it's own file position.
        Hide
        Uwe Schindler added a comment -

        This issue was resolved a long time ago, but left open for the stupid Windows Sun JRE bug which was never resolved. With Lucene 3.x and trunk we have better defaults (use e.g. MMapDirectory on Windows-64).

        Users should default to FSDirectory.open() and use the returned directory for best performance.

        Show
        Uwe Schindler added a comment - This issue was resolved a long time ago, but left open for the stupid Windows Sun JRE bug which was never resolved. With Lucene 3.x and trunk we have better defaults (use e.g. MMapDirectory on Windows-64). Users should default to FSDirectory.open() and use the returned directory for best performance.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Yonik Seeley
          • Votes:
            5 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development