Lucene - Core
  1. Lucene - Core
  2. LUCENE-5161

review FSDirectory chunking defaults and test the chunking

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today there is a loop in SimpleFS/NIOFS:

      try {
                do {
                  final int readLength;
                  if (total + chunkSize > len) {
                    readLength = len - total;
                  } else {
                    // LUCENE-1566 - work around JVM Bug by breaking very large reads into chunks
                    readLength = chunkSize;
                  }
                  final int i = file.read(b, offset + total, readLength);
                  total += i;
                } while (total < len);
              } catch (OutOfMemoryError e) {
      

      I bet if you look at the clover report its untested, because its fixed at 100MB for 32-bit users and 2GB for 64-bit users (are these defaults even good?!).

      Also if you call the setter on a 64-bit machine to change the size, it just totally ignores it. We should remove that, the setter should always work.

      And we should set it to small values in tests so this loop is actually executed.

      1. LUCENE-5161.patch
        2 kB
        Robert Muir
      2. LUCENE-5161.patch
        2 kB
        Robert Muir
      3. LUCENE-5161.patch
        1 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          This patch makes the setter always work, and changes lucenetestcase to use small values for the chunking.

          I didnt adjust any defaults (maybe Uwe can help, he knows about the code in question)

          Show
          Robert Muir added a comment - This patch makes the setter always work, and changes lucenetestcase to use small values for the chunking. I didnt adjust any defaults (maybe Uwe can help, he knows about the code in question)
          Hide
          Uwe Schindler added a comment - - edited

          Thanks Robert for opening.

          It is too late today, so I will respond tomorrow morning about the NIO stuff. I am now aware and inspected the JVM code, so I can explain why the OOMs occur in SimpleFSDir and NIOFSDir if you read large buffers. More details tomorrow, just one thing before: It has nothing to do with 32 or 64 bits, it is more limitations of the JVM with direct memory and heap size leading to the OOM under certain conditions. But the Integer.MAX_VALUE for 64 bit JVMs is just wrong, too (could also lead to OOM).

          In general I would not make the buffers too large, so the junk size should be limited to not more than a few megabytes. Making them large brings no performance improvement at all, it just wastes memory in large thread-local direct buffers allocated internally by the JVM's NIO code.

          Show
          Uwe Schindler added a comment - - edited Thanks Robert for opening. It is too late today, so I will respond tomorrow morning about the NIO stuff. I am now aware and inspected the JVM code, so I can explain why the OOMs occur in SimpleFSDir and NIOFSDir if you read large buffers. More details tomorrow, just one thing before: It has nothing to do with 32 or 64 bits, it is more limitations of the JVM with direct memory and heap size leading to the OOM under certain conditions. But the Integer.MAX_VALUE for 64 bit JVMs is just wrong, too (could also lead to OOM). In general I would not make the buffers too large, so the junk size should be limited to not more than a few megabytes. Making them large brings no performance improvement at all, it just wastes memory in large thread-local direct buffers allocated internally by the JVM's NIO code.
          Hide
          Robert Muir added a comment -

          Thanks Uwe, I will leave the issue for you tomorrow to fix the defaults.

          I can only say the chunking does not seem buggy (all tests pass with the randomization in the patch), so at least we have that.

          Show
          Robert Muir added a comment - Thanks Uwe, I will leave the issue for you tomorrow to fix the defaults. I can only say the chunking does not seem buggy (all tests pass with the randomization in the patch), so at least we have that.
          Hide
          Uwe Schindler added a comment - - edited

          Hi,

          I will now explain the problems of SimpleFSDirectory and NIOFSDirectory and why the OOM oocurs:

          NIOFSDir uses a FileChannel to read from disk. This is generally a good idea to support lockless transfers (on windows unfortunaetly not). The issue here is some limitation in the internal JVM implementation: The big issue is the garbage collector. It is impossible for the native code to read from a file descriptor and let the results go directly to a native byte[] (e.g. a ByteBuffer.allocate() on heap or a byte[] in RandomAccessFile), because those are interruptible operations and are not synchronized. It may happen that JDK invokes the kernel read() method and give it the native pointer of the byte[] and suddenly the garbage collector jumps in (in another thread) and moves the byte[] to defragment the heap. As the code is in the kernel, there is nothing that can be done to prevent this code from writing outside byte[], once it was moved. Theoretically the JVM could lock the byte[] somehow to prevent the GC from moving it, but that is not how it is done.

          Because of this problem FileChannel (and also RandomAccessFile) allocate a DirectBuffer if the buffer passed to write is a heap ByteBuffer (see http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/IOUtil.java#211 and http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/Util.java#60). Those direct byte buffers are allocated with SoftReferences to it, so they get garbage collected one memory gets low. But as you see from the code, the direct buffer is choosen to be at least the size of the requested transfer if none of these buffers is available a new one is allocated with the transfer size). And this s the big problem leading to the OOM. The maximum size of direct memory allocated outside of the JVM is limited by the heap size (I think 2 times heap).

          The current chunk sizes are horrible: With 2 Gigabytes on 64 bit and mayn megabytes on 32 bit you allocate huge direct buffer outside of the JDK heap that consume memory and it is unlikely that they are freed again. So we should really limit the maximum size of those chunks to reasonable values. The chunking code is working (and is tested), so we should limit those read buffers to a sensible value.

          E.g., for windows everything greater than 64 MB is useless (see some references for transferTo). The only thing, we change by the chunk size, is the number of syscalls, but for reading 500 MB of index norms it makes no difference if you have 2 syscalls or 500 syscalls, the harddisk is the limiting factor).

          For RandomAccessFile, the same is done: it internally allocates direct memory (in fact the current JDKs implement RandomAccessFile mostly through NIO using ByteBuffer.wrap()).

          The above also explains why making a difference between 32 bit and 64 bits is useless. The OOM occurs not because of the bit size, but morre because the direct memory is like the Java heap a limitation by the underlying JDK. So we should not waste all this memory. To also note: In fact to transfer 500 MB you need at least the 500 MB byte[] as target for the transfer (using ByteBuffer.wrap as we do in NIOFSDir) on heap, but also 500 MB in direct memory, so we waste 1 Gigabyte!!! This is horrible inefficient.

          Also note that NIOFSDir always has to copy the direct buffer to the heap buffer so this is an overhead. It might be a good idea to implement a second optimized NIOFSDir that uses DirectBuffers and does not copy all stuff to the heap. For the direct buffer chunks we can use similar code like in ByteBufferIndexInput (which is very effectove).

          I would default the chunk size in NIOFSDir to something around 1 to 32 megabyte, e.g. 2 Megabytes on 32 bit and 8 Megabytes on 64 bit. Definitely the current chunk sizes are way too large and waste physical memory we could use for something else!

          Maybe Michael McCandless can do some perf tests with NIOFSDir with radically lowered buffer sizes. I think it will not show any difference!

          Show
          Uwe Schindler added a comment - - edited Hi, I will now explain the problems of SimpleFSDirectory and NIOFSDirectory and why the OOM oocurs: NIOFSDir uses a FileChannel to read from disk. This is generally a good idea to support lockless transfers (on windows unfortunaetly not). The issue here is some limitation in the internal JVM implementation: The big issue is the garbage collector. It is impossible for the native code to read from a file descriptor and let the results go directly to a native byte[] (e.g. a ByteBuffer.allocate() on heap or a byte[] in RandomAccessFile), because those are interruptible operations and are not synchronized. It may happen that JDK invokes the kernel read() method and give it the native pointer of the byte[] and suddenly the garbage collector jumps in (in another thread) and moves the byte[] to defragment the heap. As the code is in the kernel, there is nothing that can be done to prevent this code from writing outside byte[], once it was moved. Theoretically the JVM could lock the byte[] somehow to prevent the GC from moving it, but that is not how it is done. Because of this problem FileChannel (and also RandomAccessFile) allocate a DirectBuffer if the buffer passed to write is a heap ByteBuffer (see http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/IOUtil.java#211 and http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/Util.java#60 ). Those direct byte buffers are allocated with SoftReferences to it, so they get garbage collected one memory gets low. But as you see from the code, the direct buffer is choosen to be at least the size of the requested transfer if none of these buffers is available a new one is allocated with the transfer size). And this s the big problem leading to the OOM. The maximum size of direct memory allocated outside of the JVM is limited by the heap size (I think 2 times heap). The current chunk sizes are horrible: With 2 Gigabytes on 64 bit and mayn megabytes on 32 bit you allocate huge direct buffer outside of the JDK heap that consume memory and it is unlikely that they are freed again. So we should really limit the maximum size of those chunks to reasonable values. The chunking code is working (and is tested), so we should limit those read buffers to a sensible value. E.g., for windows everything greater than 64 MB is useless (see some references for transferTo). The only thing, we change by the chunk size, is the number of syscalls, but for reading 500 MB of index norms it makes no difference if you have 2 syscalls or 500 syscalls, the harddisk is the limiting factor). For RandomAccessFile, the same is done: it internally allocates direct memory (in fact the current JDKs implement RandomAccessFile mostly through NIO using ByteBuffer.wrap()). The above also explains why making a difference between 32 bit and 64 bits is useless. The OOM occurs not because of the bit size, but morre because the direct memory is like the Java heap a limitation by the underlying JDK. So we should not waste all this memory. To also note: In fact to transfer 500 MB you need at least the 500 MB byte[] as target for the transfer (using ByteBuffer.wrap as we do in NIOFSDir) on heap, but also 500 MB in direct memory, so we waste 1 Gigabyte!!! This is horrible inefficient. Also note that NIOFSDir always has to copy the direct buffer to the heap buffer so this is an overhead. It might be a good idea to implement a second optimized NIOFSDir that uses DirectBuffers and does not copy all stuff to the heap. For the direct buffer chunks we can use similar code like in ByteBufferIndexInput (which is very effectove). I would default the chunk size in NIOFSDir to something around 1 to 32 megabyte, e.g. 2 Megabytes on 32 bit and 8 Megabytes on 64 bit. Definitely the current chunk sizes are way too large and waste physical memory we could use for something else! Maybe Michael McCandless can do some perf tests with NIOFSDir with radically lowered buffer sizes. I think it will not show any difference!
          Hide
          Robert Muir added a comment -

          in addition to the defaults, the javadocs that recommend large values should also be fixed

          Show
          Robert Muir added a comment - in addition to the defaults, the javadocs that recommend large values should also be fixed
          Hide
          Robert Muir added a comment -

          Updated patch, setting the default size to 2*MERGE_BUFFER_SIZE.

          I don't see any difference in indexing or searching with the 1M doc wikipedia. I fired off the 10M doc one too (it takes hours here).

          In general I think its obvious it doesnt much matter, because places are not really reading huge byte[]'s at init: most of those places (e.g. DV) are using things like blockpackedreader for better compression anyway.

          And for merging and so on, its totally dominated by merging of terms/postings!

          Show
          Robert Muir added a comment - Updated patch, setting the default size to 2*MERGE_BUFFER_SIZE. I don't see any difference in indexing or searching with the 1M doc wikipedia. I fired off the 10M doc one too (it takes hours here). In general I think its obvious it doesnt much matter, because places are not really reading huge byte[]'s at init: most of those places (e.g. DV) are using things like blockpackedreader for better compression anyway. And for merging and so on, its totally dominated by merging of terms/postings!
          Hide
          Robert Muir added a comment -

          I finished the 10M wikipedia, here was the indexing time:

          • patch: 979888 ms
          • trunk: 991839 ms

          So there is no degradation in indexing speed or searching speed.

          Show
          Robert Muir added a comment - I finished the 10M wikipedia, here was the indexing time: patch: 979888 ms trunk: 991839 ms So there is no degradation in indexing speed or searching speed.
          Hide
          Robert Muir added a comment -

          By the way I also fixed the TODO in this patch, which is unrelated but bad.

          If one of our core directories throws exception in its ctor, i want the test to fail, not silently fall back to FSDirectory.open. If someone specifies a bogus -Dtests.directory, that should either test their supplied directory or fail!

          Show
          Robert Muir added a comment - By the way I also fixed the TODO in this patch, which is unrelated but bad. If one of our core directories throws exception in its ctor, i want the test to fail, not silently fall back to FSDirectory.open. If someone specifies a bogus -Dtests.directory, that should either test their supplied directory or fail!
          Hide
          Robert Muir added a comment -

          final patch: i will move forward to remove this trap.

          Show
          Robert Muir added a comment - final patch: i will move forward to remove this trap.
          Hide
          Uwe Schindler added a comment -

          I would use oal.util.Rethrow.rethrow(e) instead of the RuntimeExceptionin in CommandlineUtil! This would make stack trace easier.

          Show
          Uwe Schindler added a comment - I would use oal.util.Rethrow.rethrow(e) instead of the RuntimeExceptionin in CommandlineUtil! This would make stack trace easier.
          Hide
          Robert Muir added a comment -

          Thanks Uwe, ill do that!

          Show
          Robert Muir added a comment - Thanks Uwe, ill do that!
          Hide
          ASF subversion and git services added a comment -

          Commit 1512723 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1512723 ]

          LUCENE-5161: set sane default readChunkSizes, make the setter work, and test the chunking

          Show
          ASF subversion and git services added a comment - Commit 1512723 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1512723 ] LUCENE-5161 : set sane default readChunkSizes, make the setter work, and test the chunking
          Hide
          ASF subversion and git services added a comment -

          Commit 1512729 from Robert Muir in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1512729 ]

          LUCENE-5161: set sane default readChunkSizes, make the setter work, and test the chunking

          Show
          ASF subversion and git services added a comment - Commit 1512729 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1512729 ] LUCENE-5161 : set sane default readChunkSizes, make the setter work, and test the chunking
          Hide
          Robert Muir added a comment -

          Thanks for the thorough investigation Uwe!

          Show
          Robert Muir added a comment - Thanks for the thorough investigation Uwe!
          Hide
          Adrien Grand added a comment -

          4.5 release -> bulk close

          Show
          Adrien Grand added a comment - 4.5 release -> bulk close

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development