Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7583

Can we improve OutputStreamIndexOutput's byte buffering when writing each BKD leaf block?

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.4
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When BKD writes its leaf blocks, it's essentially a lot of tiny writes (vint, int, short, etc.), and I've seen deep thread stacks through our IndexOutput impl (OutputStreamIndexOutput) when pulling hot threads while BKD is writing.

      So I tried a small change, to have BKDWriter do its own buffering, by first writing each leaf block into a RAMOutputStream, and then dumping that (in 1 KB byte[] chunks) to the actual IndexOutput.

      This gives a non-trivial reduction (~6%) in the total time for BKD writing + merging time on the 20M NYC taxis nightly benchmark (2 times each):

      Trunk, sparse:

      • total: 64.691 sec
      • total: 64.702 sec

      Patch, sparse:

      • total: 60.820 sec
      • total: 60.965 sec

      Trunk dense:

      • total: 62.730 sec
      • total: 62.383 sec

      Patch dense:

      • total: 58.805 sec
      • total: 58.742 sec

      The results seem to be consistent and reproducible. I'm using Java 1.8.0_101 on a fast SSD on Ubuntu 16.04.

      It's sort of weird and annoying that this helps so much, because OutputStreamIndexOutput already uses java's BufferedOutputStream (default 8 KB buffer) to buffer writes.

      Uwe Schindler suggested maybe hotspot is failing to inline/optimize the writeByte / the call stack just has too many layers.

      We could commit this patch (it's trivial) but it'd be nice to understand and fix why buffering writes is somehow costly so any other Lucene codec components that write lots of little things can be improved too.

      1. LUCENE-7583.fork-FastOutputStream.patch
        6 kB
        Michael McCandless
      2. LUCENE-7583.patch
        32 kB
        Michael McCandless
      3. LUCENE-7583.patch
        4 kB
        Michael McCandless
      4. LUCENE-7583.private-IndexOutput.patch
        2 kB
        Michael McCandless
      5. LUCENE-7583-hardcode-writeVInt.patch
        2 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Simple patch.

        Show
        mikemccand Michael McCandless added a comment - Simple patch.
        Hide
        thetaphi Uwe Schindler added a comment -

        As discusses yesterday, I'd try to do 2 things:

        • Ensure that synchronization in BufferedOutputStream is not the issue (by replacing with a Solr-Like FastOutputStream). I don't think this will help, because the misdesign in the Input/OutputStream is one of the first thing Hotspot eliminates. In general the whole Input/OutputStream API was synchronized in Java 1.0, which is bogus. But backwards compatibility... Same issue like StringBuilder vs. StringBuffer.
        • Another try would be to hardcode writeVInit in OutputStreamIndexOutput
        Show
        thetaphi Uwe Schindler added a comment - As discusses yesterday, I'd try to do 2 things: Ensure that synchronization in BufferedOutputStream is not the issue (by replacing with a Solr-Like FastOutputStream). I don't think this will help, because the misdesign in the Input/OutputStream is one of the first thing Hotspot eliminates. In general the whole Input/OutputStream API was synchronized in Java 1.0, which is bogus. But backwards compatibility... Same issue like StringBuilder vs. StringBuffer. Another try would be to hardcode writeVInit in OutputStreamIndexOutput
        Hide
        mikemccand Michael McCandless added a comment -

        Another try would be to hardcode writeVInt in OutputStreamIndexOutput

        OK, I tried this, in the attached patch, but it didn't help that much (maybe a bit):

        Sparse:

        • total: 64.387 sec

        Dense:

        • total: 62.134 sec

        I'll try an unsynchronized variant of BufferedOutputStream next.

        Show
        mikemccand Michael McCandless added a comment - Another try would be to hardcode writeVInt in OutputStreamIndexOutput OK, I tried this, in the attached patch, but it didn't help that much (maybe a bit): Sparse: total: 64.387 sec Dense: total: 62.134 sec I'll try an unsynchronized variant of BufferedOutputStream next.
        Hide
        thetaphi Uwe Schindler added a comment -

        Do we call flush at some places? I am sure you checked this, but maybe we missed some place.

        Show
        thetaphi Uwe Schindler added a comment - Do we call flush at some places? I am sure you checked this, but maybe we missed some place.
        Hide
        mikemccand Michael McCandless added a comment -

        Do we call flush at some places? I am sure you checked this, but maybe we missed some place.

        That's a good idea (I hadn't check for it), but the only place we call OutputStream.flush is in OutputStreamIndexOutput.close.

        Show
        mikemccand Michael McCandless added a comment - Do we call flush at some places? I am sure you checked this, but maybe we missed some place. That's a good idea (I hadn't check for it), but the only place we call OutputStream.flush is in OutputStreamIndexOutput.close .
        Hide
        mikemccand Michael McCandless added a comment -

        Hmm, this is interesting:

        I temporarily forked BufferedOutputStream.java into oal.store as a
        package private class, and changed OutputStreamIndexOutput to use
        that version. I then ran the benchmark again, and the times were the
        same.

        Then I remove synchronized from the 3 methods that have it now
        (flush, and the two write methods) and the times improved quite a bit:

        Sparse:

        • total: 61.591 sec

        Dense:

        • total: 59.739 sec

        Not quite as fast as the patch (to use RAMOutputStream to buffer writes) but close (~4.8% faster vs ~5.8%).

        Show
        mikemccand Michael McCandless added a comment - Hmm, this is interesting: I temporarily forked BufferedOutputStream.java into oal.store as a package private class, and changed OutputStreamIndexOutput to use that version. I then ran the benchmark again, and the times were the same. Then I remove synchronized from the 3 methods that have it now (flush, and the two write methods) and the times improved quite a bit: Sparse: total: 61.591 sec Dense: total: 59.739 sec Not quite as fast as the patch (to use RAMOutputStream to buffer writes) but close (~4.8% faster vs ~5.8%).
        Hide
        thetaphi Uwe Schindler added a comment -

        So this looks like a problem of the Hotspot VM that does not fully remove the stupid synchronized on this call stack. This should not happen, because most optimizations in the VM are there to fix IO, because every Input/OutputStream has an internal synchronized lock.... BufferedOutputStream just has an additional one.

        Show
        thetaphi Uwe Schindler added a comment - So this looks like a problem of the Hotspot VM that does not fully remove the stupid synchronized on this call stack. This should not happen, because most optimizations in the VM are there to fix IO, because every Input/OutputStream has an internal synchronized lock.... BufferedOutputStream just has an additional one.
        Hide
        thetaphi Uwe Schindler added a comment -

        Are we sure that we do not open the IndexOutput in one thread and had it over to another one? we should also make all references to the IndexOutput private, so it cannot escape the current thread (to help hotspot). This means: no non-private fields holding the reference to the stream.

        Show
        thetaphi Uwe Schindler added a comment - Are we sure that we do not open the IndexOutput in one thread and had it over to another one? we should also make all references to the IndexOutput private, so it cannot escape the current thread (to help hotspot). This means: no non-private fields holding the reference to the stream.
        Hide
        thetaphi Uwe Schindler added a comment - - edited

        If we are really required to fork the buffered stream, we may use: https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/FastOutputStream.java (but without the DataOutput interface impl).

        Show
        thetaphi Uwe Schindler added a comment - - edited If we are really required to fork the buffered stream, we may use: https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/FastOutputStream.java (but without the DataOutput interface impl).
        Hide
        mikemccand Michael McCandless added a comment -

        Are we sure that we do not open the IndexOutput in one thread and had it over to another one?

        Yeah, the IndexOutput is opened in Lucene60PointsWriter, and then that same thread goes and writes all points via writeField. At IW flush time it's an indexing thread, and at merge time it's a merge thread, but it should only ever be a single thread touching that IndexOutput. The benchmark I'm running only ever uses a single thread anyway ...

        we should also make all references to the IndexOutput private, so it cannot escape the current thread (to help hotspot). This means: no non-private fields holding the reference to the stream.

        I'll try to do this; there's at least one place where it's protected, but that's way high up in the stack (Lucene60PointsWriter).

        If we are really required to fork the buffered stream, we may use: https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/FastOutputStream.java (but without the DataOutput interface impl).

        I'll test that too.

        Thanks Uwe Schindler.

        Show
        mikemccand Michael McCandless added a comment - Are we sure that we do not open the IndexOutput in one thread and had it over to another one? Yeah, the IndexOutput is opened in Lucene60PointsWriter , and then that same thread goes and writes all points via writeField . At IW flush time it's an indexing thread, and at merge time it's a merge thread, but it should only ever be a single thread touching that IndexOutput . The benchmark I'm running only ever uses a single thread anyway ... we should also make all references to the IndexOutput private, so it cannot escape the current thread (to help hotspot). This means: no non-private fields holding the reference to the stream. I'll try to do this; there's at least one place where it's protected, but that's way high up in the stack ( Lucene60PointsWriter ). If we are really required to fork the buffered stream, we may use: https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/FastOutputStream.java (but without the DataOutput interface impl). I'll test that too. Thanks Uwe Schindler .
        Hide
        mikemccand Michael McCandless added a comment -

        I also tried with ByteArrayDataOutput and it gets the fastest result, ~9.6% faster than trunk today:

        Sparse:

        • total: 58.503 sec

        Dense:

        • total: 57.227 sec
        Show
        mikemccand Michael McCandless added a comment - I also tried with ByteArrayDataOutput and it gets the fastest result, ~9.6% faster than trunk today: Sparse: total: 58.503 sec Dense: total: 57.227 sec
        Hide
        thetaphi Uwe Schindler added a comment -

        I think ByteArrayDataOutput is always a good idea to create "small" blobs of structured data. You have full control of the buffer and there is almost no checks and multi-buffer handling involved. It just writes to an byte array that you can reuse later or write to IndexOutput as block.

        Show
        thetaphi Uwe Schindler added a comment - I think ByteArrayDataOutput is always a good idea to create "small" blobs of structured data. You have full control of the buffer and there is almost no checks and multi-buffer handling involved. It just writes to an byte array that you can reuse later or write to IndexOutput as block.
        Hide
        mikemccand Michael McCandless added a comment -

        OK I made the 2 places where we hang onto the IndexOutput instance in a class instance private (see attached patch) but it looks like this didn't really help:

        Sparse:

        • total: 64.457 sec

        Dense:

        • total: 62.412 sec
        Show
        mikemccand Michael McCandless added a comment - OK I made the 2 places where we hang onto the IndexOutput instance in a class instance private (see attached patch) but it looks like this didn't really help: Sparse: total: 64.457 sec Dense: total: 62.412 sec
        Hide
        mikemccand Michael McCandless added a comment -

        OK, I forked Solrj's FastOutputStream.java into oal.store, and it
        gets similar performance to forking BufferedOutputStream and
        removing its synchronized keywords:

        Sparse:

        • total: 61.584 sec

        Dense:

        • total: 59.602 sec
        Show
        mikemccand Michael McCandless added a comment - OK, I forked Solrj's FastOutputStream.java into oal.store, and it gets similar performance to forking BufferedOutputStream and removing its synchronized keywords: Sparse: total: 61.584 sec Dense: total: 59.602 sec
        Hide
        mikemccand Michael McCandless added a comment -

        I think ByteArrayDataOutput is always a good idea to create "small" blobs of structured data.

        Yeah I'm leaning towards just doing this for BKDWriter at this point. I'll clean up that approach and post a patch.

        Show
        mikemccand Michael McCandless added a comment - I think ByteArrayDataOutput is always a good idea to create "small" blobs of structured data. Yeah I'm leaning towards just doing this for BKDWriter at this point. I'll clean up that approach and post a patch.
        Hide
        mikemccand Michael McCandless added a comment -

        Here's a patch, just moving GrowableByteArrayDataOutput from the compressing codecs to oal.store, and tightening it a bit (making fields private) and switching over BKDWriter to use it as well ...

        Show
        mikemccand Michael McCandless added a comment - Here's a patch, just moving GrowableByteArrayDataOutput from the compressing codecs to oal.store, and tightening it a bit (making fields private) and switching over BKDWriter to use it as well ...
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit b97d9d7478f99660c1cfc91ef4461b7405254dea in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b97d9d7 ]

        LUCENE-7583: buffer small leaf-block writes in BKDWriter

        Show
        jira-bot ASF subversion and git services added a comment - Commit b97d9d7478f99660c1cfc91ef4461b7405254dea in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b97d9d7 ] LUCENE-7583 : buffer small leaf-block writes in BKDWriter
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 28a22c0203f8ffb6f71d6a2c0a610eebf9cbfd12 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=28a22c0 ]

        LUCENE-7583: buffer small leaf-block writes in BKDWriter

        Show
        jira-bot ASF subversion and git services added a comment - Commit 28a22c0203f8ffb6f71d6a2c0a610eebf9cbfd12 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=28a22c0 ] LUCENE-7583 : buffer small leaf-block writes in BKDWriter
        Hide
        thetaphi Uwe Schindler added a comment -

        Thanks Mike!
        I was just wondering why you added the "if (newSize > currentSize)" like checks before ArrayUtil.grow. ArrayUtil.grow does this already and exits early, so the check is done twice.

        Show
        thetaphi Uwe Schindler added a comment - Thanks Mike! I was just wondering why you added the "if (newSize > currentSize)" like checks before ArrayUtil.grow. ArrayUtil.grow does this already and exits early, so the check is done twice.
        Hide
        mikemccand Michael McCandless added a comment -

        I was just wondering why you added the "if (newSize > currentSize)" like checks before ArrayUtil.grow.

        Because I wasn't trusting that the JVM would inline this method call.

        Also, I think these methods are poorly named. They should be maybeGrow or growIfNeeded if indeed they are lenient when you call them on an array that is not in fact in need of growing.

        If you feel strongly, I can remove that if, but I think it makes the code look sloppier.

        Show
        mikemccand Michael McCandless added a comment - I was just wondering why you added the "if (newSize > currentSize)" like checks before ArrayUtil.grow. Because I wasn't trusting that the JVM would inline this method call. Also, I think these methods are poorly named. They should be maybeGrow or growIfNeeded if indeed they are lenient when you call them on an array that is not in fact in need of growing. If you feel strongly, I can remove that if, but I think it makes the code look sloppier.
        Hide
        dancollins Daniel Collins added a comment -

        Note: this breaks my eclipse build (on 6x at least, and I presume master) because lucene/core/src/java/org/apache/lucene/util/GrowableByteArrayDataOutput.java claims to be in package org.apache.lucene.store, but is actually in the util dir.

        Ant compile is fine, but I guess Eclipse is more pedantic.

        Show
        dancollins Daniel Collins added a comment - Note: this breaks my eclipse build (on 6x at least, and I presume master) because lucene/core/src/java/org/apache/lucene/util/GrowableByteArrayDataOutput.java claims to be in package org.apache.lucene.store, but is actually in the util dir. Ant compile is fine, but I guess Eclipse is more pedantic.
        Hide
        mikemccand Michael McCandless added a comment -

        Note: this breaks my eclipse build (on 6x at least, and I presume master)

        Gak, I'll fix! Thanks for reporting this.

        Show
        mikemccand Michael McCandless added a comment - Note: this breaks my eclipse build (on 6x at least, and I presume master) Gak, I'll fix! Thanks for reporting this.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit c185617582b4bf3ce2899c9ae67e9eeaf2c21741 in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c185617 ]

        LUCENE-7583: move this class to the right package

        Show
        jira-bot ASF subversion and git services added a comment - Commit c185617582b4bf3ce2899c9ae67e9eeaf2c21741 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c185617 ] LUCENE-7583 : move this class to the right package
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit ca428ce2381fd9a8e6f56767ad9d0fb1638ba7dc in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ca428ce ]

        LUCENE-7583: move this class to the right package

        Show
        jira-bot ASF subversion and git services added a comment - Commit ca428ce2381fd9a8e6f56767ad9d0fb1638ba7dc in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ca428ce ] LUCENE-7583 : move this class to the right package

          People

          • Assignee:
            Unassigned
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development