Lucene - Core
  1. Lucene - Core
  2. LUCENE-888

Improve indexing performance by increasing internal buffer sizes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: 2.2
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In working on LUCENE-843, I noticed that two buffer sizes have a
      substantial impact on overall indexing performance.

      First is BufferedIndexOutput.BUFFER_SIZE (also used by
      BufferedIndexInput). Second is CompoundFileWriter's buffer used to
      actually build the compound file. Both are now 1 KB (1024 bytes).

      I ran the same indexing test I'm using for LUCENE-843. I'm indexing
      ~5,500 byte plain text docs derived from the Europarl corpus
      (English). I index 200,000 docs with compound file enabled and term
      vector positions & offsets stored plus stored fields. I flush
      documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
      not hit LUCENE-845. The resulting index is 1.7 GB. The index is not
      optimized in the end and I left mergeFactor @ 10.

      I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
      system.

      At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
      I increase both buffers to 8 KB it takes 554 sec to build the index,
      which is an 11% overall gain!

      I will run more tests to see if there is a natural knee in the curve
      (buffer size above which we don't really gain much more performance).

      I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
      at 1024, at least for now. During searching there can be quite a few
      of this class instantiated, and likely a larger buffer size for the
      freq/prox streams could actually hurt search performance for those
      searches that use skipping.

      The CompoundFileWriter buffer is created only briefly, so I think we
      can use a fairly large (32 KB?) buffer there. And there should not be
      too many BufferedIndexOutputs alive at once so I think a large-ish
      buffer (16 KB?) should be OK.

      1. LUCENE-888.take2.patch
        26 kB
        Michael McCandless
      2. LUCENE-888.patch
        26 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael Busch added a comment -

          > At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
          > I increase both buffers to 8 KB it takes 554 sec to build the index,
          > which is an 11% overall gain!

          Cool!

          > I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
          > at 1024, at least for now. During searching there can be quite a few
          > of this class instantiated, and likely a larger buffer size for the
          > freq/prox streams could actually hurt search performance for those
          > searches that use skipping.

          I submitted a patch for LUCENE-430 which avoids copying the buffer when
          a BufferedIndexInput is cloned. With this patch we could also add a
          method setBufferSize(int) to BufferedIndexInput. This method has to
          be called then right after the input has been created or cloned and
          before the first read is performed (the first read operation allocates
          the buffer). If called later it wouldn't have any effect. This would
          allow us to adjust the buffer size dynamically, e. g. use large buffers
          for segment merges and stored fields, but smaller ones for freq/prox
          streams, maybe dependent on the document frequency.
          What do you think?

          > The CompoundFileWriter buffer is created only briefly, so I think we
          > can use a fairly large (32 KB?) buffer there. And there should not be
          > too many BufferedIndexOutputs alive at once so I think a large-ish
          > buffer (16 KB?) should be OK.

          I'm wondering how much performance benefits if you increase the buffer
          size beyond the file system's page size? Does it make a big difference
          if you use 8 KB, 16 KB or 32 KB? If the answer is yes, then I think
          the numbers you propose are good.

          Show
          Michael Busch added a comment - > At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if > I increase both buffers to 8 KB it takes 554 sec to build the index, > which is an 11% overall gain! Cool! > I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE > at 1024, at least for now. During searching there can be quite a few > of this class instantiated, and likely a larger buffer size for the > freq/prox streams could actually hurt search performance for those > searches that use skipping. I submitted a patch for LUCENE-430 which avoids copying the buffer when a BufferedIndexInput is cloned. With this patch we could also add a method setBufferSize(int) to BufferedIndexInput. This method has to be called then right after the input has been created or cloned and before the first read is performed (the first read operation allocates the buffer). If called later it wouldn't have any effect. This would allow us to adjust the buffer size dynamically, e. g. use large buffers for segment merges and stored fields, but smaller ones for freq/prox streams, maybe dependent on the document frequency. What do you think? > The CompoundFileWriter buffer is created only briefly, so I think we > can use a fairly large (32 KB?) buffer there. And there should not be > too many BufferedIndexOutputs alive at once so I think a large-ish > buffer (16 KB?) should be OK. I'm wondering how much performance benefits if you increase the buffer size beyond the file system's page size? Does it make a big difference if you use 8 KB, 16 KB or 32 KB? If the answer is yes, then I think the numbers you propose are good.
          Hide
          Michael McCandless added a comment -

          > > I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
          > > at 1024, at least for now. During searching there can be quite a few
          > > of this class instantiated, and likely a larger buffer size for the
          > > freq/prox streams could actually hurt search performance for those
          > > searches that use skipping.
          >
          > I submitted a patch for LUCENE-430 which avoids copying the buffer when
          > a BufferedIndexInput is cloned. With this patch we could also add a
          > method setBufferSize(int) to BufferedIndexInput. This method has to
          > be called then right after the input has been created or cloned and
          > before the first read is performed (the first read operation allocates
          > the buffer). If called later it wouldn't have any effect. This would
          > allow us to adjust the buffer size dynamically, e. g. use large buffers
          > for segment merges and stored fields, but smaller ones for freq/prox
          > streams, maybe dependent on the document frequency.
          > What do you think?

          I like that idea!

          I am actually seeing that increased buffer sizes for
          BufferedIndexInput help performance of indexing as well (up to ~5%
          just changing this buffer), so I think we do want to increase this but
          only for merging.

          I wonder if we should just add a ctor to BufferedIndexInput that takes
          the bufferSize? This would avoid the surprising API caveat you
          describe above. The problem is, then all classes (SegmentTermDocs,
          SegmentTermPositions, FieldsReader, etc.) that open an IndexInput
          would also have to have ctors to change buffer sizes. Even if we do
          setBufferSize instead of new ctor we have some cases (eg at least
          SegmentTermEnum) where bytes are read during construction so it's too
          late for caller to then change buffer size. Hmmm. Not clear how to
          do this cleanly...

          Maybe we do the setBufferSize approach, but, if the buffer already
          exists, rather than throwing an exception we check if the new size is
          greater than the old size and if so we grow the buffer? I can code this
          up.

          > > The CompoundFileWriter buffer is created only briefly, so I think we
          > > can use a fairly large (32 KB?) buffer there. And there should not be
          > > too many BufferedIndexOutputs alive at once so I think a large-ish
          > > buffer (16 KB?) should be OK.
          >
          > I'm wondering how much performance benefits if you increase the buffer
          > size beyond the file system's page size? Does it make a big difference
          > if you use 8 KB, 16 KB or 32 KB? If the answer is yes, then I think
          > the numbers you propose are good.

          I'm testing now different sizes of each of these three buffers
          ... will post the results.

          Show
          Michael McCandless added a comment - > > I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE > > at 1024, at least for now. During searching there can be quite a few > > of this class instantiated, and likely a larger buffer size for the > > freq/prox streams could actually hurt search performance for those > > searches that use skipping. > > I submitted a patch for LUCENE-430 which avoids copying the buffer when > a BufferedIndexInput is cloned. With this patch we could also add a > method setBufferSize(int) to BufferedIndexInput. This method has to > be called then right after the input has been created or cloned and > before the first read is performed (the first read operation allocates > the buffer). If called later it wouldn't have any effect. This would > allow us to adjust the buffer size dynamically, e. g. use large buffers > for segment merges and stored fields, but smaller ones for freq/prox > streams, maybe dependent on the document frequency. > What do you think? I like that idea! I am actually seeing that increased buffer sizes for BufferedIndexInput help performance of indexing as well (up to ~5% just changing this buffer), so I think we do want to increase this but only for merging. I wonder if we should just add a ctor to BufferedIndexInput that takes the bufferSize? This would avoid the surprising API caveat you describe above. The problem is, then all classes (SegmentTermDocs, SegmentTermPositions, FieldsReader, etc.) that open an IndexInput would also have to have ctors to change buffer sizes. Even if we do setBufferSize instead of new ctor we have some cases (eg at least SegmentTermEnum) where bytes are read during construction so it's too late for caller to then change buffer size. Hmmm. Not clear how to do this cleanly... Maybe we do the setBufferSize approach, but, if the buffer already exists, rather than throwing an exception we check if the new size is greater than the old size and if so we grow the buffer? I can code this up. > > The CompoundFileWriter buffer is created only briefly, so I think we > > can use a fairly large (32 KB?) buffer there. And there should not be > > too many BufferedIndexOutputs alive at once so I think a large-ish > > buffer (16 KB?) should be OK. > > I'm wondering how much performance benefits if you increase the buffer > size beyond the file system's page size? Does it make a big difference > if you use 8 KB, 16 KB or 32 KB? If the answer is yes, then I think > the numbers you propose are good. I'm testing now different sizes of each of these three buffers ... will post the results.
          Hide
          Michael Busch added a comment -

          > I wonder if we should just add a ctor to BufferedIndexInput that takes
          > the bufferSize? This would avoid the surprising API caveat you
          > describe above. The problem is, then all classes (SegmentTermDocs,
          > SegmentTermPositions, FieldsReader, etc.) that open an IndexInput
          > would also have to have ctors to change buffer sizes. Even if we do
          > setBufferSize instead of new ctor we have some cases (eg at least
          > SegmentTermEnum) where bytes are read during construction so it's too
          > late for caller to then change buffer size. Hmmm. Not clear how to
          > do this cleanly...

          Yeah I was thinking about the ctor approach as well. Actually
          BufferedIndexInput does not have a public ctor so far, it's created by
          using Directory.openInput(String fileName). And to add a new ctor would
          mean an API change, so subclasses wouldn't compile anymore without
          changes.
          What me might want to do instead is to add a new new method
          openInput(String fileName, int bufferSize) to Directory which calls
          the existing openInput(String fileName) by default, so subclasses of
          Directory would ignore the bufferSize parameter by default. Then we
          can change FSDirectory to overwrite openInput(String, int):

          public IndexInput openInput(String name, int bufferSize)
          throws IOException

          { FSIndexInput input = new FSIndexInput(new File(directory, name)); input.setBufferSize(bufferSize); return input; }

          This should solve the problems you mentioned like in SegmentTermEnum
          and we don't have to support setBufferSize() after a read has been
          performed. It has also the advantage that we safe an instanceof and
          cast from IndexInput to BufferedIndexInput before setBufferSize()
          can be called.

          After a clone however, we would still have to cast to
          BufferedIndexInput before setBufferSize() can be called.

          Show
          Michael Busch added a comment - > I wonder if we should just add a ctor to BufferedIndexInput that takes > the bufferSize? This would avoid the surprising API caveat you > describe above. The problem is, then all classes (SegmentTermDocs, > SegmentTermPositions, FieldsReader, etc.) that open an IndexInput > would also have to have ctors to change buffer sizes. Even if we do > setBufferSize instead of new ctor we have some cases (eg at least > SegmentTermEnum) where bytes are read during construction so it's too > late for caller to then change buffer size. Hmmm. Not clear how to > do this cleanly... Yeah I was thinking about the ctor approach as well. Actually BufferedIndexInput does not have a public ctor so far, it's created by using Directory.openInput(String fileName). And to add a new ctor would mean an API change, so subclasses wouldn't compile anymore without changes. What me might want to do instead is to add a new new method openInput(String fileName, int bufferSize) to Directory which calls the existing openInput(String fileName) by default, so subclasses of Directory would ignore the bufferSize parameter by default. Then we can change FSDirectory to overwrite openInput(String, int): public IndexInput openInput(String name, int bufferSize) throws IOException { FSIndexInput input = new FSIndexInput(new File(directory, name)); input.setBufferSize(bufferSize); return input; } This should solve the problems you mentioned like in SegmentTermEnum and we don't have to support setBufferSize() after a read has been performed. It has also the advantage that we safe an instanceof and cast from IndexInput to BufferedIndexInput before setBufferSize() can be called. After a clone however, we would still have to cast to BufferedIndexInput before setBufferSize() can be called.
          Hide
          Michael McCandless added a comment -

          > > I wonder if we should just add a ctor to BufferedIndexInput that takes
          > > the bufferSize? This would avoid the surprising API caveat you
          > > describe above. The problem is, then all classes (SegmentTermDocs,
          > > SegmentTermPositions, FieldsReader, etc.) that open an IndexInput
          > > would also have to have ctors to change buffer sizes. Even if we do
          > > setBufferSize instead of new ctor we have some cases (eg at least
          > > SegmentTermEnum) where bytes are read during construction so it's too
          > > late for caller to then change buffer size. Hmmm. Not clear how to
          > > do this cleanly...
          >
          > Yeah I was thinking about the ctor approach as well. Actually
          > BufferedIndexInput does not have a public ctor so far, it's created by
          > using Directory.openInput(String fileName). And to add a new ctor would
          > mean an API change, so subclasses wouldn't compile anymore without
          > changes.

          Actually, it does have a default public constructor right? Ie if we add

          public BufferedIndexInput()
          public BufferedIndexInput(int bufferSize)

          then I think we don't break API backwards compatibility?

          > After a clone however, we would still have to cast to
          > BufferedIndexInput before setBufferSize() can be called.

          I plan to add "private int bufferSize" to BufferedIndexInput,
          defaulting to BUFFER_SIZE. I think then it would just work w/ your
          LUCENE-430 patch because your patch sets the clone's buffer to null
          and then when the clone allocates its buffer it will be length
          bufferSize. I think?

          Show
          Michael McCandless added a comment - > > I wonder if we should just add a ctor to BufferedIndexInput that takes > > the bufferSize? This would avoid the surprising API caveat you > > describe above. The problem is, then all classes (SegmentTermDocs, > > SegmentTermPositions, FieldsReader, etc.) that open an IndexInput > > would also have to have ctors to change buffer sizes. Even if we do > > setBufferSize instead of new ctor we have some cases (eg at least > > SegmentTermEnum) where bytes are read during construction so it's too > > late for caller to then change buffer size. Hmmm. Not clear how to > > do this cleanly... > > Yeah I was thinking about the ctor approach as well. Actually > BufferedIndexInput does not have a public ctor so far, it's created by > using Directory.openInput(String fileName). And to add a new ctor would > mean an API change, so subclasses wouldn't compile anymore without > changes. Actually, it does have a default public constructor right? Ie if we add public BufferedIndexInput() public BufferedIndexInput(int bufferSize) then I think we don't break API backwards compatibility? > After a clone however, we would still have to cast to > BufferedIndexInput before setBufferSize() can be called. I plan to add "private int bufferSize" to BufferedIndexInput, defaulting to BUFFER_SIZE. I think then it would just work w/ your LUCENE-430 patch because your patch sets the clone's buffer to null and then when the clone allocates its buffer it will be length bufferSize. I think?
          Hide
          Michael Busch added a comment -

          > Actually, it does have a default public constructor right? Ie if we add

          > public BufferedIndexInput()
          > public BufferedIndexInput(int bufferSize)

          > then I think we don't break API backwards compatibility?

          Oups! Of course, you are right. What was I thinking...

          > I plan to add "private int bufferSize" to BufferedIndexInput,
          > defaulting to BUFFER_SIZE. I think then it would just work w/ your
          > LUCENE-430 patch because your patch sets the clone's buffer to null
          > and then when the clone allocates its buffer it will be length
          > bufferSize. I think?

          True. But it would be nice if it was possible to change the buffer size
          after a clone. For example in SegmentTermDocs we could then adjust the
          buffer size of the cloned freqStream according to the document frequency.
          And in my multi-level skipping patch (LUCENE-866) I could also benefit
          from this functionality.

          Hmm, in SegmentTermDocs the freq stream is cloned in the ctor. If the
          same instance of SegmentTermDocs is used for different terms, then
          the same clone is used. So actually it would be nice it was possible to
          change the buffer size after read has performed.

          > Maybe we do the setBufferSize approach, but, if the buffer already
          > exists, rather than throwing an exception we check if the new size is
          > greater than the old size and if so we grow the buffer? I can code this
          > up.

          So yes, I think we should implement it this way.

          Show
          Michael Busch added a comment - > Actually, it does have a default public constructor right? Ie if we add > public BufferedIndexInput() > public BufferedIndexInput(int bufferSize) > then I think we don't break API backwards compatibility? Oups! Of course, you are right. What was I thinking... > I plan to add "private int bufferSize" to BufferedIndexInput, > defaulting to BUFFER_SIZE. I think then it would just work w/ your > LUCENE-430 patch because your patch sets the clone's buffer to null > and then when the clone allocates its buffer it will be length > bufferSize. I think? True. But it would be nice if it was possible to change the buffer size after a clone. For example in SegmentTermDocs we could then adjust the buffer size of the cloned freqStream according to the document frequency. And in my multi-level skipping patch ( LUCENE-866 ) I could also benefit from this functionality. Hmm, in SegmentTermDocs the freq stream is cloned in the ctor. If the same instance of SegmentTermDocs is used for different terms, then the same clone is used. So actually it would be nice it was possible to change the buffer size after read has performed. > Maybe we do the setBufferSize approach, but, if the buffer already > exists, rather than throwing an exception we check if the new size is > greater than the old size and if so we grow the buffer? I can code this > up. So yes, I think we should implement it this way.
          Hide
          Michael McCandless added a comment -

          > > I plan to add "private int bufferSize" to BufferedIndexInput,
          > > defaulting to BUFFER_SIZE. I think then it would just work w/ your
          > > LUCENE-430 patch because your patch sets the clone's buffer to null
          > > and then when the clone allocates its buffer it will be length
          > > bufferSize. I think?
          >
          > True. But it would be nice if it was possible to change the buffer size
          > after a clone. For example in SegmentTermDocs we could then adjust the
          > buffer size of the cloned freqStream according to the document frequency.
          > And in my multi-level skipping patch (LUCENE-866) I could also benefit
          > from this functionality.

          OK, I agree: let's add a BufferedIndexInput.setBufferSize() and also
          openInput(path, bufferSize) to Directory base class & to FSDirectory.

          > Hmm, in SegmentTermDocs the freq stream is cloned in the ctor. If the
          > same instance of SegmentTermDocs is used for different terms, then
          > the same clone is used. So actually it would be nice it was possible to
          > change the buffer size after read has performed.
          >
          > > Maybe we do the setBufferSize approach, but, if the buffer already
          > > exists, rather than throwing an exception we check if the new size is
          > > greater than the old size and if so we grow the buffer? I can code this
          > > up.
          >
          > So yes, I think we should implement it this way.

          OK I will do this. Actually, I think we should also allow making the
          buffer smaller this way. Meaning, I will preserve buffer contents
          (starting from bufferPosition) as much as is allowed by the smaller
          buffer. This way there is no restriction on using this method
          vs. having read bytes already ("principle of least surprise").

          Show
          Michael McCandless added a comment - > > I plan to add "private int bufferSize" to BufferedIndexInput, > > defaulting to BUFFER_SIZE. I think then it would just work w/ your > > LUCENE-430 patch because your patch sets the clone's buffer to null > > and then when the clone allocates its buffer it will be length > > bufferSize. I think? > > True. But it would be nice if it was possible to change the buffer size > after a clone. For example in SegmentTermDocs we could then adjust the > buffer size of the cloned freqStream according to the document frequency. > And in my multi-level skipping patch ( LUCENE-866 ) I could also benefit > from this functionality. OK, I agree: let's add a BufferedIndexInput.setBufferSize() and also openInput(path, bufferSize) to Directory base class & to FSDirectory. > Hmm, in SegmentTermDocs the freq stream is cloned in the ctor. If the > same instance of SegmentTermDocs is used for different terms, then > the same clone is used. So actually it would be nice it was possible to > change the buffer size after read has performed. > > > Maybe we do the setBufferSize approach, but, if the buffer already > > exists, rather than throwing an exception we check if the new size is > > greater than the old size and if so we grow the buffer? I can code this > > up. > > So yes, I think we should implement it this way. OK I will do this. Actually, I think we should also allow making the buffer smaller this way. Meaning, I will preserve buffer contents (starting from bufferPosition) as much as is allowed by the smaller buffer. This way there is no restriction on using this method vs. having read bytes already ("principle of least surprise").
          Hide
          Michael Busch added a comment -

          > OK I will do this. Actually, I think we should also allow making the
          > buffer smaller this way. Meaning, I will preserve buffer contents
          > (starting from bufferPosition) as much as is allowed by the smaller
          > buffer. This way there is no restriction on using this method
          > vs. having read bytes already ("principle of least surprise").

          Yep sounds good. I can code this and commit it with LUCENE-430.
          If you want to do it, then I should commit the existing 430 patch
          soon, so that there are no conflicts in our patches?

          Show
          Michael Busch added a comment - > OK I will do this. Actually, I think we should also allow making the > buffer smaller this way. Meaning, I will preserve buffer contents > (starting from bufferPosition) as much as is allowed by the smaller > buffer. This way there is no restriction on using this method > vs. having read bytes already ("principle of least surprise"). Yep sounds good. I can code this and commit it with LUCENE-430 . If you want to do it, then I should commit the existing 430 patch soon, so that there are no conflicts in our patches?
          Hide
          Michael McCandless added a comment -

          > > OK I will do this. Actually, I think we should also allow making the
          > > buffer smaller this way. Meaning, I will preserve buffer contents
          > > (starting from bufferPosition) as much as is allowed by the smaller
          > > buffer. This way there is no restriction on using this method
          > > vs. having read bytes already ("principle of least surprise").
          >
          > Yep sounds good. I can code this and commit it with LUCENE-430.
          > If you want to do it, then I should commit the existing 430 patch
          > soon, so that there are no conflicts in our patches?

          I'm actually coding it up now. Why don't you commit LUCENE-430
          soonish and then I'll update & merge?

          Show
          Michael McCandless added a comment - > > OK I will do this. Actually, I think we should also allow making the > > buffer smaller this way. Meaning, I will preserve buffer contents > > (starting from bufferPosition) as much as is allowed by the smaller > > buffer. This way there is no restriction on using this method > > vs. having read bytes already ("principle of least surprise"). > > Yep sounds good. I can code this and commit it with LUCENE-430 . > If you want to do it, then I should commit the existing 430 patch > soon, so that there are no conflicts in our patches? I'm actually coding it up now. Why don't you commit LUCENE-430 soonish and then I'll update & merge?
          Hide
          Marvin Humphrey added a comment -

          I would like to know why these gains are appearing, and how specific they are to a particular system. How can the optimum buffer size be deduced? Is it a factor of hard disk sector size? Memory page size? Lucene write behavior pattern? Level X Cache size?

          Show
          Marvin Humphrey added a comment - I would like to know why these gains are appearing, and how specific they are to a particular system. How can the optimum buffer size be deduced? Is it a factor of hard disk sector size? Memory page size? Lucene write behavior pattern? Level X Cache size?
          Hide
          Michael Busch added a comment -

          > I'm actually coding it up now. Why don't you commit LUCENE-430
          > soonish and then I'll update & merge?

          Done.

          Show
          Michael Busch added a comment - > I'm actually coding it up now. Why don't you commit LUCENE-430 > soonish and then I'll update & merge? Done.
          Hide
          Michael McCandless added a comment -

          OK I ran two sets of tests. First is only on Mac OS X to see how
          performance changes with buffer sizes. Second was also on Debian
          Linux & Windows XP Pro.

          The performance gains are 10-18% faster overall.

          FIRST TEST

          I increased buffer sizes, separately, for each of BufferedIndexInput,
          BufferedIndexOutput and CompoundFileWriter. Each test is run once on
          Mac OS X:

          BufferedIndexInput

          1 K 622 sec (current trunk)
          4 K 607 sec
          8 K 606 sec
          16 K 598 sec
          32 K 606 sec
          64 K 589 sec
          128 K 601 sec

          CompoundFileWriter

          1 K 622 sec (current trunk)
          4 K 599 sec
          8 K 591 sec
          16 K 578 sec
          32 K 583 sec
          64 K 580 sec

          BufferedIndexOutput

          1 K 622 sec (current trunk)
          4 K 588 sec
          8 K 576 sec
          16 K 551 sec
          32 K 566 sec
          64 K 555 sec
          128 K 543 sec
          256 K 534 sec
          512 K 564 sec

          Comments:

          • The results are fairly noisy, but, performance does generally get
            better w/ larger buffers.
          • BufferedIndexOutput seems specifically to like very large output
            buffers; the other two seem to have less but still significant
            effect.

          Given this I picked 16 K buffer for BufferedIndexOutput, 16 K buffer
          for CompoundFileWriter and 4 K buffer for BufferedIndexInput. I think
          we would get faster performance for a larger buffer for
          BufferedIndexInput, but, even when merging there are quite a few of
          these created (mergeFactor * N where N = number of separate index
          files).

          Then, I re-tested the baseline (trunk) & these buffer sizes across
          platforms (below):

          SECOND TEST

          Baseline (trunk) = 1 K buffers for all 3. New = 16 K for
          BufferedIndexOutput, 16 K for CompoundFileWriter and 4 K for
          BufferedIndexInput.

          I ran each test 4 times & took the best time:

          Quad core Mac OS X on 4-drive RAID 0
          baseline 622 sec
          new 527 sec
          -> 15% faster

          Dual core Debian Linux (2.6.18 kernel) on 6 drive RAID 5
          baseline 708 sec
          new 635 sec
          -> 10% faster

          Windows XP Pro laptop, single drive
          baseline 1604 sec
          new 1308 sec
          -> 18% faster

          Net/net it's between 10-18% performance gain overall. It is
          interesting that the system with the "weakest" IO system (one drive on
          Windows XP vs RAID 0/5 on the others) has the best gains.

          Show
          Michael McCandless added a comment - OK I ran two sets of tests. First is only on Mac OS X to see how performance changes with buffer sizes. Second was also on Debian Linux & Windows XP Pro. The performance gains are 10-18% faster overall. FIRST TEST I increased buffer sizes, separately, for each of BufferedIndexInput, BufferedIndexOutput and CompoundFileWriter. Each test is run once on Mac OS X: BufferedIndexInput 1 K 622 sec (current trunk) 4 K 607 sec 8 K 606 sec 16 K 598 sec 32 K 606 sec 64 K 589 sec 128 K 601 sec CompoundFileWriter 1 K 622 sec (current trunk) 4 K 599 sec 8 K 591 sec 16 K 578 sec 32 K 583 sec 64 K 580 sec BufferedIndexOutput 1 K 622 sec (current trunk) 4 K 588 sec 8 K 576 sec 16 K 551 sec 32 K 566 sec 64 K 555 sec 128 K 543 sec 256 K 534 sec 512 K 564 sec Comments: The results are fairly noisy, but, performance does generally get better w/ larger buffers. BufferedIndexOutput seems specifically to like very large output buffers; the other two seem to have less but still significant effect. Given this I picked 16 K buffer for BufferedIndexOutput, 16 K buffer for CompoundFileWriter and 4 K buffer for BufferedIndexInput. I think we would get faster performance for a larger buffer for BufferedIndexInput, but, even when merging there are quite a few of these created (mergeFactor * N where N = number of separate index files). Then, I re-tested the baseline (trunk) & these buffer sizes across platforms (below): SECOND TEST Baseline (trunk) = 1 K buffers for all 3. New = 16 K for BufferedIndexOutput, 16 K for CompoundFileWriter and 4 K for BufferedIndexInput. I ran each test 4 times & took the best time: Quad core Mac OS X on 4-drive RAID 0 baseline 622 sec new 527 sec -> 15% faster Dual core Debian Linux (2.6.18 kernel) on 6 drive RAID 5 baseline 708 sec new 635 sec -> 10% faster Windows XP Pro laptop, single drive baseline 1604 sec new 1308 sec -> 18% faster Net/net it's between 10-18% performance gain overall. It is interesting that the system with the "weakest" IO system (one drive on Windows XP vs RAID 0/5 on the others) has the best gains.
          Hide
          Michael McCandless added a comment -

          > I would like to know why these gains are appearing, and how specific
          > they are to a particular system. How can the optimum buffer size be
          > deduced? Is it a factor of hard disk sector size? Memory page size?
          > Lucene write behavior pattern? Level X Cache size?

          It looks like the gains are cross platform (at least between OS X,
          Linux, Windows XP) and cross-IO architecture.

          I'm not sure how this depends/correlates to the various cache/page
          sizes through the layers of OS -> disk heads.

          It must be that doing an IO request has a fairly high overhead and so
          the more bytes you can read/write at once the faster it is, since you
          amortize that overhead.

          For merging in particular, with mergeFactor=10, I can see that a
          larger buffer size on the input streams should help reduce insane
          seeks back & forth between the 10 files (and the 1 output file).
          Maybe larger reads on the input streams also cause OS's IO scheduler
          to do larger read-ahead in anticipation?

          And some good news: these gains seem to be additive to the gains in
          LUCENE-843, at least with my initial testing.

          Show
          Michael McCandless added a comment - > I would like to know why these gains are appearing, and how specific > they are to a particular system. How can the optimum buffer size be > deduced? Is it a factor of hard disk sector size? Memory page size? > Lucene write behavior pattern? Level X Cache size? It looks like the gains are cross platform (at least between OS X, Linux, Windows XP) and cross-IO architecture. I'm not sure how this depends/correlates to the various cache/page sizes through the layers of OS -> disk heads. It must be that doing an IO request has a fairly high overhead and so the more bytes you can read/write at once the faster it is, since you amortize that overhead. For merging in particular, with mergeFactor=10, I can see that a larger buffer size on the input streams should help reduce insane seeks back & forth between the 10 files (and the 1 output file). Maybe larger reads on the input streams also cause OS's IO scheduler to do larger read-ahead in anticipation? And some good news: these gains seem to be additive to the gains in LUCENE-843 , at least with my initial testing.
          Hide
          John Haxby added a comment -

          > Net/net it's between 10-18% performance gain overall. It is
          > interesting that the system with the "weakest" IO system (one drive on
          > Windows XP vs RAID 0/5 on the others) has the best gains.

          Actually, it's not that surprising. Linux and BSD (MacOS) kernels work hard to do good I/O without the user having to do that much to take it into account. The improvement you're seeing in those systems is as much to do with the fact that you're dealing with complete file system block sizes (4x4k) and complete VM page sizes (4x4k). You'd probably see similar gains just going from 1k to 4k though: even "cp" benefits from using a 4k block size rather than 1k. I'd guess that a 4k or 8k buffer would be best on Linux/MacOS and that you wouldn't see much difference going to 16k. In fact, in the MacOS tests the big jump seems to be from 1k to 4k with smaller improvements thereafer.

          I'm not that surprised by the WinXP changes: the I/O subsystem on a laptop is usually dire and anything that will cut down on the I/O is going to be a big help. I would expect that the difference would be more dramatic with a FAT32 file system than it would be with NTFS though.

          Show
          John Haxby added a comment - > Net/net it's between 10-18% performance gain overall. It is > interesting that the system with the "weakest" IO system (one drive on > Windows XP vs RAID 0/5 on the others) has the best gains. Actually, it's not that surprising. Linux and BSD (MacOS) kernels work hard to do good I/O without the user having to do that much to take it into account. The improvement you're seeing in those systems is as much to do with the fact that you're dealing with complete file system block sizes (4x4k) and complete VM page sizes (4x4k). You'd probably see similar gains just going from 1k to 4k though: even "cp" benefits from using a 4k block size rather than 1k. I'd guess that a 4k or 8k buffer would be best on Linux/MacOS and that you wouldn't see much difference going to 16k. In fact, in the MacOS tests the big jump seems to be from 1k to 4k with smaller improvements thereafer. I'm not that surprised by the WinXP changes: the I/O subsystem on a laptop is usually dire and anything that will cut down on the I/O is going to be a big help. I would expect that the difference would be more dramatic with a FAT32 file system than it would be with NTFS though.
          Hide
          Michael McCandless added a comment -

          Attached the patch based on above discussion.

          Show
          Michael McCandless added a comment - Attached the patch based on above discussion.
          Hide
          Michael Busch added a comment -

          Mike,

          I tested and reviewed your patch. It looks good and all tests pass! Two comments:

          • Should we increase the buffer size for CompoundFileReader to 4KB not only for the merge mode but also for the normal read mode?
          • In BufferedIndexInput.setBufferSize() a new buffer should only be allocated if the new size is different from the previous buffer size.
          Show
          Michael Busch added a comment - Mike, I tested and reviewed your patch. It looks good and all tests pass! Two comments: Should we increase the buffer size for CompoundFileReader to 4KB not only for the merge mode but also for the normal read mode? In BufferedIndexInput.setBufferSize() a new buffer should only be allocated if the new size is different from the previous buffer size.
          Hide
          Michael McCandless added a comment -

          > I tested and reviewed your patch. It looks good and all tests pass!

          Thanks!

          > - Should we increase the buffer size for CompoundFileReader to 4KB
          > not only for the merge mode but also for the normal read mode?

          I'm a little nervous about that: I don't know the impact it will have
          on searching, especially queries that heavily use skipping?

          Hmmm, actually, a CSIndexInput potentially goes through 2 buffers when
          it does a read – its own (since each CSIndexInput subclasses from
          BufferedIndexInput) and then the main stream of the
          CompoundFileReader. It seems like we shouldn't do this? We should
          not do a double copy.

          It almost seems like the double copy would not occur becaase
          readBytes() has logic to read directly from the underlying stream if
          the sizes is >= bufferSize. However, I see at least one case during
          merging where this logic doesn't kick in (and we do a double buffer
          copy) because the buffers become "skewed". I will open a separate
          issue for this; I think we should fix it and gain some more
          performance.

          > In BufferedIndexInput.setBufferSize() a new buffer should only be
          > allocated if the new size is different from the previous buffer
          > size.

          Ahh, good. Will do.

          Show
          Michael McCandless added a comment - > I tested and reviewed your patch. It looks good and all tests pass! Thanks! > - Should we increase the buffer size for CompoundFileReader to 4KB > not only for the merge mode but also for the normal read mode? I'm a little nervous about that: I don't know the impact it will have on searching, especially queries that heavily use skipping? Hmmm, actually, a CSIndexInput potentially goes through 2 buffers when it does a read – its own (since each CSIndexInput subclasses from BufferedIndexInput) and then the main stream of the CompoundFileReader. It seems like we shouldn't do this? We should not do a double copy. It almost seems like the double copy would not occur becaase readBytes() has logic to read directly from the underlying stream if the sizes is >= bufferSize. However, I see at least one case during merging where this logic doesn't kick in (and we do a double buffer copy) because the buffers become "skewed". I will open a separate issue for this; I think we should fix it and gain some more performance. > In BufferedIndexInput.setBufferSize() a new buffer should only be > allocated if the new size is different from the previous buffer > size. Ahh, good. Will do.
          Hide
          Michael Busch added a comment -

          > I'm a little nervous about that: I don't know the impact it will have
          > on searching, especially queries that heavily use skipping?

          Doesn't the OS always read at least a whole block from the disk (usually
          4 KB)? If the answer is yes, then we don't safe IO by limiting the buffer
          size to 1 KB, right? Of course it makes sense to limit the size for
          streams that we clone often (like the freq stream) to safe memory and
          array copies. But we never clone the base stream in the cfs reader.

          > Hmmm, actually, a CSIndexInput potentially goes through 2 buffers when
          > it does a read – its own (since each CSIndexInput subclasses from
          > BufferedIndexInput) and then the main stream of the
          > CompoundFileReader. It seems like we shouldn't do this? We should
          > not do a double copy.
          >
          > It almost seems like the double copy would not occur becaase
          > readBytes() has logic to read directly from the underlying stream if
          > the sizes is >= bufferSize. However, I see at least one case during
          > merging where this logic doesn't kick in (and we do a double buffer
          > copy) because the buffers become "skewed". I will open a separate
          > issue for this; I think we should fix it and gain some more
          > performance.

          Good catch! Reminds me a bit of LUCENE-431 where we also did double
          buffering for the RAMInputStream and RAMOutputStream. Yes, I think
          we should fix this.

          Show
          Michael Busch added a comment - > I'm a little nervous about that: I don't know the impact it will have > on searching, especially queries that heavily use skipping? Doesn't the OS always read at least a whole block from the disk (usually 4 KB)? If the answer is yes, then we don't safe IO by limiting the buffer size to 1 KB, right? Of course it makes sense to limit the size for streams that we clone often (like the freq stream) to safe memory and array copies. But we never clone the base stream in the cfs reader. > Hmmm, actually, a CSIndexInput potentially goes through 2 buffers when > it does a read – its own (since each CSIndexInput subclasses from > BufferedIndexInput) and then the main stream of the > CompoundFileReader. It seems like we shouldn't do this? We should > not do a double copy. > > It almost seems like the double copy would not occur becaase > readBytes() has logic to read directly from the underlying stream if > the sizes is >= bufferSize. However, I see at least one case during > merging where this logic doesn't kick in (and we do a double buffer > copy) because the buffers become "skewed". I will open a separate > issue for this; I think we should fix it and gain some more > performance. Good catch! Reminds me a bit of LUCENE-431 where we also did double buffering for the RAMInputStream and RAMOutputStream. Yes, I think we should fix this.
          Hide
          Michael McCandless added a comment -

          > > I'm a little nervous about that: I don't know the impact it will have
          > > on searching, especially queries that heavily use skipping?
          >
          > Doesn't the OS always read at least a whole block from the disk
          > (usually 4 KB)? If the answer is yes, then we don't safe IO by
          > limiting the buffer size to 1 KB, right? Of course it makes sense to
          > limit the size for streams that we clone often (like the freq
          > stream) to safe memory and array copies. But we never clone the base
          > stream in the cfs reader.

          Yes, I think you're right. But we should test search
          performance on a large index & "typical" queries to be sure? I will
          open a new issue to track this...

          Show
          Michael McCandless added a comment - > > I'm a little nervous about that: I don't know the impact it will have > > on searching, especially queries that heavily use skipping? > > Doesn't the OS always read at least a whole block from the disk > (usually 4 KB)? If the answer is yes, then we don't safe IO by > limiting the buffer size to 1 KB, right? Of course it makes sense to > limit the size for streams that we clone often (like the freq > stream) to safe memory and array copies. But we never clone the base > stream in the cfs reader. Yes, I think you're right. But we should test search performance on a large index & "typical" queries to be sure? I will open a new issue to track this...
          Hide
          Doug Cutting added a comment -

          > then we don't save IO by limiting the buffer size to 1 KB

          I'm confused by this. My assumption is that, when you make a request to read 1k from a disk file, that the OS reads substantially more than 1k from the disk and places it in the buffer cache. (The cost of randomly reading 1k is nearly the same as randomly reading 100k--both are dominated by seek.) So, if you make another request to read 1k shortly thereafter you'll get it from the buffer cache and the incremental cost will be that of making a system call.

          In general, one should thus rely on the buffer cache and read-ahead, and make input buffers only big enough so that system call overhead is insignificant. An alternate strategy is to not trust the buffer cache and read-ahead, but rather to make your buffers large enough so that transfer time dominates seeks. This can require 1MB or larger buffers, so isn't always practical.

          So, back to your statement, a 1k buffer doesn't save physical i/o, but nor should it incur extra physical i/o. It does incur extra system calls, but uses less memory, which is a tradeoff. Is that what you meant?

          Show
          Doug Cutting added a comment - > then we don't save IO by limiting the buffer size to 1 KB I'm confused by this. My assumption is that, when you make a request to read 1k from a disk file, that the OS reads substantially more than 1k from the disk and places it in the buffer cache. (The cost of randomly reading 1k is nearly the same as randomly reading 100k--both are dominated by seek.) So, if you make another request to read 1k shortly thereafter you'll get it from the buffer cache and the incremental cost will be that of making a system call. In general, one should thus rely on the buffer cache and read-ahead, and make input buffers only big enough so that system call overhead is insignificant. An alternate strategy is to not trust the buffer cache and read-ahead, but rather to make your buffers large enough so that transfer time dominates seeks. This can require 1MB or larger buffers, so isn't always practical. So, back to your statement, a 1k buffer doesn't save physical i/o, but nor should it incur extra physical i/o. It does incur extra system calls, but uses less memory, which is a tradeoff. Is that what you meant?
          Hide
          robert engels added a comment -

          I think the important consideration is how expensive is the system call. Since the system call requires JNI, it MIGHT be expensive.

          Another important consideration is buffer utilization. It is my understanding that the OS will perform read-ahead normally only in sequential access only, outside of the additional bytes read to optimize the physical read. If Lucene performs indexed reads but the data is actually being accessed sequential, Lucene managing its own buffers can far more effective.

          Along these lines, if the server is heavily used for a variety of applications Lucene can manage its own buffers more efficiently - similar to how a database almost always (every commercial one I know) has its own buffer cache and does not rely on the OS. In a GC environment though it may be even more imporant if the buffers were managed/reused from a pool as you avoid the GC overhead.

          Just my thoughts.

          Show
          robert engels added a comment - I think the important consideration is how expensive is the system call. Since the system call requires JNI, it MIGHT be expensive. Another important consideration is buffer utilization. It is my understanding that the OS will perform read-ahead normally only in sequential access only, outside of the additional bytes read to optimize the physical read. If Lucene performs indexed reads but the data is actually being accessed sequential, Lucene managing its own buffers can far more effective. Along these lines, if the server is heavily used for a variety of applications Lucene can manage its own buffers more efficiently - similar to how a database almost always (every commercial one I know) has its own buffer cache and does not rely on the OS. In a GC environment though it may be even more imporant if the buffers were managed/reused from a pool as you avoid the GC overhead. Just my thoughts.
          Hide
          Michael Busch added a comment -

          > So, back to your statement, a 1k buffer doesn't save
          > physical i/o, but nor should it incur extra physical i/o.

          Yes, I agree.

          > It does incur extra system calls, but uses less memory,
          > which is a tradeoff.

          A coworker told me that he ran some experiments with buffer
          sizes in a different application, and he found out that
          increasing the buffer size beyond the disks block size
          further speed up things. In these tests he read a whole
          file sequentially, which means that the speedup came from
          saving system calls, right?

          So yes, it is a tradeoff between memory usage and amount
          of system calls. That's the reason for my suggestion to
          only increase the buffer size for input streams that we
          don't clone, like the base stream in the compound reader.

          But I'm just sort of guessing here, I think we should run
          some performance experiments. Mike already opened
          LUCENE-893 for that.

          Show
          Michael Busch added a comment - > So, back to your statement, a 1k buffer doesn't save > physical i/o, but nor should it incur extra physical i/o. Yes, I agree. > It does incur extra system calls, but uses less memory, > which is a tradeoff. A coworker told me that he ran some experiments with buffer sizes in a different application, and he found out that increasing the buffer size beyond the disks block size further speed up things. In these tests he read a whole file sequentially, which means that the speedup came from saving system calls, right? So yes, it is a tradeoff between memory usage and amount of system calls. That's the reason for my suggestion to only increase the buffer size for input streams that we don't clone, like the base stream in the compound reader. But I'm just sort of guessing here, I think we should run some performance experiments. Mike already opened LUCENE-893 for that.
          Hide
          Michael Busch added a comment -

          Mike,

          another thing I just noticed is your new testcase doesn't remove
          the directory "testSetBufferSize" at the end.

          Show
          Michael Busch added a comment - Mike, another thing I just noticed is your new testcase doesn't remove the directory "testSetBufferSize" at the end.
          Hide
          Eks Dev added a comment -

          we did some time ago a few tests and simply concluded, it boils down to what Doug said, "It does incur extra system calls, but uses less memory, which is a tradeoff."

          in our setup 4k was kind of magic number , ca 5-8% faster. I guess it is actually a compromise between time spent in extra os calls vs probability of reading more than you will really use (waste time on it). Why 4k number happens often to be good compromise is probably the difference in speed of buffer copy for 4k vs 1k being negligible compared to time spent on system calls.

          The only conclusion we came up to is that you have to measure it and find good compromise. Our case is a bit atypical, short documents (1G index, 60Mio docs) and queries with a lot of terms (80-200), Win 2003 Server, single disk.

          And I do not remember was it before or after we started using MMAP, so no idea really if this affects MMAP setup, guess not.

          Show
          Eks Dev added a comment - we did some time ago a few tests and simply concluded, it boils down to what Doug said, "It does incur extra system calls, but uses less memory, which is a tradeoff." in our setup 4k was kind of magic number , ca 5-8% faster. I guess it is actually a compromise between time spent in extra os calls vs probability of reading more than you will really use (waste time on it). Why 4k number happens often to be good compromise is probably the difference in speed of buffer copy for 4k vs 1k being negligible compared to time spent on system calls. The only conclusion we came up to is that you have to measure it and find good compromise. Our case is a bit atypical, short documents (1G index, 60Mio docs) and queries with a lot of terms (80-200), Win 2003 Server, single disk. And I do not remember was it before or after we started using MMAP, so no idea really if this affects MMAP setup, guess not.
          Hide
          Michael McCandless added a comment -

          > another thing I just noticed is your new testcase doesn't remove the
          > directory "testSetBufferSize" at the end.

          Woops, I will fix. I will also fix it to appear under the tempDir.

          Show
          Michael McCandless added a comment - > another thing I just noticed is your new testcase doesn't remove the > directory "testSetBufferSize" at the end. Woops, I will fix. I will also fix it to appear under the tempDir.
          Hide
          Michael McCandless added a comment -

          > we did some time ago a few tests and simply concluded, it boils down to what Doug said, "It does incur extra system calls, but uses less memory, which is a tradeoff."
          >
          > in our setup 4k was kind of magic number , ca 5-8% faster. I guess it is actually a compromise between time spent in extra os calls vs probability of reading more than you will really use (waste time on it). Why 4k number happens often to be good compromise is probably the difference in speed of buffer copy for 4k vs 1k being negligible compared to time spent on system calls.
          >
          > The only conclusion we came up to is that you have to measure it and find good compromise. Our case is a bit atypical, short documents (1G index, 60Mio docs) and queries with a lot of terms (80-200), Win 2003 Server, single disk.
          >
          > And I do not remember was it before or after we started using MMAP, so no idea really if this affects MMAP setup, guess not.

          Interesting! Do you remember if your 5-8% gain was for searching or
          indexing? This issue is focusing on buffer size impact on indexing
          performance and LUCENE-893 is focusing on search performance.

          Show
          Michael McCandless added a comment - > we did some time ago a few tests and simply concluded, it boils down to what Doug said, "It does incur extra system calls, but uses less memory, which is a tradeoff." > > in our setup 4k was kind of magic number , ca 5-8% faster. I guess it is actually a compromise between time spent in extra os calls vs probability of reading more than you will really use (waste time on it). Why 4k number happens often to be good compromise is probably the difference in speed of buffer copy for 4k vs 1k being negligible compared to time spent on system calls. > > The only conclusion we came up to is that you have to measure it and find good compromise. Our case is a bit atypical, short documents (1G index, 60Mio docs) and queries with a lot of terms (80-200), Win 2003 Server, single disk. > > And I do not remember was it before or after we started using MMAP, so no idea really if this affects MMAP setup, guess not. Interesting! Do you remember if your 5-8% gain was for searching or indexing? This issue is focusing on buffer size impact on indexing performance and LUCENE-893 is focusing on search performance.
          Hide
          Michael McCandless added a comment -

          New patch with these changes:

          • The unit test now creates its test index under "tempDir" & removes it when done
          • Don't allocate a new buffer in setBufferSize if the newSize == the current size
          Show
          Michael McCandless added a comment - New patch with these changes: The unit test now creates its test index under "tempDir" & removes it when done Don't allocate a new buffer in setBufferSize if the newSize == the current size
          Hide
          Michael Busch added a comment -

          Take2 looks good. +1 for committing.

          Show
          Michael Busch added a comment - Take2 looks good. +1 for committing.
          Hide
          Marvin Humphrey added a comment -

          I have some auxiliary data points to report after experimenting with buffer
          size in KS today on three different systems: OS X 10.4.9, FreeBSD 5.3, and an
          old RedHat 9 box.

          The FS i/o classes in KinoSearch use a FILE* and
          fopen/fwrite/fread/fseek/ftell, rather than file descriptors and the POSIX
          family of functions. Theoretically, this is wasteful because FILE* stream i/o
          is buffered, so there's double buffering happening. I've meant to change that
          for some time. However, when I've used setvbuf(self->fhandle, NULL, _IONBF)
          to eliminate the buffer for the underlying FILE* object, performance tanks –
          indexing time doubles. I still don't understand exactly why, but I know a
          little more now.

          • Swapping out the FILE* for a descriptor and switching all the I/O calls to
            POSIX variants has no measurable impact on any of these systems.
          • Changing the KS buffer size from 1024 to 4096 has no measurable impact on
            any of these systems.
          • Using setvbuf to eliminate the buffering at output turns out to have no
            impact on indexing performance. It's only killing off the read mode FILE*
            buffer that causes the problem.

          So, it seems that the only change I can make moves the numbers in the wrong
          direction.

          The results are somewhat puzzling because I would ordinarily have blamed
          sub-optimal flush/refill scheduling in my app for the degraded performance
          with setvbuf() on read mode. However, the POSIX i/o calls are unbuffered, so
          that's not it. My best guess is that disabling buffering for read mode
          disables an fseek/ftell optimization.

          Show
          Marvin Humphrey added a comment - I have some auxiliary data points to report after experimenting with buffer size in KS today on three different systems: OS X 10.4.9, FreeBSD 5.3, and an old RedHat 9 box. The FS i/o classes in KinoSearch use a FILE* and fopen/fwrite/fread/fseek/ftell, rather than file descriptors and the POSIX family of functions. Theoretically, this is wasteful because FILE* stream i/o is buffered, so there's double buffering happening. I've meant to change that for some time. However, when I've used setvbuf(self->fhandle, NULL, _IONBF) to eliminate the buffer for the underlying FILE* object, performance tanks – indexing time doubles. I still don't understand exactly why, but I know a little more now. Swapping out the FILE* for a descriptor and switching all the I/O calls to POSIX variants has no measurable impact on any of these systems. Changing the KS buffer size from 1024 to 4096 has no measurable impact on any of these systems. Using setvbuf to eliminate the buffering at output turns out to have no impact on indexing performance. It's only killing off the read mode FILE* buffer that causes the problem. So, it seems that the only change I can make moves the numbers in the wrong direction. The results are somewhat puzzling because I would ordinarily have blamed sub-optimal flush/refill scheduling in my app for the degraded performance with setvbuf() on read mode. However, the POSIX i/o calls are unbuffered, so that's not it. My best guess is that disabling buffering for read mode disables an fseek/ftell optimization.
          Hide
          Michael McCandless added a comment -

          Marvin, it's odd that you see no gains when coming straight from C.
          Hmmm.

          I wonder if IO calls from Java somehow have a sizable overhead that
          the corresponding C calls do not. I didn't think JNI was that
          expensive? Though, it looks like reading into byte[] does entail
          extra byte copying. We could explore using ByteBuffers from
          java.nio.* ... ahh, this has been somewhat explored already, at least
          in LUCENE-414 and LUCENE-893.

          Also, how much "merging" is actually done in your test / KS? How many
          segments are merged at once? The fact that KS doesn't re-merge the
          stored fields & term vectors (within one session) is probably also a
          big difference here.

          Show
          Michael McCandless added a comment - Marvin, it's odd that you see no gains when coming straight from C. Hmmm. I wonder if IO calls from Java somehow have a sizable overhead that the corresponding C calls do not. I didn't think JNI was that expensive? Though, it looks like reading into byte[] does entail extra byte copying. We could explore using ByteBuffers from java.nio.* ... ahh, this has been somewhat explored already, at least in LUCENE-414 and LUCENE-893 . Also, how much "merging" is actually done in your test / KS? How many segments are merged at once? The fact that KS doesn't re-merge the stored fields & term vectors (within one session) is probably also a big difference here.
          Hide
          Marvin Humphrey added a comment -

          > Also, how much "merging" is actually done in your test / KS? How many
          > segments are merged at once? The fact that KS doesn't re-merge the stored
          > fields & term vectors (within one session) is probably also a big difference
          > here.

          In the previous test, I was indexing 1000 Reuters articles all in one session.
          The postings were never flushed to disk prior to the final write, as the
          external sorter never exceeded its memory threshold.

          I reran the test on the FreeBSD box, changing it so that the index was built
          up in 10-doc increments. There was still no measurable change for changing
          either the KS buffer size or POSIX/stream style IO. However, using setvbuf on
          the read mode FILE* stream i/o was utterly disastrous. After several minutes
          (compare with 3.8 seconds) I cut it off. Subsequent testing with a 500-doc
          increment verified that the test actually was working (10.3 secs).

          Show
          Marvin Humphrey added a comment - > Also, how much "merging" is actually done in your test / KS? How many > segments are merged at once? The fact that KS doesn't re-merge the stored > fields & term vectors (within one session) is probably also a big difference > here. In the previous test, I was indexing 1000 Reuters articles all in one session. The postings were never flushed to disk prior to the final write, as the external sorter never exceeded its memory threshold. I reran the test on the FreeBSD box, changing it so that the index was built up in 10-doc increments. There was still no measurable change for changing either the KS buffer size or POSIX/stream style IO. However, using setvbuf on the read mode FILE* stream i/o was utterly disastrous. After several minutes (compare with 3.8 seconds) I cut it off. Subsequent testing with a 500-doc increment verified that the test actually was working (10.3 secs).
          Hide
          Michael McCandless added a comment -

          I re-ran the "second test" above, but this time with compound file
          turned off.

          Baseline (trunk) = 1 K buffers for all 3. New = 16 K for
          BufferedIndexOutput, 16 K for CompoundFileWriter and 4 K for
          BufferedIndexInput. I ran each test 4 times & took the best time.

          Quad core Mac OS X on 4-drive RAID 0
          baseline 553 sec
          new 499 sec
          -> 10% faster

          Dual core Debian Linux (2.6.18 kernel) on 6 drive RAID 5
          baseline 590 sec
          new 548 sec
          -> 7% faster

          Windows XP Pro laptop, single drive
          baseline 1078 sec
          new 918 sec
          -> 15% faster

          Quick observations:

          • Still a healthy 7-15% overall gain even without compound file by
            increasing the buffer sizes.
          • The overall performance gain on the trunk just by turning off
            compound file ranges from 7-33% (33% gain = the Windows XP
            Laptop).

          OK I plan to commit this soon.

          Show
          Michael McCandless added a comment - I re-ran the "second test" above, but this time with compound file turned off. Baseline (trunk) = 1 K buffers for all 3. New = 16 K for BufferedIndexOutput, 16 K for CompoundFileWriter and 4 K for BufferedIndexInput. I ran each test 4 times & took the best time. Quad core Mac OS X on 4-drive RAID 0 baseline 553 sec new 499 sec -> 10% faster Dual core Debian Linux (2.6.18 kernel) on 6 drive RAID 5 baseline 590 sec new 548 sec -> 7% faster Windows XP Pro laptop, single drive baseline 1078 sec new 918 sec -> 15% faster Quick observations: Still a healthy 7-15% overall gain even without compound file by increasing the buffer sizes. The overall performance gain on the trunk just by turning off compound file ranges from 7-33% (33% gain = the Windows XP Laptop). OK I plan to commit this soon.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              2 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development