Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-892

CompoundFileReader's openInput produces streams that may do an extra buffer copy

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: 2.3
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Spinoff of LUCENE-888.

      The class for reading from a compound file (CompoundFileReader) has a
      primary stream which is a BufferedIndexInput when that stream is from
      an FSDirectory (which is the norm). That is one layer of buffering.

      Then, when its openInput is called, a CSIndexInput is created which
      also subclasses from BufferedIndexInput. That's a second layer of
      buffering.

      When a consumer actually uses that CSIndexInput to read, and a call to
      readByte or readBytes runs out of what's in the first buffer, it will
      go to refill its buffer. But that refill calls the first
      BufferedIndexInput which in turn may refill its buffer (a double
      copy) by reading the underlying stream.

      Not sure how to fix it yet but we should change things to not do the
      extra buffer copy.

      1. LUCENE-892.patch
        4 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        It looks like the double copying only happens in certain limited cases
        with Lucene now.

        During indexing, when doing a segment merge of multiple CFS files,
        when the CSIndexInput reaches the end of its sub-file then that final
        buffer fill will be a double copy.

        If the flushing is small (like the default every 10 docs, with small
        docs) then often files are so small that they are less than a buffer
        size and all reads are double-copy.

        During searching the same situation occurs (tail end of file that is
        less than buffer size).

        I had thought I saw a case before where the two buffers get "out of
        sync" such that every buffer refill in CSIndexInput results in a
        double copy even when you are not at the end of the file, but I can't
        reproduce that case now. This would have been much more serious.

        So we are sort of getting "lucky" right now, but, if Lucene changes
        how it uses CFS, it could get the buffers to be "mis-aligned" at which
        point every buffer fill would be a double copy. So I think we should
        still fix it.

        The solution I'm leaning towards is to add a new readBytes() method in
        IndexInput that takes 4th argument "boolean useBuffer". It would have
        a default implementation that just ignores that parameter (and calls
        the current readBytes). BufferedIndexInput would implement that
        method and not refill its buffer if that parameter is false, and
        CSIndexInput would pass false when it calls base.readBytes in its
        readInternal method.

        Show
        mikemccand Michael McCandless added a comment - It looks like the double copying only happens in certain limited cases with Lucene now. During indexing, when doing a segment merge of multiple CFS files, when the CSIndexInput reaches the end of its sub-file then that final buffer fill will be a double copy. If the flushing is small (like the default every 10 docs, with small docs) then often files are so small that they are less than a buffer size and all reads are double-copy. During searching the same situation occurs (tail end of file that is less than buffer size). I had thought I saw a case before where the two buffers get "out of sync" such that every buffer refill in CSIndexInput results in a double copy even when you are not at the end of the file, but I can't reproduce that case now. This would have been much more serious. So we are sort of getting "lucky" right now, but, if Lucene changes how it uses CFS, it could get the buffers to be "mis-aligned" at which point every buffer fill would be a double copy. So I think we should still fix it. The solution I'm leaning towards is to add a new readBytes() method in IndexInput that takes 4th argument "boolean useBuffer". It would have a default implementation that just ignores that parameter (and calls the current readBytes). BufferedIndexInput would implement that method and not refill its buffer if that parameter is false, and CSIndexInput would pass false when it calls base.readBytes in its readInternal method.
        Hide
        mikemccand Michael McCandless added a comment -

        Attached patch that implements the design above (make an
        IndexInput.readBytes that let's you specify whether or not a buffer
        should be used). All tests pass.

        Show
        mikemccand Michael McCandless added a comment - Attached patch that implements the design above (make an IndexInput.readBytes that let's you specify whether or not a buffer should be used). All tests pass.

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development