Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-705

CompoundFileWriter should pre-set its file length

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: 2.4
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I've read that if you are writing a large file, it's best to pre-set
      the size of the file in advance before you write all of its contents.
      This in general minimizes fragmentation and improves IO performance
      against the file in the future.

      I think this makes sense (intuitively) but I haven't done any real
      performance testing to verify.

      Java has the java.io.File.setLength() method (since 1.2) for this.

      We can easily fix CompoundFileWriter to call setLength() on the file
      it's writing (and add setLength() method to IndexOutput). The
      CompoundFileWriter knows exactly how large its file will be.

      Another good thing is: if you are going run out of disk space, then,
      the setLength call should fail up front instead of failing when the
      compound file is actually written. This has two benefits: first, you
      find out sooner that you will run out of disk space, and, second, you
      don't fill up the disk down to 0 bytes left (always a frustrating
      experience!). Instead you leave what space was available
      and throw an IOException.

      My one hesitation here is: what if out there there exists a filesystem
      that can't handle this call, and it throws an IOException on that
      platform? But this is balanced against possible easy-win improvement
      in performance.

      Does anyone have any feedback / thoughts / experience relevant to
      this?

      1. LUCENE-705.patch
        3 kB
        Michael McCandless

        Activity

        Hide
        gsingers Grant Ingersoll added a comment -

        This seems reasonable, although I am not an expert in low-level file system calls like this. I guess for me the thing would be to find out if the major filesystems support it (Windows, OSX, Linux) and then perhaps we can deal w/ others from there as they arise (i.e. those that don't support it)

        Show
        gsingers Grant Ingersoll added a comment - This seems reasonable, although I am not an expert in low-level file system calls like this. I guess for me the thing would be to find out if the major filesystems support it (Windows, OSX, Linux) and then perhaps we can deal w/ others from there as they arise (i.e. those that don't support it)
        Hide
        mikemccand Michael McCandless added a comment -

        OK I'll test on the major platforms, and take that approach. I'll tentatively target 2.4.

        Show
        mikemccand Michael McCandless added a comment - OK I'll test on the major platforms, and take that approach. I'll tentatively target 2.4.
        Hide
        mikemccand Michael McCandless added a comment -

        Attached patch. All tests pass on OS X 10.4, Linux 2.6.22, Windows
        Server 2003. I plan to commit in a day or two.

        Show
        mikemccand Michael McCandless added a comment - Attached patch. All tests pass on OS X 10.4, Linux 2.6.22, Windows Server 2003. I plan to commit in a day or two.
        Hide
        mikemccand Michael McCandless added a comment -

        I just committed this.

        Show
        mikemccand Michael McCandless added a comment - I just committed this.
        Hide
        otis Otis Gospodnetic added a comment -

        Didn't find time to comment on this earlier.
        Does this mean one will no longer be able to tell exactly how large the index really is (because some portion of some data files will actually be empty)?

        Show
        otis Otis Gospodnetic added a comment - Didn't find time to comment on this earlier. Does this mean one will no longer be able to tell exactly how large the index really is (because some portion of some data files will actually be empty)?
        Hide
        mikemccand Michael McCandless added a comment -

        Does this mean one will no longer be able to tell exactly how large the index really is (because some portion of some data files will actually be empty)?

        Only while the CFS is being built. After it's done being built, then
        it is "fully occupied" (no portion are empty).

        Show
        mikemccand Michael McCandless added a comment - Does this mean one will no longer be able to tell exactly how large the index really is (because some portion of some data files will actually be empty)? Only while the CFS is being built. After it's done being built, then it is "fully occupied" (no portion are empty).

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development