Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-767

maxDoc should be explicitly stored in the index, not derived from file length

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.9, 2.0.0, 2.1
    • Fix Version/s: 2.1
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is a spinoff of LUCENE-140

      In general we should rely on "as little as possible" from the file system. Right now, maxDoc is derived by checking the file length of the FieldsReader index file (.fdx) which makes me nervous. I think we should explicitly store it instead.

      Note that there are no known cases where this is actually causing a problem. There was some speculation in the discussion of LUCENE-140 that it could be one of the possible, but in digging / discussion there were no specifically relevant JVM bugs found (yet!). So this would be a defensive fix at this point.

        Activity

        Hide
        manawiz Chuck Williams added a comment -

        Isn't maxDoc always the same as the docCount of the segment, which is stored? I.e., couldn't SegmentReader.maxDoc() be equivalently defined as:

        public int maxDoc()

        { return si.docCount; }

        Since maxDoc==numDocs==docCount for a newly merged segment, and deletion with a reader never changes numDocs or maxDoc, it seems to me these values should always be the same.

        All Lucene tests pass with this definition. I have code that relies on this equivalence and so would appreciate knowledge of any case where this equivalence might not hold.

        Show
        manawiz Chuck Williams added a comment - Isn't maxDoc always the same as the docCount of the segment, which is stored? I.e., couldn't SegmentReader.maxDoc() be equivalently defined as: public int maxDoc() { return si.docCount; } Since maxDoc==numDocs==docCount for a newly merged segment, and deletion with a reader never changes numDocs or maxDoc, it seems to me these values should always be the same. All Lucene tests pass with this definition. I have code that relies on this equivalence and so would appreciate knowledge of any case where this equivalence might not hold.
        Hide
        mikemccand Michael McCandless added a comment -

        Ooh that's great! I think your logic is correct.

        But I do see one unit test failing when I make that change locally (testIndexAndMerge in src/test/org/apache/lucene/index/TestDoc.java). Actually, this unit test only fails with my last commit (yesterday) for LUCENE-140 , because I made the checking for "docs out of order" more strict (catch a previously missing boundary case), and this test seems to hit that boundary case.

        However, that test is buggy because it manually creates SegmentInfos with an incorrect docCount. So I will fix the test, and commit your solution above. Thanks!

        Show
        mikemccand Michael McCandless added a comment - Ooh that's great! I think your logic is correct. But I do see one unit test failing when I make that change locally (testIndexAndMerge in src/test/org/apache/lucene/index/TestDoc.java). Actually, this unit test only fails with my last commit (yesterday) for LUCENE-140 , because I made the checking for "docs out of order" more strict (catch a previously missing boundary case), and this test seems to hit that boundary case. However, that test is buggy because it manually creates SegmentInfos with an incorrect docCount. So I will fix the test, and commit your solution above. Thanks!
        Hide
        mikemccand Michael McCandless added a comment -

        Carrying over from the java-dev list:

        Grant Ingersoll wrote:

        > Can you explain in more detail on this bug why this makes you nervous?

        Well ... the only specific example I have is NFS (always my favorite
        example!).

        As I understand it, the NFS client typically uses a separate cache to
        hold the "attributes" of the file, including file length. This cache
        often has weaker or maybe just "different" guarantees than the "data
        cache" that holds the file contents. So basically you can ask what
        the file length is and get a wrong (stale) answer. EG see
        http://nfs.sourceforge.net, which describes Linux's NFS client
        approach. The NFS client on Apple's OS X seems to be even worse!

        I think very likely Lucene may not trip up on this specifically since
        a reader would only ask for this file's length for the first time once
        the file is done being written (ie the commit of segments_N has
        occurred) and so hopefully it's not in the attribute cache yet?

        I think there may very well be cases of other filesystems where
        "checking file length" is risky (that we all just don't know about
        (yet!)), which is why I favor using explicit values instead of relying
        on file system semantics, whenever possible.

        Maybe I'm just too paranoid

        But for all the places / devices Lucene has gone and will go, relying
        on the bare minimum set of IO operations I think will maximize our
        overall portability. Every filesystem has its quirks.

        Show
        mikemccand Michael McCandless added a comment - Carrying over from the java-dev list: Grant Ingersoll wrote: > Can you explain in more detail on this bug why this makes you nervous? Well ... the only specific example I have is NFS (always my favorite example!). As I understand it, the NFS client typically uses a separate cache to hold the "attributes" of the file, including file length. This cache often has weaker or maybe just "different" guarantees than the "data cache" that holds the file contents. So basically you can ask what the file length is and get a wrong (stale) answer. EG see http://nfs.sourceforge.net , which describes Linux's NFS client approach. The NFS client on Apple's OS X seems to be even worse! I think very likely Lucene may not trip up on this specifically since a reader would only ask for this file's length for the first time once the file is done being written (ie the commit of segments_N has occurred) and so hopefully it's not in the attribute cache yet? I think there may very well be cases of other filesystems where "checking file length" is risky (that we all just don't know about (yet!)), which is why I favor using explicit values instead of relying on file system semantics, whenever possible. Maybe I'm just too paranoid But for all the places / devices Lucene has gone and will go, relying on the bare minimum set of IO operations I think will maximize our overall portability. Every filesystem has its quirks.
        Hide
        mikemccand Michael McCandless added a comment -

        Closing all issues that were resolved for 2.1.

        Show
        mikemccand Michael McCandless added a comment - Closing all issues that were resolved for 2.1.

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development