Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity comparing to compound indexes. But their file descriptors foot print is much higher.

      By maintaining all field norms in a single .nrm file, we can bound the number of files used by non compound indexes, and possibly allow more applications to use this format.

      More details on the motivation for this in: http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html (in particular http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

      1. index.premergednorms.cfs.zip
        2 kB
        Michael McCandless
      2. index.premergednorms.nocfs.zip
        5 kB
        Michael McCandless
      3. LUCENE-756-Jan16.patch
        16 kB
        Michael McCandless
      4. LUCENE-756-Jan16.Take2.patch
        18 kB
        Michael McCandless
      5. nrm.patch.2.txt
        16 kB
        Doron Cohen
      6. nrm.patch.3.txt
        23 kB
        Doron Cohen
      7. nrm.patch.txt
        12 kB
        Doron Cohen

        Activity

        Hide
        Doron Cohen added a comment -

        Attached patch - nrm.patch.txt - modifies field norms maintenance to a single .nrm file.

        Modification is backwards compatible - existing indexes with norms in a file per norm are read. - the first merge would create a single .nrm file.

        All tests pass.

        No performance degtadations were observed as result of this change, but my tests so far were not very extensive.

        Show
        Doron Cohen added a comment - Attached patch - nrm.patch.txt - modifies field norms maintenance to a single .nrm file. Modification is backwards compatible - existing indexes with norms in a file per norm are read. - the first merge would create a single .nrm file. All tests pass. No performance degtadations were observed as result of this change, but my tests so far were not very extensive.
        Hide
        Doron Cohen added a comment -

        Replacing the patch file (prev file was garbage - "svn stat" instead of "svn diff").

        Few words on how this patch works:

        • <segment>.nrm file was added.
        • addDocument (DocumentWriter) still writes each norm to a separate file - but that's in memory,
        • at merge, all norms are written to a single file.
        • CFS now also maintains all norms in a single file.
        • IndexWriter merge-decision now considers hasSeparateNorms() not only for CFS but also for non compound.
        • SegmentReader.openNorms() still creates ready-to-use/load Norm objects (which would read the norms only when needed). But the Norm object is now assigned a normSeek value, which is nonzero if the norm file is <segment>.nrm.
        • existing indexes, prior to this change, are managed the same way that segments resulted of addDocument are managed.

        Tests:

        • I verified that also the (contrib) tests for FieldNormModifier and LengthNormModofier are working.

        Remaining:

        • I might add a test.
        • more benchmarking?
        • update fileFormat document.
        Show
        Doron Cohen added a comment - Replacing the patch file (prev file was garbage - "svn stat" instead of "svn diff"). Few words on how this patch works: <segment>.nrm file was added. addDocument (DocumentWriter) still writes each norm to a separate file - but that's in memory, at merge, all norms are written to a single file. CFS now also maintains all norms in a single file. IndexWriter merge-decision now considers hasSeparateNorms() not only for CFS but also for non compound. SegmentReader.openNorms() still creates ready-to-use/load Norm objects (which would read the norms only when needed). But the Norm object is now assigned a normSeek value, which is nonzero if the norm file is <segment>.nrm. existing indexes, prior to this change, are managed the same way that segments resulted of addDocument are managed. Tests: I verified that also the (contrib) tests for FieldNormModifier and LengthNormModofier are working. Remaining: I might add a test. more benchmarking? update fileFormat document.
        Hide
        Yonik Seeley added a comment -

        Seems like a good idea... given that norms are read once on-demand, I wouldn't expect anything search related to be slower with this. Opening a new reader should actually be slightly faster due to fewer files to open.

        Show
        Yonik Seeley added a comment - Seems like a good idea... given that norms are read once on-demand, I wouldn't expect anything search related to be slower with this. Opening a new reader should actually be slightly faster due to fewer files to open.
        Hide
        Yonik Seeley added a comment -

        > - CFS now also maintains all norms in a single file.

        Does this mean a separate file outside the final .cfs files?

        Show
        Yonik Seeley added a comment - > - CFS now also maintains all norms in a single file. Does this mean a separate file outside the final .cfs files?
        Hide
        Doron Cohen added a comment -

        > Does this mean a separate file outside the final .cfs files?

        Oh no - there's a single .nrm file in the .cfs file (instead of multiple .fN files in the .cfs file).
        As before, only .sN files (separated norm files) are outside of .cfs file.

        Show
        Doron Cohen added a comment - > Does this mean a separate file outside the final .cfs files? Oh no - there's a single .nrm file in the .cfs file (instead of multiple .fN files in the .cfs file). As before, only .sN files (separated norm files) are outside of .cfs file.
        Hide
        Doug Cutting added a comment -

        Since we're adding a new file, shouldn't we give it a header, so that it's format can be revised? Something like:
        new byte[]

        {'N','R','M',VERSION}

        as the first four bytes. We might someday decide to change the representation used, e.g., a different one-byte-float format, or permit higher resolution, or compression, or somesuch.

        Also, should we use a constant for ".nrm" extension, so that it's checked at compile-time?

        Show
        Doug Cutting added a comment - Since we're adding a new file, shouldn't we give it a header, so that it's format can be revised? Something like: new byte[] {'N','R','M',VERSION} as the first four bytes. We might someday decide to change the representation used, e.g., a different one-byte-float format, or permit higher resolution, or compression, or somesuch. Also, should we use a constant for ".nrm" extension, so that it's checked at compile-time?
        Hide
        Doron Cohen added a comment -

        Thanks for the comments, Doug.
        You're right of course, I will add both the header and the constant.
        (that would be either today or only in a week from now.)

        Show
        Doron Cohen added a comment - Thanks for the comments, Doug. You're right of course, I will add both the header and the constant. (that would be either today or only in a week from now.)
        Hide
        Doron Cohen added a comment -

        nrm.patch.2.txt:

        Updated as Doug suggested:

        • ".nrm" extension now maintained in a constant .
        • .nrm file now has a 4 bytes header.

        And, fileFormat document is updated.

        Also, I checked again that the seeks for the various field norms are lazy - performed only when bytes are actually read with refill().

        Show
        Doron Cohen added a comment - nrm.patch.2.txt: Updated as Doug suggested: ".nrm" extension now maintained in a constant . .nrm file now has a 4 bytes header. And, fileFormat document is updated. Also, I checked again that the seeks for the various field norms are lazy - performed only when bytes are actually read with refill().
        Hide
        Doron Cohen added a comment -

        I am updating the patch (nrm.patch.3.txt):

        • using a single constant for the norms file extension:
          static final String NORMS_EXTENSION = "nrm";
          (This is more in line with existing extension constants in the code.)
          (As a side comment, there are various extension names (e.g. ".cfs") in the code that are also candidate for factoring as a constant, but this is a separate issue.)
        • adding a test - TestNorms
          This test verifies that norm values assigned with field.setBoost() are preserved during the life cycle of an index, including adding documents, updating norms values (separate norms), addIndexes(), and optimize.

        All tests pass.
        On my side this is ready to go in.

        Show
        Doron Cohen added a comment - I am updating the patch (nrm.patch.3.txt): using a single constant for the norms file extension: static final String NORMS_EXTENSION = "nrm"; (This is more in line with existing extension constants in the code.) (As a side comment, there are various extension names (e.g. ".cfs") in the code that are also candidate for factoring as a constant, but this is a separate issue.) adding a test - TestNorms This test verifies that norm values assigned with field.setBoost() are preserved during the life cycle of an index, including adding documents, updating norms values (separate norms), addIndexes(), and optimize. All tests pass. On my side this is ready to go in.
        Hide
        Yonik Seeley added a comment -

        Committed. Thanks Doron!

        Show
        Yonik Seeley added a comment - Committed. Thanks Doron!
        Hide
        Doron Cohen added a comment -

        Thanks for commiting this Yonik!

        Seems the added test TestNorms was not commited..?

        Show
        Doron Cohen added a comment - Thanks for commiting this Yonik! Seems the added test TestNorms was not commited..?
        Hide
        Yonik Seeley added a comment -

        Hmmm, I actually did an "svn status" to see if there was anything to add too.
        Problem is, my current tree is too messy and I missed it.
        Thanks for the double-check.

        Show
        Yonik Seeley added a comment - Hmmm, I actually did an "svn status" to see if there was anything to add too. Problem is, my current tree is too messy and I missed it. Thanks for the double-check.
        Hide
        Michael McCandless added a comment -

        I would like to propose some small improvements to this nice feature.

        I've worked out a patch (will attach shortly). Doron if you agree /
        or we can iterate then I'll commit it! Thanks.

        Proposed changes:

        • Renamed "withNrm()" to "getHasMergedNorms" to be more
          descriptive. Also changed the field to "hasMergedNorms".
        • Explicitly store "hasMergedNorms" in the segments_N file.

        I think in general we should favor storing things like this
        explicitly instead of relying on IO operations (fileExists).
        We've made great progress lately in reducing such IO operations so
        I'd like to keep that up when possible

        I created a new FORMAT_MERGED_NORMS in SegmentInfos for this. The
        change is fully backwards compatible (old indices work fine). I
        extended TestBackwardsCompatibility to test this.

        This then has the nice side effect of not having to create the
        fleeting CompoundFileReader in "SegmentInfo.getHasMergedNorms"
        (which was somewhat spooky to me) for indices written to after
        this is committed. For indices written to before this gets
        committed but after the first version was committed (10 days ago),
        the check is still needed so I've left it in there with a comment.

        • Fixed the TestDoc unit test to actually create & return
          SegmentInfo's vs recreating a new SegmentInfo every time (which
          causes problems whenever we add something to SegmentInfo). This
          is still a correct test but more scalable with time as we make
          changes to SegmentInfo.
        Show
        Michael McCandless added a comment - I would like to propose some small improvements to this nice feature. I've worked out a patch (will attach shortly). Doron if you agree / or we can iterate then I'll commit it! Thanks. Proposed changes: Renamed "withNrm()" to "getHasMergedNorms" to be more descriptive. Also changed the field to "hasMergedNorms". Explicitly store "hasMergedNorms" in the segments_N file. I think in general we should favor storing things like this explicitly instead of relying on IO operations (fileExists). We've made great progress lately in reducing such IO operations so I'd like to keep that up when possible I created a new FORMAT_MERGED_NORMS in SegmentInfos for this. The change is fully backwards compatible (old indices work fine). I extended TestBackwardsCompatibility to test this. This then has the nice side effect of not having to create the fleeting CompoundFileReader in "SegmentInfo.getHasMergedNorms" (which was somewhat spooky to me) for indices written to after this is committed. For indices written to before this gets committed but after the first version was committed (10 days ago), the check is still needed so I've left it in there with a comment. Fixed the TestDoc unit test to actually create & return SegmentInfo's vs recreating a new SegmentInfo every time (which causes problems whenever we add something to SegmentInfo). This is still a correct test but more scalable with time as we make changes to SegmentInfo.
        Hide
        Yonik Seeley added a comment -

        I agree that reducing the IO operations on an index open is a good thing.

        > For indices written to before this gets
        > committed but after the first version was committed (10 days ago),

        No hard rule on this, but IMO that may be a small enough window that compatibility is not needed.

        Show
        Yonik Seeley added a comment - I agree that reducing the IO operations on an index open is a good thing. > For indices written to before this gets > committed but after the first version was committed (10 days ago), No hard rule on this, but IMO that may be a small enough window that compatibility is not needed.
        Hide
        Michael McCandless added a comment -

        > No hard rule on this, but IMO that may be a small enough window that compatibility is not needed.

        This is a good question. I had flip/flop'd on it. It would be nice
        to not have to support reading indices that were written to based on
        the past 10 days of Lucene trunk builds. Then we could greatly
        simplify the "SegmentInfo.getHasMergedNorms" to not create then
        destroy the CompoundFileReader.

        Any objections to this?

        If not I will re-work the patch (it makes things a fair bit cleaner).

        Show
        Michael McCandless added a comment - > No hard rule on this, but IMO that may be a small enough window that compatibility is not needed. This is a good question. I had flip/flop'd on it. It would be nice to not have to support reading indices that were written to based on the past 10 days of Lucene trunk builds. Then we could greatly simplify the "SegmentInfo.getHasMergedNorms" to not create then destroy the CompoundFileReader. Any objections to this? If not I will re-work the patch (it makes things a fair bit cleaner).
        Hide
        Yonik Seeley added a comment -

        As an aside, I think we need to start making more frequent releases... then "trunk" could be designated as a work-in-progress and unstable, and hence compatibility concerns could be limited to those releases.

        Show
        Yonik Seeley added a comment - As an aside, I think we need to start making more frequent releases... then "trunk" could be designated as a work-in-progress and unstable, and hence compatibility concerns could be limited to those releases.
        Hide
        Michael McCandless added a comment -

        OK, take two! I attached LUCENE-756-Jan16.Take2.patch

        I removed backwards compatibility for the past 10 days of Lucene
        nightly trunk builds. I also fixed fileformats.xml to describe the
        new "HasMergedNorms" entry in the segments_N file.

        Show
        Michael McCandless added a comment - OK, take two! I attached LUCENE-756 -Jan16.Take2.patch I removed backwards compatibility for the past 10 days of Lucene nightly trunk builds. I also fixed fileformats.xml to describe the new "HasMergedNorms" entry in the segments_N file.
        Hide
        Chuck Williams added a comment -

        I may have the only app that will be broken by the 10-day backwards incompatibility, but the change seems worth it. I need to create some large indexes to take on the road for demos. Is the index format in the latest patch final?

        Show
        Chuck Williams added a comment - I may have the only app that will be broken by the 10-day backwards incompatibility, but the change seems worth it. I need to create some large indexes to take on the road for demos. Is the index format in the latest patch final?
        Hide
        Michael McCandless added a comment -

        Actually, if you apply my first change above, regen your index, then the format will be readable to the 2nd patch.

        Chuck, I think this latest patch would likely be the "final" index file format for this issue, pending any more feedback on it though!

        Show
        Michael McCandless added a comment - Actually, if you apply my first change above, regen your index, then the format will be readable to the 2nd patch. Chuck, I think this latest patch would likely be the "final" index file format for this issue, pending any more feedback on it though!
        Hide
        Doron Cohen added a comment -

        Michael, I like this improvement!

        (At first I considered adding such FORMAT level but decided that it is not worth it, - aiming backwards compatibility with pre-lockless indexes. Then I had to add that file check - wrong trade-off indeed.)

        Two minor comments:

        • getHasMergedNorms() is private and now the method has no logic - I would remove that method and refer to hasMergedNorms instead.
        • the term "merged" (in hasMergedNorms) is a little overloaded with other semantics (in Lucene), though I cannot think of other matching descriptive (short) term.

        Thanks for improving this,
        Doron

        Show
        Doron Cohen added a comment - Michael, I like this improvement! (At first I considered adding such FORMAT level but decided that it is not worth it, - aiming backwards compatibility with pre-lockless indexes. Then I had to add that file check - wrong trade-off indeed.) Two minor comments: getHasMergedNorms() is private and now the method has no logic - I would remove that method and refer to hasMergedNorms instead. the term "merged" (in hasMergedNorms) is a little overloaded with other semantics (in Lucene), though I cannot think of other matching descriptive (short) term. Thanks for improving this, Doron
        Hide
        Doug Cutting added a comment -

        > the term "merged" (in hasMergedNorms) is a little overloaded with other semantics (in Lucene)

        Unified? Single?

        Show
        Doug Cutting added a comment - > the term "merged" (in hasMergedNorms) is a little overloaded with other semantics (in Lucene) Unified? Single?
        Hide
        Doron Cohen added a comment -

        Catenated?

        Show
        Doron Cohen added a comment - Catenated?
        Hide
        Doron Cohen added a comment -

        Just to let you know - I checked this with recent patch for Lucene-741 (Field norm modifier) --> working as is with this improvement.

        Show
        Doron Cohen added a comment - Just to let you know - I checked this with recent patch for Lucene-741 (Field norm modifier) --> working as is with this improvement.
        Hide
        Michael McCandless added a comment -

        OK thanks Doron. I will make the fixes you suggested!

        I like "single" – I will redo the "non backwards compatible for past 10 days" patch with these fixes!

        Show
        Michael McCandless added a comment - OK thanks Doron. I will make the fixes you suggested! I like "single" – I will redo the "non backwards compatible for past 10 days" patch with these fixes!
        Hide
        Michael McCandless added a comment -

        OK I committed the fix (changed the name to "singleNormFile"). Thanks
        everyone!

        Show
        Michael McCandless added a comment - OK I committed the fix (changed the name to "singleNormFile"). Thanks everyone!
        Hide
        Michael McCandless added a comment -

        Closing all issues that were resolved for 2.1.

        Show
        Michael McCandless added a comment - Closing all issues that were resolved for 2.1.

          People

          • Assignee:
            Doron Cohen
            Reporter:
            Doron Cohen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development