Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7391

MemoryIndexReader.fields() performance regression

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.0
    • Fix Version/s: 6.2
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      While upgrading our codebase from Lucene 4 to Lucene 6 we found a significant performance regression - a 5x slowdown

      On profiling the code, the method MemoryIndexReader.fields() shows up as one of the hottest methods

      Looking at the method, it just creates a copy of the inner fields Map before passing it to MemoryFields. It does this so that it can filter out fields with numTokens <= 0.

      The simplest "fix" would be to just remove the copying of the map completely, and pass fields directly to MemoryFields. It's simple and removes any slowdown caused by this method. It does potentially change behaviour though, but none of the unit tests seem to test that behaviour so I wonder whether it's necessary (I looked at the original ticket LUCENE-7091 that introduced this code, I can't find much in way of an explanation). I'm going to attach a patch to this effect anyway and we can take things from there

      1. LUCENE-7391.patch
        3 kB
        Steve Mason
      2. LUCENE-7391.patch
        1 kB
        Steve Mason
      3. LUCENE-7391-test.patch
        2 kB
        Steve Mason

        Activity

        Hide
        spmason Steve Mason added a comment -

        Patch attached. Note that all unit tests pass. I've also run it through our integration test suite (matching 1000 queries against 45000 documents) and verified that they pass as well

        It would be good to know why the original code was like this, I wonder if Martijn van Groningen remembers - it seems to be tied to this comment: https://issues.apache.org/jira/browse/LUCENE-7091?focusedCommentId=15189525&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15189525

        If the existing behaviour needs to be preserved then that's fine - if someone can provide me with a test case (or explain one to me) then I'll add it to the patch and formulate an alternative solution

        Show
        spmason Steve Mason added a comment - Patch attached. Note that all unit tests pass. I've also run it through our integration test suite (matching 1000 queries against 45000 documents) and verified that they pass as well It would be good to know why the original code was like this, I wonder if Martijn van Groningen remembers - it seems to be tied to this comment: https://issues.apache.org/jira/browse/LUCENE-7091?focusedCommentId=15189525&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15189525 If the existing behaviour needs to be preserved then that's fine - if someone can provide me with a test case (or explain one to me) then I'll add it to the patch and formulate an alternative solution
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        The reason it filters out field with numTokens <= 0 is that it would otherwise include non indexed fields (fields with just doc values or point values). However this slowdown is unintended. Maybe instead we could build `filteredFields` in the constructor of `MemoryIndexReader` and reuse it between `#fields()` invocations?

        Show
        martijn.v.groningen Martijn van Groningen added a comment - The reason it filters out field with numTokens <= 0 is that it would otherwise include non indexed fields (fields with just doc values or point values). However this slowdown is unintended. Maybe instead we could build `filteredFields` in the constructor of `MemoryIndexReader` and reuse it between `#fields()` invocations?
        Hide
        dsmiley David Smiley added a comment -

        A simple solution is for fields() to loop and count the number of fields that have numTerms > 0 (don't need to record which), and pass this to MemoryFields so that MemoryFields.size() is easy. Then MemoryFields.terms() can simply check numTerms <= 0 and return null.

        Show
        dsmiley David Smiley added a comment - A simple solution is for fields() to loop and count the number of fields that have numTerms > 0 (don't need to record which), and pass this to MemoryFields so that MemoryFields.size() is easy. Then MemoryFields.terms() can simply check numTerms <= 0 and return null.
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        +1 to count the number of fields with `numTerms > 0` and filter out fields with `numTerms <= 0`

        Show
        martijn.v.groningen Martijn van Groningen added a comment - +1 to count the number of fields with `numTerms > 0` and filter out fields with `numTerms <= 0`
        Hide
        spmason Steve Mason added a comment -

        OK, thanks Martijn - is it part of the contract that fields() should only return indexed fields then?

        There are a couple of options we can think of:
        1. As you suggest - cache the filtered Map (or the whole MemoryIndexReader object?). In that case, we'd need to make sure that the cache was invalidated after every mutation
        2. Make MemoryFields a "view" on the original that filters them out when each of its methods are called (there are only 3)
        3. Maybe maintain another Map of just the indexed fields?

        I'll have a go at some of these and see how I get on

        Show
        spmason Steve Mason added a comment - OK, thanks Martijn - is it part of the contract that fields() should only return indexed fields then? There are a couple of options we can think of: 1. As you suggest - cache the filtered Map (or the whole MemoryIndexReader object?). In that case, we'd need to make sure that the cache was invalidated after every mutation 2. Make MemoryFields a "view" on the original that filters them out when each of its methods are called (there are only 3) 3. Maybe maintain another Map of just the indexed fields? I'll have a go at some of these and see how I get on
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        > is it part of the contract that fields() should only return indexed fields then?

        Yes.

        I think David's fix is the easiest here. Computing this count each time fields is invoked is less of an overhead compared what happens now when building MemoryFields. Since that count is computed each time, I think you shouldn't worry about caching or cache invalidation.

        The concurrency aspect of the MemoryIndex is in my opinion a bit of a mess. It allows fields to be added to be made after a reader has been created, except when the freeze method is invoked (and then it should be able to be used from many threads). I think the MemoryIndex class itself should be kind of a builder that just returns an IndexReader and shouldn't be able to be used after an IndexReader instance has been made.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - > is it part of the contract that fields() should only return indexed fields then? Yes. I think David's fix is the easiest here. Computing this count each time fields is invoked is less of an overhead compared what happens now when building MemoryFields . Since that count is computed each time, I think you shouldn't worry about caching or cache invalidation. The concurrency aspect of the MemoryIndex is in my opinion a bit of a mess. It allows fields to be added to be made after a reader has been created, except when the freeze method is invoked (and then it should be able to be used from many threads). I think the MemoryIndex class itself should be kind of a builder that just returns an IndexReader and shouldn't be able to be used after an IndexReader instance has been made.
        Hide
        spmason Steve Mason added a comment -

        Sorry our posts must have crossed

        OK I'll try that as well - it's very much like the "view" option I think

        Show
        spmason Steve Mason added a comment - Sorry our posts must have crossed OK I'll try that as well - it's very much like the "view" option I think
        Hide
        spmason Steve Mason added a comment -

        Attaching a patch with a test for the current behaviour - my original patch causes this to fail when applied

        Suggestions for improvements appreciated (I'm very new to this codebase)

        Show
        spmason Steve Mason added a comment - Attaching a patch with a test for the current behaviour - my original patch causes this to fail when applied Suggestions for improvements appreciated (I'm very new to this codebase)
        Hide
        dsmiley David Smiley added a comment -

        The test is fine; I thought the perf fix would also be in the patch. Let me know when you have that and I'll review and commit later. Also... your .patch appears to not be a normal patch file.... see https://wiki.apache.org/lucene-java/HowToContribute#Creating_a_patch

        Show
        dsmiley David Smiley added a comment - The test is fine; I thought the perf fix would also be in the patch. Let me know when you have that and I'll review and commit later. Also... your .patch appears to not be a normal patch file.... see https://wiki.apache.org/lucene-java/HowToContribute#Creating_a_patch
        Hide
        spmason Steve Mason added a comment -

        Yeah it's still a work-in-progress, I'm still working on the full patch. I'm actually on annual leave next week so the final patch might be a little while (though I'll add it if I get time)

        I'll follow those instructions for generating the patch - it's just the output of git format-patch right now - thanks for the advice

        Show
        spmason Steve Mason added a comment - Yeah it's still a work-in-progress, I'm still working on the full patch. I'm actually on annual leave next week so the final patch might be a little while (though I'll add it if I get time) I'll follow those instructions for generating the patch - it's just the output of git format-patch right now - thanks for the advice
        Hide
        romseygeek Alan Woodward added a comment -

        The concurrency aspect of the MemoryIndex is in my opinion a bit of a mess

        +1 - freeze() was a hack, and I've been meaning to open an issue to make things properly immutable for ages.

        Show
        romseygeek Alan Woodward added a comment - The concurrency aspect of the MemoryIndex is in my opinion a bit of a mess +1 - freeze() was a hack, and I've been meaning to open an issue to make things properly immutable for ages.
        Hide
        spmason Steve Mason added a comment -

        I got some time and put together another patch

        This one does the filtering for numTokens > 0 in the MemoryFields object itself when any of its methods are called. I couldn't simply pass the size as suggested because of the iterator() method. As the MemoryField object is pretty much stateless I also cache it on the reader.

        This patch also includes the unit tests from before

        Performance-wise this is the same performance improvement as my original patch

        Show
        spmason Steve Mason added a comment - I got some time and put together another patch This one does the filtering for numTokens > 0 in the MemoryFields object itself when any of its methods are called. I couldn't simply pass the size as suggested because of the iterator() method. As the MemoryField object is pretty much stateless I also cache it on the reader. This patch also includes the unit tests from before Performance-wise this is the same performance improvement as my original patch
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        +1 The filtering cost is deferred to when someone really needs to know the size or the actual fields and I think that this is better then what happens now.

        Small nit: Maybe rename the variable `ignored` to `field` in the `size()` method as it is actually not ignored?

        I'll let David commit this as he assigned himself to this issue.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - +1 The filtering cost is deferred to when someone really needs to know the size or the actual fields and I think that this is better then what happens now. Small nit: Maybe rename the variable `ignored` to `field` in the `size()` method as it is actually not ignored? I'll let David commit this as he assigned himself to this issue.
        Hide
        martijn.v.groningen Martijn van Groningen added a comment -

        +1 - freeze() was a hack, and I've been meaning to open an issue to make things properly immutable for ages.

        Alan Woodward Then lets try to fix this in master I've opened LUCENE-7394 to track this.

        Show
        martijn.v.groningen Martijn van Groningen added a comment - +1 - freeze() was a hack, and I've been meaning to open an issue to make things properly immutable for ages. Alan Woodward Then lets try to fix this in master I've opened LUCENE-7394 to track this.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit dcc9a4b9a411a0436e5cb21e7d6251691640f3db in lucene-solr's branch refs/heads/master from David Smiley
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dcc9a4b ]

        LUCENE-7391: Fix MemoryIndex fields() perf regression

        Show
        jira-bot ASF subversion and git services added a comment - Commit dcc9a4b9a411a0436e5cb21e7d6251691640f3db in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dcc9a4b ] LUCENE-7391 : Fix MemoryIndex fields() perf regression
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit d85a9718564abf3dd33f66d057f8cfc835a56715 in lucene-solr's branch refs/heads/branch_6x from David Smiley
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d85a971 ]

        LUCENE-7391: Fix MemoryIndex fields() perf regression
        (cherry picked from commit dcc9a4b)

        Show
        jira-bot ASF subversion and git services added a comment - Commit d85a9718564abf3dd33f66d057f8cfc835a56715 in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d85a971 ] LUCENE-7391 : Fix MemoryIndex fields() perf regression (cherry picked from commit dcc9a4b)
        Hide
        dsmiley David Smiley added a comment -

        Thanks for the patch Steve! Committed and closing now.

        Show
        dsmiley David Smiley added a comment - Thanks for the patch Steve! Committed and closing now.
        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

          People

          • Assignee:
            dsmiley David Smiley
            Reporter:
            spmason Steve Mason
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development