Lucene - Core
  1. Lucene - Core
  2. LUCENE-769

[PATCH] Performance improvement for some cases of sorted search

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 2.0.0
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.

      The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC. Their fields values are then fetched from the cache and compared while sorting.

      1. selfContained.patch
        12 kB
        Artem Vasiliev

        Activity

        Hide
        Erick Erickson added a comment -

        SPRING_CLEANING_2013 We can reopen if necessary. Think this code has been extensively re-worked anyway.

        Show
        Erick Erickson added a comment - SPRING_CLEANING_2013 We can reopen if necessary. Think this code has been extensively re-worked anyway.
        Hide
        Artem Vasiliev added a comment -

        Removed several inner classes and documents cache from StoredFieldSortFactory. Now the whole class is pretty clean and simple. Checked the timings in the test - they remain pretty much the same.

        Checked it in Sharehound with more large index (~1mln documents), it sorts resultset with 4000 docs in ~0,2s now taking 70M RAM, that's fine to me.

        With standart new Sort(field, false) it takes (for the first search on a field) about 30-40s and quite a lot of memory (after several sorted searches different fields it took about 500M).

        Show
        Artem Vasiliev added a comment - Removed several inner classes and documents cache from StoredFieldSortFactory. Now the whole class is pretty clean and simple. Checked the timings in the test - they remain pretty much the same. Checked it in Sharehound with more large index (~1mln documents), it sorts resultset with 4000 docs in ~0,2s now taking 70M RAM, that's fine to me. With standart new Sort(field, false) it takes (for the first search on a field) about 30-40s and quite a lot of memory (after several sorted searches different fields it took about 500M).
        Hide
        Artem Vasiliev added a comment -

        Btw I've integrated the modified fix into sharhound, 4000 documented sorted search improved from 0,4s to 0,1s. That's great, thanks guys for you time and consideration!

        Show
        Artem Vasiliev added a comment - Btw I've integrated the modified fix into sharhound, 4000 documented sorted search improved from 0,4s to 0,1s. That's great, thanks guys for you time and consideration!
        Hide
        Artem Vasiliev added a comment -

        Refactored the fix according to Hoss's recomendations. Now only StoredFieldSortFactory class is left public; FieldSelector is always used to fetch the only field from documents. Guys please remove all the attachments except #7 - things get messy..

        Show
        Artem Vasiliev added a comment - Refactored the fix according to Hoss's recomendations. Now only StoredFieldSortFactory class is left public; FieldSelector is always used to fetch the only field from documents. Guys please remove all the attachments except #7 - things get messy..
        Hide
        robert engels added a comment -

        The IndexReaderUtils I posted is not compilable - there are a few more classes needed. These are unnecessary to understand the technique.

        It was written this way to minimize the dependencies with Lucene, and not have to apply patches for my local codebase.

        Show
        robert engels added a comment - The IndexReaderUtils I posted is not compilable - there are a few more classes needed. These are unnecessary to understand the technique. It was written this way to minimize the dependencies with Lucene, and not have to apply patches for my local codebase.
        Hide
        Chuck Williams added a comment -

        Robert,

        Could you attach your current implementation of reopen() as well? The attachment did not come through in your java-dev message today, or the one from 12/11. I'd like to look at an incremental implementation of reopen() for FieldCache.

        Thanks

        Show
        Chuck Williams added a comment - Robert, Could you attach your current implementation of reopen() as well? The attachment did not come through in your java-dev message today, or the one from 12/11. I'd like to look at an incremental implementation of reopen() for FieldCache. Thanks
        Hide
        Hoss Man added a comment -

        Artem: while i agree with Yonik/Chuck's comments about your performance tests probably not being realistic in the general case, what i really like about your patch is that it makes no attempt to change the default behavior of sorting in a way that would hurt users by default – users would only get this behavior if they choose to use it, and while the "typical" case may not bnefit from it, i'm sure there are plenty of situations where people know their index is big, and know that they are doing a search that should have a small number of results. adding something like this doesn't proclude future work on making sorting using FieldCache's less prohibitive (ie: an IndexReader.reopen approach)

        what does concern me about this patch is that without better javadocs explaining exactly what it does and when it's usefull, it could easily be missued by people who stumble upon it.

        I also don't understand why in your updated version of the patch, you aren't making an attempt to use the FieldSelector version of IndexReader.document(), since it should allways be faster in this use case, and would result in your memory cache talking up less space.

        I also don't understand your "IndexReader members access rights problems raised" ... a subclass of IndexReader should be able to live freely in any package – including as a private static class inside of another class. perhaps you ran into problems because you are attempting to subclass methods you don't really need to worry about subclassing? ... yet another reason to subclass FilterIndexReader and save yourself some headaches.

        Show
        Hoss Man added a comment - Artem: while i agree with Yonik/Chuck's comments about your performance tests probably not being realistic in the general case, what i really like about your patch is that it makes no attempt to change the default behavior of sorting in a way that would hurt users by default – users would only get this behavior if they choose to use it, and while the "typical" case may not bnefit from it, i'm sure there are plenty of situations where people know their index is big, and know that they are doing a search that should have a small number of results. adding something like this doesn't proclude future work on making sorting using FieldCache's less prohibitive (ie: an IndexReader.reopen approach) what does concern me about this patch is that without better javadocs explaining exactly what it does and when it's usefull, it could easily be missued by people who stumble upon it. I also don't understand why in your updated version of the patch, you aren't making an attempt to use the FieldSelector version of IndexReader.document(), since it should allways be faster in this use case, and would result in your memory cache talking up less space. I also don't understand your "IndexReader members access rights problems raised" ... a subclass of IndexReader should be able to live freely in any package – including as a private static class inside of another class. perhaps you ran into problems because you are attempting to subclass methods you don't really need to worry about subclassing? ... yet another reason to subclass FilterIndexReader and save yourself some headaches.
        Hide
        Artem Vasiliev added a comment -

        Ok guys, I think I'm finished on this. Feel free to include it in Lucene or not.

        I'm quite happy already using it in my app (Sharehound), it does solve the problem for me.

        Show
        Artem Vasiliev added a comment - Ok guys, I think I'm finished on this. Feel free to include it in Lucene or not. I'm quite happy already using it in my app (Sharehound), it does solve the problem for me.
        Hide
        Chuck Williams added a comment -

        I have this same issue with a constantly changing large index where users needs a current view. The frist search after each frequent IndexReader reopen is slow due primarily to the requirement to rebuild the FieldCache for sort fields.

        I don't believe this patch, or any continuation along these lines, will help my issue. Documents are lage and queries frequently return large results sets, say 20% of the entire multi-million document index or more. Hundreds of thousands of document() retrievals, even with a fast LOAD_AND_BREAK FieldSelector finding sort fields at the beginning of each Document, is not going to beat FieldCache's single traversal of the postings for the sort fieds.

        Another approach I've looked at is Robert Engel's IndexReader.reopen(). I think this direction is more promising. Artem, you might want to look at this. At least the version I've seen is not integrated with FieldCache, but it seems this would be feasible. Segments to the left of the first changed segment maintain their doc-ids, so an improved FieldCache could iterate just the postings in the first changed segment and those to the right. Unless somebody else does this first, it's on my list to improve IndexReader.reopen() with this optimization and to make other enhancements my app needs (e.g., support for ParallelReader – the current implementation fails in this case).

        A specific comment on the new patch: the introduction of FieldSelectors is too restrictive. The same doc-id may be retrieved using multiple FieldSelectors in different calls to IndexReader.document(). Any implementation of the cache needs to support this.

        Show
        Chuck Williams added a comment - I have this same issue with a constantly changing large index where users needs a current view. The frist search after each frequent IndexReader reopen is slow due primarily to the requirement to rebuild the FieldCache for sort fields. I don't believe this patch, or any continuation along these lines, will help my issue. Documents are lage and queries frequently return large results sets, say 20% of the entire multi-million document index or more. Hundreds of thousands of document() retrievals, even with a fast LOAD_AND_BREAK FieldSelector finding sort fields at the beginning of each Document, is not going to beat FieldCache's single traversal of the postings for the sort fieds. Another approach I've looked at is Robert Engel's IndexReader.reopen(). I think this direction is more promising. Artem, you might want to look at this. At least the version I've seen is not integrated with FieldCache, but it seems this would be feasible. Segments to the left of the first changed segment maintain their doc-ids, so an improved FieldCache could iterate just the postings in the first changed segment and those to the right. Unless somebody else does this first, it's on my list to improve IndexReader.reopen() with this optimization and to make other enhancements my app needs (e.g., support for ParallelReader – the current implementation fails in this case). A specific comment on the new patch: the introduction of FieldSelectors is too restrictive. The same doc-id may be retrieved using multiple FieldSelectors in different calls to IndexReader.document(). Any implementation of the cache needs to support this.
        Hide
        Artem Vasiliev added a comment -

        Renamed classes as Hoss proposed. Tried to hide DocFieldCachingIndexReader, no luck - IndexReader members access rights problems raised. FieldSelector is now verified to be the same and is used by StoredFieldComparatorSource for DocFieldCachingIndexReader creation.
        Timings didn't change much - they probably would if documents in index were larger.

        Show
        Artem Vasiliev added a comment - Renamed classes as Hoss proposed. Tried to hide DocFieldCachingIndexReader, no luck - IndexReader members access rights problems raised. FieldSelector is now verified to be the same and is used by StoredFieldComparatorSource for DocFieldCachingIndexReader creation. Timings didn't change much - they probably would if documents in index were larger.
        Hide
        Artem Vasiliev added a comment -

        Hi guys!

        Thanks for value comments. What a feedback!

        I'd like to stress the point of my fix - to avoid costly FieldCache population with field values from the whole index.

        Your point that it will be slower for cases when filtered sets be nearly as large as the whole index is valid. But is it a practical point? Lucene shines on big indexes and queries resulting with full index are not very useful I guess.

        I think it's good idea to hide the caching reader class and utilize FieldSelector mechanism to make the fix more effective. However do you think this improvement worth doing? You are strong opposition and I'm not feeling up to an endless fight I'm serious, let me know what you think. This fix will have its limitations by no means but I think the above OutOfMemory scenario with current sorting mechanism alone makes this fix legitimate.

        Show
        Artem Vasiliev added a comment - Hi guys! Thanks for value comments. What a feedback! I'd like to stress the point of my fix - to avoid costly FieldCache population with field values from the whole index. Your point that it will be slower for cases when filtered sets be nearly as large as the whole index is valid. But is it a practical point? Lucene shines on big indexes and queries resulting with full index are not very useful I guess. I think it's good idea to hide the caching reader class and utilize FieldSelector mechanism to make the fix more effective. However do you think this improvement worth doing? You are strong opposition and I'm not feeling up to an endless fight I'm serious, let me know what you think. This fix will have its limitations by no means but I think the above OutOfMemory scenario with current sorting mechanism alone makes this fix legitimate.
        Hide
        Yonik Seeley added a comment -

        Sorry for some of the redundant comments... Chucks comment wasn't visible to me for some strange reason when I left mine.

        Show
        Yonik Seeley added a comment - Sorry for some of the redundant comments... Chucks comment wasn't visible to me for some strange reason when I left mine.
        Hide
        Yonik Seeley added a comment -

        Those performance numbers don't make sense to me.
        Why would DocCaching sort be so much faster than FieldCache sort the second time on the same IndexReader?
        Using a cached FieldCache entry for sorting involves an array lookup... how do you improve on that?
        Or do you open a new reader for each test?

        Also, you specify the size of the index, but not the size of the number of documents to be sorted (that match the query). DocCacheSorting should use much more memory than the FieldCache (and be slower) if the number of documents to be sorted is large, right?

        Show
        Yonik Seeley added a comment - Those performance numbers don't make sense to me. Why would DocCaching sort be so much faster than FieldCache sort the second time on the same IndexReader? Using a cached FieldCache entry for sorting involves an array lookup... how do you improve on that? Or do you open a new reader for each test? Also, you specify the size of the index, but not the size of the number of documents to be sorted (that match the query). DocCacheSorting should use much more memory than the FieldCache (and be slower) if the number of documents to be sorted is large, right?
        Hide
        Chuck Williams added a comment -

        The test case uses only tiny documents, and the reported timings for multiple searches with FieldCache make it appear that the version of lucene used contains the bug that caused FieldCaches to be frequently recomputed unnecessarily.

        I suggest trying the test with much larger documents, of realistic size, and using current Lucene source. I'm sure the patch will make things much slower with the current implementation. As Hoss suggests, performance would be improved considerably by using a FieldSelector to obtain just the sort field, but even so will be slow unless the sort field is arranged to be early on the documents, ideally the first field, and a LOAD_AND_BREAK FieldSelector is used.

        Another important performance variable will be the number of documents retrieved in the test query. If the number of documents satisfying the query is a sizable percentage of the total collection size, I'm pretty sure the patch will be much slower than using FieldCache.

        Show
        Chuck Williams added a comment - The test case uses only tiny documents, and the reported timings for multiple searches with FieldCache make it appear that the version of lucene used contains the bug that caused FieldCaches to be frequently recomputed unnecessarily. I suggest trying the test with much larger documents, of realistic size, and using current Lucene source. I'm sure the patch will make things much slower with the current implementation. As Hoss suggests, performance would be improved considerably by using a FieldSelector to obtain just the sort field, but even so will be slow unless the sort field is arranged to be early on the documents, ideally the first field, and a LOAD_AND_BREAK FieldSelector is used. Another important performance variable will be the number of documents retrieved in the test query. If the number of documents satisfying the query is a sizable percentage of the total collection size, I'm pretty sure the patch will be much slower than using FieldCache.
        Hide
        Hoss Man added a comment -

        Artem: I've only skimmed your patch breifly, but i have a few comments:

        1) since Lucene sorting has historicly been based on indexed fields, and this new patch results in fields being sorted on the stored values, you should definitely point this out in the javadocs of your DocCachingSortFactory and DocCachingFieldComparatorSource classes .. in big bold letters .. i would go even so far as to name the classes StoredFieldComparatorSrouce and StoredFieldSortFactory instead of the names you currently use.

        2) your use of WeakHashMap with keys that are never refrenced elsewhere to ensure the cache is purged on every GC is not something i've ever seen before...

        +public class WeakDocumentsCache {
        + private Map cache = Collections.synchronizedMap(new WeakHashMap());
        +
        + public Document getById(int docId)

        { + return (Document) cache.get(new Integer(docId)); + }

        +
        + public void put(int docId, Document document)

        { + cache.put(new Integer(docId), document); + }

        ...have you tested out the performacne of this approach with various GC implimentations?

        skimming the javadocs for WeakReference/WeakHashMap it seems like perhaps SoftRefrence would be better suited for your purposes.

        3) making your new DocCachingIndexReader and WeakDocumentsCache clases part of the Lucene public API seems to be a little outside of the scope of this change ... perhapsthey should be left as a private static inner class es inside of the individual classes they are used by (DocCachingFieldComparatorSource and DocCachingIndexReader respectively) .... even if these classes are left public, DocCachingIndexReader should probably subclass FilterIndexReader to reduce the amount of duplication.

        4) your check for FieldSelector usage in DocCachingIndexReader doesn't check if the FieldSelector used is the same every time – which means you can't trust your cache ... fixing this could be complicated, and serves as another reason why it would be easier if DocCachingIndexReader was made a private inner class of DocCachingFieldComparatorSource where you know exactly how it's going to be used.

        5) speaking of FieldSelector, your use case seems like a perfect example of when a FieldSelector would make sense to only read the field(s) that are needed for sorting.

        Show
        Hoss Man added a comment - Artem: I've only skimmed your patch breifly, but i have a few comments: 1) since Lucene sorting has historicly been based on indexed fields, and this new patch results in fields being sorted on the stored values, you should definitely point this out in the javadocs of your DocCachingSortFactory and DocCachingFieldComparatorSource classes .. in big bold letters .. i would go even so far as to name the classes StoredFieldComparatorSrouce and StoredFieldSortFactory instead of the names you currently use. 2) your use of WeakHashMap with keys that are never refrenced elsewhere to ensure the cache is purged on every GC is not something i've ever seen before... +public class WeakDocumentsCache { + private Map cache = Collections.synchronizedMap(new WeakHashMap()); + + public Document getById(int docId) { + return (Document) cache.get(new Integer(docId)); + } + + public void put(int docId, Document document) { + cache.put(new Integer(docId), document); + } ...have you tested out the performacne of this approach with various GC implimentations? skimming the javadocs for WeakReference/WeakHashMap it seems like perhaps SoftRefrence would be better suited for your purposes. 3) making your new DocCachingIndexReader and WeakDocumentsCache clases part of the Lucene public API seems to be a little outside of the scope of this change ... perhapsthey should be left as a private static inner class es inside of the individual classes they are used by (DocCachingFieldComparatorSource and DocCachingIndexReader respectively) .... even if these classes are left public, DocCachingIndexReader should probably subclass FilterIndexReader to reduce the amount of duplication. 4) your check for FieldSelector usage in DocCachingIndexReader doesn't check if the FieldSelector used is the same every time – which means you can't trust your cache ... fixing this could be complicated, and serves as another reason why it would be easier if DocCachingIndexReader was made a private inner class of DocCachingFieldComparatorSource where you know exactly how it's going to be used. 5) speaking of FieldSelector, your use case seems like a perfect example of when a FieldSelector would make sense to only read the field(s) that are needed for sorting.
        Hide
        Artem Vasiliev added a comment -

        fork="true" maxmemory="105m" attributes need to be added to <junit> task for the test to be runnable with DOCS_NUM == 1,000,000

        Show
        Artem Vasiliev added a comment - fork="true" maxmemory="105m" attributes need to be added to <junit> task for the test to be runnable with DOCS_NUM == 1,000,000
        Hide
        Artem Vasiliev added a comment -

        Also tried this with DOCS_NUM == 1,000,000. Right after index creation it was
        [junit] 1687ms elapsed for DocCaching sort
        [junit] 1922ms elapsed for FieldCache'd sort

        But next 2 time I ran it (without index set up) the timings were near this:
        [junit] 94ms elapsed for DocCaching sort
        [junit] 1797ms elapsed for FieldCache'd sort

        Show
        Artem Vasiliev added a comment - Also tried this with DOCS_NUM == 1,000,000. Right after index creation it was [junit] 1687ms elapsed for DocCaching sort [junit] 1922ms elapsed for FieldCache'd sort But next 2 time I ran it (without index set up) the timings were near this: [junit] 94ms elapsed for DocCaching sort [junit] 1797ms elapsed for FieldCache'd sort
        Hide
        Artem Vasiliev added a comment -

        I've tried the test with DOCS_NUM == 10,000,000. DocCaching sort took about 1s with standart amount of memory (-Xmx80m) while FieldCache'd trapped with OutOfMemoryError even with 1G (-Xmx1000m). Resulting index size was 640M, its creation took at least 7hrs.

        Show
        Artem Vasiliev added a comment - I've tried the test with DOCS_NUM == 10,000,000. DocCaching sort took about 1s with standart amount of memory (-Xmx80m) while FieldCache'd trapped with OutOfMemoryError even with 1G (-Xmx1000m). Resulting index size was 640M, its creation took at least 7hrs.

          People

          • Assignee:
            Unassigned
            Reporter:
            Artem Vasiliev
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development