Lucene - Core
  1. Lucene - Core
  2. LUCENE-1690

Morelikethis queries are very slow compared to other search types

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.4.1
    • Fix Version/s: None
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The MoreLikeThis object performs term frequency lookups for every query. From my testing that's what seems to take up the majority of time for MoreLikeThis searches.

      For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them.

      I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10.

      1. LruCache.patch
        5 kB
        Richard Marr
      2. LUCENE-1690.patch
        13 kB
        Richard Marr
      3. LUCENE-1690.patch
        3 kB
        Richard Marr

        Activity

        Hide
        Richard Marr added a comment -

        This patch implements a basic hashmap term frequency cache. It shouldn't affect any applications that don't opt-in to using it, and applications that do should see an order of magnitude performance improvement for MLT queries.

        This cache implementation is tied to the MLT object but can be cleared on demand.

        Show
        Richard Marr added a comment - This patch implements a basic hashmap term frequency cache. It shouldn't affect any applications that don't opt-in to using it, and applications that do should see an order of magnitude performance improvement for MLT queries. This cache implementation is tied to the MLT object but can be cleared on demand.
        Hide
        Michael McCandless added a comment -

        This sounds good!

        Could we include the IndexReader in the cache key? Then it'd be functionally equivalent we could enable it by default?

        Show
        Michael McCandless added a comment - This sounds good! Could we include the IndexReader in the cache key? Then it'd be functionally equivalent we could enable it by default?
        Hide
        Richard Marr added a comment -

        Sounds reasonable although that'll take a little longer for me to do. I'll have a think about it.

        Show
        Richard Marr added a comment - Sounds reasonable although that'll take a little longer for me to do. I'll have a think about it.
        Hide
        Carl Austin added a comment -

        The cache used for this is a HashMap and this is unbounded. Perhaps this should be an LRU cache with a settable maximum number of entries to stop it growing forever if you do a lot of like this queries on large indexes with many unique terms.
        Otherwise nice addition, has sped up my more like this queries a bit.

        Show
        Carl Austin added a comment - The cache used for this is a HashMap and this is unbounded. Perhaps this should be an LRU cache with a settable maximum number of entries to stop it growing forever if you do a lot of like this queries on large indexes with many unique terms. Otherwise nice addition, has sped up my more like this queries a bit.
        Hide
        Richard Marr added a comment -

        Okay, so the ideal solution is an LRU cache binding to a specific IndexReader instance. I think I can handle that.

        Carl, do you have any data on how this has changed performance in your system? My use case is a limited vocabulary so the performance gain was large.

        Show
        Richard Marr added a comment - Okay, so the ideal solution is an LRU cache binding to a specific IndexReader instance. I think I can handle that. Carl, do you have any data on how this has changed performance in your system? My use case is a limited vocabulary so the performance gain was large.
        Hide
        Carl Austin added a comment -

        I wasn't all that scientific I am afraid, just noting that it improved performace enough once warmed up to keep on using it. Sorry.
        However, after just 3 or 4 more like this queries I am seeing a definate improvement, as the majority of freetext is standard vocab, and the unique terms only make up a small amount of the rest of the text.

        Show
        Carl Austin added a comment - I wasn't all that scientific I am afraid, just noting that it improved performace enough once warmed up to keep on using it. Sorry. However, after just 3 or 4 more like this queries I am seeing a definate improvement, as the majority of freetext is standard vocab, and the unique terms only make up a small amount of the rest of the text.
        Hide
        Richard Marr added a comment -

        Attached is a draft of an implementation that uses a WeakHashMap to bind the cache to the IndexReader instance, and a LinkedHashMap to provide LRU functionality.

        Disclaimer: I'm not fluent in Java or OSS contribution so there may be holes or bad style in this implementation. I also need to check it meets the project coding standards.

        Anybody up for giving me some feedback in the meantime?

        Show
        Richard Marr added a comment - Attached is a draft of an implementation that uses a WeakHashMap to bind the cache to the IndexReader instance, and a LinkedHashMap to provide LRU functionality. Disclaimer: I'm not fluent in Java or OSS contribution so there may be holes or bad style in this implementation. I also need to check it meets the project coding standards. Anybody up for giving me some feedback in the meantime?
        Hide
        Michael McCandless added a comment -

        The getTermFrequency method looks like it'll incorrectly put 0 into the cache, when the field was in the top-level cache but the term text wasn't in the 2nd level cache?

        Show
        Michael McCandless added a comment - The getTermFrequency method looks like it'll incorrectly put 0 into the cache, when the field was in the top-level cache but the term text wasn't in the 2nd level cache?
        Hide
        Richard Marr added a comment -

        There's also another problem I've just noticed. Please ignore the latest patch.

        Show
        Richard Marr added a comment - There's also another problem I've just noticed. Please ignore the latest patch.
        Hide
        Richard Marr added a comment -

        This is the latest version. I wasn't working on it at quite such a rediculous hour this time so it should be better.

        It includes - fixed cache logic, a few comments, LRU object applied in the right place, and some test cases demonstrating things behave as expected. I'll do some more testing when I have a free evening.

        I have some questions:

        a) org.apache.lucene.search.similar doesn't seem like the right place for a generic LRU LinkedHashMap wrapper. Is there an existing class I can use instead?

        b) Having the cache dependent on both the MLT object and the IndexReader object seems a bit... odd. I suspect the right place for this cache is in the IndexReader, but suspect that would be a can of worms. Comments?

        Show
        Richard Marr added a comment - This is the latest version. I wasn't working on it at quite such a rediculous hour this time so it should be better. It includes - fixed cache logic, a few comments, LRU object applied in the right place, and some test cases demonstrating things behave as expected. I'll do some more testing when I have a free evening. I have some questions: a) org.apache.lucene.search.similar doesn't seem like the right place for a generic LRU LinkedHashMap wrapper. Is there an existing class I can use instead? b) Having the cache dependent on both the MLT object and the IndexReader object seems a bit... odd. I suspect the right place for this cache is in the IndexReader, but suspect that would be a can of worms. Comments?
        Hide
        Michael McCandless added a comment -

        OK now I feel silly – this cache is in fact very similar to the caching that Lucene already does, internally! Sorry I didn't catch this overlap sooner.

        In oal.index.TermInfosReader.java there's an LRU cache, default size 1024, that holds recently retrieved terms and their TermInfo. It uses oal.util.cache.SimpleLRUCache.

        There are some important differences from this new cache in MLT. EG, it holds the entire TermInfo, not just the docFreq. Plus, it's a central cache for any & all term lookups that go through the SegmentReader. Also, it's stored in thread-private storage, so each thread has its own cache.

        But, now I'm confused: how come you are not already seeing the benefits of this cache? You ought to see MLT queries going faster. This core cache was first added in 2.4.x; it looks like you were testing against 2.4.1 (from the "Affects Version" on this issue).

        Show
        Michael McCandless added a comment - OK now I feel silly – this cache is in fact very similar to the caching that Lucene already does, internally! Sorry I didn't catch this overlap sooner. In oal.index.TermInfosReader.java there's an LRU cache, default size 1024, that holds recently retrieved terms and their TermInfo. It uses oal.util.cache.SimpleLRUCache. There are some important differences from this new cache in MLT. EG, it holds the entire TermInfo, not just the docFreq. Plus, it's a central cache for any & all term lookups that go through the SegmentReader. Also, it's stored in thread-private storage, so each thread has its own cache. But, now I'm confused: how come you are not already seeing the benefits of this cache? You ought to see MLT queries going faster. This core cache was first added in 2.4.x; it looks like you were testing against 2.4.1 (from the "Affects Version" on this issue).
        Hide
        Carl Austin added a comment -

        The cache in terminfosreader is for everything as you say. I do a lot of stuff with terms, and those terms will get pushed out of this LRU cache very quickly.
        I have a separate cache on my version of the MLT. This has the advantage of those terms only being pushed out by other MLT queries, and not by everything else I am doing that is not MLT related.
        A lot of MLTs use the same terms, and I have a good size cache for it, meaning most terms I use in MLT can be retrieved from there. Seeing as MLT in my circumstance is one of the slower bits, this can give me a good advantage.

        Show
        Carl Austin added a comment - The cache in terminfosreader is for everything as you say. I do a lot of stuff with terms, and those terms will get pushed out of this LRU cache very quickly. I have a separate cache on my version of the MLT. This has the advantage of those terms only being pushed out by other MLT queries, and not by everything else I am doing that is not MLT related. A lot of MLTs use the same terms, and I have a good size cache for it, meaning most terms I use in MLT can be retrieved from there. Seeing as MLT in my circumstance is one of the slower bits, this can give me a good advantage.
        Hide
        Erick Erickson added a comment -

        SPRING_CLEANING_2013 JIRAS Can anyone say if this is still valid?

        Show
        Erick Erickson added a comment - SPRING_CLEANING_2013 JIRAS Can anyone say if this is still valid?
        Hide
        Richard Marr added a comment -

        Erick Erickson as far as I know this is still an issue. I'm not aware of any change, although I haven't re-run the benchmark recently. If I get around to re-running the benchmark I'll post the results

        Show
        Richard Marr added a comment - Erick Erickson as far as I know this is still an issue. I'm not aware of any change, although I haven't re-run the benchmark recently. If I get around to re-running the benchmark I'll post the results

          People

          • Assignee:
            Unassigned
            Reporter:
            Richard Marr
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 2h
              2h
              Remaining:
              Remaining Estimate - 2h
              2h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development