Lucene - Core
  1. Lucene - Core
  2. LUCENE-2391

Spellchecker uses default IW mergefactor/ramMB settings of 300/10

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/spellchecker
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      These settings seem odd - I'd like to investigate what makes most sense here.

        Activity

        Hide
        Robert Muir added a comment -

        Here's a patch to speed up the spellchecker build.

        • i wired the default RamMB to IWConfig's default
        • i didnt mess with the mergefactor for now (because the default is still to optimize)
        • but i added an additional 'optimize' parameter so you can update your spellcheck index without re-optimizing.
        • when updating, i changed the exists() to work per-segment, so its reasonable if the index isn't optimized.
        • the exists() check now bypasses the term dictionary cache, which is stupid and just slows it down.
        • we don't do any of the exists() logic if the index is empty (this is the case for i think solr which completely rebuilds
          and doesnt do an incremental update)
        • the startXXX, endXXX, and word fields can only contain one term per document. I turned off norms, positions,
          and tf for these.
        • the gramXXX field is unchanged, i didnt want to change spellchecker scoring in any way. But we could
          reasonably in the future likely omit norms here too since i think its gonna be very short.
        trunk:
        scratch build time: 229,803ms
        index size: 214,322,200 bytes
        no-op update time (updating but there is no new terms to add): 4,619ms
        
        patch:
        scratch build time: 99,214ms
        index size: 177,781,273 bytes
        no-op update time: 2,504ms
        

        i still left the optimize default on, but really i think for most users (e.g. solr) they should set
        mergefactor to be maybe a bit more reasonable, set optimize to false, and the scratch build
        is then must faster (60,000 ms), but the no-op update time is heavier (eg 16,000ms). Still,
        if you are rebuilding on every commit for smallish updates something like 20-30 seconds
        is a lot better than 100seconds, but for now I kept the defaults as is (optimizing every time).

        Show
        Robert Muir added a comment - Here's a patch to speed up the spellchecker build. i wired the default RamMB to IWConfig's default i didnt mess with the mergefactor for now (because the default is still to optimize) but i added an additional 'optimize' parameter so you can update your spellcheck index without re-optimizing. when updating, i changed the exists() to work per-segment, so its reasonable if the index isn't optimized. the exists() check now bypasses the term dictionary cache, which is stupid and just slows it down. we don't do any of the exists() logic if the index is empty (this is the case for i think solr which completely rebuilds and doesnt do an incremental update) the startXXX, endXXX, and word fields can only contain one term per document. I turned off norms, positions, and tf for these. the gramXXX field is unchanged, i didnt want to change spellchecker scoring in any way. But we could reasonably in the future likely omit norms here too since i think its gonna be very short. trunk: scratch build time: 229,803ms index size: 214,322,200 bytes no-op update time (updating but there is no new terms to add): 4,619ms patch: scratch build time: 99,214ms index size: 177,781,273 bytes no-op update time: 2,504ms i still left the optimize default on, but really i think for most users (e.g. solr) they should set mergefactor to be maybe a bit more reasonable, set optimize to false, and the scratch build is then must faster (60,000 ms), but the no-op update time is heavier (eg 16,000ms). Still, if you are rebuilding on every commit for smallish updates something like 20-30 seconds is a lot better than 100seconds, but for now I kept the defaults as is (optimizing every time).
        Hide
        Michael McCandless added a comment -

        Patch looks great Robert!

        Do we really need to handle subclasses that override exists?

        Show
        Michael McCandless added a comment - Patch looks great Robert! Do we really need to handle subclasses that override exists?
        Hide
        Robert Muir added a comment -

        Do we really need to handle subclasses that override exists?

        The only reason i did this is because i want to backport this to branch_3x too, since it
        significantly speeds up spellchecker rebuilds.

        But a simpler option would be to just mark the spellchecker final, does anyone actually
        subclass this thing? It seems like a scary thing to subclass (the synchronization etc inside of it)

        Show
        Robert Muir added a comment - Do we really need to handle subclasses that override exists? The only reason i did this is because i want to backport this to branch_3x too, since it significantly speeds up spellchecker rebuilds. But a simpler option would be to just mark the spellchecker final, does anyone actually subclass this thing? It seems like a scary thing to subclass (the synchronization etc inside of it)
        Hide
        Robert Muir added a comment -

        Committed revision 1055285, 1055289 (3x).

        it would be good to make a follow-on-issue to allow solr users to control optimize-on-build,
        and also to control the clearIndex(), so they can reasonably use incremental update rather
        than fully rebuilding the entire spellcheck index every time.

        Show
        Robert Muir added a comment - Committed revision 1055285, 1055289 (3x). it would be good to make a follow-on-issue to allow solr users to control optimize-on-build, and also to control the clearIndex(), so they can reasonably use incremental update rather than fully rebuilding the entire spellcheck index every time.
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1

          People

          • Assignee:
            Robert Muir
            Reporter:
            Mark Miller
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development