Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Our default terms index today breaks terms into blocks of fixed size
      (ie, every 32 terms is a new block), and then we build an index on top
      of that (holding the start term for each block).

      But, it should be better to instead break terms according to how they
      share prefixes. This results in variable sized blocks, but means
      within each block we maximize the shared prefix and minimize the
      resulting terms index. It should also be a speedup for terms dict
      intensive queries because the terms index becomes a "true" prefix
      trie, and can be used to fast-fail on term lookup (ie returning
      NOT_FOUND without having to seek/scan a terms block).

      Having a true prefix trie should also enable much faster intersection
      with automaton (but this will be a new issue).

      I've made an initial impl for this (called
      BlockTreeTermsWriter/Reader). It's still a work in progress... lots
      of nocommits, and hairy code, but tests pass (at least once!).

      I made two new codecs, temporarily called StandardTree, PulsingTree,
      that are just like their counterparts but use this new terms dict.

      I added a new "exactOnly" boolean to TermsEnum.seek. If that's true
      and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
      enum is unpositioned (ie you should not call next(), docs(), etc.).

      In this approach the index and dict are tightly connected, so it does
      not support a pluggable index impl like BlockTermsWriter/Reader.
      Blocks are stored on certain nodes of the prefix trie, and can contain
      both terms and pointers to sub-blocks (ie, if the block is not a leaf
      block). So there are two trees, tied to one another – the index
      trie, and the blocks. Only certain nodes in the trie map to a block
      in the block tree.

      I think this algorithm is similar to burst tries
      (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
      except it allows terms to be stored on inner blocks (not just leaf
      blocks). This is important for Lucene because an [accidental]
      "adversary" could produce a terms dict with way too many blocks (way
      too much RAM used by the terms index). Still, with my current patch,
      an adversary can produce too-big blocks... which we may need to fix,
      by letting the terms index not be a true prefix trie on it's leaf
      edges.

      Exactly how the blocks are picked can be factored out as its own
      policy (but I haven't done that yet). Then, burst trie is one policy,
      my current approach is another, etc. The policy can be tuned to
      the terms' expected distribution, eg if it's a primary key field and
      you only use base 10 for each character then you want block sizes of
      size 10. This can make a sizable difference on lookup cost.

      I modified the FST Builder to allow for a "plugin" that freezes the
      "tail" (changed suffix) of each added term, because I use this to find
      the blocks.

      1. LUCENE-3030.patch
        248 kB
        Michael McCandless
      2. LUCENE-3030.patch
        253 kB
        Michael McCandless
      3. LUCENE-3030.patch
        285 kB
        Michael McCandless
      4. LUCENE-3030.patch
        379 kB
        Michael McCandless
      5. LUCENE-3030.patch
        26 kB
        Michael McCandless
      6. LUCENE-3030.patch
        27 kB
        Michael McCandless
      7. LUCENE-3030.patch
        447 kB
        Michael McCandless
      8. LUCENE-3030.patch
        441 kB
        Michael McCandless
      9. LUCENE-3030.patch
        437 kB
        Robert Muir
      10. BlockTree.png
        21 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch.

        Show
        Michael McCandless added a comment - Patch.
        Hide
        Michael McCandless added a comment -

        Current patch...

        Show
        Michael McCandless added a comment - Current patch...
        Hide
        Michael McCandless added a comment -

        Another rev of the patch... tons of nocommits, but I think it's close (though some tests still fail w/ StandardTree codec).

        Show
        Michael McCandless added a comment - Another rev of the patch... tons of nocommits, but I think it's close (though some tests still fail w/ StandardTree codec).
        Hide
        Michael McCandless added a comment -

        Checkpointing my current state here – the big change is I added a Terms.intersect(CompiledAutomaton) method, which returns a TermsEnum, but there's something wrong it still – seems to give the right results but makes LEV2 FuzzyQ slower.

        Show
        Michael McCandless added a comment - Checkpointing my current state here – the big change is I added a Terms.intersect(CompiledAutomaton) method, which returns a TermsEnum, but there's something wrong it still – seems to give the right results but makes LEV2 FuzzyQ slower.
        Hide
        Michael McCandless added a comment -
        Show
        Michael McCandless added a comment - I created a branch https://svn.apache.org/repos/asf/lucene/dev/branches/blocktree_3030 for iterating on this.
        Hide
        Robert Muir added a comment -

        This is awesome, i really like adding the intersect() hook!

        Thanks for making a branch, I will check it out and try to dive in to help with some of this

        One trivial thing we might want to do is to add the logic currently in AQ's ctor to CA, so that you ask CA for its termsenum.
        this way, if it can be accomplished with a simpler enum like just terms.iterator() or prefixtermsenum etc etc we get that optimization always.

        Show
        Robert Muir added a comment - This is awesome, i really like adding the intersect() hook! Thanks for making a branch, I will check it out and try to dive in to help with some of this One trivial thing we might want to do is to add the logic currently in AQ's ctor to CA, so that you ask CA for its termsenum. this way, if it can be accomplished with a simpler enum like just terms.iterator() or prefixtermsenum etc etc we get that optimization always.
        Hide
        Robert Muir added a comment -

        Also, we should measure if a "prefix automaton" with intersect() is faster than PrefixTermsEnum (I suspect it could be!)

        If this is true, we might want to not rewrite to prefixtermsenum anymore, instead changing PrefixQuery to extend AutomatonQuery too.

        Show
        Robert Muir added a comment - Also, we should measure if a "prefix automaton" with intersect() is faster than PrefixTermsEnum (I suspect it could be!) If this is true, we might want to not rewrite to prefixtermsenum anymore, instead changing PrefixQuery to extend AutomatonQuery too.
        Hide
        Michael McCandless added a comment -

        One trivial thing we might want to do is to add the logic currently in AQ's ctor to CA, so that you ask CA for its termsenum.

        +1 – I think CA should serve up a TermsEnum when provided a Terms?

        Show
        Michael McCandless added a comment - One trivial thing we might want to do is to add the logic currently in AQ's ctor to CA, so that you ask CA for its termsenum. +1 – I think CA should serve up a TermsEnum when provided a Terms?
        Hide
        Michael McCandless added a comment -

        The block tree terms dict seems to be working... all tests pass w/
        StandardTree codec. There's still more to do (many nocommits), but, I
        think the perf results should be close to what I finally commit:

        Task QPS base StdDev base QPS blocktree StdDev blocktree Pct diff
        IntNRQ 11.58 1.37 10.11 1.77 35%-16%
        Term 106.65 3.24 98.84 4.97 14%-0%
        Prefix3 30.83 1.36 28.64 2.42 18%-5%
        OrHighHigh 5.85 0.15 5.44 0.28 14%-0%
        OrHighMed 19.25 0.62 17.91 0.86 14%-0%
        Phrase 9.37 0.42 8.87 0.10 10%-0%
        TermBGroup1M 44.02 0.90 42.76 1.08 7%-1%
        TermGroup1M 37.68 0.65 36.95 0.74 5%-1%
        TermBGroup1M1P 47.16 2.94 46.36 0.16 7%-5%
        SpanNear 5.60 0.35 5.55 0.29 11%-11%
        SloppyPhrase 3.36 0.16 3.34 0.04 6%-5%
        Wildcard 35.15 1.30 35.05 2.42 10%-10%
        AndHighHigh 10.71 0.22 10.99 0.22 1%-6%
        AndHighMed 51.15 1.44 54.31 1.84 0%-12%
        Fuzzy1 31.63 0.55 66.15 1.35 101%-117%
        PKLookup 40.00 0.75 84.93 5.49 94%-130%
        Fuzzy2 33.78 0.82 89.59 2.46 151%-179%
        Respell 23.56 1.15 70.89 1.77 179%-224%

        This is for a multi-segment index, 10 M wikipedia docs, using luceneutil.

        These are huge speedups for the terms-dict intensive queries!

        The two FuzzyQuerys and Respell get the speedup from the directly
        implemented intersect method, and the PKLookup gets gains because it
        can often avoid seeking since block tree's terms index can sometimes
        rule out terms by their prefix (though, this relies on the PK terms
        being "predictable" – I use "%09d" w/ a counter, now; if you instead
        used something more random looking (GUIDs )I don't think we'd see
        gains).

        Show
        Michael McCandless added a comment - The block tree terms dict seems to be working... all tests pass w/ StandardTree codec. There's still more to do (many nocommits), but, I think the perf results should be close to what I finally commit: Task QPS base StdDev base QPS blocktree StdDev blocktree Pct diff IntNRQ 11.58 1.37 10.11 1.77 35% - 16% Term 106.65 3.24 98.84 4.97 14% - 0% Prefix3 30.83 1.36 28.64 2.42 18% - 5% OrHighHigh 5.85 0.15 5.44 0.28 14% - 0% OrHighMed 19.25 0.62 17.91 0.86 14% - 0% Phrase 9.37 0.42 8.87 0.10 10% - 0% TermBGroup1M 44.02 0.90 42.76 1.08 7% - 1% TermGroup1M 37.68 0.65 36.95 0.74 5% - 1% TermBGroup1M1P 47.16 2.94 46.36 0.16 7% - 5% SpanNear 5.60 0.35 5.55 0.29 11% - 11% SloppyPhrase 3.36 0.16 3.34 0.04 6% - 5% Wildcard 35.15 1.30 35.05 2.42 10% - 10% AndHighHigh 10.71 0.22 10.99 0.22 1% - 6% AndHighMed 51.15 1.44 54.31 1.84 0% - 12% Fuzzy1 31.63 0.55 66.15 1.35 101% - 117% PKLookup 40.00 0.75 84.93 5.49 94% - 130% Fuzzy2 33.78 0.82 89.59 2.46 151% - 179% Respell 23.56 1.15 70.89 1.77 179% - 224% This is for a multi-segment index, 10 M wikipedia docs, using luceneutil. These are huge speedups for the terms-dict intensive queries! The two FuzzyQuerys and Respell get the speedup from the directly implemented intersect method, and the PKLookup gets gains because it can often avoid seeking since block tree's terms index can sometimes rule out terms by their prefix (though, this relies on the PK terms being "predictable" – I use "%09d" w/ a counter, now; if you instead used something more random looking (GUIDs )I don't think we'd see gains).
        Hide
        Michael McCandless added a comment -

        Here's the graph of the results:

        Show
        Michael McCandless added a comment - Here's the graph of the results:
        Hide
        Simon Willnauer added a comment -

        These are huge speedups for the terms-dict intensive queries!

        oh boy! This is awesome!

        Show
        Simon Willnauer added a comment - These are huge speedups for the terms-dict intensive queries! oh boy! This is awesome!
        Hide
        Michael McCandless added a comment -

        Patch, removing all nocommits!

        Show
        Michael McCandless added a comment - Patch, removing all nocommits!
        Hide
        Michael McCandless added a comment -

        Some small cleanups over the last patch. I'll commit shortly.

        Show
        Michael McCandless added a comment - Some small cleanups over the last patch. I'll commit shortly.
        Hide
        Michael McCandless added a comment -

        There is a small hit to indexing perf here, somewhere between 0 - 5% or so depending on the run. Given the gains for MTQs I think this is an OK tradeoff. We can further optimize the BlockTreeTermsWriter later....

        Show
        Michael McCandless added a comment - There is a small hit to indexing perf here, somewhere between 0 - 5% or so depending on the run. Given the gains for MTQs I think this is an OK tradeoff. We can further optimize the BlockTreeTermsWriter later....
        Hide
        Michael McCandless added a comment -

        Final patch, against current trunk. I think it's ready to commit! I'll wait a day or so...

        Show
        Michael McCandless added a comment - Final patch, against current trunk. I think it's ready to commit! I'll wait a day or so...
        Hide
        Simon Willnauer added a comment -

        There is a small hit to indexing perf here, somewhere between 0 - 5% or so depending on the run. Given the gains for MTQs I think this is an OK tradeoff. We can further optimize the BlockTreeTermsWriter later....

        I think 0 - 5 % is totally fine. if somebody is totally obsessed by indexing throughput they should overclock

        awesome work mike!

        Show
        Simon Willnauer added a comment - There is a small hit to indexing perf here, somewhere between 0 - 5% or so depending on the run. Given the gains for MTQs I think this is an OK tradeoff. We can further optimize the BlockTreeTermsWriter later.... I think 0 - 5 % is totally fine. if somebody is totally obsessed by indexing throughput they should overclock awesome work mike!
        Hide
        Dawid Weiss added a comment -

        awesome work mike!

        I couldn't agree more, this is great stuff.

        Show
        Dawid Weiss added a comment - awesome work mike! I couldn't agree more, this is great stuff.
        Hide
        Michael McCandless added a comment -

        New patch, just cleaning up small stuff, commenting out the DEBUGs, adding some TODOs.

        Show
        Michael McCandless added a comment - New patch, just cleaning up small stuff, commenting out the DEBUGs, adding some TODOs.
        Hide
        Robert Muir added a comment -

        Just some trivial cleanups!

        Show
        Robert Muir added a comment - Just some trivial cleanups!
        Hide
        Michael McCandless added a comment -

        Thanks Robert – looks good! I'll commit shortly.

        Show
        Michael McCandless added a comment - Thanks Robert – looks good! I'll commit shortly.
        Hide
        Uwe Schindler added a comment -

        Awesome!

        Show
        Uwe Schindler added a comment - Awesome!
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1382141

        LUCENE-3030: add missing CHANGES entry

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1382141 LUCENE-3030 : add missing CHANGES entry

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development