Lucene - Core
  1. Lucene - Core
  2. LUCENE-4682

Reduce wasted bytes in FST due to array arcs

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/FSTs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup.

      The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes.

      I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted.

      It would be nice to reduce this.

      One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion.

      Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ...

      Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ...

      1. fstByteStats.txt
        16 kB
        Michael McCandless
      2. LUCENE-4682.patch
        5 kB
        Robert Muir
      3. kuromoji.wasted.bytes.txt
        256 kB
        Michael McCandless

        Activity

        Hide
        Robert Muir added a comment -

        Not sure exactly what to conclude since a byte's frequency is FST dependent ...

        So we can just add it as another param to the massive FST.Builder ctor!

        Show
        Robert Muir added a comment - Not sure exactly what to conclude since a byte's frequency is FST dependent ... So we can just add it as another param to the massive FST.Builder ctor!
        Hide
        Michael McCandless added a comment -

        I was curious about the number of times each byte value occurs in an FST ... because we could pick an uncommon one and use it as periodic marker (eg ever 5 arcs or something) to enable binary searching.

        So I ran this on two FSTs ... first one is all Wikipedia terms and second one is FreeDB suggester (has some non-ascii song titles...).

        Not sure exactly what to conclude since a byte's frequency is FST dependent ...

        Show
        Michael McCandless added a comment - I was curious about the number of times each byte value occurs in an FST ... because we could pick an uncommon one and use it as periodic marker (eg ever 5 arcs or something) to enable binary searching. So I ran this on two FSTs ... first one is all Wikipedia terms and second one is FreeDB suggester (has some non-ascii song titles...). Not sure exactly what to conclude since a byte's frequency is FST dependent ...
        Hide
        Dawid Weiss added a comment -

        Yep, it's similar to what I was suggesting – instead of expanding the subnodes into a full array encode a tree-like structure that would allow binary-searching right away. They still use a binary search over this packed representation; I'd just encode the binary tree directly somehow. I guess it's a matter of checking which will be faster/more efficient. (can't devote any time for it now, sorry).

        Show
        Dawid Weiss added a comment - Yep, it's similar to what I was suggesting – instead of expanding the subnodes into a full array encode a tree-like structure that would allow binary-searching right away. They still use a binary search over this packed representation; I'd just encode the binary tree directly somehow. I guess it's a matter of checking which will be faster/more efficient. (can't devote any time for it now, sorry).
        Hide
        Robert Muir added a comment -

        There's a pretty interesting approach here (http://www.aclweb.org/anthology/W/W09/W09-1505.pdf)
        that might be a good tradeoff, so we aren't stuck with a black and white situation (array arcs or not).

        instead we could (in theory) waste a few bits/mark/escape bytes in a similar way so we can do
        an "approximate binary search"...

        Show
        Robert Muir added a comment - There's a pretty interesting approach here ( http://www.aclweb.org/anthology/W/W09/W09-1505.pdf ) that might be a good tradeoff, so we aren't stuck with a black and white situation (array arcs or not). instead we could (in theory) waste a few bits/mark/escape bytes in a similar way so we can do an "approximate binary search"...
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Robert Muir
        http://svn.apache.org/viewvc?view=revision&revision=1435141

        LUCENE-4677, LUCENE-4682, LUCENE-4678, LUCENE-3298: Merged /lucene/dev/trunk:r1432459,1432466,1432472,1432474,1432522,1432646,1433026,1433109

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1435141 LUCENE-4677 , LUCENE-4682 , LUCENE-4678 , LUCENE-3298 : Merged /lucene/dev/trunk:r1432459,1432466,1432472,1432474,1432522,1432646,1433026,1433109
        Hide
        Michael McCandless added a comment -

        I mean in general NEXT/reversing adds more complexity here which makes it harder to try different things? Like a big doberman and BEWARE sign scaring off developers

        LOL

        But yeah I agree.

        Its a big part of what sent me over the edge trying to refactor FST to have a store abstraction (LUCENE-4593). But fortunately you did that anyway...

        Right but it's not good if bus factor is 1 ... it's effectively dead code when that happens.

        It would be really really really good for FSTs long term to do things like remove reversing, remove packed (fold these optos or at least most of them in by default), etc.

        +1, except that NEXT buys us a much smaller FST now. We can't just drop it ... we need some way to simplify it (eg somehow stop reversing).

        Show
        Michael McCandless added a comment - I mean in general NEXT/reversing adds more complexity here which makes it harder to try different things? Like a big doberman and BEWARE sign scaring off developers LOL But yeah I agree. Its a big part of what sent me over the edge trying to refactor FST to have a store abstraction ( LUCENE-4593 ). But fortunately you did that anyway... Right but it's not good if bus factor is 1 ... it's effectively dead code when that happens. It would be really really really good for FSTs long term to do things like remove reversing, remove packed (fold these optos or at least most of them in by default), etc. +1, except that NEXT buys us a much smaller FST now. We can't just drop it ... we need some way to simplify it (eg somehow stop reversing).
        Hide
        Robert Muir added a comment -

        Wait: the 28% waste comes from the array arcs (unrelated to NEXT?). To fix that I think we should use a skip list? Surely the bytes required to encode the skip list are less than our waste today.

        I know reversing held back delta-code of the node target, but, that looks like it won't gain much. What else is it holding back?

        I mean in general NEXT/reversing adds more complexity here which makes it harder to try different things? Like a big doberman and BEWARE sign scaring off developers

        Its a big part of what sent me over the edge trying to refactor FST to have a store abstraction (LUCENE-4593). But fortunately you did that anyway...

        It would be really really really good for FSTs long term to do things like remove reversing, remove packed (fold these optos or at least most of them in by default), etc.

        Show
        Robert Muir added a comment - Wait: the 28% waste comes from the array arcs (unrelated to NEXT?). To fix that I think we should use a skip list? Surely the bytes required to encode the skip list are less than our waste today. I know reversing held back delta-code of the node target, but, that looks like it won't gain much. What else is it holding back? I mean in general NEXT/reversing adds more complexity here which makes it harder to try different things? Like a big doberman and BEWARE sign scaring off developers Its a big part of what sent me over the edge trying to refactor FST to have a store abstraction ( LUCENE-4593 ). But fortunately you did that anyway... It would be really really really good for FSTs long term to do things like remove reversing, remove packed (fold these optos or at least most of them in by default), etc.
        Hide
        Michael McCandless added a comment -

        So FST would be ~39% larger if we remove NEXT

        But according to your notes above, we have 28% waste for this (with a long output).
        Are we making the right tradeoff?

        Wait: the 28% waste comes from the array arcs (unrelated to NEXT?). To fix that I think we should use a skip list? Surely the bytes required to encode the skip list are less than our waste today.

        Maybe, we can find a way to do NEXT without the confusing per-node-reverse-bytes?

        Or, not do it at all if we cant figure it out? The reversing holds back other improvements so
        benchmarking it by itself could be misleading.

        I don't think we should drop NEXT unless we have some alternative? 39% increase is size is non-trivial!

        I know reversing held back delta-code of the node target, but, that looks like it won't gain much. What else is it holding back?

        Show
        Michael McCandless added a comment - So FST would be ~39% larger if we remove NEXT But according to your notes above, we have 28% waste for this (with a long output). Are we making the right tradeoff? Wait: the 28% waste comes from the array arcs (unrelated to NEXT?). To fix that I think we should use a skip list? Surely the bytes required to encode the skip list are less than our waste today. Maybe, we can find a way to do NEXT without the confusing per-node-reverse-bytes? Or, not do it at all if we cant figure it out? The reversing holds back other improvements so benchmarking it by itself could be misleading. I don't think we should drop NEXT unless we have some alternative? 39% increase is size is non-trivial! I know reversing held back delta-code of the node target, but, that looks like it won't gain much. What else is it holding back?
        Hide
        Dawid Weiss added a comment -

        I also "tested" delta-coding the arc target instead of the abs vInt we have today ...

        I did such experiments when I was working on that paper. Remember – you don't publish negative results, unfortunately. Obviously I didn't try everything but: 1) NEXT was important, 2) the structure of the FST doesn't yield to easy local deltas; it's not easily separable and pointers typically jump all over.

        Which is surprising ... I guess we don't see much "locality" for the nodes ... or, eg the common suffixes freeze early on and then lots of future nodes refer to them.

        Not really that surprising – you encode common suffixes after all so most of them will appear in a properly sized sample. This is actually why the strategy of moving nodes around works too – you move those that are super frequent but they'll most likely be reordered at the "top" suffix frequencies of the automaton anyway.

        Show
        Dawid Weiss added a comment - I also "tested" delta-coding the arc target instead of the abs vInt we have today ... I did such experiments when I was working on that paper. Remember – you don't publish negative results, unfortunately. Obviously I didn't try everything but: 1) NEXT was important, 2) the structure of the FST doesn't yield to easy local deltas; it's not easily separable and pointers typically jump all over. Which is surprising ... I guess we don't see much "locality" for the nodes ... or, eg the common suffixes freeze early on and then lots of future nodes refer to them. Not really that surprising – you encode common suffixes after all so most of them will appear in a properly sized sample. This is actually why the strategy of moving nodes around works too – you move those that are super frequent but they'll most likely be reordered at the "top" suffix frequencies of the automaton anyway.
        Hide
        Robert Muir added a comment -

        So FST would be ~39% larger if we remove NEXT

        But according to your notes above, we have 28% waste for this (with a long output).
        Are we making the right tradeoff?

        Maybe, we can find a way to do NEXT without the confusing per-node-reverse-bytes?

        Or, not do it at all if we cant figure it out? The reversing holds back other improvements so
        benchmarking it by itself could be misleading.

        Show
        Robert Muir added a comment - So FST would be ~39% larger if we remove NEXT But according to your notes above, we have 28% waste for this (with a long output). Are we making the right tradeoff? Maybe, we can find a way to do NEXT without the confusing per-node-reverse-bytes? Or, not do it at all if we cant figure it out? The reversing holds back other improvements so benchmarking it by itself could be misleading.
        Hide
        Michael McCandless added a comment -

        I tried removing NEXT opto in building the all-English-Wikipedia-terms FST and it was a big hit:

        • With NEXT: 59267794 bytes
        • Without NEXT: 82543993 bytes

        So FST would be ~39% larger if we remove NEXT ... however lookup sped up from 726 ns per lookup to 636 ns. But: we could get this speedup today, if we just fixed setting of a NEXT arc's target to be lazy instead. Today it's very costly for non-array arcs because we scan to the end of all nodes to set the target, even if the caller isn't going to use it, which is really ridiculous.

        I also "tested" delta-coding the arc target instead of the abs vInt we have today ... it wasn't a real test, instead I just measured how many bytes the vInt delta would be vs how many bytes the vInt abs it today, and the results were disappointing:

        • Abs vInt (what we do today): 26681349 bytes
        • Delta vInt: 25479316 bytes

        Which is surprising ... I guess we don't see much "locality" for the nodes ... or, eg the common suffixes freeze early on and then lots of future nodes refer to them.

        Maybe, we can find a way to do NEXT without the confusing per-node-reverse-bytes?

        Show
        Michael McCandless added a comment - I tried removing NEXT opto in building the all-English-Wikipedia-terms FST and it was a big hit: With NEXT: 59267794 bytes Without NEXT: 82543993 bytes So FST would be ~39% larger if we remove NEXT ... however lookup sped up from 726 ns per lookup to 636 ns. But: we could get this speedup today, if we just fixed setting of a NEXT arc's target to be lazy instead. Today it's very costly for non-array arcs because we scan to the end of all nodes to set the target, even if the caller isn't going to use it, which is really ridiculous. I also "tested" delta-coding the arc target instead of the abs vInt we have today ... it wasn't a real test, instead I just measured how many bytes the vInt delta would be vs how many bytes the vInt abs it today, and the results were disappointing: Abs vInt (what we do today): 26681349 bytes Delta vInt: 25479316 bytes Which is surprising ... I guess we don't see much "locality" for the nodes ... or, eg the common suffixes freeze early on and then lots of future nodes refer to them. Maybe, we can find a way to do NEXT without the confusing per-node-reverse-bytes?
        Hide
        Dawid Weiss added a comment -

        Yeah, there are many ideas layered on top of each other and it's gotten beyond the point of being easy to comprehend.

        As for the "next" bit – in any implementation I've seen this leads to significant reduction in automaton size. But I'm not saying it's the optimal way to do it, perhaps there are other encoding options that would reach similar compression levels without the added complexity.

        Show
        Dawid Weiss added a comment - Yeah, there are many ideas layered on top of each other and it's gotten beyond the point of being easy to comprehend. As for the "next" bit – in any implementation I've seen this leads to significant reduction in automaton size. But I'm not saying it's the optimal way to do it, perhaps there are other encoding options that would reach similar compression levels without the added complexity.
        Hide
        Robert Muir added a comment -

        I agree: about the code complication, its hard to try different things currently

        I think this is due to the reversing and BIT_TARGET_NEXT? does this really save us more
        than the bloat of not being able to delta-encode 'target' ?

        The packed case dereferences to nodeID and looks that up from packed ints right? But
        I wonder about that too: is the cost of this additional indirection in space worth it
        versus just delta-encoding with vint?

        Show
        Robert Muir added a comment - I agree: about the code complication, its hard to try different things currently I think this is due to the reversing and BIT_TARGET_NEXT? does this really save us more than the bloat of not being able to delta-encode 'target' ? The packed case dereferences to nodeID and looks that up from packed ints right? But I wonder about that too: is the cost of this additional indirection in space worth it versus just delta-encoding with vint?
        Hide
        Dawid Weiss added a comment -

        I think if it would be too wasteful (e.g. exceed 10%), we should

        just insert some skip points or something

        I had the same thought. We only go with the extremes – either a linked list (essentially) or an array. A tree structure would be in between and would give logN searches and optimal space utilization. Obviously it'd also complicate the code.

        Show
        Dawid Weiss added a comment - I think if it would be too wasteful (e.g. exceed 10%), we should just insert some skip points or something I had the same thought. We only go with the extremes – either a linked list (essentially) or an array. A tree structure would be in between and would give logN searches and optimal space utilization. Obviously it'd also complicate the code.
        Hide
        Robert Muir added a comment -

        Well, i think actually we shouldnt allow 25% waste. maybe only something like 10%.

        In these waste-cases, doing array-arcs is a really inefficient/overkill way to just
        gain binary search... I think if it would be too wasteful (e.g. exceed 10%), we should
        just insert some skip points or something.

        Show
        Robert Muir added a comment - Well, i think actually we shouldnt allow 25% waste. maybe only something like 10%. In these waste-cases, doing array-arcs is a really inefficient/overkill way to just gain binary search... I think if it would be too wasteful (e.g. exceed 10%), we should just insert some skip points or something.
        Hide
        Michael McCandless added a comment -

        OK I ran several performance tests, comparing trunk (lots of waste),
        the "up to 25% waste", and 0 waste (no array arcs).

        First, I tested luceneutil on all of English Wikipedia, on an optimized index:

        Base = trunk (lots of waste), comp = up to 25% waste:

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                         Respell      167.37      (3.3%)      128.36      (3.2%)  -23.3% ( -28% -  -17%)
                          Fuzzy2       83.54      (8.0%)       74.17      (7.2%)  -11.2% ( -24% -    4%)
                        PKLookup      359.62      (2.9%)      346.88      (4.5%)   -3.5% ( -10% -    4%)
                        Wildcard       28.48      (4.3%)       28.25      (3.9%)   -0.8% (  -8% -    7%)
                          Fuzzy1       82.77      (7.1%)       82.36      (7.7%)   -0.5% ( -14% -   15%)
                HighSloppyPhrase        0.78      (4.6%)        0.78      (4.5%)   -0.4% (  -9% -    9%)
                          IntNRQ        3.38      (8.1%)        3.37      (8.1%)   -0.2% ( -15% -   17%)
                 MedSloppyPhrase       28.13      (2.4%)       28.13      (2.4%)   -0.0% (  -4% -    4%)
                 LowSloppyPhrase       25.10      (1.8%)       25.11      (1.8%)    0.0% (  -3% -    3%)
                         Prefix3       18.40      (4.8%)       18.41      (5.3%)    0.1% (  -9% -   10%)
                         MedTerm       65.42     (18.2%)       65.46     (17.7%)    0.1% ( -30% -   43%)
                         LowTerm      316.34      (6.9%)      317.45      (7.1%)    0.4% ( -12% -   15%)
                       LowPhrase       23.30      (2.1%)       23.39      (1.8%)    0.4% (  -3% -    4%)
                        HighTerm       52.11     (18.6%)       52.33     (18.5%)    0.4% ( -30% -   46%)
                       OrHighMed       25.65      (6.7%)       25.76      (7.2%)    0.4% ( -12% -   15%)
                     AndHighHigh       19.42      (1.8%)       19.52      (2.0%)    0.5% (  -3% -    4%)
                       OrHighLow       24.19      (7.1%)       24.36      (7.4%)    0.7% ( -12% -   16%)
                      OrHighHigh       14.32      (6.6%)       14.45      (7.3%)    0.9% ( -12% -   15%)
                    HighSpanNear        3.30      (3.9%)        3.33      (3.1%)    0.9% (  -5% -    8%)
                     LowSpanNear        8.94      (4.3%)        9.04      (3.4%)    1.1% (  -6% -    9%)
                      AndHighMed       78.66      (1.6%)       79.82      (1.4%)    1.5% (  -1% -    4%)
                     MedSpanNear       30.45      (3.1%)       30.91      (2.8%)    1.5% (  -4% -    7%)
                       MedPhrase      130.74      (5.8%)      133.03      (5.9%)    1.8% (  -9% -   14%)
                      AndHighLow      571.07      (3.5%)      582.83      (2.8%)    2.1% (  -4% -    8%)
                      HighPhrase       11.62      (6.1%)       11.88      (6.4%)    2.2% (  -9% -   15%)
        

        trunk .tip file was 23928963 and comp was 17125654 bytes (~28% smaller!).

        Then, base=trunk, comp=0 waste (no array arcs):

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                        PKLookup      359.26      (2.3%)      152.91      (5.3%)  -57.4% ( -63% -  -51%)
                         Respell      168.82      (3.4%)      128.85      (2.2%)  -23.7% ( -28% -  -18%)
                          Fuzzy2       85.59      (8.2%)       75.03      (6.6%)  -12.3% ( -25% -    2%)
                         LowTerm      331.01      (7.9%)      324.30      (7.2%)   -2.0% ( -15% -   14%)
                        HighTerm       54.75     (19.2%)       54.00     (19.2%)   -1.4% ( -33% -   45%)
                         MedTerm       68.68     (18.5%)       68.04     (19.0%)   -0.9% ( -32% -   44%)
                      AndHighMed       80.50      (1.4%)       79.78      (1.4%)   -0.9% (  -3% -    1%)
                        Wildcard       29.13      (4.5%)       28.89      (4.0%)   -0.8% (  -8% -    8%)
                       LowPhrase       23.88      (2.3%)       23.69      (1.5%)   -0.8% (  -4% -    3%)
                      AndHighLow      584.30      (2.9%)      582.17      (2.9%)   -0.4% (  -6% -    5%)
                          Fuzzy1       83.84      (6.9%)       83.62      (6.7%)   -0.3% ( -12% -   14%)
                     AndHighHigh       19.88      (1.8%)       19.84      (1.5%)   -0.2% (  -3% -    3%)
                 LowSloppyPhrase       25.61      (1.9%)       25.56      (2.0%)   -0.2% (  -4% -    3%)
                     LowSpanNear        9.20      (3.6%)        9.20      (3.9%)    0.0% (  -7% -    7%)
                     MedSpanNear       31.32      (2.7%)       31.36      (2.9%)    0.1% (  -5% -    5%)
                 MedSloppyPhrase       28.67      (2.3%)       28.73      (2.2%)    0.2% (  -4% -    4%)
                         Prefix3       18.83      (4.9%)       18.88      (5.4%)    0.3% (  -9% -   11%)
                      OrHighHigh       14.71      (7.5%)       14.76      (8.2%)    0.3% ( -14% -   17%)
                    HighSpanNear        3.39      (3.0%)        3.40      (2.9%)    0.5% (  -5% -    6%)
                       OrHighMed       26.28      (7.3%)       26.47      (8.1%)    0.7% ( -13% -   17%)
                          IntNRQ        3.44      (8.6%)        3.47      (9.3%)    0.9% ( -15% -   20%)
                       OrHighLow       24.69      (7.3%)       24.97      (8.3%)    1.1% ( -13% -   18%)
                HighSloppyPhrase        0.80      (4.8%)        0.81      (4.7%)    1.3% (  -7% -   11%)
                      HighPhrase       11.89      (6.1%)       12.20      (7.4%)    2.6% ( -10% -   17%)
                       MedPhrase      134.04      (5.5%)      138.07      (6.4%)    3.0% (  -8% -   15%)
        

        comp .tip file was 16806594 (just a bit smaller than "up to 25% waste").

        Curious how PK was barely affected by the "up to 25%", but heavily
        affected by the "0 waste", and how Respell/Fuzzy2 lost perf going to
        "up to 25% waste" and then didn't lose much going to "0 waste". I
        suspect PK will be sensitive to the key structure (luceneutil uses
        base 36 keys)...

        Next I tested Kuromoji, just tokenizing first 100K docs from jawiki.

        • trunk (lots of waste): TokenInfoFST was 1535643 bytes, tokenizing 100K docs took 163.3 sec
        • up to 25% waste: TokenInfoFST was 1309108 bytes, tokenizing 100K docs took 215.3 sec
        • 0 waste: TokenInfoFST was 1228847 bytes, tokenizing 100K docs took 218.0 sec

        Looks like Kuromoji sees good gains from the > 25% waste arcs... but
        to keep this in perspective, in a "real" app, you index after
        tokenizing and that indexing time will dwarf the tokenization time.
        But then on the flip side we are "only" talking about ~221 KB ...

        Finally, I tested building a no-outputs FST for all Wikipedia English
        terms (9.8 M terms):

        • trunk (lots of waste): 59267794 bytes, 566 nsec per lookup
        • up to 25% waste: 58509011 bytes, 567 nsec per lookup
        • 0 waste: 56413148 bytes, 1808 nsec per lookup

        For this case it looks like you get all the benefit allowing only 25%
        waste (though, most of the byte increase also?).

        Show
        Michael McCandless added a comment - OK I ran several performance tests, comparing trunk (lots of waste), the "up to 25% waste", and 0 waste (no array arcs). First, I tested luceneutil on all of English Wikipedia, on an optimized index: Base = trunk (lots of waste), comp = up to 25% waste: Task QPS base StdDev QPS comp StdDev Pct diff Respell 167.37 (3.3%) 128.36 (3.2%) -23.3% ( -28% - -17%) Fuzzy2 83.54 (8.0%) 74.17 (7.2%) -11.2% ( -24% - 4%) PKLookup 359.62 (2.9%) 346.88 (4.5%) -3.5% ( -10% - 4%) Wildcard 28.48 (4.3%) 28.25 (3.9%) -0.8% ( -8% - 7%) Fuzzy1 82.77 (7.1%) 82.36 (7.7%) -0.5% ( -14% - 15%) HighSloppyPhrase 0.78 (4.6%) 0.78 (4.5%) -0.4% ( -9% - 9%) IntNRQ 3.38 (8.1%) 3.37 (8.1%) -0.2% ( -15% - 17%) MedSloppyPhrase 28.13 (2.4%) 28.13 (2.4%) -0.0% ( -4% - 4%) LowSloppyPhrase 25.10 (1.8%) 25.11 (1.8%) 0.0% ( -3% - 3%) Prefix3 18.40 (4.8%) 18.41 (5.3%) 0.1% ( -9% - 10%) MedTerm 65.42 (18.2%) 65.46 (17.7%) 0.1% ( -30% - 43%) LowTerm 316.34 (6.9%) 317.45 (7.1%) 0.4% ( -12% - 15%) LowPhrase 23.30 (2.1%) 23.39 (1.8%) 0.4% ( -3% - 4%) HighTerm 52.11 (18.6%) 52.33 (18.5%) 0.4% ( -30% - 46%) OrHighMed 25.65 (6.7%) 25.76 (7.2%) 0.4% ( -12% - 15%) AndHighHigh 19.42 (1.8%) 19.52 (2.0%) 0.5% ( -3% - 4%) OrHighLow 24.19 (7.1%) 24.36 (7.4%) 0.7% ( -12% - 16%) OrHighHigh 14.32 (6.6%) 14.45 (7.3%) 0.9% ( -12% - 15%) HighSpanNear 3.30 (3.9%) 3.33 (3.1%) 0.9% ( -5% - 8%) LowSpanNear 8.94 (4.3%) 9.04 (3.4%) 1.1% ( -6% - 9%) AndHighMed 78.66 (1.6%) 79.82 (1.4%) 1.5% ( -1% - 4%) MedSpanNear 30.45 (3.1%) 30.91 (2.8%) 1.5% ( -4% - 7%) MedPhrase 130.74 (5.8%) 133.03 (5.9%) 1.8% ( -9% - 14%) AndHighLow 571.07 (3.5%) 582.83 (2.8%) 2.1% ( -4% - 8%) HighPhrase 11.62 (6.1%) 11.88 (6.4%) 2.2% ( -9% - 15%) trunk .tip file was 23928963 and comp was 17125654 bytes (~28% smaller!). Then, base=trunk, comp=0 waste (no array arcs): Task QPS base StdDev QPS comp StdDev Pct diff PKLookup 359.26 (2.3%) 152.91 (5.3%) -57.4% ( -63% - -51%) Respell 168.82 (3.4%) 128.85 (2.2%) -23.7% ( -28% - -18%) Fuzzy2 85.59 (8.2%) 75.03 (6.6%) -12.3% ( -25% - 2%) LowTerm 331.01 (7.9%) 324.30 (7.2%) -2.0% ( -15% - 14%) HighTerm 54.75 (19.2%) 54.00 (19.2%) -1.4% ( -33% - 45%) MedTerm 68.68 (18.5%) 68.04 (19.0%) -0.9% ( -32% - 44%) AndHighMed 80.50 (1.4%) 79.78 (1.4%) -0.9% ( -3% - 1%) Wildcard 29.13 (4.5%) 28.89 (4.0%) -0.8% ( -8% - 8%) LowPhrase 23.88 (2.3%) 23.69 (1.5%) -0.8% ( -4% - 3%) AndHighLow 584.30 (2.9%) 582.17 (2.9%) -0.4% ( -6% - 5%) Fuzzy1 83.84 (6.9%) 83.62 (6.7%) -0.3% ( -12% - 14%) AndHighHigh 19.88 (1.8%) 19.84 (1.5%) -0.2% ( -3% - 3%) LowSloppyPhrase 25.61 (1.9%) 25.56 (2.0%) -0.2% ( -4% - 3%) LowSpanNear 9.20 (3.6%) 9.20 (3.9%) 0.0% ( -7% - 7%) MedSpanNear 31.32 (2.7%) 31.36 (2.9%) 0.1% ( -5% - 5%) MedSloppyPhrase 28.67 (2.3%) 28.73 (2.2%) 0.2% ( -4% - 4%) Prefix3 18.83 (4.9%) 18.88 (5.4%) 0.3% ( -9% - 11%) OrHighHigh 14.71 (7.5%) 14.76 (8.2%) 0.3% ( -14% - 17%) HighSpanNear 3.39 (3.0%) 3.40 (2.9%) 0.5% ( -5% - 6%) OrHighMed 26.28 (7.3%) 26.47 (8.1%) 0.7% ( -13% - 17%) IntNRQ 3.44 (8.6%) 3.47 (9.3%) 0.9% ( -15% - 20%) OrHighLow 24.69 (7.3%) 24.97 (8.3%) 1.1% ( -13% - 18%) HighSloppyPhrase 0.80 (4.8%) 0.81 (4.7%) 1.3% ( -7% - 11%) HighPhrase 11.89 (6.1%) 12.20 (7.4%) 2.6% ( -10% - 17%) MedPhrase 134.04 (5.5%) 138.07 (6.4%) 3.0% ( -8% - 15%) comp .tip file was 16806594 (just a bit smaller than "up to 25% waste"). Curious how PK was barely affected by the "up to 25%", but heavily affected by the "0 waste", and how Respell/Fuzzy2 lost perf going to "up to 25% waste" and then didn't lose much going to "0 waste". I suspect PK will be sensitive to the key structure (luceneutil uses base 36 keys)... Next I tested Kuromoji, just tokenizing first 100K docs from jawiki. trunk (lots of waste): TokenInfoFST was 1535643 bytes, tokenizing 100K docs took 163.3 sec up to 25% waste: TokenInfoFST was 1309108 bytes, tokenizing 100K docs took 215.3 sec 0 waste: TokenInfoFST was 1228847 bytes, tokenizing 100K docs took 218.0 sec Looks like Kuromoji sees good gains from the > 25% waste arcs... but to keep this in perspective, in a "real" app, you index after tokenizing and that indexing time will dwarf the tokenization time. But then on the flip side we are "only" talking about ~221 KB ... Finally, I tested building a no-outputs FST for all Wikipedia English terms (9.8 M terms): trunk (lots of waste): 59267794 bytes, 566 nsec per lookup up to 25% waste: 58509011 bytes, 567 nsec per lookup 0 waste: 56413148 bytes, 1808 nsec per lookup For this case it looks like you get all the benefit allowing only 25% waste (though, most of the byte increase also?).
        Hide
        Robert Muir added a comment -

        Another simple idea: instead of boolean allowArrayArcs we just make this a float: acceptableArrayArcOverhead (or maybe a better name).

        you would pass 0 to disable array arcs completely (and we'd internally still have our boolean allowArrayArcs and not waste
        time computing stuff if this is actually <= 0)

        Show
        Robert Muir added a comment - Another simple idea: instead of boolean allowArrayArcs we just make this a float: acceptableArrayArcOverhead (or maybe a better name). you would pass 0 to disable array arcs completely (and we'd internally still have our boolean allowArrayArcs and not waste time computing stuff if this is actually <= 0)
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Robert Muir
        http://svn.apache.org/viewvc?view=revision&revision=1432522

        LUCENE-4682: vInt-encode maxBytesPerArc

        Show
        Commit Tag Bot added a comment - [trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1432522 LUCENE-4682 : vInt-encode maxBytesPerArc
        Hide
        Robert Muir added a comment -

        ok i committed the vInt for maxBytesPerArc, but left out the heuristic (so we still have the waste!!!)

        Here's the comment i added:

            // TODO: try to avoid wasteful cases: disable doFixedArray in that case
            /* 
             * 
             * LUCENE-4682: what is a fair heuristic here?
             * It could involve some of these:
             * 1. how "busy" the node is: nodeIn.inputCount relative to frontier[0].inputCount?
             * 2. how much binSearch saves over scan: nodeIn.numArcs
             * 3. waste: numBytes vs numBytesExpanded
             * 
             * the one below just looks at #3
            if (doFixedArray) {
              // rough heuristic: make this 1.25 "waste factor" a parameter to the phd ctor????
              int numBytes = lastArcStart - startAddress;
              int numBytesExpanded = maxBytesPerArc * nodeIn.numArcs;
              if (numBytesExpanded > numBytes*1.25) {
                doFixedArray = false;
              }
            }
            */
        

        I think it would just be best to do some performance benchmarks and figure this out.
        I know all the kuromoji waste is at node.depth=1 exactly.

        Also I indexed all of geonames with this heuristic and it barely changed the FST size:

        trunk
        FST: 45296685
        packedFST: 39083451
        vint maxBytesPerArc:
        FST: 45052386
        packedFST: 39083451
        vint maxBytesPerArc+heuristic:
        FST: 44988400
        packedFST: 39029108

        So the waste and heuristic doesn't affect all FSTs, only certain ones.

        Show
        Robert Muir added a comment - ok i committed the vInt for maxBytesPerArc, but left out the heuristic (so we still have the waste!!!) Here's the comment i added: // TODO: try to avoid wasteful cases: disable doFixedArray in that case /* * * LUCENE-4682: what is a fair heuristic here? * It could involve some of these: * 1. how "busy" the node is: nodeIn.inputCount relative to frontier[0].inputCount? * 2. how much binSearch saves over scan: nodeIn.numArcs * 3. waste: numBytes vs numBytesExpanded * * the one below just looks at #3 if (doFixedArray) { // rough heuristic: make this 1.25 "waste factor" a parameter to the phd ctor???? int numBytes = lastArcStart - startAddress; int numBytesExpanded = maxBytesPerArc * nodeIn.numArcs; if (numBytesExpanded > numBytes*1.25) { doFixedArray = false ; } } */ I think it would just be best to do some performance benchmarks and figure this out. I know all the kuromoji waste is at node.depth=1 exactly. Also I indexed all of geonames with this heuristic and it barely changed the FST size: trunk FST: 45296685 packedFST: 39083451 vint maxBytesPerArc: FST: 45052386 packedFST: 39083451 vint maxBytesPerArc+heuristic: FST: 44988400 packedFST: 39029108 So the waste and heuristic doesn't affect all FSTs, only certain ones.
        Hide
        Uwe Schindler added a comment -

        +1

        Show
        Uwe Schindler added a comment - +1
        Hide
        Dawid Weiss added a comment -

        +1. Nice.

        Show
        Dawid Weiss added a comment - +1. Nice.
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Robert Muir added a comment -

        I can cleanup+commit the patch with the heuristic commented out (so we still get the cutover to vint, which i think is an obvious win?)

        This way we can benchmark and make sure the heuristic is set appropriately/doesnt hurt performance?

        Show
        Robert Muir added a comment - I can cleanup+commit the patch with the heuristic commented out (so we still get the cutover to vint, which i think is an obvious win?) This way we can benchmark and make sure the heuristic is set appropriately/doesnt hurt performance?
        Hide
        Michael McCandless added a comment -

        +1

        This is much cleaner (write header in the end).

        I built the AnalyzingSuggester for FreeDB: trunk is 1.046 GB and with patch it's 0.917 GB = ~9% smaller!

        Show
        Michael McCandless added a comment - +1 This is much cleaner (write header in the end). I built the AnalyzingSuggester for FreeDB: trunk is 1.046 GB and with patch it's 0.917 GB = ~9% smaller!
        Hide
        Robert Muir added a comment -

        Mike can you try this patch on your corpus?

        It cuts us over to vint for the maxBytesPerArc (saving 3 bytes for the unpacked case), and adds an "acceptable overhead" for array arcs (currently 1.25).

        For the kuromoji packed case, this seems to solve the waste:

        [java] 53645 nodes, 253185 arcs, 1309077 bytes... done

        Show
        Robert Muir added a comment - Mike can you try this patch on your corpus? It cuts us over to vint for the maxBytesPerArc (saving 3 bytes for the unpacked case), and adds an "acceptable overhead" for array arcs (currently 1.25). For the kuromoji packed case, this seems to solve the waste: [java] 53645 nodes, 253185 arcs, 1309077 bytes... done
        Hide
        Michael McCandless added a comment -

        Another datapoint: the FreeDB suggester (tool in luceneutil to create/test it) is 1.05 GB FST, and has 87.5 MB wasted bytes (~8%).

        Show
        Michael McCandless added a comment - Another datapoint: the FreeDB suggester (tool in luceneutil to create/test it) is 1.05 GB FST, and has 87.5 MB wasted bytes (~8%).
        Hide
        Dawid Weiss added a comment -

        Even more than the 271,187 I measured (20% smaller FST), I think because the FST is now smaller we use fewer bytes writing the delta-coded node addresses

        Yes, these things are all tightly coupled.

        Dawid

        Show
        Dawid Weiss added a comment - Even more than the 271,187 I measured (20% smaller FST), I think because the FST is now smaller we use fewer bytes writing the delta-coded node addresses Yes, these things are all tightly coupled. Dawid
        Hide
        Robert Muir added a comment -

        in the fixedArray case:

        // write a "false" first arc:
        writer.writeByte(ARCS_AS_FIXED_ARRAY);
        writer.writeVInt(nodeIn.numArcs);
        // placeholder -- we'll come back and write the number
        // of bytes per arc (int) here:
        // TODO: we could make this a vInt instead
        writer.writeInt(0);
        fixedArrayStart = writer.getPosition();
        

        I think we should actually make that TODO line a writeByte.

        If it turns out the max arcSize is > 255 i think we should just not encode as array arcs (just save our position before we write ARCS_AS_FIXED_ARRAY, rewind to that, and encode normally)

        This would reduce the overhead of array-arcs, but also maybe prevent some worst cases causing waste as a side effect.

        Show
        Robert Muir added a comment - in the fixedArray case: // write a " false " first arc: writer.writeByte(ARCS_AS_FIXED_ARRAY); writer.writeVInt(nodeIn.numArcs); // placeholder -- we'll come back and write the number // of bytes per arc ( int ) here: // TODO: we could make this a vInt instead writer.writeInt(0); fixedArrayStart = writer.getPosition(); I think we should actually make that TODO line a writeByte. If it turns out the max arcSize is > 255 i think we should just not encode as array arcs (just save our position before we write ARCS_AS_FIXED_ARRAY, rewind to that, and encode normally) This would reduce the overhead of array-arcs, but also maybe prevent some worst cases causing waste as a side effect.
        Hide
        Michael McCandless added a comment -

        Even more than the 271,187 I measured (20% smaller FST), I think because the FST is now smaller we use fewer bytes writing the delta-coded node addresses ...

        Show
        Michael McCandless added a comment - Even more than the 271,187 I measured (20% smaller FST), I think because the FST is now smaller we use fewer bytes writing the delta-coded node addresses ...
        Hide
        Robert Muir added a comment -

        As an experiment i turned off array arcs for kuromoji in my trunk checkout:

        FST
        before: [java] 53645 nodes, 253185 arcs, 1535612 bytes... done
        after: [java] 53645 nodes, 253185 arcs, 1228816 bytes... done

        JAR
        before: rw-rw-r- 1 rmuir rmuir 4581420 Jan 12 09:56 lucene-analyzers-kuromoji-4.1-SNAPSHOT.jar
        after: rw-rw-r- 1 rmuir rmuir 4306792 Jan 12 09:56 lucene-analyzers-kuromoji-5.0-SNAPSHOT.jar

        Show
        Robert Muir added a comment - As an experiment i turned off array arcs for kuromoji in my trunk checkout: FST before: [java] 53645 nodes, 253185 arcs, 1535612 bytes... done after: [java] 53645 nodes, 253185 arcs, 1228816 bytes... done JAR before: rw-rw-r- 1 rmuir rmuir 4581420 Jan 12 09:56 lucene-analyzers-kuromoji-4.1-SNAPSHOT.jar after: rw-rw-r- 1 rmuir rmuir 4306792 Jan 12 09:56 lucene-analyzers-kuromoji-5.0-SNAPSHOT.jar
        Hide
        Michael McCandless added a comment -

        Shows the wasted bytes ... one line per node whose arcs were turned into an array, sorted by net bytes wasted.

        Show
        Michael McCandless added a comment - Shows the wasted bytes ... one line per node whose arcs were turned into an array, sorted by net bytes wasted.
        Hide
        Michael McCandless added a comment -

        Maybe we should just tighten up the FST thresholds for when we make an array arc:

          /**
           * @see #shouldExpand(UnCompiledNode)
           */
          final static int FIXED_ARRAY_SHALLOW_DISTANCE = 3; // 0 => only root node.
        
          /**
           * @see #shouldExpand(UnCompiledNode)
           */
          final static int FIXED_ARRAY_NUM_ARCS_SHALLOW = 5;
        
          /**
           * @see #shouldExpand(UnCompiledNode)
           */
          final static int FIXED_ARRAY_NUM_ARCS_DEEP = 10;
        

        When I print out the waste, it's generally the smaller nodes that have higher proportional waste:

             [java] waste: 44 numArcs=16 perArc=2.75
             [java] waste: 20 numArcs=11 perArc=1.8181819
             [java] waste: 13 numArcs=5 perArc=2.6
             [java] waste: 20 numArcs=12 perArc=1.6666666
             [java] waste: 60 numArcs=20 perArc=3.0
             [java] waste: 0 numArcs=5 perArc=0.0
             [java] waste: 48 numArcs=15 perArc=3.2
             [java] waste: 16 numArcs=5 perArc=3.2
             [java] waste: 20 numArcs=6 perArc=3.3333333
             [java] waste: 8 numArcs=6 perArc=1.3333334
             [java] waste: 24 numArcs=8 perArc=3.0
             [java] waste: 32 numArcs=9 perArc=3.5555556
             [java] waste: 17 numArcs=7 perArc=2.4285715
             [java] waste: 13 numArcs=5 perArc=2.6
             [java] waste: 17 numArcs=6 perArc=2.8333333
             [java] waste: 28 numArcs=8 perArc=3.5
             [java] waste: 20 numArcs=16 perArc=1.25
             [java] waste: 44 numArcs=15 perArc=2.9333334
             [java] waste: 28 numArcs=13 perArc=2.1538463
             [java] waste: 28 numArcs=15 perArc=1.8666667
        
        Show
        Michael McCandless added a comment - Maybe we should just tighten up the FST thresholds for when we make an array arc: /** * @see #shouldExpand(UnCompiledNode) */ final static int FIXED_ARRAY_SHALLOW_DISTANCE = 3; // 0 => only root node. /** * @see #shouldExpand(UnCompiledNode) */ final static int FIXED_ARRAY_NUM_ARCS_SHALLOW = 5; /** * @see #shouldExpand(UnCompiledNode) */ final static int FIXED_ARRAY_NUM_ARCS_DEEP = 10; When I print out the waste, it's generally the smaller nodes that have higher proportional waste: [java] waste: 44 numArcs=16 perArc=2.75 [java] waste: 20 numArcs=11 perArc=1.8181819 [java] waste: 13 numArcs=5 perArc=2.6 [java] waste: 20 numArcs=12 perArc=1.6666666 [java] waste: 60 numArcs=20 perArc=3.0 [java] waste: 0 numArcs=5 perArc=0.0 [java] waste: 48 numArcs=15 perArc=3.2 [java] waste: 16 numArcs=5 perArc=3.2 [java] waste: 20 numArcs=6 perArc=3.3333333 [java] waste: 8 numArcs=6 perArc=1.3333334 [java] waste: 24 numArcs=8 perArc=3.0 [java] waste: 32 numArcs=9 perArc=3.5555556 [java] waste: 17 numArcs=7 perArc=2.4285715 [java] waste: 13 numArcs=5 perArc=2.6 [java] waste: 17 numArcs=6 perArc=2.8333333 [java] waste: 28 numArcs=8 perArc=3.5 [java] waste: 20 numArcs=16 perArc=1.25 [java] waste: 44 numArcs=15 perArc=2.9333334 [java] waste: 28 numArcs=13 perArc=2.1538463 [java] waste: 28 numArcs=15 perArc=1.8666667
        Hide
        Michael McCandless added a comment -

        A couple more ideas:

        • Since the root arc is [usually?] cached ... we [usually] shouldn't make the root node into an array?
        • The building process sometimes has freedom in where the outputs are pushed ... so in theory we could push the outputs forwards if it would mean fewer wasted bytes on the prior node ... this would be a tricky optimization problem I think.
        Show
        Michael McCandless added a comment - A couple more ideas: Since the root arc is [usually?] cached ... we [usually] shouldn't make the root node into an array? The building process sometimes has freedom in where the outputs are pushed ... so in theory we could push the outputs forwards if it would mean fewer wasted bytes on the prior node ... this would be a tricky optimization problem I think.

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development