Lucene - Core
  1. Lucene - Core
  2. LUCENE-3801

Generify FST shortestPaths() to take a comparator

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: core/FSTs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Not sure we should do this, it costs 5-10% performance for WFSTSuggester.
      But maybe we can optimize something here, or maybe its just no big deal to us.

      Because in general, this could be pretty powerful, e.g. if you needed to store
      some custom stuff in the suggester, you could use pairoutputs, or whatever.

      And the possibility we might need shortestPaths for other cool things... at the
      least I just wanted to have the patch up here.

      I haven't tested this on pairoutputs... but i've tested it with e.g. FloatOutputs
      and other things and it works fine.

      I tried to minimize the generics violations, there is only 1 (cannot create generic array).

      1. LUCENE-3801.patch
        15 kB
        Robert Muir
      2. LUCENE-3801.patch
        22 kB
        Robert Muir
      3. LUCENE-3801.patch
        23 kB
        Robert Muir
      4. LUCENE-3801.patch
        23 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        here's my patch, again for what its worth.

        Show
        Robert Muir added a comment - here's my patch, again for what its worth.
        Hide
        Robert Muir added a comment -

        looking at the benchmarks again: I think there is an "off-by-a-factor-of-one-thousand" on the benchmark.

        I could not for the life of me figure out why these suggesters would have e.g. 220QPS, but I think instead
        that the benchmark is wrong. So 200,000-210,000QPS compared to 220,000QPS really cannot matter!

        If the benchmark can really satisfy 50000 queries in 220 ms, thats no 226QPS!

        I'll come back with a PairOutputs test in a bit to confirm the functionality but I'm not going to worry
        about 5%/10% performance here, thats crazy.

        Show
        Robert Muir added a comment - looking at the benchmarks again: I think there is an "off-by-a-factor-of-one-thousand" on the benchmark. I could not for the life of me figure out why these suggesters would have e.g. 220QPS, but I think instead that the benchmark is wrong. So 200,000-210,000QPS compared to 220,000QPS really cannot matter! If the benchmark can really satisfy 50000 queries in 220 ms, thats no 226QPS! I'll come back with a PairOutputs test in a bit to confirm the functionality but I'm not going to worry about 5%/10% performance here, thats crazy.
        Hide
        Robert Muir added a comment -

        updated patch with tests (which exposes a real problem!).

        I created "paired" versions (Pair<Long,Long> where its weight,output) of the two existing shortestPath tests: the simple one, and the random one. The simple one passes (somehow), but the random one triggers an assertion in shortestPaths

        Basically the issue is that the original impl has an optimization: when you pick some minimum path (say 5), you know there is sequence of all zeros leading to some final state, otherwise its not actually the minimum

        The code takes advantage of this and (fortunately) asserts that it finds this path of all NO_OUTPUTs.

        The problem is if you have a PairOutputs(weight,output), its only NO_OUTPUT if both weight and output are also NO_OUTPUT, but this only holds true for the weight... the output has its own separately pushed minimums that don't necessarily correspond...

        Show
        Robert Muir added a comment - updated patch with tests (which exposes a real problem!). I created "paired" versions (Pair<Long,Long> where its weight,output) of the two existing shortestPath tests: the simple one, and the random one. The simple one passes (somehow), but the random one triggers an assertion in shortestPaths Basically the issue is that the original impl has an optimization: when you pick some minimum path (say 5), you know there is sequence of all zeros leading to some final state, otherwise its not actually the minimum The code takes advantage of this and (fortunately) asserts that it finds this path of all NO_OUTPUTs. The problem is if you have a PairOutputs(weight,output), its only NO_OUTPUT if both weight and output are also NO_OUTPUT, but this only holds true for the weight... the output has its own separately pushed minimums that don't necessarily correspond...
        Hide
        Robert Muir added a comment - - edited

        I think this might not be hard to fix, consider my comparator:

        // compares just the weight side of the pair
        static final Comparator<Pair<Long,Long>> minPairWeightComparator = new Comparator<Pair<Long,Long>> () {
          public int compare(Pair<Long,Long> left, Pair<Long,Long> right) {
            return left.output1.compareTo(right.output1);
          }  
        };
        

        Instead of looking for a NO_OUTPUT path, I think we should instead express "output == NO_OUTPUT" as "comparator.compare(previousOutput, outputs.add(previousOutput, output)) == 0",
        Ill rerun benchmarks but i don't think this will hurt.

        Show
        Robert Muir added a comment - - edited I think this might not be hard to fix, consider my comparator: // compares just the weight side of the pair static final Comparator<Pair< Long , Long >> minPairWeightComparator = new Comparator<Pair< Long , Long >> () { public int compare(Pair< Long , Long > left, Pair< Long , Long > right) { return left.output1.compareTo(right.output1); } }; Instead of looking for a NO_OUTPUT path, I think we should instead express "output == NO_OUTPUT" as "comparator.compare(previousOutput, outputs.add(previousOutput, output)) == 0", Ill rerun benchmarks but i don't think this will hurt.
        Hide
        Robert Muir added a comment -

        updated patch, instead of looking for 'output == 0' using the 'previousOutput + output == previousOutput' via the comparator.

        random test passes now

        Show
        Robert Muir added a comment - updated patch, instead of looking for 'output == 0' using the 'previousOutput + output == previousOutput' via the comparator. random test passes now
        Hide
        Robert Muir added a comment -

        I think this patch is ready to commit, but the tricky FST math there (compare + add) does add some additional cost.

        Still, I think 180,000QPS versus 210,000QPS or whatever, who cares. Being able to separately have weights and outputs and
        do shortest path operations on just the weight side (with any Outputs representation) is really powerful and I think we can
        use it to improve our suggesters.

        Benchmarks are all in QPS with a top-N of 7 suggestions (50,000 inputs)

        impl prefixes 2-4 prefixes 6-9 prefixes 100-200
        Jaspell 129,000 330,000 436,000
        TST 31,000 258,000 820,000
        FST 330,000 263,000 269,000
        WFST 209,000 606,000 781,000
        WFST-Generic 179,000 521,000 708,000

        I'll wait a bit in case someone wants to review or knows of ways to speed up the patch

        Show
        Robert Muir added a comment - I think this patch is ready to commit, but the tricky FST math there (compare + add) does add some additional cost. Still, I think 180,000QPS versus 210,000QPS or whatever, who cares. Being able to separately have weights and outputs and do shortest path operations on just the weight side (with any Outputs representation) is really powerful and I think we can use it to improve our suggesters. Benchmarks are all in QPS with a top-N of 7 suggestions (50,000 inputs) impl prefixes 2-4 prefixes 6-9 prefixes 100-200 Jaspell 129,000 330,000 436,000 TST 31,000 258,000 820,000 FST 330,000 263,000 269,000 WFST 209,000 606,000 781,000 WFST-Generic 179,000 521,000 708,000 I'll wait a bit in case someone wants to review or knows of ways to speed up the patch
        Hide
        Robert Muir added a comment -

        Hmm actually, i think we can speed this up a little bit if we re-arrange stuff in the addIfCompetitive?
        Currently this patch causes us to 'add twice' (once for the A + B = A stuff, and then again in addIfCompetitive!)

        Show
        Robert Muir added a comment - Hmm actually, i think we can speed this up a little bit if we re-arrange stuff in the addIfCompetitive? Currently this patch causes us to 'add twice' (once for the A + B = A stuff, and then again in addIfCompetitive!)
        Hide
        Christian Moen added a comment -

        If it's possible to speed things up, that's great. Regardless, I think the flexibility vs. speed trade-off you are proposing here is a very good idea. Powerful stuff, indeed. +1.

        Show
        Christian Moen added a comment - If it's possible to speed things up, that's great. Regardless, I think the flexibility vs. speed trade-off you are proposing here is a very good idea. Powerful stuff, indeed. +1.
        Hide
        Robert Muir added a comment -

        yeah as an example use case we could use Pair<weight,originalText> as an 'output' but
        store the kuromoji-generated readings as the input side for japanese text, we would
        just analyze the japanese input to a reading at query-time too and I think it would work...

        Show
        Robert Muir added a comment - yeah as an example use case we could use Pair<weight,originalText> as an 'output' but store the kuromoji-generated readings as the input side for japanese text, we would just analyze the japanese input to a reading at query-time too and I think it would work...
        Hide
        Christian Moen added a comment -

        Exactly. If we match on normalized forms that also deal with the combined scripts users would use when they type in Japanese queries, i.e. ほっk to get 北海道 (Hokkaido) suggested, I think we have a good starting point to build a Japanese query suggester/autocompleter.

        Show
        Christian Moen added a comment - Exactly. If we match on normalized forms that also deal with the combined scripts users would use when they type in Japanese queries, i.e. ほっk to get 北海道 (Hokkaido) suggested, I think we have a good starting point to build a Japanese query suggester/autocompleter.
        Hide
        Michael McCandless added a comment -

        I think instead of per-arc add (to find the NO_OUTPUT arc) you can just do comparator.compare(arc.output, NO_OUTPUT) == 0.

        Otherwise patch looks great! I think the perf loss if acceptable...

        Very exciting!

        Show
        Michael McCandless added a comment - I think instead of per-arc add (to find the NO_OUTPUT arc) you can just do comparator.compare(arc.output, NO_OUTPUT) == 0. Otherwise patch looks great! I think the perf loss if acceptable... Very exciting!
        Hide
        Robert Muir added a comment -

        I think instead of per-arc add (to find the NO_OUTPUT arc) you can just do comparator.compare(arc.output, NO_OUTPUT) == 0.

        I have no idea what I was thinking... this is why we have code reviews i guess

        Ill upload a new patch with new benchmarks.

        Show
        Robert Muir added a comment - I think instead of per-arc add (to find the NO_OUTPUT arc) you can just do comparator.compare(arc.output, NO_OUTPUT) == 0. I have no idea what I was thinking... this is why we have code reviews i guess Ill upload a new patch with new benchmarks.
        Hide
        Robert Muir added a comment -

        updated patch with mike's idea.

        this wins some perf back:

        impl prefixes 2-4 prefixes 6-9 prefixes 100-200
        Jaspell 129,000 330,000 436,000
        TST 31,000 258,000 820,000
        FST 330,000 263,000 269,000
        WFST 209,000 606,000 781,000
        WFST-Generic 190,000 550,000 765,000

        I think this is ok... I'll commit soon.

        Show
        Robert Muir added a comment - updated patch with mike's idea. this wins some perf back: impl prefixes 2-4 prefixes 6-9 prefixes 100-200 Jaspell 129,000 330,000 436,000 TST 31,000 258,000 820,000 FST 330,000 263,000 269,000 WFST 209,000 606,000 781,000 WFST-Generic 190,000 550,000 765,000 I think this is ok... I'll commit soon.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development