Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 4.1, 5.0
    • Component/s: modules/spellchecker
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801,
      I think we should look at implementing suggesters that have more capabilities than just basic prefix matching.

      In particular I think the most flexible approach is to integrate with Analyzer at both build and query time,
      such that we build a wFST with:
      input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token separator
      output: surface form such as "the ghost of christmas past"
      weight: the weight of the suggestion

      we make an FST with PairOutputs<weight,output>, but only do the shortest path operation on the weight side (like
      the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion.

      This allows a lot of flexibility:

      • Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in "ghost of chr...",
        it will suggest "the ghost of christmas past"
      • we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!)
      • this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading,
        so we would add a TokenFilter that copies ReadingAttribute into term text to support that...
      • other general things like offering suggestions that are more "fuzzy" like using a plural stemmer or ignoring accents or whatever.

      According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not
      explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc).

      1. LUCENE-3842.patch
        118 kB
        Michael McCandless
      2. LUCENE-3842.patch
        117 kB
        Michael McCandless
      3. LUCENE-3842.patch
        116 kB
        Michael McCandless
      4. LUCENE-3842.patch
        46 kB
        Michael McCandless
      5. LUCENE-3842.patch
        22 kB
        Michael McCandless
      6. LUCENE-3842.patch
        25 kB
        Michael McCandless
      7. LUCENE-3842.patch
        17 kB
        Michael McCandless
      8. LUCENE-3842.patch
        20 kB
        Michael McCandless
      9. LUCENE-3842.patch
        72 kB
        Michael McCandless
      10. LUCENE-3842.patch
        74 kB
        Sudarshan Gaikaiwari
      11. LUCENE-3842.patch
        65 kB
        Robert Muir
      12. LUCENE-3842.patch
        64 kB
        Michael McCandless
      13. LUCENE-3842.patch
        58 kB
        Michael McCandless
      14. LUCENE-3842.patch
        53 kB
        Michael McCandless
      15. LUCENE-3842.patch
        34 kB
        Robert Muir
      16. LUCENE-3842.patch
        34 kB
        Robert Muir
      17. LUCENE-3842.patch
        44 kB
        Robert Muir
      18. LUCENE-3842.patch
        26 kB
        Robert Muir
      19. LUCENE-3842-TokenStream_to_Automaton.patch
        17 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          messy,dirty,drinking beer on a saturday-afternoon style patch with tons of nocommits.

          particularly:

          • we should make the separator optional, i dont think e.g. for the japanese case we care so much about segmentation but readings, but for the english case we do.
          • both index-time and query-time should respect position increments (if you dont want that for e.g. stopfilter, set your analyzer correctly). But i dont even want to go this route until after LUCENE-3767 has landed, it adds PositionLengthAttribute, and there is no sense in dealing with a "lossy" tokenstream since we want the real graph, not one thats been sausaged into the existing tokenstream model.
          • duplicates on the input side dont matter, of course two suggestions could analyze to the same thing. Though this is not particularly desirable (as they will likely result in the same query results), I dont think we should just drop duplicates for the highest cost? instead we can add bogus bytes to the end of the FST input side to uniquify them, since it wont matter anyway.
          Show
          Robert Muir added a comment - messy,dirty,drinking beer on a saturday-afternoon style patch with tons of nocommits. particularly: we should make the separator optional, i dont think e.g. for the japanese case we care so much about segmentation but readings, but for the english case we do. both index-time and query-time should respect position increments (if you dont want that for e.g. stopfilter, set your analyzer correctly). But i dont even want to go this route until after LUCENE-3767 has landed, it adds PositionLengthAttribute, and there is no sense in dealing with a "lossy" tokenstream since we want the real graph, not one thats been sausaged into the existing tokenstream model. duplicates on the input side dont matter, of course two suggestions could analyze to the same thing. Though this is not particularly desirable (as they will likely result in the same query results), I dont think we should just drop duplicates for the highest cost? instead we can add bogus bytes to the end of the FST input side to uniquify them, since it wont matter anyway.
          Hide
          Robert Muir added a comment -

          also just so i dont forget, we should allow separate specification of "index-time" and "query-time" analyzers...
          because in cases where you are adding synonyms/wdf/etc there is a tradeoff (bigger FST, versus slower query-time perf)

          Show
          Robert Muir added a comment - also just so i dont forget, we should allow separate specification of "index-time" and "query-time" analyzers... because in cases where you are adding synonyms/wdf/etc there is a tradeoff (bigger FST, versus slower query-time perf)
          Hide
          Robert Muir added a comment -

          Linking to SOLR-2479 since i think this is the most obvious way to support infix suggestions... we already have the analyzer splitting the terms into "words" so its just a matter of indexing tokenstream suffixes if the user wants.

          However, I think anything that supports infixing should look at this as word-level edit distance... we should apply some cost penalty to these since they are somewhat of a "reach" for a query suggestion.

          I think we want to allow several parameters for that though:

          1. maximum infix length (we should only infix suffixes up to some N)
          2. infix penalty (I think when indexing infixed suffixes we should simply bake a penalty into the FST)
          Show
          Robert Muir added a comment - Linking to SOLR-2479 since i think this is the most obvious way to support infix suggestions... we already have the analyzer splitting the terms into "words" so its just a matter of indexing tokenstream suffixes if the user wants. However, I think anything that supports infixing should look at this as word-level edit distance... we should apply some cost penalty to these since they are somewhat of a "reach" for a query suggestion. I think we want to allow several parameters for that though: maximum infix length (we should only infix suffixes up to some N) infix penalty (I think when indexing infixed suffixes we should simply bake a penalty into the FST)
          Hide
          Michael McCandless added a comment -

          VERY cool

          Show
          Michael McCandless added a comment - VERY cool
          Hide
          Michael McCandless added a comment -

          Once posLength is in, I think a very simple way to handle multiple paths at query time is to open up the TopNSearcher class in oal.fst.Util.

          Currently the API only allows you to pass in a single starting FST node, but we can easily improve this by adding eg a addStartNode(FST.Arc<T>, int startingCost) instead. This way the app could create a TopNSearcher, add any number of start nodes with the initial path cost, then call .search() to get the best completions.

          The only limitation of this is that all differences must be pre-computed as an initial path cost that's "consistent" with how the path costs are accumulated (with the Outputs.add) during searching; I'm not sure if that'd be overly restrictive here?

          Show
          Michael McCandless added a comment - Once posLength is in, I think a very simple way to handle multiple paths at query time is to open up the TopNSearcher class in oal.fst.Util. Currently the API only allows you to pass in a single starting FST node, but we can easily improve this by adding eg a addStartNode(FST.Arc<T>, int startingCost) instead. This way the app could create a TopNSearcher, add any number of start nodes with the initial path cost, then call .search() to get the best completions. The only limitation of this is that all differences must be pre-computed as an initial path cost that's "consistent" with how the path costs are accumulated (with the Outputs.add) during searching; I'm not sure if that'd be overly restrictive here?
          Hide
          Michael McCandless added a comment -

          Patch with a static utility method to translate a TokenStream to a byte-by-byte automaton.

          It respects posInc and posLength but there are lots of nocommits still!

          I think we can use this at index time and maybe query time to graph analyzers... but we still need the "enumerate all paths from this automaton" method...

          Show
          Michael McCandless added a comment - Patch with a static utility method to translate a TokenStream to a byte-by-byte automaton. It respects posInc and posLength but there are lots of nocommits still! I think we can use this at index time and maybe query time to graph analyzers... but we still need the "enumerate all paths from this automaton" method...
          Hide
          Robert Muir added a comment -

          Wow, thats awesome Mike... its so small too!

          I think we can integrate these two patches into one, with your patch and your addStartNode idea
          the AnalyzingSuggester should be able to support both index-time and query-time synonyms/wdf/whatever
          crazy stuff we throw at it

          but we still need the "enumerate all paths from this automaton" method...

          Brics has some code for this (puts all the accepted strings into a set). we should be able to do
          something similar, to create a set of bytesref?

          But really, i'm not sure we need a 'general' method for this. i think we should just have an enumerator
          for finite automata (e.g. tokenstream) as we can probably make this a 'real' enum rather than creating
          a massive list/set, we dont need the set deduplication at all either, because its finite.

          Show
          Robert Muir added a comment - Wow, thats awesome Mike... its so small too! I think we can integrate these two patches into one, with your patch and your addStartNode idea the AnalyzingSuggester should be able to support both index-time and query-time synonyms/wdf/whatever crazy stuff we throw at it but we still need the "enumerate all paths from this automaton" method... Brics has some code for this (puts all the accepted strings into a set). we should be able to do something similar, to create a set of bytesref? But really, i'm not sure we need a 'general' method for this. i think we should just have an enumerator for finite automata (e.g. tokenstream) as we can probably make this a 'real' enum rather than creating a massive list/set, we dont need the set deduplication at all either, because its finite.
          Hide
          Dawid Weiss added a comment -

          Patch with a static utility method to translate a TokenStream to a byte-by-byte automaton.

          I looked at the patch but I don't fully get what it does. Looks like a combination of state sequence unions, am I right?

          Brics has some code for this (puts all the accepted strings into a set).

          It's probably a naive walk with an acceptor. I've always wanted to see what Brics returns from that method for an automaton equivalent to .*

          Show
          Dawid Weiss added a comment - Patch with a static utility method to translate a TokenStream to a byte-by-byte automaton. I looked at the patch but I don't fully get what it does. Looks like a combination of state sequence unions, am I right? Brics has some code for this (puts all the accepted strings into a set). It's probably a naive walk with an acceptor. I've always wanted to see what Brics returns from that method for an automaton equivalent to .*
          Hide
          Robert Muir added a comment -

          I've always wanted to see what Brics returns from that method for an automaton equivalent to .*

          Oh, so it is only for finite automata, so it returns null in that case:

          http://www.brics.dk/automaton/doc/dk/brics/automaton/SpecialOperations.html#getFiniteStrings%28dk.brics.automaton.Automaton%29

          Show
          Robert Muir added a comment - I've always wanted to see what Brics returns from that method for an automaton equivalent to .* Oh, so it is only for finite automata, so it returns null in that case: http://www.brics.dk/automaton/doc/dk/brics/automaton/SpecialOperations.html#getFiniteStrings%28dk.brics.automaton.Automaton%29
          Hide
          Robert Muir added a comment -

          I looked at the patch but I don't fully get what it does. Looks like a combination of state sequence unions, am I right?

          Well the conversion should ultimately be more useful for the suggester to intersect with the FST than a tokenstream, because a tokenstream is like a word-level automaton, if dogs is a synonym for dog, then we have:
          smelly dog|dogs(positionIncrement=0).

          So for our intersection, we would prefer it to be a deterministic at 'character' (byte) level instead. So the conversion should produce an automaton of: smelly dog(s?) in regex notation... this is easier to work with.

          at index time its useful too, because in the FST we only care about all the possible byte strings, so this should be easier to enumerate than a tokenstream (especially if you consider multiword synonyms, decompounded terms etc where some span across many).

          Show
          Robert Muir added a comment - I looked at the patch but I don't fully get what it does. Looks like a combination of state sequence unions, am I right? Well the conversion should ultimately be more useful for the suggester to intersect with the FST than a tokenstream, because a tokenstream is like a word-level automaton, if dogs is a synonym for dog, then we have: smelly dog|dogs(positionIncrement=0). So for our intersection, we would prefer it to be a deterministic at 'character' (byte) level instead. So the conversion should produce an automaton of: smelly dog(s?) in regex notation... this is easier to work with. at index time its useful too, because in the FST we only care about all the possible byte strings, so this should be easier to enumerate than a tokenstream (especially if you consider multiword synonyms, decompounded terms etc where some span across many).
          Hide
          Dawid Weiss added a comment -

          Ok, get it, thanks. I wonder if it's always possible, but I bet you can write a random test to ensure that

          Show
          Dawid Weiss added a comment - Ok, get it, thanks. I wonder if it's always possible, but I bet you can write a random test to ensure that
          Hide
          Dawid Weiss added a comment -

          It also occurred to me that it would be interesting to have a naive minimalization technique for Brics Automata which would (for automata with a finite language):

          • enumarate the language
          • sort
          • build the minimal automaton

          While it may seem like an idea with crazy overhead it may actually be a viable alternative to minimization algorithms in Brics for very large automata. Interesting.

          Show
          Dawid Weiss added a comment - It also occurred to me that it would be interesting to have a naive minimalization technique for Brics Automata which would (for automata with a finite language): enumarate the language sort build the minimal automaton While it may seem like an idea with crazy overhead it may actually be a viable alternative to minimization algorithms in Brics for very large automata. Interesting.
          Hide
          Robert Muir added a comment -

          Dawid: hmm, well the transitions in brics are ranges in sorted order, so, if its finite, couldn't we just enumerate the language in sorted order while building the minimal automaton incrementally in parallel?

          Or am i missing something... its sunday...

          Show
          Robert Muir added a comment - Dawid: hmm, well the transitions in brics are ranges in sorted order, so, if its finite, couldn't we just enumerate the language in sorted order while building the minimal automaton incrementally in parallel? Or am i missing something... its sunday...
          Hide
          Dawid Weiss added a comment -

          Yep, sure you could (I admit I haven't looked at Brics in a long time so I don't remember the details, but I do remember the overhead was significant on optimization; this was a while ago - maybe it's improved).

          Show
          Dawid Weiss added a comment - Yep, sure you could (I admit I haven't looked at Brics in a long time so I don't remember the details, but I do remember the overhead was significant on optimization; this was a while ago - maybe it's improved).
          Hide
          Robert Muir added a comment -

          updated patch, tying in Mike's patch too.

          Currently my silly test fails because it trips mike's assert. it starts with a stopword

          Show
          Robert Muir added a comment - updated patch, tying in Mike's patch too. Currently my silly test fails because it trips mike's assert. it starts with a stopword
          Hide
          Robert Muir added a comment -

          I also don't think we really need this generic getFiniteStrings. its just to get it off the ground.
          we can just write the possibilities on the fly i think and it will be simpler...

          Show
          Robert Muir added a comment - I also don't think we really need this generic getFiniteStrings. its just to get it off the ground. we can just write the possibilities on the fly i think and it will be simpler...
          Hide
          Robert Muir added a comment -

          merged the patch up to trunk. But it still trips the assert in tokenStreamToAutomaton because the test begins with a stopword.

          Show
          Robert Muir added a comment - merged the patch up to trunk. But it still trips the assert in tokenStreamToAutomaton because the test begins with a stopword.
          Hide
          Robert Muir added a comment -

          i see the problem. it actually happens on the second term (we have ghost/2 christmas/2).
          The problem is it tries to find the last state to connect the new node, but it uses
          a hashmap based on position for that... so if there are holes this returns null.

          I think for this code we would add nodes for holes (text=POS_SEP) to simplify the logic?

          Show
          Robert Muir added a comment - i see the problem. it actually happens on the second term (we have ghost/2 christmas/2). The problem is it tries to find the last state to connect the new node, but it uses a hashmap based on position for that... so if there are holes this returns null. I think for this code we would add nodes for holes (text=POS_SEP) to simplify the logic?
          Hide
          Robert Muir added a comment -

          updated patch: i fixed the bug in tokenStreamToAutomaton (just use lastEndPos instead)

          Show
          Robert Muir added a comment - updated patch: i fixed the bug in tokenStreamToAutomaton (just use lastEndPos instead)
          Hide
          Michael McCandless added a comment -

          Patch, fixing TS2A to insert holes ... this is causing the AnalyzingCompletionTest.testStandard to fail... we have to fix its query-time to insert holes too...

          Show
          Michael McCandless added a comment - Patch, fixing TS2A to insert holes ... this is causing the AnalyzingCompletionTest.testStandard to fail... we have to fix its query-time to insert holes too...
          Hide
          Robert Muir added a comment -

          testStandard is also bogus: it has 2 asserts. the first one should pass, but the second one should
          really only work if you disable positionincrements in the (mock) stopfilter.

          Show
          Robert Muir added a comment - testStandard is also bogus: it has 2 asserts. the first one should pass, but the second one should really only work if you disable positionincrements in the (mock) stopfilter.
          Hide
          Michael McCandless added a comment -

          New patch, getting us closer ... I opened up Util.shortestPaths so that you can make a TopNSearcher and seed multiple start nodes into its queue... and I created an intersectPaths method to intersect an automaton with an FST, gathering the end nodes and the accumulated outputs. Then I fixed lookup to use these two to enumerate and complete the paths.

          The first test in testStandard now passes, but not the 2nd one (I haven't tried disabling posincs in the StopFilter yet).

          Show
          Michael McCandless added a comment - New patch, getting us closer ... I opened up Util.shortestPaths so that you can make a TopNSearcher and seed multiple start nodes into its queue... and I created an intersectPaths method to intersect an automaton with an FST, gathering the end nodes and the accumulated outputs. Then I fixed lookup to use these two to enumerate and complete the paths. The first test in testStandard now passes, but not the 2nd one (I haven't tried disabling posincs in the StopFilter yet).
          Hide
          Michael McCandless added a comment -

          OK, when I pass false for enablePositionIncrements to MockAnalyzer in testStandard then both cases pass... and I added a 3rd case testing for "ghost chris" and it also passes... cool!

          Show
          Michael McCandless added a comment - OK, when I pass false for enablePositionIncrements to MockAnalyzer in testStandard then both cases pass... and I added a 3rd case testing for "ghost chris" and it also passes... cool!
          Hide
          Michael McCandless added a comment -

          New patch, fixing some nocommits, and a separate bug where WFSTSuggester can return a dup suggestion. Still more to do...

          Show
          Michael McCandless added a comment - New patch, fixing some nocommits, and a separate bug where WFSTSuggester can return a dup suggestion. Still more to do...
          Hide
          Robert Muir added a comment -

          Nice catch on the exactFirst dup problem!

          Show
          Robert Muir added a comment - Nice catch on the exactFirst dup problem!
          Hide
          Robert Muir added a comment -

          When running the benchmark (LookupBenchmarkTest) i noticed the FST size has increased since the original patch. I wonder why this is? the benchmark uses KeywordAnalyzer... it could be (likely even) that the original patch had a bug and now its correct, but maybe its worth investigation...

          Show
          Robert Muir added a comment - When running the benchmark (LookupBenchmarkTest) i noticed the FST size has increased since the original patch. I wonder why this is? the benchmark uses KeywordAnalyzer... it could be (likely even) that the original patch had a bug and now its correct, but maybe its worth investigation...
          Hide
          Sudarshan Gaikaiwari added a comment -

          I was not able to apply the latest patch cleanly

          smg@dev21:~/lucene_trunk$ patch -p0 < ~/LUCENE-3842.patch
          patching file lucene/test-framework/src/java/org/apache/lucene/util/RollingBuffer.java
          patching file lucene/core/src/java/org/apache/lucene/util/automaton/SpecialOperations.java
          patching file lucene/core/src/java/org/apache/lucene/util/RollingBuffer.java
          Hunk #1 FAILED at 109.
          1 out of 1 hunk FAILED – saving rejects to file lucene/core/src/java/org/apache/lucene/util/RollingBuffer.java.rej
          patching file lucene/core/src/java/org/apache/lucene/util/fst/Util.java
          patching file lucene/core/src/java/org/apache/lucene/analysis/TokenStreamToAutomaton.java
          patching file lucene/core/src/test/org/apache/lucene/analysis/TestGraphTokenizers.java
          patching file lucene/core/src/test/org/apache/lucene/util/automaton/TestSpecialOperations.java
          patching file lucene/suggest/src/test/org/apache/lucene/search/suggest/analyzing/AnalyzingCompletionTest.java
          patching file lucene/suggest/src/test/org/apache/lucene/search/suggest/LookupBenchmarkTest.java
          patching file lucene/suggest/src/java/org/apache/lucene/search/suggest/analyzing/AnalyzingCompletionLookup.java
          patching file lucene/suggest/src/java/org/apache/lucene/search/suggest/analyzing/FSTUtil.java

          I needed to copy RollingBuffer.java to from test-framework to core for the patch to apply cleanly.

          cp lucene/test-framework/src/java/org/apache/lucene/util/RollingBuffer.java lucene/core/src/java/org/apache/lucene/util/

          Show
          Sudarshan Gaikaiwari added a comment - I was not able to apply the latest patch cleanly smg@dev21:~/lucene_trunk$ patch -p0 < ~/ LUCENE-3842 .patch patching file lucene/test-framework/src/java/org/apache/lucene/util/RollingBuffer.java patching file lucene/core/src/java/org/apache/lucene/util/automaton/SpecialOperations.java patching file lucene/core/src/java/org/apache/lucene/util/RollingBuffer.java Hunk #1 FAILED at 109. 1 out of 1 hunk FAILED – saving rejects to file lucene/core/src/java/org/apache/lucene/util/RollingBuffer.java.rej patching file lucene/core/src/java/org/apache/lucene/util/fst/Util.java patching file lucene/core/src/java/org/apache/lucene/analysis/TokenStreamToAutomaton.java patching file lucene/core/src/test/org/apache/lucene/analysis/TestGraphTokenizers.java patching file lucene/core/src/test/org/apache/lucene/util/automaton/TestSpecialOperations.java patching file lucene/suggest/src/test/org/apache/lucene/search/suggest/analyzing/AnalyzingCompletionTest.java patching file lucene/suggest/src/test/org/apache/lucene/search/suggest/LookupBenchmarkTest.java patching file lucene/suggest/src/java/org/apache/lucene/search/suggest/analyzing/AnalyzingCompletionLookup.java patching file lucene/suggest/src/java/org/apache/lucene/search/suggest/analyzing/FSTUtil.java I needed to copy RollingBuffer.java to from test-framework to core for the patch to apply cleanly. cp lucene/test-framework/src/java/org/apache/lucene/util/RollingBuffer.java lucene/core/src/java/org/apache/lucene/util/
          Hide
          Michael McCandless added a comment -

          Hi Sudarshan, sorry that was my bad: I had svn mv'd RollingBuffer but when I created the patch I failed to pass --show-copies-as-adds to svn ... so you have to do that mv yourself before applying the patch...

          Show
          Michael McCandless added a comment - Hi Sudarshan, sorry that was my bad: I had svn mv'd RollingBuffer but when I created the patch I failed to pass --show-copies-as-adds to svn ... so you have to do that mv yourself before applying the patch...
          Hide
          Sudarshan Gaikaiwari added a comment -

          Hi Michael,

          Thanks a lot for opening up Util.shortestPaths, now that I can seed the queue with the intial nodes using addStartPaths the performance of the GeoSpatialSuggest that I presented at Lucene Revolution has been improved by 2x.

          While migrating my code to use this patch, I noticed that I would hit the following assertion in addIfCompetitive.

          
                    path.input.length--;
                    assert cmp != 0;
                    if (cmp < 0) {
          

          This assert fires when it is not possible to differentiate between the path that we are trying to add to the queue and the bottom. This happens because the different paths that lead to FST nodes during the automata FST intersection are not stored. So the inputpath used to differentiate path contains only the characters that have been consumed from one of the initial FST nodes.

          From your comments for the addStartPaths method I think that you have foreseen this problem.

              // nocommit this should also take the starting
              // weight...?
          
              /** Adds all leaving arcs, including 'finished' arc, if
               *  the node is final, from this node into the queue.  */
              public void addStartPaths(FST.Arc<T> node, T startOutput, boolean allowEmptyString) throws IOException {
          

          Here is a unit test that causes the assert to be triggered.

            public void testInputPathRequired() throws Exception {
              TermFreq keys[] = new TermFreq[] {
                  new TermFreq("fast ghost", 50),
                  new TermFreq("quick gazelle", 50),
                  new TermFreq("fast ghoul", 50),
                  new TermFreq("fast gizzard", 50),
              };
          
              SynonymMap.Builder b = new SynonymMap.Builder(false);
              b.add(new CharsRef("fast"), new CharsRef("quick"), true);
              final SynonymMap map = b.build();
          
              final Analyzer analyzer = new Analyzer() {
                @Override
                protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
                  Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.SIMPLE, true);
                  TokenStream stream = new SynonymFilter(tokenizer, map, true);
                  return new TokenStreamComponents(tokenizer, new RemoveDuplicatesTokenFilter(stream));
                }
              };
              AnalyzingCompletionLookup suggester = new AnalyzingCompletionLookup(analyzer);
              suggester.build(new TermFreqArrayIterator(keys));
              List<LookupResult> results = suggester.lookup("fast g", false, 2);
            }
          

          Please let me know if the above analysis looks correct to you and I will start trying to fix this by storing paths during the FST automata intersection.

          Show
          Sudarshan Gaikaiwari added a comment - Hi Michael, Thanks a lot for opening up Util.shortestPaths, now that I can seed the queue with the intial nodes using addStartPaths the performance of the GeoSpatialSuggest that I presented at Lucene Revolution has been improved by 2x. While migrating my code to use this patch, I noticed that I would hit the following assertion in addIfCompetitive. path.input.length--; assert cmp != 0; if (cmp < 0) { This assert fires when it is not possible to differentiate between the path that we are trying to add to the queue and the bottom. This happens because the different paths that lead to FST nodes during the automata FST intersection are not stored. So the inputpath used to differentiate path contains only the characters that have been consumed from one of the initial FST nodes. From your comments for the addStartPaths method I think that you have foreseen this problem. // nocommit this should also take the starting // weight...? /** Adds all leaving arcs, including 'finished' arc, if * the node is final , from this node into the queue. */ public void addStartPaths(FST.Arc<T> node, T startOutput, boolean allowEmptyString) throws IOException { Here is a unit test that causes the assert to be triggered. public void testInputPathRequired() throws Exception { TermFreq keys[] = new TermFreq[] { new TermFreq( "fast ghost" , 50), new TermFreq( "quick gazelle" , 50), new TermFreq( "fast ghoul" , 50), new TermFreq( "fast gizzard" , 50), }; SynonymMap.Builder b = new SynonymMap.Builder( false ); b.add( new CharsRef( "fast" ), new CharsRef( "quick" ), true ); final SynonymMap map = b.build(); final Analyzer analyzer = new Analyzer() { @Override protected TokenStreamComponents createComponents( String fieldName, Reader reader) { Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.SIMPLE, true ); TokenStream stream = new SynonymFilter(tokenizer, map, true ); return new TokenStreamComponents(tokenizer, new RemoveDuplicatesTokenFilter(stream)); } }; AnalyzingCompletionLookup suggester = new AnalyzingCompletionLookup(analyzer); suggester.build( new TermFreqArrayIterator(keys)); List<LookupResult> results = suggester.lookup( "fast g" , false , 2); } Please let me know if the above analysis looks correct to you and I will start trying to fix this by storing paths during the FST automata intersection.
          Hide
          Michael McCandless added a comment -

          Hi Sudarshan, thanks for raising this ... I'll have a look...

          Show
          Michael McCandless added a comment - Hi Sudarshan, thanks for raising this ... I'll have a look...
          Hide
          Michael McCandless added a comment -

          Please let me know if the above analysis looks correct to you

          That sounds right to me: when we do the initial intersection of the suggest FST with A (from the analyzer), it looks like we must keep the full input path (BytesRef) and carry that over when we pass the node to addStartPaths.

          Either that, or ... we change how we break ties in the scores? EG maybe we can tie-break instead by the surface form?

          Show
          Michael McCandless added a comment - Please let me know if the above analysis looks correct to you That sounds right to me: when we do the initial intersection of the suggest FST with A (from the analyzer), it looks like we must keep the full input path (BytesRef) and carry that over when we pass the node to addStartPaths. Either that, or ... we change how we break ties in the scores? EG maybe we can tie-break instead by the surface form?
          Hide
          Robert Muir added a comment -

          no changes: just syncing up with trunk

          Show
          Robert Muir added a comment - no changes: just syncing up with trunk
          Hide
          Sudarshan Gaikaiwari added a comment -

          maybe we can tie-break instead by the surface form?

          The FST construction guarantees that the input paths leading to different nodes are unique, while I don't think we have such a guarantee about the surface form.

          I have attached a patch that modifies the intersectPrefixPaths method to keep track of the input paths as the automaton and FST are intersected. Please let me know if that look ok to you?

          Show
          Sudarshan Gaikaiwari added a comment - maybe we can tie-break instead by the surface form? The FST construction guarantees that the input paths leading to different nodes are unique, while I don't think we have such a guarantee about the surface form. I have attached a patch that modifies the intersectPrefixPaths method to keep track of the input paths as the automaton and FST are intersected. Please let me know if that look ok to you?
          Hide
          Michael McCandless added a comment -

          Sudarshan those changes look great! You now also record the input for every Path coming back from intersectPrefixPaths, and pass that to addStartPaths. Sorry it took so long to have a look!

          I started from your patch, got up to trunk again (there was one compilation error I think), added some comments, downgraded a nocommit.

          I think we are close here ... I'll make a branch so we can iterate.

          Show
          Michael McCandless added a comment - Sudarshan those changes look great! You now also record the input for every Path coming back from intersectPrefixPaths, and pass that to addStartPaths. Sorry it took so long to have a look! I started from your patch, got up to trunk again (there was one compilation error I think), added some comments, downgraded a nocommit. I think we are close here ... I'll make a branch so we can iterate.
          Show
          Michael McCandless added a comment - OK I created branch: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3842
          Hide
          Robert Muir added a comment -

          Thanks for resurrecting this from the dead! I had forgotten just how fun this issue was

          Show
          Robert Muir added a comment - Thanks for resurrecting this from the dead! I had forgotten just how fun this issue was
          Hide
          Michael McCandless added a comment -

          Patch, addressing a few nocommits and getting PRESERVE_HOLES and PRESERVE_SEPS working. I did this by adding options to AnalyzingCompletionLookup, and then post-processing the returned automaton from TokenStreamToAutomaton.

          I also added a couple new nocommits ....

          Show
          Michael McCandless added a comment - Patch, addressing a few nocommits and getting PRESERVE_HOLES and PRESERVE_SEPS working. I did this by adding options to AnalyzingCompletionLookup, and then post-processing the returned automaton from TokenStreamToAutomaton. I also added a couple new nocommits ....
          Hide
          Michael McCandless added a comment -

          New patch, removing "preserve holes" option from AnalyzingCompletionLookup: you can simply tell your StopFilter whether or not holes are meaningful.

          Show
          Michael McCandless added a comment - New patch, removing "preserve holes" option from AnalyzingCompletionLookup: you can simply tell your StopFilter whether or not holes are meaningful.
          Hide
          Robert Muir added a comment -

          +1 for that. lets keep this as simple as possible and leave the responsibility to the analyzer as much as possible.

          My main concern for the PRESERVE_SEPS was for the japanese use case: we don't much care what the actual tokenization
          of the japanese words was, only the concatenated reading string. If the tokenization is a little off but the concatenation
          of all the readings is still correct, then we are ok. So it makes it more robust against tokenization differences,
          especially considering its partial inputs going into this thing (not whole words)

          Show
          Robert Muir added a comment - +1 for that. lets keep this as simple as possible and leave the responsibility to the analyzer as much as possible. My main concern for the PRESERVE_SEPS was for the japanese use case: we don't much care what the actual tokenization of the japanese words was, only the concatenated reading string. If the tokenization is a little off but the concatenation of all the readings is still correct, then we are ok. So it makes it more robust against tokenization differences, especially considering its partial inputs going into this thing (not whole words)
          Hide
          Michael McCandless added a comment -

          Patch, getting "duplicates" (different surface forms analyzing to same analyzed bytes) fully working, with exactFirst.

          I did this by having the FST handle multiple outputs for a single input (this is something it can do in general but we don't often use it)...

          But ... I don't really like the approach: it makes the path/results handling more complex because now a single "path" can correspond to multiple results.

          I think it would be better/cleaner to append unique (disambiguating) bytes to the end of the analyzed bytes (this was Robert's original idea): then each path is a single result. The only downside I can think of is we will have to reserve a byte (0xFF?), ie we'd append 0xFF 0x00, then 0xFF 0x01 to the next duplicate, ... but since these input BytesRefs are "typically" UTF-8 ... this seems not so bad? Then can of course in general be arbitrary bytes since they are produced by the analysis process...

          Show
          Michael McCandless added a comment - Patch, getting "duplicates" (different surface forms analyzing to same analyzed bytes) fully working, with exactFirst. I did this by having the FST handle multiple outputs for a single input (this is something it can do in general but we don't often use it)... But ... I don't really like the approach: it makes the path/results handling more complex because now a single "path" can correspond to multiple results. I think it would be better/cleaner to append unique (disambiguating) bytes to the end of the analyzed bytes (this was Robert's original idea): then each path is a single result. The only downside I can think of is we will have to reserve a byte (0xFF?), ie we'd append 0xFF 0x00, then 0xFF 0x01 to the next duplicate, ... but since these input BytesRefs are "typically" UTF-8 ... this seems not so bad? Then can of course in general be arbitrary bytes since they are produced by the analysis process...
          Hide
          Bill Bell added a comment -

          A common Suggester use case is to boost the results by closest (auto suggest the whole USA but boost the results in the suggester by geodistance). WOuld love to get faster response with that. At the Lucene Revolution 2012 in Boston a speaker did discuss using WFST to do this, but I have yet to figure out how to do it).

          Show
          Bill Bell added a comment - A common Suggester use case is to boost the results by closest (auto suggest the whole USA but boost the results in the suggester by geodistance). WOuld love to get faster response with that. At the Lucene Revolution 2012 in Boston a speaker did discuss using WFST to do this, but I have yet to figure out how to do it).
          Hide
          Robert Muir added a comment -

          Bill can you start a different thread for that. Its unrelated to this issue.

          Show
          Robert Muir added a comment - Bill can you start a different thread for that. Its unrelated to this issue.
          Hide
          Robert Muir added a comment -

          I think it would be better/cleaner to append unique (disambiguating) bytes to the end of the analyzed bytes (this was Robert's original idea): then each path is a single result. The only downside I can think of is we will have to reserve a byte (0xFF?), ie we'd append 0xFF 0x00, then 0xFF 0x01 to the next duplicate, ... but since these input BytesRefs are "typically" UTF-8 ... this seems not so bad? Then can of course in general be arbitrary bytes since they are produced by the analysis process...

          I don't understand why we have to reserve any bytes. We can append arbitrary bytes of any sort to the end of the input side, this will have no effect on the actual surface form that we suggest.

          Show
          Robert Muir added a comment - I think it would be better/cleaner to append unique (disambiguating) bytes to the end of the analyzed bytes (this was Robert's original idea): then each path is a single result. The only downside I can think of is we will have to reserve a byte (0xFF?), ie we'd append 0xFF 0x00, then 0xFF 0x01 to the next duplicate, ... but since these input BytesRefs are "typically" UTF-8 ... this seems not so bad? Then can of course in general be arbitrary bytes since they are produced by the analysis process... I don't understand why we have to reserve any bytes. We can append arbitrary bytes of any sort to the end of the input side, this will have no effect on the actual surface form that we suggest.
          Hide
          Robert Muir added a comment -

          and as far as exactFirst: lets just keep it simple and have it as a surface form comparison?

          This is really what I think most people will expect anyway: in the case of "duplicates" and exactFirst,
          nobody really cares nor sees the underlying analyzed form.

          So I don't think we should have multiple outputs for the FST here.

          Show
          Robert Muir added a comment - and as far as exactFirst: lets just keep it simple and have it as a surface form comparison? This is really what I think most people will expect anyway: in the case of "duplicates" and exactFirst, nobody really cares nor sees the underlying analyzed form. So I don't think we should have multiple outputs for the FST here.
          Hide
          Robert Muir added a comment -

          Sorry mike, I'm still catching up and sadly just creating a lot of noise and thinking out loud.

          I think you actually have a point with reserving a byte when there are duplicates

          But at the same time i still think the surface form is valuable for this option too... when we reserve a byte can we also sort the
          duplicate outputs up front, in such a way that we can start traversing the output side to look for an exactly-matching surface form?

          So its like within that 'subtree' of the FST (the duplicates for the exact input), we can binary search?

          Otherwise exactFirst would be inefficient in some cases (as we have to do a special 'walk' here). Christian Moen
          showed me some scary stuff on his Japanese phone as far as readings->kanji forms ... i think if we can it might be
          good to be cautious and keep this fast...

          Show
          Robert Muir added a comment - Sorry mike, I'm still catching up and sadly just creating a lot of noise and thinking out loud. I think you actually have a point with reserving a byte when there are duplicates But at the same time i still think the surface form is valuable for this option too... when we reserve a byte can we also sort the duplicate outputs up front, in such a way that we can start traversing the output side to look for an exactly-matching surface form? So its like within that 'subtree' of the FST (the duplicates for the exact input), we can binary search? Otherwise exactFirst would be inefficient in some cases (as we have to do a special 'walk' here). Christian Moen showed me some scary stuff on his Japanese phone as far as readings->kanji forms ... i think if we can it might be good to be cautious and keep this fast...
          Hide
          Michael McCandless added a comment -

          New patch, going back to deduping on the input side... it's not done yet: I think we need to escape the bytes we steal.

          Show
          Michael McCandless added a comment - New patch, going back to deduping on the input side... it's not done yet: I think we need to escape the bytes we steal.
          Hide
          Michael McCandless added a comment -

          New patch, with escaping working, and exactFirst now means exact surface-form. I also made ctor parameters maxSurfaceFormsPerAnalyzedForm and maxGraphExpansions.

          I did the simple linear scan for now ... I think that's fine to commit ("N" is bounded). We may be able to optimize later with something similar to getByOutput...

          I think this patch is ready to commit to the branch ... and I think we are close overall to landing ... still a number of nocommits to resolve though.

          Show
          Michael McCandless added a comment - New patch, with escaping working, and exactFirst now means exact surface-form. I also made ctor parameters maxSurfaceFormsPerAnalyzedForm and maxGraphExpansions. I did the simple linear scan for now ... I think that's fine to commit ("N" is bounded). We may be able to optimize later with something similar to getByOutput... I think this patch is ready to commit to the branch ... and I think we are close overall to landing ... still a number of nocommits to resolve though.
          Hide
          Michael McCandless added a comment -

          Applyable patch from the branch (you just have to remove the 0-byte RollingBuffer.java in test-framework after applying it).

          I think it's ready ...

          Show
          Michael McCandless added a comment - Applyable patch from the branch (you just have to remove the 0-byte RollingBuffer.java in test-framework after applying it). I think it's ready ...
          Hide
          Robert Muir added a comment -

          in TStoA:

          if (pos == -1 && posInc == 0) {
            // TODO: hmm are TS's still allowed to do this...?
            posInc = 1;
          }
          

          NO they are not!

          As far as the limitations, i feel like if the last token's endOffset != length of input
          that might be pretty safe in general (e.g. standardtokenizer) because of how unicode
          works... i have to think about it.

          strange the FST size increased so much? If i run the benchmark:

          [junit4:junit4]   2> -- RAM consumption
          [junit4:junit4]   2> JaspellLookup   size[B]:    9,815,152
          [junit4:junit4]   2> TSTLookup       size[B]:    9,858,792
          [junit4:junit4]   2> FSTCompletionLookup size[B]:      466,520
          [junit4:junit4]   2> WFSTCompletionLookup size[B]:      507,640
          [junit4:junit4]   2> AnalyzingCompletionLookup size[B]:    1,832,952
          

          I don't know if we should worry about that, but it seems somewhat large
          for just using KeywordTokenizer.

           * <b>NOTE</b>: Although the {@link TermFreqIterator} API specifies
           * floating point weights
          

          Thats obselete. See WFSTSuggester in trunk where I fixed this already.

          Show
          Robert Muir added a comment - in TStoA: if (pos == -1 && posInc == 0) { // TODO: hmm are TS's still allowed to do this ...? posInc = 1; } NO they are not! As far as the limitations, i feel like if the last token's endOffset != length of input that might be pretty safe in general (e.g. standardtokenizer) because of how unicode works... i have to think about it. strange the FST size increased so much? If i run the benchmark: [junit4:junit4] 2> -- RAM consumption [junit4:junit4] 2> JaspellLookup size[B]: 9,815,152 [junit4:junit4] 2> TSTLookup size[B]: 9,858,792 [junit4:junit4] 2> FSTCompletionLookup size[B]: 466,520 [junit4:junit4] 2> WFSTCompletionLookup size[B]: 507,640 [junit4:junit4] 2> AnalyzingCompletionLookup size[B]: 1,832,952 I don't know if we should worry about that, but it seems somewhat large for just using KeywordTokenizer. * <b>NOTE</b>: Although the {@link TermFreqIterator} API specifies * floating point weights Thats obselete. See WFSTSuggester in trunk where I fixed this already.
          Hide
          Michael McCandless added a comment -

          Thanks Rob, good feedback ... I'll post new patch changing that posInc check to an assert, and removing that obsolete NOTE.

          As far as the limitations, i feel like if the last token's endOffset != length of input
          that might be pretty safe in general (e.g. standardtokenizer) because of how unicode
          works... i have to think about it.

          I think we should try that! This way the suggester can "guess" whether the input text is still inside the last token.

          But this won't help the StopFilter case, ie if user types 'a' then StopFilter will still delete it even though the token isn't "done" (ie maybe user intends to type 'apple').

          Still it's progress so I think we should try it ...

          I'm not sure why FST is so much larger ... the outputs should share very well with KeywordTokenizer ... hmm what weights do we use for the benchmark?

          Show
          Michael McCandless added a comment - Thanks Rob, good feedback ... I'll post new patch changing that posInc check to an assert, and removing that obsolete NOTE. As far as the limitations, i feel like if the last token's endOffset != length of input that might be pretty safe in general (e.g. standardtokenizer) because of how unicode works... i have to think about it. I think we should try that! This way the suggester can "guess" whether the input text is still inside the last token. But this won't help the StopFilter case, ie if user types 'a' then StopFilter will still delete it even though the token isn't "done" (ie maybe user intends to type 'apple'). Still it's progress so I think we should try it ... I'm not sure why FST is so much larger ... the outputs should share very well with KeywordTokenizer ... hmm what weights do we use for the benchmark?
          Hide
          Michael McCandless added a comment -

          New patch, folding in Rob's suggestions above (thanks!).

          OK the super-large FST size was a false alarm: we were using
          RAMUsageEstimator, which then went and included the RAM usage of
          MockAnalyzer; I changed LookupBenchmarkTest to use FST.sizeInBytes
          instead:

          .-- RAM consumption
          JaspellLookup   size[B]:    9,815,152
          TSTLookup       size[B]:    9,858,792
          FSTCompletionLookup size[B]:      466,520
          WFSTCompletionLookup size[B]:      507,640
          AnalyzingCompletionLookup size[B]:      889,138
          

          So we are still larger ... but not insanely so. I do think we could
          shrink the FST if we didn't add 2 bytes in the non-dup case ... I put
          a TODO to do this, but it'd make the exactFirst logic even hairier ...

          I also put a TODO to use the end offset as a heuristic to "guess"
          whether final token was a partial token or not ...

          Show
          Michael McCandless added a comment - New patch, folding in Rob's suggestions above (thanks!). OK the super-large FST size was a false alarm: we were using RAMUsageEstimator, which then went and included the RAM usage of MockAnalyzer; I changed LookupBenchmarkTest to use FST.sizeInBytes instead: .-- RAM consumption JaspellLookup size[B]: 9,815,152 TSTLookup size[B]: 9,858,792 FSTCompletionLookup size[B]: 466,520 WFSTCompletionLookup size[B]: 507,640 AnalyzingCompletionLookup size[B]: 889,138 So we are still larger ... but not insanely so. I do think we could shrink the FST if we didn't add 2 bytes in the non-dup case ... I put a TODO to do this, but it'd make the exactFirst logic even hairier ... I also put a TODO to use the end offset as a heuristic to "guess" whether final token was a partial token or not ...
          Hide
          Robert Muir added a comment -

          Can we split Analyzer into indexAnalyzer and queryAnalyzer?

          Can we also add 1 or 2 sugar ctors that use default values?

          I'm thinking:

          ctor(Analyzer analyzer) {
            this(analyzer, analyzer);
          }
          
          ctor(Analyzer index, Analyzer query) {
            this(index, query, default, default, default);
          }
          
          ctor(Analyzer index, Analyzer query, int option, int option, int option) {
            // this is the full ctor!
          }
          
          Show
          Robert Muir added a comment - Can we split Analyzer into indexAnalyzer and queryAnalyzer? Can we also add 1 or 2 sugar ctors that use default values? I'm thinking: ctor(Analyzer analyzer) { this (analyzer, analyzer); } ctor(Analyzer index, Analyzer query) { this (index, query, default , default , default ); } ctor(Analyzer index, Analyzer query, int option, int option, int option) { // this is the full ctor! }
          Hide
          Michael McCandless added a comment -

          Good ideas! New patch, separating indexAnalyzer and queryAnalyzer, w/ the sugar ctors.

          I also renamed to AnalyzingSuggester.

          Show
          Michael McCandless added a comment - Good ideas! New patch, separating indexAnalyzer and queryAnalyzer, w/ the sugar ctors. I also renamed to AnalyzingSuggester.
          Hide
          Robert Muir added a comment -

          +1 thanks Mike. lets get it in!

          Show
          Robert Muir added a comment - +1 thanks Mike. lets get it in!
          Hide
          Sudarshan Gaikaiwari added a comment -

          +1. This is awesome. It would be great to get this in trunk.

          Show
          Sudarshan Gaikaiwari added a comment - +1. This is awesome. It would be great to get this in trunk.
          Hide
          Michael McCandless added a comment -

          Thanks Sudarshan! It's actually already committed (will be in 4.1) ... I just forgot to resolve ...

          Show
          Michael McCandless added a comment - Thanks Sudarshan! It's actually already committed (will be in 4.1) ... I just forgot to resolve ...
          Hide
          David Smiley added a comment -

          That TokenStreamToAutomaton is cool Mike! I can put that to use in my FST text tagger work.

          Show
          David Smiley added a comment - That TokenStreamToAutomaton is cool Mike! I can put that to use in my FST text tagger work.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Michael McCandless
          http://svn.apache.org/viewvc?view=revision&revision=1391704

          LUCENE-3842: refactor: don't make spooky State methods public

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1391704 LUCENE-3842 : refactor: don't make spooky State methods public
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Michael McCandless
          http://svn.apache.org/viewvc?view=revision&revision=1391686

          LUCENE-3842: add AnalyzingSuggester

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1391686 LUCENE-3842 : add AnalyzingSuggester
          Hide
          Alexey Kudinov added a comment -

          I tried building the analyzing suggester model from an external file containing 1mln short phrases taken from Wikipedia titles.
          2Gb of memory seems not enough, it runs forever and dies with OOM.
          What is the expected dictionary size? What is the benchmark behavior?

          Thanks!

          Show
          Alexey Kudinov added a comment - I tried building the analyzing suggester model from an external file containing 1mln short phrases taken from Wikipedia titles. 2Gb of memory seems not enough, it runs forever and dies with OOM. What is the expected dictionary size? What is the benchmark behavior? Thanks!
          Hide
          Michael McCandless added a comment -

          The building process is unfortunately RAM intensive, but there are settings/knobs in the FST Builder API to tradeoff RAM required during building vs how small the resulting FST is. Maybe we need to expose control for these in AnalyzingSuggester ...

          Can you share those 1M short phrases? What is the total number of characters across them?

          Show
          Michael McCandless added a comment - The building process is unfortunately RAM intensive, but there are settings/knobs in the FST Builder API to tradeoff RAM required during building vs how small the resulting FST is. Maybe we need to expose control for these in AnalyzingSuggester ... Can you share those 1M short phrases? What is the total number of characters across them?
          Hide
          Alexey Kudinov added a comment -

          Setting maxGraphExpansions to some value > 0 (say, 30) ends with null reference exception. paths is null here:
          maxAnalyzedPathsForOneInput = Math.max(maxAnalyzedPathsForOneInput, paths.size());
          Fixing this, the model loads after a while.
          With maxGraphExpansions < 0 it doesn't load regardless the dictionary size.
          I'm using the wordnet synonyms, so I guess this causes a lot of paths, I suspect loops.
          The total dictionary file size is about 20Mb, but this doesn't really matter as I get similar behavior for even smaller one (2Mb).
          The dataset is from here: http://wiki.dbpedia.org/Downloads32, Titles in english. I took the values only and tried different sizes (10mln-1mln-0.1mln).

          Show
          Alexey Kudinov added a comment - Setting maxGraphExpansions to some value > 0 (say, 30) ends with null reference exception. paths is null here: maxAnalyzedPathsForOneInput = Math.max(maxAnalyzedPathsForOneInput, paths.size()); Fixing this, the model loads after a while. With maxGraphExpansions < 0 it doesn't load regardless the dictionary size. I'm using the wordnet synonyms, so I guess this causes a lot of paths, I suspect loops. The total dictionary file size is about 20Mb, but this doesn't really matter as I get similar behavior for even smaller one (2Mb). The dataset is from here: http://wiki.dbpedia.org/Downloads32 , Titles in english. I took the values only and tried different sizes (10mln-1mln-0.1mln).
          Hide
          Michael McCandless added a comment -

          I'm using the wordnet synonyms, so I guess this causes a lot of paths, I suspect loops.

          Ahhhh Yes this will cause lots of expansions / RAM used.

          But NPE because paths is null sounds like a real bug.

          OK I see why it's happening ... when we try to enumerate all finite strings from the expanded graph, if it exceeds the limit (maxGraphExpansions), SpecialOperations.getFiniteStrings returns null but the code assumes it will return the N finite strings it had found "so far". Can you open a new issue for this? We should separately fix it.

          Show
          Michael McCandless added a comment - I'm using the wordnet synonyms, so I guess this causes a lot of paths, I suspect loops. Ahhhh Yes this will cause lots of expansions / RAM used. But NPE because paths is null sounds like a real bug. OK I see why it's happening ... when we try to enumerate all finite strings from the expanded graph, if it exceeds the limit (maxGraphExpansions), SpecialOperations.getFiniteStrings returns null but the code assumes it will return the N finite strings it had found "so far". Can you open a new issue for this? We should separately fix it.
          Hide
          Alexey Kudinov added a comment -

          I opened an issue for NPE - LUCENE-4971

          Show
          Alexey Kudinov added a comment - I opened an issue for NPE - LUCENE-4971
          Hide
          Michael McCandless added a comment -

          Thank you Alexey!

          Show
          Michael McCandless added a comment - Thank you Alexey!
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Robert Muir
            • Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development