Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I implemented the algo described at
      http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for
      incrementally building a finite state transducer (FST) from sorted
      inputs.

      This is not a fully general FST impl – it's only able to build up an
      FST incrementally from input/output pairs that are pre-sorted.

      Currently the inputs are BytesRefs, and the outputs are pluggable –
      NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long,
      ByteSequenceOutput maps to a BytesRef.

      The implementation has a low memory overhead, so that it can handle a
      fairly large set of terms. For example, it can build the FSA for the
      9.8M terms from a 10M document wikipedia index in ~8 seconds (on
      beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB.

      It packs the FST as-it-builds into a compact byte[], and then exposes
      the API to read nodes/arcs directly from the byte[]. The FST can be
      quickly saved/loaded to/from a Directory since it's just a big byte[].

      The format is similar to what Morfologik uses
      (http://sourceforge.net/projects/morfologik/).

      I think there are a number of possible places we can use this in
      Lucene. For example, I think many apps could hold the entire terms
      dict in RAM, either at the multi-reader level or maybe per-segment
      (mapping to file offset or to something else custom to the app), which
      may possibly be a good speedup for certain MTQs (though, because the
      format is packed into a byte[], there is a decode cost when visiting
      arcs).

      The builder can also prune as it goes, so you get a prefix trie pruned
      according to how many terms run through the nodes, which makes it
      faster and even less memory consuming. This may be useful as a
      replacement for our current binary search terms index since it can
      achieve higher term density for the same RAM consumption of our
      current index.

      As an initial usage to make sure this is exercised, I cutover the
      SimpleText codec, which currently fully loads all terms into a
      TreeMap (and has caused intermittent OOME in some tests), to use an FST
      instead. SimpleText uses a PairOutputs which is able to "pair up" any
      two other outputs, since it needs to map each input term to an int
      docFreq and long filePosition.

      All tests pass w/ SimpleText forced codec, and I think this is
      committable except I'd love to get some help w/ the generics
      (confession to the policeman: I had to add
      @SuppressWarnings(

      {"unchecked"}

      )) all over!! Ideally an FST is
      parameterized by its output type (Integer, BytesRef, etc.).

      I even added a new @nightly test that makes a largeish set of random
      terms and tests the resulting FST on different outputs

      I think it would also be easy to make a variant that uses char[]
      instead of byte[] as its inputs, so we could eg use this during analysis
      (Robert's idea). It's already be easy to have a CharSequence
      output type since the outputs are pluggable.

      Dawid Weiss (author of HPPC – http://labs.carrotsearch.com/hppc.html – and
      Morfologik – http://sourceforge.net/projects/morfologik/)
      was very helpful iterating with me on this (thank you!).

      1. FSTExample.png
        69 kB
        Michael McCandless
      2. LUCENE-2792.patch
        157 kB
        Uwe Schindler
      3. LUCENE-2792.patch
        157 kB
        Michael McCandless
      4. LUCENE-2792.patch
        145 kB
        Uwe Schindler
      5. LUCENE-2792.patch
        145 kB
        Robert Muir
      6. LUCENE-2792.patch
        141 kB
        Michael McCandless
      7. LUCENE-2792.patch
        99 kB
        Michael McCandless
      8. LUCENE-2792-ArrayUtil-grow.patch
        11 kB
        Uwe Schindler
      9. SimpleBench.java
        2 kB
        Robert Muir

        Activity

        Hide
        Michael McCandless added a comment -

        Patch.

        Show
        Michael McCandless added a comment - Patch.
        Hide
        Michael McCandless added a comment -

        I'm attaching an example FST (PNG image) so we can visualize what an FST does.

        Really an FST is like a SortedMap<BytesRef,X>, where X (the "outputs") is pluggable.

        This particular FST maps a canned set of words to the int ord (0, 1, 2, ...) of the word.

        Arcs that are "final" end in a T instead of an arrow, which means if you stop at that arc the input you've traversed is "accepted". (The blue arcs just indicate arcs that were compressed with a NEXT flag when stored in the byte[].)

        To get the total output you sum the outputs you encounter. For example, the input "stop" hits output 4 on the 's' and output 1 on the o so its ord is 5.

        Show
        Michael McCandless added a comment - I'm attaching an example FST (PNG image) so we can visualize what an FST does. Really an FST is like a SortedMap<BytesRef,X>, where X (the "outputs") is pluggable. This particular FST maps a canned set of words to the int ord (0, 1, 2, ...) of the word. Arcs that are "final" end in a T instead of an arrow, which means if you stop at that arc the input you've traversed is "accepted". (The blue arcs just indicate arcs that were compressed with a NEXT flag when stored in the byte[].) To get the total output you sum the outputs you encounter. For example, the input "stop" hits output 4 on the 's' and output 1 on the o so its ord is 5.
        Hide
        Robert Muir added a comment -

        This sounds awesome!

        It would be neat to write intersect(FST, Automaton), and somehow optionally use it if available.
        Such a thing might be cleaner now that MTQ.getTermsEnum takes a Terms structure

        Show
        Robert Muir added a comment - This sounds awesome! It would be neat to write intersect(FST, Automaton), and somehow optionally use it if available. Such a thing might be cleaner now that MTQ.getTermsEnum takes a Terms structure
        Hide
        Jason Rutherglen added a comment -

        Can the FST be made concurrent for use in LUCENE-2312 as the terms dict implementation?

        Show
        Jason Rutherglen added a comment - Can the FST be made concurrent for use in LUCENE-2312 as the terms dict implementation?
        Hide
        Michael McCandless added a comment -

        Can the FST be made concurrent for use in LUCENE-2312 as the terms dict implementation?

        Maybe? Ie the write ops are things like "append a new arc to this node", so eg if we use a concurrent linked list per node?

        Also, the impl here isn't helpful for the RT case, since it requires you add the terms in sorted order. But, there is a non-sorted-order version too, however, they will be substantially more RAM consuming since they'd use a objects for node/arcs (this impl doesn't).

        Show
        Michael McCandless added a comment - Can the FST be made concurrent for use in LUCENE-2312 as the terms dict implementation? Maybe? Ie the write ops are things like "append a new arc to this node", so eg if we use a concurrent linked list per node? Also, the impl here isn't helpful for the RT case, since it requires you add the terms in sorted order. But, there is a non-sorted-order version too, however, they will be substantially more RAM consuming since they'd use a objects for node/arcs (this impl doesn't).
        Hide
        Robert Muir added a comment -

        Can the FST be made concurrent for use in LUCENE-2312 as the terms dict implementation?

        Just an idea, maybe you could use something like ConcurrentSkipListMap for this
        until mike makes "fst 2.0" or something?

        its only in java 6 but you can poach from apache harmony...

        Show
        Robert Muir added a comment - Can the FST be made concurrent for use in LUCENE-2312 as the terms dict implementation? Just an idea, maybe you could use something like ConcurrentSkipListMap for this until mike makes "fst 2.0" or something? its only in java 6 but you can poach from apache harmony...
        Hide
        Michael Busch added a comment -

        Cool stuff, Mike!

        Could we use this for more efficient wildcard search? E.g. could we add posting lists for inner nodes to the index?

        Show
        Michael Busch added a comment - Cool stuff, Mike! Could we use this for more efficient wildcard search? E.g. could we add posting lists for inner nodes to the index?
        Hide
        Michael McCandless added a comment -

        Could we use this for more efficient wildcard search?

        Possibly – we need to explore this. It remains to be seen if the CPU cost of traversing the FST is offset by the fact that the FST can keep more (possibly, all) terms in memory.

        Also we'd need to figure out whether this'd work at the top-level reader (eg as an initial filter to enumerate all terms matching the MTQ), or, per-segment as the terms index (which'd be much more RAM costly).

        E.g. could we add posting lists for inner nodes to the index?

        Hmm... what are the "inner nodes"?

        I guess we could do something like Pulsing codec, where postings for low doc-freq terms are simply stored in an FST, and then only the higher freq terms remain on disk (optionally of course since whether this is feasible depends on the app). Then eg lookups against a primary key field would be entirely in RAM.

        Show
        Michael McCandless added a comment - Could we use this for more efficient wildcard search? Possibly – we need to explore this. It remains to be seen if the CPU cost of traversing the FST is offset by the fact that the FST can keep more (possibly, all) terms in memory. Also we'd need to figure out whether this'd work at the top-level reader (eg as an initial filter to enumerate all terms matching the MTQ), or, per-segment as the terms index (which'd be much more RAM costly). E.g. could we add posting lists for inner nodes to the index? Hmm... what are the "inner nodes"? I guess we could do something like Pulsing codec, where postings for low doc-freq terms are simply stored in an FST, and then only the higher freq terms remain on disk (optionally of course since whether this is feasible depends on the app). Then eg lookups against a primary key field would be entirely in RAM.
        Hide
        Jason Rutherglen added a comment -

        Just an idea, maybe you could use something like ConcurrentSkipListMap for this
        its only in java 6 but you can poach from apache harmony...

        Right, I'm planning on using CSLM and sorted int[]s. Aren't we on Java 6 for trunk? However, yes, it'd be nice to support JDK 1.5, I'll look into if Harmony's CSLM can be pulled out.

        Show
        Jason Rutherglen added a comment - Just an idea, maybe you could use something like ConcurrentSkipListMap for this its only in java 6 but you can poach from apache harmony... Right, I'm planning on using CSLM and sorted int[]s. Aren't we on Java 6 for trunk? However, yes, it'd be nice to support JDK 1.5, I'll look into if Harmony's CSLM can be pulled out.
        Hide
        Michael McCandless added a comment -

        New patch! I think it's ready to commit.

        I switched the core FST input from byte to int so that the FST can now map any IntsRef to any output. I also added sugar for adding eg CharSequence, char[] as utf32 to the FST, and also running (looking up the output for a given input). I made separate BytesRefFSTEnum and IntsRefFSTEnum. I changed the "nocommit switch to generics" to TODOs , but it'd be nice to somehow cutover to generics for the FST's output type.

        I also added a test case that indexes docs from the new test line docs file, builds an FST, then verifies the resulting TermsEnum vs BytesRefFSTEnum behave the same.

        Show
        Michael McCandless added a comment - New patch! I think it's ready to commit. I switched the core FST input from byte to int so that the FST can now map any IntsRef to any output. I also added sugar for adding eg CharSequence, char[] as utf32 to the FST, and also running (looking up the output for a given input). I made separate BytesRefFSTEnum and IntsRefFSTEnum. I changed the "nocommit switch to generics" to TODOs , but it'd be nice to somehow cutover to generics for the FST's output type. I also added a test case that indexes docs from the new test line docs file, builds an FST, then verifies the resulting TermsEnum vs BytesRefFSTEnum behave the same.
        Hide
        Uwe Schindler added a comment -

        Hi Mike,
        my problem is that I have no time to closely look into it and I don't understand the whole thing (you cannot always understand everything g). So I have no idea how the generics should look like. I don't even understand all the SuppressWarnings currently in the code, because in my opinion, generics warnings cannot occur at all those places. From a first look it seems that a few places using untyped Output, but for that the methods need to be generified (a generic <T> before the return type like in AttributeSource). The "copy" in one of the TODOs cannot be avoided by generics, because the generics are type erasure, so it would not help in that method (only that you may not need to add so many casts, which on the other hand the compiler will add).

        Show
        Uwe Schindler added a comment - Hi Mike, my problem is that I have no time to closely look into it and I don't understand the whole thing (you cannot always understand everything g ). So I have no idea how the generics should look like. I don't even understand all the SuppressWarnings currently in the code, because in my opinion, generics warnings cannot occur at all those places. From a first look it seems that a few places using untyped Output, but for that the methods need to be generified (a generic <T> before the return type like in AttributeSource). The "copy" in one of the TODOs cannot be avoided by generics, because the generics are type erasure, so it would not help in that method (only that you may not need to add so many casts, which on the other hand the compiler will add).
        Hide
        Robert Muir added a comment -

        As a start, i generified the PairOutputs.. will try to help with the other classes later today.

        Show
        Robert Muir added a comment - As a start, i generified the PairOutputs.. will try to help with the other classes later today.
        Hide
        Uwe Schindler added a comment -

        Here a first generifid patch. There are several problems:

        • FST<T> as parameter to Outputs.write() and Outputs.read() breaks generics in PairOutput (I added nocommit, this is really why we have the generics, without that this bug would never be visible). Mike said, that there should be really used IndexInput and IndexOutput
        • The testcase is broken, as it dynamically casts types depending on int constants. The test should be rewritten to use typed inner classes (and some code duplication)

        I may have missed more generics violations, but it now compiles correctly. Javac does not detect all missing parameters, so I have to review again, but for now I want to post the patch.

        Show
        Uwe Schindler added a comment - Here a first generifid patch. There are several problems: FST<T> as parameter to Outputs.write() and Outputs.read() breaks generics in PairOutput (I added nocommit, this is really why we have the generics, without that this bug would never be visible). Mike said, that there should be really used IndexInput and IndexOutput The testcase is broken, as it dynamically casts types depending on int constants. The test should be rewritten to use typed inner classes (and some code duplication) I may have missed more generics violations, but it now compiles correctly. Javac does not detect all missing parameters, so I have to review again, but for now I want to post the patch.
        Hide
        Uwe Schindler added a comment -

        Last patch missed some generics in NoOutputs and PositiveInts

        Show
        Uwe Schindler added a comment - Last patch missed some generics in NoOutputs and PositiveInts
        Hide
        Michael McCandless added a comment -

        This looks great Uwe, thanks!

        Show
        Michael McCandless added a comment - This looks great Uwe, thanks!
        Hide
        Michael McCandless added a comment -

        New patch, after substantial iterations w/ the Generics Policeman (thanks!) using Mercurial.

        It's much cleaner now – the classes are parameterized by the output type of the FST.

        I will commit soon...

        Show
        Michael McCandless added a comment - New patch, after substantial iterations w/ the Generics Policeman (thanks!) using Mercurial. It's much cleaner now – the classes are parameterized by the output type of the FST. I will commit soon...
        Hide
        Uwe Schindler added a comment -

        I will merge the ArrayUtils imporvements to 3.x after the commit (its minor, but very useful).

        Show
        Uwe Schindler added a comment - I will merge the ArrayUtils imporvements to 3.x after the commit (its minor, but very useful).
        Hide
        Uwe Schindler added a comment -

        Mike: Here some further, small cleanups in the grow() parts.

        Show
        Uwe Schindler added a comment - Mike: Here some further, small cleanups in the grow() parts.
        Hide
        Michael McCandless added a comment -

        Thanks Uwe!

        Show
        Michael McCandless added a comment - Thanks Uwe!
        Hide
        Robert Muir added a comment -

        Reopening: the generic ArrayUtil.grow here (with getComponentType etc) is more than 10x slower than the old code on -client.
        Yes, its the same performance on -server, but not everyone runs with that.

        Show
        Robert Muir added a comment - Reopening: the generic ArrayUtil.grow here (with getComponentType etc) is more than 10x slower than the old code on -client. Yes, its the same performance on -server, but not everyone runs with that.
        Hide
        Robert Muir added a comment -

        In revision 1045012, i reverted the generic grow() for the classes that are unrelated to this issue.

        Its still being used in 3 places by the FST stuff in this patch, and this should be reviewed separately if these areas are performance sensitive.

        Show
        Robert Muir added a comment - In revision 1045012, i reverted the generic grow() for the classes that are unrelated to this issue. Its still being used in 3 places by the FST stuff in this patch, and this should be reviewed separately if these areas are performance sensitive.
        Hide
        Robert Muir added a comment -

        Here's the link for reference: http://bugs.sun.com/view_bug.do?bug_id=6525802

        This getComponentType is intrinsic only on -server, on -client it is native (slow).

        Personally I think we should remove this generic grow() completely to avoid performance issues (I kept it, but its only used by FST)

        Show
        Robert Muir added a comment - Here's the link for reference: http://bugs.sun.com/view_bug.do?bug_id=6525802 This getComponentType is intrinsic only on -server, on -client it is native (slow). Personally I think we should remove this generic grow() completely to avoid performance issues (I kept it, but its only used by FST)
        Hide
        Uwe Schindler added a comment -

        Thanks Robert for the test with -client. We all use -server and did not see a difference (I made tests yesterday with Java 6).

        This may also be a reason for the slow performance of Collections.sort() without our improvements, because AbstractCollection.toArray(T[]) exactly uses this code to create the array instance (in fact, I copied the resize code from Harmony).

        Here's the link for reference: http://bugs.sun.com/view_bug.do?bug_id=6525802

        As far as I see is the slow part a.getClass().getComponentType(). What do you think about the following method signature, that would make it also 100% type safe and does not have the negative performance impact?:

        public static <T> T[] grow(T[] array, Class<T> componentType, int size)
        

        In code you would use:

        String[] sa=...
        ArrayUtils.grow(sa, String.class, 20);
        

        I can supply a patch for testing. Array.newInstance is one of the most used methods in lots of code, and the actual growing is done seldom (because of oversize()).

        Show
        Uwe Schindler added a comment - Thanks Robert for the test with -client. We all use -server and did not see a difference (I made tests yesterday with Java 6). This may also be a reason for the slow performance of Collections.sort() without our improvements, because AbstractCollection.toArray(T[]) exactly uses this code to create the array instance (in fact, I copied the resize code from Harmony). Here's the link for reference: http://bugs.sun.com/view_bug.do?bug_id=6525802 As far as I see is the slow part a.getClass().getComponentType(). What do you think about the following method signature, that would make it also 100% type safe and does not have the negative performance impact?: public static <T> T[] grow(T[] array, Class <T> componentType, int size) In code you would use: String [] sa=... ArrayUtils.grow(sa, String .class, 20); I can supply a patch for testing. Array.newInstance is one of the most used methods in lots of code, and the actual growing is done seldom (because of oversize()).
        Hide
        Uwe Schindler added a comment -

        Here the patch, just to test performance. Robert?

        Silly: It again creates stupid unchecked warnings in FST classes (this was the main treason for the grow method). If this patch helps I will try to improve, maybe by relaxing generics in ArrayUtil.

        Show
        Uwe Schindler added a comment - Here the patch, just to test performance. Robert? Silly: It again creates stupid unchecked warnings in FST classes (this was the main treason for the grow method). If this patch helps I will try to improve, maybe by relaxing generics in ArrayUtil.
        Hide
        Robert Muir added a comment -

        Uwe, unfortunately even without the getComponentType its still much worse.
        I'm attaching my benchmark (times in milliseconds):
        method1: new+arraycopy
        method2: original Array.grow
        method3: Array.grow without getComponentType

        -client
        method1: 221
        method2: 2672
        method3: 1731
        method1: 225
        method2: 2674
        method3: 1727
        method1: 212
        method2: 2697
        method3: 1698

        -server
        method1: 160
        method2: 166
        method3: 167
        method1: 159
        method2: 168
        method3: 170
        method1: 162
        method2: 172
        method3: 171

        Show
        Robert Muir added a comment - Uwe, unfortunately even without the getComponentType its still much worse. I'm attaching my benchmark (times in milliseconds): method1: new+arraycopy method2: original Array.grow method3: Array.grow without getComponentType -client method1: 221 method2: 2672 method3: 1731 method1: 225 method2: 2674 method3: 1727 method1: 212 method2: 2697 method3: 1698 -server method1: 160 method2: 166 method3: 167 method1: 159 method2: 168 method3: 170 method1: 162 method2: 172 method3: 171
        Hide
        Michael McCandless added a comment -

        I think we should remove this generic ArrayUtil.grow? It's too dangerous that we'll accidentally use it in a hotspot some time going forward...?

        Show
        Michael McCandless added a comment - I think we should remove this generic ArrayUtil.grow? It's too dangerous that we'll accidentally use it in a hotspot some time going forward...?
        Hide
        Uwe Schindler added a comment -

        Give me few days to think about it. We still use it in FST, but mostly because the generics are ugly otherwise. Maybe I have a good idea. But we can revert in 3.x? I will do that, if you also thinks so.

        Show
        Uwe Schindler added a comment - Give me few days to think about it. We still use it in FST, but mostly because the generics are ugly otherwise. Maybe I have a good idea. But we can revert in 3.x? I will do that, if you also thinks so.
        Hide
        Michael McCandless added a comment -

        It would be best if we could keep the API but w/ no trap for -client java users.

        But, I think we should at least revert from FST's Builder.UnCompiledNode arc reallocation; that's a hot spot...

        Show
        Michael McCandless added a comment - It would be best if we could keep the API but w/ no trap for -client java users. But, I think we should at least revert from FST's Builder.UnCompiledNode arc reallocation; that's a hot spot...
        Hide
        Uwe Schindler added a comment -

        Reverted generic ArrayUtil.grow(T[]) in trunk and 3.x (rev. 1045319).

        Show
        Uwe Schindler added a comment - Reverted generic ArrayUtil.grow(T[]) in trunk and 3.x (rev. 1045319).

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development