Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: core/FSTs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The FSTs produced by Builder can be further shrunk if you are willing
      to spend highish transient RAM to do so... our Builder today tries
      hard not to use much RAM (and has options to tweak down the RAM usage,
      in exchange for somewhat lager FST), even when building immense FSTs.

      But for apps that can afford highish transient RAM to get a smaller
      net FST, I think we should offer packing.

      1. Perf.java
        2 kB
        Michael McCandless
      2. LUCENE-3725.patch
        107 kB
        Michael McCandless
      3. LUCENE-3725.patch
        121 kB
        Michael McCandless
      4. LUCENE-3725.patch
        122 kB
        Michael McCandless

        Activity

        Hide
        Dawid Weiss added a comment -

        If we didn't do this then the packer would have to use even less RAM efficient data structures (eg Map<Int,X>) I think?

        Yes, this is exactly what I used (although I used a primitive-backed hash maps from HPPC), but the overhead will be there, sure.

        Second, the format written by the packer is tightly coupled with the FST reading, ie there are sizable differences when reading packed vs unpacked FST.

        Right. I have a different design in which the FSA is an abstract superclass and the implementation provides methods to walk the edges/ nodes. The writers simply walk that structure when serializing. Reading is delegated to a reader that can understand a particular format (and then provide a traversal implementation over raw bytes).

        I do have major simplifications over Lucene's version so this wouldn't be easy to do in Lucene's case without sacrificing performance.

        Show
        Dawid Weiss added a comment - If we didn't do this then the packer would have to use even less RAM efficient data structures (eg Map<Int,X>) I think? Yes, this is exactly what I used (although I used a primitive-backed hash maps from HPPC), but the overhead will be there, sure. Second, the format written by the packer is tightly coupled with the FST reading, ie there are sizable differences when reading packed vs unpacked FST. Right. I have a different design in which the FSA is an abstract superclass and the implementation provides methods to walk the edges/ nodes. The writers simply walk that structure when serializing. Reading is delegated to a reader that can understand a particular format (and then provide a traversal implementation over raw bytes). I do have major simplifications over Lucene's version so this wouldn't be easy to do in Lucene's case without sacrificing performance.
        Hide
        Michael McCandless added a comment -

        Thanks Dawid.

        Yes, this is pretty much the top-n nodes reordering that I did, albeit without any optimization of how many n to take (the hardcoded magic constants should probably be justified somehow? Or replaced by a default in FST somewhere?).

        Deciding how many nodes to reorder is I think hard – I failed to provide any sensible fast heuristic for that and I simply do a simulated annealing to find a local optimum.

        Yeah the algo is simplistic now... and it does force caller to pick the params (though minInCountDeref=3 and maxDerefNodes=Inf are probably pretty good)... we can and should make it more sophisticated over time. We have at least one spare bit to still use in the arc flags

        One thing I was wondering is why you decided to integrate the packer with the fst – wouldn't it be cleaner to separate packing from construction? Granted, this requires a double pass over the fst nodes and more intermediate memory but it wouldn't add any more complexity to the builder which is already, ehm, a bit complex . You can compare this design in Morfologik:

        Well... it's tricky. First, I decided to change node targets to ords so that pack could use an array to (relatively!) efficiently hold node data, eg inCount, newAddress, etc. That required making the first pass FST "modal" (deref'd or not). If we didn't do this then the packer would have to use even less RAM efficient data structures (eg Map<Int,X>) I think?

        Second, the format written by the packer is tightly coupled with the FST reading, ie there are sizable differences when reading packed vs unpacked FST.

        But I agree it'd be cleaner if we could move packing out (eg Util.pack), and more strongly decouple packing from FST format...

        One thing I'd really like to somehow do is create different classes for FST writing vs reading, and different classes for each format. We now have write-ord, write-non-ord, read-packed, read-unpacked (and even the two readers have 3 different modes depending on INPUT_TYPE).

        Show
        Michael McCandless added a comment - Thanks Dawid. Yes, this is pretty much the top-n nodes reordering that I did, albeit without any optimization of how many n to take (the hardcoded magic constants should probably be justified somehow? Or replaced by a default in FST somewhere?). Deciding how many nodes to reorder is I think hard – I failed to provide any sensible fast heuristic for that and I simply do a simulated annealing to find a local optimum. Yeah the algo is simplistic now... and it does force caller to pick the params (though minInCountDeref=3 and maxDerefNodes=Inf are probably pretty good)... we can and should make it more sophisticated over time. We have at least one spare bit to still use in the arc flags One thing I was wondering is why you decided to integrate the packer with the fst – wouldn't it be cleaner to separate packing from construction? Granted, this requires a double pass over the fst nodes and more intermediate memory but it wouldn't add any more complexity to the builder which is already, ehm, a bit complex . You can compare this design in Morfologik: Well... it's tricky. First, I decided to change node targets to ords so that pack could use an array to (relatively!) efficiently hold node data, eg inCount, newAddress, etc. That required making the first pass FST "modal" (deref'd or not). If we didn't do this then the packer would have to use even less RAM efficient data structures (eg Map<Int,X>) I think? Second, the format written by the packer is tightly coupled with the FST reading, ie there are sizable differences when reading packed vs unpacked FST. But I agree it'd be cleaner if we could move packing out (eg Util.pack), and more strongly decouple packing from FST format... One thing I'd really like to somehow do is create different classes for FST writing vs reading, and different classes for each format. We now have write-ord, write-non-ord, read-packed, read-unpacked (and even the two readers have 3 different modes depending on INPUT_TYPE).
        Hide
        Dawid Weiss added a comment -

        I had the time to look at the patch, finally. Yes, this is pretty much the top-n nodes reordering that I did, albeit without any optimization of how many n to take (the hardcoded magic constants should probably be justified somehow? Or replaced by a default in FST somewhere?). Deciding how many nodes to reorder is I think hard – I failed to provide any sensible fast heuristic for that and I simply do a simulated annealing to find a local optimum.

        One thing I was wondering is why you decided to integrate the packer with the fst – wouldn't it be cleaner to separate packing from construction? Granted, this requires a double pass over the fst nodes and more intermediate memory but it wouldn't add any more complexity to the builder which is already, ehm, a bit complex . You can compare this design in Morfologik:

        Builder:
        http://morfologik.svn.sourceforge.net/viewvc/morfologik/morfologik-stemming/trunk/morfologik-fsa/src/main/java/morfologik/fsa/FSABuilder.java?revision=343&view=markup

        Serialization (optimized or not, takes ant FSA on input) (method #linearize):
        http://morfologik.svn.sourceforge.net/viewvc/morfologik/morfologik-stemming/trunk/morfologik-fsa/src/main/java/morfologik/fsa/CFSA2Serializer.java?revision=343&view=markup

        I any way, the patch looks good to me.

        Show
        Dawid Weiss added a comment - I had the time to look at the patch, finally. Yes, this is pretty much the top-n nodes reordering that I did, albeit without any optimization of how many n to take (the hardcoded magic constants should probably be justified somehow? Or replaced by a default in FST somewhere?). Deciding how many nodes to reorder is I think hard – I failed to provide any sensible fast heuristic for that and I simply do a simulated annealing to find a local optimum. One thing I was wondering is why you decided to integrate the packer with the fst – wouldn't it be cleaner to separate packing from construction? Granted, this requires a double pass over the fst nodes and more intermediate memory but it wouldn't add any more complexity to the builder which is already, ehm, a bit complex . You can compare this design in Morfologik: Builder: http://morfologik.svn.sourceforge.net/viewvc/morfologik/morfologik-stemming/trunk/morfologik-fsa/src/main/java/morfologik/fsa/FSABuilder.java?revision=343&view=markup Serialization (optimized or not, takes ant FSA on input) (method #linearize): http://morfologik.svn.sourceforge.net/viewvc/morfologik/morfologik-stemming/trunk/morfologik-fsa/src/main/java/morfologik/fsa/CFSA2Serializer.java?revision=343&view=markup I any way, the patch looks good to me.
        Hide
        Robert Muir added a comment -

        Just some numbers with another (CJK) fst I have been playing with, this one uses BYTE2 + "SingleByteOutput"
        Before:
        Finished: 326915 words, 77222 nodes, 358677 arcs, 2617255 bytes...
        Zipped: 1812629 bytes
        Packed:
        Finished: 326915 words, 77222 nodes, 358677 arcs, 2027763 bytes...
        Zipped: 1735486 bytes

        Show
        Robert Muir added a comment - Just some numbers with another (CJK) fst I have been playing with, this one uses BYTE2 + "SingleByteOutput" Before: Finished: 326915 words, 77222 nodes, 358677 arcs, 2617255 bytes... Zipped: 1812629 bytes Packed: Finished: 326915 words, 77222 nodes, 358677 arcs, 2027763 bytes... Zipped: 1735486 bytes
        Hide
        Uwe Schindler added a comment -

        Yes, I had the same problem at another place in Viterbi, too It's not easy to understand. In my case I wanted to reuse a scratch IntsRef for the targetMap retrieval, but that failed horrible

        Show
        Uwe Schindler added a comment - Yes, I had the same problem at another place in Viterbi, too It's not easy to understand. In my case I wanted to reuse a scratch IntsRef for the targetMap retrieval, but that failed horrible
        Hide
        Michael McCandless added a comment -

        OK I added the FST.BytesReader to TokenInfoFST.findTargetArc (matching the FST API itself)... and discovered Viterbi isn't thread private (the instance is shared across threads)... so I just pull and reuse the reader in Viterbi.build.

        Show
        Michael McCandless added a comment - OK I added the FST.BytesReader to TokenInfoFST.findTargetArc (matching the FST API itself)... and discovered Viterbi isn't thread private (the instance is shared across threads)... so I just pull and reuse the reader in Viterbi.build.
        Hide
        Uwe Schindler added a comment -

        Ok, should be easy to implement. And its an internal class, so its not risky. We should just not make public APIs that way.

        Show
        Uwe Schindler added a comment - Ok, should be easy to implement. And its an internal class, so its not risky. We should just not make public APIs that way.
        Hide
        Michael McCandless added a comment -

        Right, but eg Viterbi is thread private instance right? So, eg, Viterbi could hold an FST.BytesReader that it passes into TokenInfoFST.findTargetArc... really TokenInfoFST is just a "thin wrapper" around an FST.

        Show
        Michael McCandless added a comment - Right, but eg Viterbi is thread private instance right? So, eg, Viterbi could hold an FST.BytesReader that it passes into TokenInfoFST.findTargetArc... really TokenInfoFST is just a "thin wrapper" around an FST.
        Hide
        Uwe Schindler added a comment -

        About that one:

        // TODO: could require caller to pass in the
        // FSTReader... since a tokenStream is thread private
        // anyway...
        return fst.findTargetArc(ch, follow, arc, fst.getBytesReader(0));
        

        The problem is that the TokenInfoFST is a singleton, right? I am not sure that the caller should pass that in, it makes it too complicated. The change a few lines before is fine (getting the reader before loop execution), but a class thats used by multiple threads should not reuse across method calls.

        Show
        Uwe Schindler added a comment - About that one: // TODO: could require caller to pass in the // FSTReader... since a tokenStream is thread private // anyway... return fst.findTargetArc(ch, follow, arc, fst.getBytesReader(0)); The problem is that the TokenInfoFST is a singleton, right? I am not sure that the caller should pass that in, it makes it too complicated. The change a few lines before is fine (getting the reader before loop execution), but a class thats used by multiple threads should not reuse across method calls.
        Hide
        Uwe Schindler added a comment -

        +1 to enable packing!

        Show
        Uwe Schindler added a comment - +1 to enable packing!
        Hide
        Michael McCandless added a comment -

        OK I turned packing back on for Kuromoji's TokenInfoDict... this
        reduces size from 1954846 to 1498215 bytes (23% smaller = 445.9 KB).

        And.... now JAR is a bit smaller: 4533833 vs 4570053 bytes (~35
        KB). What changed was... I tweaked the params to pack (there are 2
        int params) and got better packing than before.

        I wrote a simple perf test (Perf.java attached)... and as far as I can
        tell the perf change is within hotspot noise... with trunk I get this:

        1366 msec; 688.1405563689605 tokens/msec
        1006 msec; 934.3936381709741 tokens/msec
        1020 msec; 921.5686274509804 tokens/msec
        938 msec; 1002.1321961620469 tokens/msec
        937 msec; 1003.2017075773746 tokens/msec
        942 msec; 997.8768577494692 tokens/msec
        938 msec; 1002.1321961620469 tokens/msec
        940 msec; 1000.0 tokens/msec
        939 msec; 1001.0649627263045 tokens/msec
        939 msec; 1001.0649627263045 tokens/msec
        

        And with the packed FST I get this:

        1366 msec; 688.1405563689605 tokens/msec
        1003 msec; 937.1884346959123 tokens/msec
        1014 msec; 927.0216962524655 tokens/msec
        934 msec; 1006.423982869379 tokens/msec
        935 msec; 1005.3475935828877 tokens/msec
        936 msec; 1004.2735042735043 tokens/msec
        935 msec; 1005.3475935828877 tokens/msec
        938 msec; 1002.1321961620469 tokens/msec
        936 msec; 1004.2735042735043 tokens/msec
        935 msec; 1005.3475935828877 tokens/msec
        

        But (annoyingly, as usual!) the results can differ quite a bit
        depending on how hotspot flips coins on startup...

        Show
        Michael McCandless added a comment - OK I turned packing back on for Kuromoji's TokenInfoDict... this reduces size from 1954846 to 1498215 bytes (23% smaller = 445.9 KB). And.... now JAR is a bit smaller: 4533833 vs 4570053 bytes (~35 KB). What changed was... I tweaked the params to pack (there are 2 int params) and got better packing than before. I wrote a simple perf test (Perf.java attached)... and as far as I can tell the perf change is within hotspot noise... with trunk I get this: 1366 msec; 688.1405563689605 tokens/msec 1006 msec; 934.3936381709741 tokens/msec 1020 msec; 921.5686274509804 tokens/msec 938 msec; 1002.1321961620469 tokens/msec 937 msec; 1003.2017075773746 tokens/msec 942 msec; 997.8768577494692 tokens/msec 938 msec; 1002.1321961620469 tokens/msec 940 msec; 1000.0 tokens/msec 939 msec; 1001.0649627263045 tokens/msec 939 msec; 1001.0649627263045 tokens/msec And with the packed FST I get this: 1366 msec; 688.1405563689605 tokens/msec 1003 msec; 937.1884346959123 tokens/msec 1014 msec; 927.0216962524655 tokens/msec 934 msec; 1006.423982869379 tokens/msec 935 msec; 1005.3475935828877 tokens/msec 936 msec; 1004.2735042735043 tokens/msec 935 msec; 1005.3475935828877 tokens/msec 938 msec; 1002.1321961620469 tokens/msec 936 msec; 1004.2735042735043 tokens/msec 935 msec; 1005.3475935828877 tokens/msec But (annoyingly, as usual!) the results can differ quite a bit depending on how hotspot flips coins on startup...
        Hide
        Michael McCandless added a comment -

        OK, I agree... I'll run perf tests of Kuromoji and post back...

        I'll re-check the JAR size impact; I think it was only a bit larger.

        Show
        Michael McCandless added a comment - OK, I agree... I'll run perf tests of Kuromoji and post back... I'll re-check the JAR size impact; I think it was only a bit larger.
        Hide
        Robert Muir added a comment -

        Well it would be good to test kuromoji performance too before cutting over.

        if the performance is not impacted, I tend to agree with Uwe that decreasing memory usage is a good thing,
        even if the jar is slightly larger (how much?), especially in this case as it costs no code complexity to the analyzer.

        Show
        Robert Muir added a comment - Well it would be good to test kuromoji performance too before cutting over. if the performance is not impacted, I tend to agree with Uwe that decreasing memory usage is a good thing, even if the jar is slightly larger (how much?), especially in this case as it costs no code complexity to the analyzer.
        Hide
        Uwe Schindler added a comment -

        Thanks @ Dawid and Mike. I expected something like that, but when I read the issue, this was just funny! Thanks for clarification, it seems that there was background information (maybe only discussed in IRC) missing.

        About the Kumoroji: In my opinion, the JAR size is not the major thing. The reason why Robert and me was compacting structures was to a) reduce SVN checkout size (this should improve by packing) b) reduce memory footprint of the Analyzer (and that is also improvement). As the FST is only build once during rebuilding the binary FST from the dict and not at runtime, should we not package for memory efficiency?

        Show
        Uwe Schindler added a comment - Thanks @ Dawid and Mike. I expected something like that, but when I read the issue, this was just funny! Thanks for clarification, it seems that there was background information (maybe only discussed in IRC) missing. About the Kumoroji: In my opinion, the JAR size is not the major thing. The reason why Robert and me was compacting structures was to a) reduce SVN checkout size (this should improve by packing) b) reduce memory footprint of the Analyzer (and that is also improvement). As the FST is only build once during rebuilding the binary FST from the dict and not at runtime, should we not package for memory efficiency?
        Hide
        Michael McCandless added a comment -

        P.S.: Does this change FST file format again?

        Oh sorry forgot to answer: yes!

        Show
        Michael McCandless added a comment - P.S.: Does this change FST file format again? Oh sorry forgot to answer: yes!
        Hide
        Michael McCandless added a comment -

        That's exactly right – the FST got smaller, but harder for zip to compress, so the JAR got bigger.

        I haven't turned it on anywhere in Lucene yet... I think it need to be an explicit choice since it requires extra RAM after building.

        Show
        Michael McCandless added a comment - That's exactly right – the FST got smaller, but harder for zip to compress, so the JAR got bigger. I haven't turned it on anywhere in Lucene yet... I think it need to be an explicit choice since it requires extra RAM after building.
        Hide
        Dawid Weiss added a comment -

        A smaller automaton may result in a larger JAR. A smaller automaton will usually have a denser representation and larger entropy. I've observed a similar phenomenon when I was working on automata compression for morfologik – the ZIP file didn't grow there but it remained pretty much constant as the automaton was (progressively) made smaller and smaller.

        Show
        Dawid Weiss added a comment - A smaller automaton may result in a larger JAR. A smaller automaton will usually have a denser representation and larger entropy. I've observed a similar phenomenon when I was working on automata compression for morfologik – the ZIP file didn't grow there but it remained pretty much constant as the automaton was (progressively) made smaller and smaller.
        Hide
        Uwe Schindler added a comment - - edited

        Mike: This his somehow opposite to the last comment:

        I only turned packing on for one thing: the Kuromoji FST (shrank by 14%, 272 KB).

        vs.:

        I turned off packing for Kuromoji, since the JAR actually got larger (despite FST getting smaller)... strange.

        What's going on? Does this mean the FST and the TokenInfoDict$fst.dat file was smaller, but the resulting JAR file bigger - can be a gzip inflate compression problem? I can only repeat Simon: https://www.destroyallsoftware.com/talks/wat

        As you wanted to enable this because of Kumoroji and now don't use it at all, is there any use-case in Lucene using the compaction?

        I am just confused!

        P.S.: Does this change FST file format again?

        Show
        Uwe Schindler added a comment - - edited Mike: This his somehow opposite to the last comment: I only turned packing on for one thing: the Kuromoji FST (shrank by 14%, 272 KB). vs.: I turned off packing for Kuromoji, since the JAR actually got larger (despite FST getting smaller)... strange. What's going on? Does this mean the FST and the TokenInfoDict$fst.dat file was smaller, but the resulting JAR file bigger - can be a gzip inflate compression problem? I can only repeat Simon: https://www.destroyallsoftware.com/talks/wat As you wanted to enable this because of Kumoroji and now don't use it at all, is there any use-case in Lucene using the compaction? I am just confused! P.S.: Does this change FST file format again?
        Hide
        Michael McCandless added a comment -

        New patch with all nocommits removed. I also optimized Util.get and
        FST.findTargetArc a bit. I turned off packing for Kuromoji, since the
        JAR actually got larger (despite FST getting smaller)... strange.

        I think it's ready!

        Show
        Michael McCandless added a comment - New patch with all nocommits removed. I also optimized Util.get and FST.findTargetArc a bit. I turned off packing for Kuromoji, since the JAR actually got larger (despite FST getting smaller)... strange. I think it's ready!
        Hide
        Michael McCandless added a comment -

        Initial patch.... it has tons of nocommits but I think it's basically
        working correctly.

        The packing is fairly simplistic now, but we can improve it with time
        (I know Dawid has done all sorts of cool things!): it chooses the top
        N nodes (sorted by incoming arc count) and saves them dereferenced so
        that nodes w/ high in-count get a "low" address. It also saves the
        pointer as delta vs current position, if that would take fewer bytes.
        The bytes are then in "forward" order.

        The size savings varies by FST... eg, for the all-Wikipedia-terms FSA
        (no outputs) it reduces byte size by 21%. If I map to ords (FST) then
        it's only 13% (I don't do anything to pack the outputs now, so the
        bytes required for them are unchanged).

        While the resulting FST is smaller, there is some hit to lookup (~8%
        for the Wikipedia ord FST), because we have to deref some nodes.

        I only turned packing on for one thing: the Kuromoji FST (shrank by
        14%, 272 KB).

        Show
        Michael McCandless added a comment - Initial patch.... it has tons of nocommits but I think it's basically working correctly. The packing is fairly simplistic now, but we can improve it with time (I know Dawid has done all sorts of cool things!): it chooses the top N nodes (sorted by incoming arc count) and saves them dereferenced so that nodes w/ high in-count get a "low" address. It also saves the pointer as delta vs current position, if that would take fewer bytes. The bytes are then in "forward" order. The size savings varies by FST... eg, for the all-Wikipedia-terms FSA (no outputs) it reduces byte size by 21%. If I map to ords (FST) then it's only 13% (I don't do anything to pack the outputs now, so the bytes required for them are unchanged). While the resulting FST is smaller, there is some hit to lookup (~8% for the Wikipedia ord FST), because we have to deref some nodes. I only turned packing on for one thing: the Kuromoji FST (shrank by 14%, 272 KB).

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development