Details

    • Type: Task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: master (7.0)
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This seems to be only used for the kuromoji dictionaries, but we could easily rebuild those dictionaries with packing disabled.

      1. LUCENE-7531.patch
        58 kB
        Adrien Grand

        Activity

        Hide
        jpountz Adrien Grand added a comment -

        Here is a patch.

        Show
        jpountz Adrien Grand added a comment - Here is a patch.
        Hide
        mikemccand Michael McCandless added a comment -

        +1

        Show
        mikemccand Michael McCandless added a comment - +1
        Hide
        dsmiley David Smiley added a comment -

        I like the reduction in complexity. What's the motivation? Does the packing feature just seem not worth it?

        Show
        dsmiley David Smiley added a comment - I like the reduction in complexity. What's the motivation? Does the packing feature just seem not worth it?
        Hide
        dweiss Dawid Weiss added a comment -

        Well, it's sad to see the stuff I came up with (and Mike implemented) go away... But more seriously – this does seem to impact large automata. Can you recode the existing automata and see how much we lose by removing packing? Looking at the patch target addresses are still vint-encoded; if I recall right the compression ratio gained by packing was significant (compared to baseline fst), but a small fraction of overall input size. So a fst gain of a few megabytes for data size that is several hundred megabytes is indeed worth cutting the additional complexity of fst construction.

        +1 to remove it, but some stats on dictionary sizes before/after would be nice.

        Show
        dweiss Dawid Weiss added a comment - Well, it's sad to see the stuff I came up with (and Mike implemented) go away... But more seriously – this does seem to impact large automata. Can you recode the existing automata and see how much we lose by removing packing? Looking at the patch target addresses are still vint-encoded; if I recall right the compression ratio gained by packing was significant (compared to baseline fst), but a small fraction of overall input size. So a fst gain of a few megabytes for data size that is several hundred megabytes is indeed worth cutting the additional complexity of fst construction. +1 to remove it, but some stats on dictionary sizes before/after would be nice.
        Hide
        jpountz Adrien Grand added a comment -

        The thing that made me open that issue is not really that I think packing is not a good idea: actually I have no idea. But I'd like to clean up PackedInts a bit, and FST packing is one user of this API. And I noticed there is nothing in the code base that enables packing on FSTs, except the kuromoji dictionaries, which are smaller with packing disabled than with packing enabled.

        Show
        jpountz Adrien Grand added a comment - The thing that made me open that issue is not really that I think packing is not a good idea: actually I have no idea. But I'd like to clean up PackedInts a bit, and FST packing is one user of this API. And I noticed there is nothing in the code base that enables packing on FSTs, except the kuromoji dictionaries, which are smaller with packing disabled than with packing enabled.
        Hide
        dweiss Dawid Weiss added a comment -

        The code may not call for packed dictionaries, but there may be existing dictionaries that are already packed. Anyway, I don't think this is that crucial, really.

        except the kuromoji dictionaries, which are smaller with packing disabled than with packing enabled.

        This should never happen, something is wrong. The way automata compression was implemented in Morfologik would nearly always decrease the size of the automaton, especially in the first few optimization/ reshuffling steps. [1].

        [1] http://www.cs.put.poznan.pl/dweiss/site/publications/download/fsacomp.pdf

        Show
        dweiss Dawid Weiss added a comment - The code may not call for packed dictionaries, but there may be existing dictionaries that are already packed. Anyway, I don't think this is that crucial, really. except the kuromoji dictionaries, which are smaller with packing disabled than with packing enabled. This should never happen, something is wrong. The way automata compression was implemented in Morfologik would nearly always decrease the size of the automaton, especially in the first few optimization/ reshuffling steps. [1] . [1] http://www.cs.put.poznan.pl/dweiss/site/publications/download/fsacomp.pdf
        Hide
        mikemccand Michael McCandless added a comment -

        This should never happen, something is wrong.

        I also confirmed that with packing enabled, the Kuromoji FSTs got bigger, which I agree is messed up! Net/net I must not have implemented it properly

        I even tested having the terms index for the default postings format do packing, indexing Wikipedia English docs, and it also increased the FST size!

        Net/net it's hairy complex code with apparently dubious value at this point.

        Show
        mikemccand Michael McCandless added a comment - This should never happen, something is wrong. I also confirmed that with packing enabled, the Kuromoji FSTs got bigger, which I agree is messed up! Net/net I must not have implemented it properly I even tested having the terms index for the default postings format do packing, indexing Wikipedia English docs, and it also increased the FST size! Net/net it's hairy complex code with apparently dubious value at this point.
        Hide
        dweiss Dawid Weiss added a comment -

        Fair enough. +1 for simpler implementation.

        Show
        dweiss Dawid Weiss added a comment - Fair enough. +1 for simpler implementation.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit c4c5e868d2304148c146d52833ac2c80b0541dc3 in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c4c5e86 ]

        LUCENE-7531: Removed packing support from FST.

        Show
        jira-bot ASF subversion and git services added a comment - Commit c4c5e868d2304148c146d52833ac2c80b0541dc3 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c4c5e86 ] LUCENE-7531 : Removed packing support from FST.
        Hide
        jpountz Adrien Grand added a comment -

        Thanks all for having a look!

        Show
        jpountz Adrien Grand added a comment - Thanks all for having a look!

          People

          • Assignee:
            Unassigned
            Reporter:
            jpountz Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development