Details

    • Type: Task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This seems to be only used for the kuromoji dictionaries, but we could easily rebuild those dictionaries with packing disabled.

      1. LUCENE-7531.patch
        58 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          jpountz Adrien Grand added a comment -

          Here is a patch.

          Show
          jpountz Adrien Grand added a comment - Here is a patch.
          Hide
          mikemccand Michael McCandless added a comment -

          +1

          Show
          mikemccand Michael McCandless added a comment - +1
          Hide
          dsmiley David Smiley added a comment -

          I like the reduction in complexity. What's the motivation? Does the packing feature just seem not worth it?

          Show
          dsmiley David Smiley added a comment - I like the reduction in complexity. What's the motivation? Does the packing feature just seem not worth it?
          Hide
          dweiss Dawid Weiss added a comment -

          Well, it's sad to see the stuff I came up with (and Mike implemented) go away... But more seriously – this does seem to impact large automata. Can you recode the existing automata and see how much we lose by removing packing? Looking at the patch target addresses are still vint-encoded; if I recall right the compression ratio gained by packing was significant (compared to baseline fst), but a small fraction of overall input size. So a fst gain of a few megabytes for data size that is several hundred megabytes is indeed worth cutting the additional complexity of fst construction.

          +1 to remove it, but some stats on dictionary sizes before/after would be nice.

          Show
          dweiss Dawid Weiss added a comment - Well, it's sad to see the stuff I came up with (and Mike implemented) go away... But more seriously – this does seem to impact large automata. Can you recode the existing automata and see how much we lose by removing packing? Looking at the patch target addresses are still vint-encoded; if I recall right the compression ratio gained by packing was significant (compared to baseline fst), but a small fraction of overall input size. So a fst gain of a few megabytes for data size that is several hundred megabytes is indeed worth cutting the additional complexity of fst construction. +1 to remove it, but some stats on dictionary sizes before/after would be nice.
          Hide
          jpountz Adrien Grand added a comment -

          The thing that made me open that issue is not really that I think packing is not a good idea: actually I have no idea. But I'd like to clean up PackedInts a bit, and FST packing is one user of this API. And I noticed there is nothing in the code base that enables packing on FSTs, except the kuromoji dictionaries, which are smaller with packing disabled than with packing enabled.

          Show
          jpountz Adrien Grand added a comment - The thing that made me open that issue is not really that I think packing is not a good idea: actually I have no idea. But I'd like to clean up PackedInts a bit, and FST packing is one user of this API. And I noticed there is nothing in the code base that enables packing on FSTs, except the kuromoji dictionaries, which are smaller with packing disabled than with packing enabled.
          Hide
          dweiss Dawid Weiss added a comment -

          The code may not call for packed dictionaries, but there may be existing dictionaries that are already packed. Anyway, I don't think this is that crucial, really.

          except the kuromoji dictionaries, which are smaller with packing disabled than with packing enabled.

          This should never happen, something is wrong. The way automata compression was implemented in Morfologik would nearly always decrease the size of the automaton, especially in the first few optimization/ reshuffling steps. [1].

          [1] http://www.cs.put.poznan.pl/dweiss/site/publications/download/fsacomp.pdf

          Show
          dweiss Dawid Weiss added a comment - The code may not call for packed dictionaries, but there may be existing dictionaries that are already packed. Anyway, I don't think this is that crucial, really. except the kuromoji dictionaries, which are smaller with packing disabled than with packing enabled. This should never happen, something is wrong. The way automata compression was implemented in Morfologik would nearly always decrease the size of the automaton, especially in the first few optimization/ reshuffling steps. [1] . [1] http://www.cs.put.poznan.pl/dweiss/site/publications/download/fsacomp.pdf
          Hide
          mikemccand Michael McCandless added a comment -

          This should never happen, something is wrong.

          I also confirmed that with packing enabled, the Kuromoji FSTs got bigger, which I agree is messed up! Net/net I must not have implemented it properly

          I even tested having the terms index for the default postings format do packing, indexing Wikipedia English docs, and it also increased the FST size!

          Net/net it's hairy complex code with apparently dubious value at this point.

          Show
          mikemccand Michael McCandless added a comment - This should never happen, something is wrong. I also confirmed that with packing enabled, the Kuromoji FSTs got bigger, which I agree is messed up! Net/net I must not have implemented it properly I even tested having the terms index for the default postings format do packing, indexing Wikipedia English docs, and it also increased the FST size! Net/net it's hairy complex code with apparently dubious value at this point.
          Hide
          dweiss Dawid Weiss added a comment -

          Fair enough. +1 for simpler implementation.

          Show
          dweiss Dawid Weiss added a comment - Fair enough. +1 for simpler implementation.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit c4c5e868d2304148c146d52833ac2c80b0541dc3 in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c4c5e86 ]

          LUCENE-7531: Removed packing support from FST.

          Show
          jira-bot ASF subversion and git services added a comment - Commit c4c5e868d2304148c146d52833ac2c80b0541dc3 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c4c5e86 ] LUCENE-7531 : Removed packing support from FST.
          Hide
          jpountz Adrien Grand added a comment -

          Thanks all for having a look!

          Show
          jpountz Adrien Grand added a comment - Thanks all for having a look!

            People

            • Assignee:
              Unassigned
              Reporter:
              jpountz Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development