Lucene - Core
  1. Lucene - Core
  2. LUCENE-3699

kuromoji dictionary could be more compact

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Reading thru the ipadic documentation, i realized we are storing a lot of redundant information,
      for example the connection costs for bigram weights are based on POS+inflection data, so its redundant
      to also separately encode POS and inflection data for each entry.

      With the patch the dictionary access is also faster and simpler, and TokenInfoDictionary is 1.5MB smaller.

      1. LUCENE-3699.patch
        21 kB
        Robert Muir
      2. LUCENE-3699_more.patch
        16 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Uwe improved the memory usage a lot too (e.g. parallel arrays)... thanks for this!

        Our uncompressed size is 9MB now, which i think is good for this dataset.

        My motivation for improving the size stuff is not because kuromoji was ever really wasteful,
        instead I think size comes with the territory for CJK (see your JVM/ICU if you don't believe me).

        On the server, it doesn't really matter: but smaller size can make kuromoji more attractive for
        other use cases like integration into desktop. Also its also to make retrieval of attributes efficient
        for the analysis chain (e.g. getting part of speech just means reading a short).

        Finally, languages aren't static and we can only anticipate dictionaries to grow in size in the future and
        maybe have even more attributes (e.g. naist-jdic being 25% larger).

        Show
        Robert Muir added a comment - Uwe improved the memory usage a lot too (e.g. parallel arrays)... thanks for this! Our uncompressed size is 9MB now, which i think is good for this dataset. My motivation for improving the size stuff is not because kuromoji was ever really wasteful, instead I think size comes with the territory for CJK (see your JVM/ICU if you don't believe me). On the server, it doesn't really matter: but smaller size can make kuromoji more attractive for other use cases like integration into desktop. Also its also to make retrieval of attributes efficient for the analysis chain (e.g. getting part of speech just means reading a short). Finally, languages aren't static and we can only anticipate dictionaries to grow in size in the future and maybe have even more attributes (e.g. naist-jdic being 25% larger).
        Hide
        Uwe Schindler added a comment -

        Uwe, the initial Kuromoji JAR was 11MB. I believe Robert also has improved runtime memory usage.

        I referred to the uncompressed size!

        Show
        Uwe Schindler added a comment - Uwe, the initial Kuromoji JAR was 11MB. I believe Robert also has improved runtime memory usage. I referred to the uncompressed size!
        Hide
        Christian Moen added a comment -

        Good work, Robert!

        Uwe, the initial Kuromoji JAR was 11MB. I believe Robert also has improved runtime memory usage.

        Show
        Christian Moen added a comment - Good work, Robert! Uwe, the initial Kuromoji JAR was 11MB. I believe Robert also has improved runtime memory usage.
        Hide
        Dawid Weiss added a comment -

        Oh, ok, sorry – I had the impression everything is encoded into a single automaton. The strategies of compression vary and there is hardly a single winner. What you did looks pretty compact to me (extracting the lookup data to separate data structures).

        Show
        Dawid Weiss added a comment - Oh, ok, sorry – I had the impression everything is encoded into a single automaton. The strategies of compression vary and there is hardly a single winner. What you did looks pretty compact to me (extracting the lookup data to separate data structures).
        Hide
        Robert Muir added a comment -

        Dawid, currently the FST is not really the biggest culprit:

        -rw-r--r--   1 rmuir  staff    65568 Jan 16 16:35 CharacterDefinition.dat
        -rw-r--r--   1 rmuir  staff  2624540 Jan 16 16:35 ConnectionCosts.dat
        -rw-r--r--   1 rmuir  staff  4337216 Jan 17 03:22 TokenInfoDictionary$buffer.dat
        -rw-r--r--   1 rmuir  staff  1954846 Jan 16 16:35 TokenInfoDictionary$fst.dat
        -rw-r--r--   1 rmuir  staff    54870 Jan 16 16:35 TokenInfoDictionary$posDict.dat
        -rw-r--r--   1 rmuir  staff   392165 Jan 17 03:22 TokenInfoDictionary$targetMap.dat
        -rw-r--r--   1 rmuir  staff      311 Jan 17 03:22 UnknownDictionary$buffer.dat
        -rw-r--r--   1 rmuir  staff     4111 Jan 16 16:35 UnknownDictionary$posDict.dat
        -rw-r--r--   1 rmuir  staff       69 Jan 16 16:35 UnknownDictionary$targetMap.dat
        

        as far as the FST, our output is just an increasing ord (according to term sort order),
        so I think it should be pretty good? Is there something more efficient than this?

        Basically there are about 330k headwords, and 390k words. so some words have different
        parts of speech/reading etc for the same surface form.

        The $fst.dat is currently FST<int> where int is just an ord into $targetMap.dat, which is
        really a int[][] (it maps the output ord from the fst into an int[] containing the offsets
        of all word entries for that surface form).

        But the 'meat' describing the entries is in $buffer.dat. for each word this is its cost,
        part of speech, base form (stem), reading, pronunciation, etc, etc. As you see we
        are down to about 11 bytes per lemma on average, but still this 'metadata' is the biggest,
        thats what i was working on shrinking in this issue.

        Show
        Robert Muir added a comment - Dawid, currently the FST is not really the biggest culprit: -rw-r--r-- 1 rmuir staff 65568 Jan 16 16:35 CharacterDefinition.dat -rw-r--r-- 1 rmuir staff 2624540 Jan 16 16:35 ConnectionCosts.dat -rw-r--r-- 1 rmuir staff 4337216 Jan 17 03:22 TokenInfoDictionary$buffer.dat -rw-r--r-- 1 rmuir staff 1954846 Jan 16 16:35 TokenInfoDictionary$fst.dat -rw-r--r-- 1 rmuir staff 54870 Jan 16 16:35 TokenInfoDictionary$posDict.dat -rw-r--r-- 1 rmuir staff 392165 Jan 17 03:22 TokenInfoDictionary$targetMap.dat -rw-r--r-- 1 rmuir staff 311 Jan 17 03:22 UnknownDictionary$buffer.dat -rw-r--r-- 1 rmuir staff 4111 Jan 16 16:35 UnknownDictionary$posDict.dat -rw-r--r-- 1 rmuir staff 69 Jan 16 16:35 UnknownDictionary$targetMap.dat as far as the FST, our output is just an increasing ord (according to term sort order), so I think it should be pretty good? Is there something more efficient than this? Basically there are about 330k headwords, and 390k words. so some words have different parts of speech/reading etc for the same surface form. The $fst.dat is currently FST<int> where int is just an ord into $targetMap.dat, which is really a int[][] (it maps the output ord from the fst into an int[] containing the offsets of all word entries for that surface form). But the 'meat' describing the entries is in $buffer.dat. for each word this is its cost, part of speech, base form (stem), reading, pronunciation, etc, etc. As you see we are down to about 11 bytes per lemma on average, but still this 'metadata' is the biggest, thats what i was working on shrinking in this issue.
        Hide
        Dawid Weiss added a comment -

        If it's something that is statically compiled (in batch mode) then one could reorder states (nodes) to minimize vlength of arc pointers globally. This is something I did for fst5 automata and it worked very nice (because the distribution of in-node degrees is exponential-like so moving a few nodes with many in-links decreases the global automaton size in a significant way).

        I don't think there is any fast algorithm to do this. I used a simple heuristic: calculate in-link degree for each state, sort in descending order, then re-order N top-most nodes so that they're at the front of the serialized automaton. Pick N using any heuristic you like (constant, in-link cutoff, I used a sort of simulated annealing approach and probed around).

        The presentation about the paper in question is here:
        http://ciaa-fsmnlp-2011.univ-tours.fr/ciaa/upload/files/Weiss-Daciuk.pdf

        I can't publish the PDF of the paper publicly (Springer below), but I can send a PDF copy if somebody is interested. The concept should be clear without the paper anyway
        http://www.springerlink.com/content/60r47952k610l822/

        Show
        Dawid Weiss added a comment - If it's something that is statically compiled (in batch mode) then one could reorder states (nodes) to minimize vlength of arc pointers globally. This is something I did for fst5 automata and it worked very nice (because the distribution of in-node degrees is exponential-like so moving a few nodes with many in-links decreases the global automaton size in a significant way). I don't think there is any fast algorithm to do this. I used a simple heuristic: calculate in-link degree for each state, sort in descending order, then re-order N top-most nodes so that they're at the front of the serialized automaton. Pick N using any heuristic you like (constant, in-link cutoff, I used a sort of simulated annealing approach and probed around). The presentation about the paper in question is here: http://ciaa-fsmnlp-2011.univ-tours.fr/ciaa/upload/files/Weiss-Daciuk.pdf I can't publish the PDF of the paper publicly (Springer below), but I can send a PDF copy if somebody is interested. The concept should be clear without the paper anyway http://www.springerlink.com/content/60r47952k610l822/
        Hide
        Robert Muir added a comment -

        Thanks for taking a look Uwe: i committed this one too.

        Show
        Robert Muir added a comment - Thanks for taking a look Uwe: i committed this one too.
        Hide
        Uwe Schindler added a comment -

        +1

        Show
        Uwe Schindler added a comment - +1
        Hide
        Robert Muir added a comment -

        here is a second patch removing ~ 1.3MB of the dictionary (jar file reduced to 4.5MB).

        There are two changes:

        • baseform (stems) typically share prefixes with surface forms, so write a shared prefix amount and only the suffix.
        • for many entries (e.g. ones with no kanji), the reading is the same as the surface form if you map any hiragana to katakana, so just flag with a bit in that case.

        there is a todo in the patch where we can save another 150KB, but i think it would complicate the reader at this time so I didn't do it.

        Show
        Robert Muir added a comment - here is a second patch removing ~ 1.3MB of the dictionary (jar file reduced to 4.5MB). There are two changes: baseform (stems) typically share prefixes with surface forms, so write a shared prefix amount and only the suffix. for many entries (e.g. ones with no kanji), the reading is the same as the surface form if you map any hiragana to katakana, so just flag with a bit in that case. there is a todo in the patch where we can save another 150KB, but i think it would complicate the reader at this time so I didn't do it.
        Hide
        Robert Muir added a comment -

        thanks Uwe, we are now slightly under 5MB jar

        Show
        Robert Muir added a comment - thanks Uwe, we are now slightly under 5MB jar
        Hide
        Uwe Schindler added a comment -

        I forgot: +1 to commit!

        Show
        Uwe Schindler added a comment - I forgot: +1 to commit!
        Hide
        Uwe Schindler added a comment -

        Hey, another nice simplification, cool! You will see, next week we will have reduced the dict by another few megabytes There is still possibilities to compress ConnectionCosts g.

        Dict size started with 56 MB, right?

        Show
        Uwe Schindler added a comment - Hey, another nice simplification, cool! You will see, next week we will have reduced the dict by another few megabytes There is still possibilities to compress ConnectionCosts g . Dict size started with 56 MB, right?
        Hide
        Robert Muir added a comment -

        here's a patch: all tests pass.

        Show
        Robert Muir added a comment - here's a patch: all tests pass.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development