Dawid, currently the FST is not really the biggest culprit:
-rw-r--r-- 1 rmuir staff 65568 Jan 16 16:35 CharacterDefinition.dat
-rw-r--r-- 1 rmuir staff 2624540 Jan 16 16:35 ConnectionCosts.dat
-rw-r--r-- 1 rmuir staff 4337216 Jan 17 03:22 TokenInfoDictionary$buffer.dat
-rw-r--r-- 1 rmuir staff 1954846 Jan 16 16:35 TokenInfoDictionary$fst.dat
-rw-r--r-- 1 rmuir staff 54870 Jan 16 16:35 TokenInfoDictionary$posDict.dat
-rw-r--r-- 1 rmuir staff 392165 Jan 17 03:22 TokenInfoDictionary$targetMap.dat
-rw-r--r-- 1 rmuir staff 311 Jan 17 03:22 UnknownDictionary$buffer.dat
-rw-r--r-- 1 rmuir staff 4111 Jan 16 16:35 UnknownDictionary$posDict.dat
-rw-r--r-- 1 rmuir staff 69 Jan 16 16:35 UnknownDictionary$targetMap.dat
as far as the FST, our output is just an increasing ord (according to term sort order),
so I think it should be pretty good? Is there something more efficient than this?
Basically there are about 330k headwords, and 390k words. so some words have different
parts of speech/reading etc for the same surface form.
The $fst.dat is currently FST<int> where int is just an ord into $targetMap.dat, which is
really a int (it maps the output ord from the fst into an int containing the offsets
of all word entries for that surface form).
But the 'meat' describing the entries is in $buffer.dat. for each word this is its cost,
part of speech, base form (stem), reading, pronunciation, etc, etc. As you see we
are down to about 11 bytes per lemma on average, but still this 'metadata' is the biggest,
thats what i was working on shrinking in this issue.