Apache OpenOffice (AOO) Bugzilla – Issue 71449
hunspell: contains large utf_lst table
Last modified: 2013-02-24 20:42:33 UTC
hunspell for spellchecking contains a huge utf_lst table for uppercasing and lowercasing characters, it apparently covers all of unicode, and for each entry there the unicode point, and the matching upper/lower points. That's a pretty big damn table. we have icu in OOo, and there is uchar.h u_tolower and u_toupper, can we rejig hunspell to use those at runtime to determine the uppercase and lowercase of a unicode character and drop this table ?
reassigning
Created attachment 40518 [details] how about this...
Would that patch fit your needs, ifdef for being inside OOo and use icu toupper/tolower, use and include the table if a standalone hunspell ? before: du ../../../unxlngi6.pro/lib/libhunspell.so 212 ../../../unxlngi6.pro/lib/libhunspell.so after: du ../../../unxlngi6.pro/lib/libhunspell.so 164 ../../../unxlngi6.pro/lib/libhunspell.so
Created attachment 40531 [details] actually, this instead I think, bubble the language down always
Target: 2.2 Caolan: I'm very glad of your nice patch. I will put it into Hunspell 1.5 and make a CWS. Thank you very much! Laci
Hi Caolan, nice work :-) OTOH - the huge memory chew we see from loading the dictionaries is prolly more significant. For myspell we had a nice patch: i#50842# that mmapped the spelling dictionaries, and saved nearly 3Mb for an en-US locale. It mostly involved some changes to the various string routines to terminate on newline/special-character instead of '\0' - and well, we've never got around to porting it to hunspell sadly.
Unfortunatelly, I couldn't use Michael's patch to Hunspell. I plan a build-time dictionary pre-compression for OpenOffice.org. For example, using alias compression of the integrated Hunspell, nearly 3/4 MB RAM saved for en_US (~5.5 MB -> 4.8), 3 MB for hu_HU (17->14), and 9 MB for Arabic (18->9). Thomas: OOo doesn't use shared dictionaries, if I run different OOo processes on my Linux machine. Thomas, may I need network installing or something special parameter to share the dictionaries between the processes? I believe, you have mentioned the dictionary sharing on the Lingu-dev.
Hi there, > I plan a build-time dictionary pre-compression for OpenOffice.org. > For example, using alias compression of the integrated Hunspell, > nearly 3/4 MB RAM saved for en_US (~5.5 MB -> 4.8), 3 MB for > hu_HU (17->14), and 9 MB for Arabic (18->9). So - the main memory win for us came, not from shrinking the size of the dictionary on disk, but from not duplicating all those strings into malloc'd memory [ which has a substantial malloc overhead per string ]. Also - of course for thin-clients, the mmapped memory is shared, where heap allocated memory cannot be, so we win yet more.
I worked on the attempt to use a similar memory-mapping approach for hunspell, as for the earlier code, but unfortunately it was much uglier. I could check if I can find the attempt still on disk somewhere, if people are interested.
I believe, the most efficient and flexible method to generate build-time memory footprints (in fact, spec. binary datafiles) from OOo dictionaries, and use it run-time by mmap, similar to Python byte code compilation and usage (py->pyc).
any update on status?
Fixed. (I will put it in CWS hunspell2 this day.)
Test: size of libhunsell.so is ~133 kB instead of 180 kB (removed Unicode casing table), but spell checking works with Unicode dictionaries and data. (Attachment: Hungarian Unicode test data Test environment: Hungarian aff and dic file from OpenOffice.org CVS (dictionaries/hu_HU/hu_HU*) or a simple ====hu.aff==== SET UTF-8 ============== and ====hu.dic==== 1 Å‘s ============== and add DICT hu HU hu to the dictionary.lst.)
Created attachment 43882 [details] Unicode test data (to check Å‘s->Ås casing without Hunspell's conversion table)
SBA: Thanks your help in advance, Laci.
I will reopen this issue after Hunspell integration, because Windows build doesn't work with this patch, so I have switched off it for Windows in CWS hunspell2. It seems in OpenOffice.org Wiki (ICU), Windows need special configuration (http://wiki.services.openoffice.org/wiki/ICU), but using ICU is not recommended. For future developments, in comments of CWS hunspell2 Thomas has suggested to use OOo internal Unicode functions: > TL->Laci: The usual way to make uppercase/lowercase conversion or isAlpha test > would be to make use of CharClass ans SysLocale. > See unotools/charclass.hxx and svtools/syslocale.hxx > It is used like > GetSysLocale().GetCharClass().... > CharClass has all the functions you like, though usually for strings... > ER also recommended to use those functions.
new target: 2.4
SBA: Verified in CWS hunspell2.
This issue is closed automatically and wasn't rechecked in a current version of OOo. The fixed issue should be integrated in OOo since more than half a year. If you think this issue isn't fixed in a current version (OOo 3.1), please reopen it and change the field 'Target Milestone' accordingly. If you want to download a current version of OOo => http://download.openoffice.org/index.html If you want to know more about the handling of fixed/verified issues => http://wiki.services.openoffice.org/wiki/Handle_fixed_verified_issues