71449 – hunspell: contains large utf_lst table

Issue 71449 - hunspell: contains large utf_lst table

Summary: hunspell: contains large utf_lst table

Status:	CLOSED FIXED

Alias:	None

Product:	General
Classification:	Code
Component:	spell checking (show other issues)
Version:	3.3.0 or older (OOo)
Hardware:	PC Linux, all

Importance:	P3 Trivial (vote)
Target Milestone:	---
Assignee:	stefan.baltzer
QA Contact:	issues@lingucomponent

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-11-11 12:30 UTC by caolanm
Modified:	2013-02-24 20:42 UTC (History)
CC List:	5 users (show)

See Also:
Issue Type:	PATCH
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
how about this... (17.31 KB, patch) 2006-11-11 15:11 UTC, caolanm	no flags	Details \| Diff
actually, this instead I think, bubble the language down always (17.35 KB, patch) 2006-11-12 14:54 UTC, caolanm	no flags	Details \| Diff
Unicode test data (to check Å‘s->Ås casing without Hunspell's conversion table) (6.48 KB, application/vnd.sun.xml.writer) 2007-03-22 22:58 UTC, nemeth.lacko	no flags	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description caolanm 2006-11-11 12:30:20 UTC

hunspell for spellchecking contains a huge utf_lst table for uppercasing and
lowercasing characters, it apparently covers all of unicode, and for each entry
there the unicode point, and the matching upper/lower points. That's a pretty
big damn table.

we have icu in OOo, and there is uchar.h u_tolower and u_toupper, can we rejig
hunspell to use those at runtime to determine the uppercase and lowercase of a
unicode character and drop this table ?

Comment 1 caolanm 2006-11-11 12:31:52 UTC

reassigning

Comment 2 caolanm 2006-11-11 15:11:21 UTC

Created attachment 40518 [details]
how about this...

Comment 3 caolanm 2006-11-11 15:13:42 UTC

Would that patch fit your needs, ifdef for being inside OOo and use icu
toupper/tolower, use and include the table if a standalone hunspell ?

before: du ../../../unxlngi6.pro/lib/libhunspell.so
212     ../../../unxlngi6.pro/lib/libhunspell.so

after:  du ../../../unxlngi6.pro/lib/libhunspell.so
164     ../../../unxlngi6.pro/lib/libhunspell.so

Comment 4 caolanm 2006-11-12 14:54:45 UTC

Created attachment 40531 [details]
actually, this instead I think, bubble the language down always

Comment 5 nemeth.lacko 2006-11-13 10:30:51 UTC

Target: 2.2

Caolan: I'm very glad of your nice patch. I will put it into Hunspell 1.5 and
make a CWS. Thank you very much! Laci

Comment 6 mmeeks 2006-11-13 13:53:04 UTC

Hi Caolan, nice work :-)

OTOH - the huge memory chew we see from loading the dictionaries is prolly more
significant.

For myspell we had a nice patch: i#50842# that mmapped the spelling
dictionaries, and saved nearly 3Mb for an en-US locale.

It mostly involved some changes to the various string routines to terminate on
newline/special-character instead of '\0' - and well, we've never got around to
porting it to hunspell sadly.

Comment 7 nemeth.lacko 2006-11-14 01:45:30 UTC

Unfortunatelly, I couldn't use Michael's patch to Hunspell.

I plan a build-time dictionary pre-compression for OpenOffice.org.
For example, using alias compression of the integrated Hunspell, nearly 3/4 MB
RAM saved for en_US (~5.5 MB -> 4.8), 3 MB for hu_HU (17->14), and 9 MB for
Arabic (18->9).

Thomas: OOo doesn't use shared dictionaries, if I run different OOo processes on
my Linux machine. Thomas, may I need network installing or something special
parameter to share the dictionaries between the processes? I believe, you have
mentioned the dictionary sharing on the Lingu-dev.

Comment 8 mmeeks 2006-11-14 09:36:34 UTC

Hi there,

> I plan a build-time dictionary pre-compression for OpenOffice.org.
> For example, using alias compression of the integrated Hunspell,
> nearly 3/4 MB RAM saved for en_US (~5.5 MB -> 4.8), 3 MB for
> hu_HU (17->14), and 9 MB for Arabic (18->9).

So - the main memory win for us came, not from shrinking the size of the
dictionary on disk, but from not duplicating all those strings into malloc'd
memory [ which has a substantial malloc overhead per string ].

Also - of course for thin-clients, the mmapped memory is shared, where heap
allocated memory cannot be, so we win yet more.

Comment 9 tml 2006-11-14 09:53:27 UTC

I worked on the attempt to use a similar memory-mapping approach for hunspell,
as for the earlier code, but unfortunately it was much uglier. I could check if
I can find the attempt still on disk somewhere, if people are interested.

Comment 10 nemeth.lacko 2006-11-14 10:25:17 UTC

I believe, the most efficient and flexible method to generate build-time memory
footprints (in fact, spec. binary datafiles) from OOo dictionaries, and use it
run-time by mmap, similar to Python byte code compilation and usage (py->pyc).

Comment 11 pavel 2007-01-15 19:24:09 UTC

any update on status?

Comment 12 nemeth.lacko 2007-01-18 09:15:20 UTC

Fixed. (I will put it in CWS hunspell2 this day.)

Comment 13 nemeth.lacko 2007-03-22 22:56:01 UTC

Test: size of libhunsell.so is ~133 kB instead of 180 kB (removed Unicode casing
table), but spell checking works with Unicode dictionaries and data.

(Attachment: Hungarian Unicode test data
Test environment: Hungarian aff and dic file from OpenOffice.org CVS
(dictionaries/hu_HU/hu_HU*) or a simple
====hu.aff====
SET UTF-8
==============

and

====hu.dic====
1
Å‘s
==============

and add

DICT hu HU hu

to the dictionary.lst.)

Comment 14 nemeth.lacko 2007-03-22 22:58:35 UTC

Created attachment 43882 [details]
Unicode test data (to check Å‘s->Ås casing without Hunspell's conversion table)

Comment 15 nemeth.lacko 2007-03-22 22:59:47 UTC

SBA: Thanks your help in advance, Laci.

Comment 16 nemeth.lacko 2007-08-02 16:51:06 UTC

I will reopen this issue after Hunspell integration, because Windows build
doesn't work with this patch, so I have switched off it for Windows in CWS
hunspell2. It seems in OpenOffice.org Wiki (ICU), Windows need special
configuration (http://wiki.services.openoffice.org/wiki/ICU), but using ICU
is not recommended.

For future developments, in comments of CWS hunspell2 Thomas has suggested to
use OOo internal Unicode functions:

> TL->Laci: The usual way to make uppercase/lowercase conversion or isAlpha test
> would be to make use of CharClass ans SysLocale.
> See unotools/charclass.hxx and svtools/syslocale.hxx
> It is used like 
>   GetSysLocale().GetCharClass()....
> CharClass has all the functions you like, though usually for strings...
> ER also recommended to use those functions.

Comment 17 nemeth.lacko 2007-08-06 15:00:37 UTC

new target: 2.4

Comment 18 stefan.baltzer 2007-12-11 15:25:32 UTC

SBA: Verified in CWS hunspell2.

Comment 19 thorsten.ziehm 2009-07-20 14:52:21 UTC

This issue is closed automatically and wasn't rechecked in a current version of
OOo. The fixed issue should be integrated in OOo since more than half a year. If
you think this issue isn't fixed in a current version (OOo 3.1), please reopen
it and change the field 'Target Milestone' accordingly.

If you want to download a current version of OOo =>
http://download.openoffice.org/index.html
If you want to know more about the handling of fixed/verified issues =>
http://wiki.services.openoffice.org/wiki/Handle_fixed_verified_issues