60698 – user-dict format re-work ...

Issue 60698 - user-dict format re-work ...

Summary: user-dict format re-work ...

Status:	CLOSED FIXED

Alias:	None

Product:	Writer
Classification:	Application
Component:	code (show other issues)
Version:	680m148
Hardware:	All All

Importance:	P3 Trivial (vote)
Target Milestone:	---
Assignee:	stefan.baltzer
QA Contact:	issues@sw

URL:
Keywords:

Depends on:
Blocks:	106032
	Show dependency tree

Reported:	2006-01-17 12:35 UTC by mmeeks
Modified:	2013-08-07 14:40 UTC (History)
CC List:	4 users (show)

See Also:
Issue Type:	PATCH
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
patch (15.46 KB, patch) 2006-01-17 13:22 UTC, mmeeks	no flags	Details \| Diff
Archive of small sample dictionaries using different user-dictionary file formats. (903 bytes, application/octet-stream) 2006-03-08 09:41 UTC, thomas.lange	no flags	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description mmeeks 2006-01-17 12:35:01 UTC

So - the user-dictionary format looks (to me) like a text file written in a
random binary file format for no good reason :-)

So - wanting to be able to efficiently manipulate these files much more easily -
I re-wrote the read/write pass to use a simple - but extensible tagged file
header + string list body as now ie. my dicts look like this:

OOoDICT1
lang: <none>
type: positive
---
furtivelyhutch
sjkdhkjshfddksj

Instead of this:

michael@linux:/opt/OOInstall/share/wordbook/en-US> hexdump -C oldsun.dic 
00000000  06 00 57 42 53 57 47 32  ff 00 00 07 00 41 64 61  |..WBSWG2.....Ada|
00000010  62 61 73 3d 0b 00 43 6f  6d 70 61 63 74 50 43 49  |bas=..CompactPCI|
00000020  3d 08 00 48 6f 74 4a 61  76 61 3d 0a 00 4a 61 76  |=..HotJava=..Jav|
00000030  61 42 65 61 6e 73 3d 08  00 4a 61 76 61 43 68 69  |aBeans=..JavaChi|
00000040  70 07 00 4a 61 76 61 4f  53 3d 06 00 4a 61 76 61  |p..JavaOS=..Java|
00000050  53 4d 0b 00 4a 61 76 61  53 74 61 74 69 6f 6e 05  |SM..JavaStation.|
...

Which - incidentally - manages to share a 'magic' file number with:

$ file oldsun.dic 
oldsun.dic: DBase 3 index file

Presumably that's a mistake ? or is this really a DBase 3 index file ? :-) cf.

$ file ~/.ooo-2.0/user/wordbook/standard.dic 
/home/michael/.ooo-2.0/user/wordbook/standard.dic: ASCII text

Anyhow - I hope you'll concur that this is a pleasant improvement. Of course,
the code is still backwards compatible etc.

Comment 1 michael.ruess 2006-01-17 13:19:39 UTC

Reassigned to SBA.

Comment 2 mmeeks 2006-01-17 13:22:49 UTC

Created attachment 33302 [details]
patch

Comment 3 thomas.lange 2006-02-02 13:19:44 UTC

TL: taking ownership.

Comment 4 thomas.lange 2006-02-02 13:22:44 UTC

Adding Lacy (owner of hunspell, OOo 2.0.2 spellchecking engine) to CC.

Comment 5 thomas.lange 2006-02-02 14:15:15 UTC

Adding maccy as well to CC list.

TL->mmeeks: Thanks for providing a patch!

But I need to add some thoughts here:
- The magic header lines removed in the patch are to distinguish between
  different file format versions used for user-dictionaries since StarWriter 2.0
  Even so it is quite unlikely to still have dictionarties this old there were
  changes to the file format at least two times later on and we should still be 
  able to read those old formats.
- Also if applying a modified version of the patch right now would mean that we 
  need to take care of another new file format for those files.
  Thus it seems sth like this should better be done for a major release like
  OOo 3.0 when proper migration tools would be required anyway.
  (Migration tools are not available in minor releases)


TL->mmeeks, Lacy, maccy:

And if it is all about having the user-dictionaries more easily editable I would
suggest a completely different way to go that would have other benefits as well:

If a new file format is to be used we should use the very same file format
hunspell uses for the spellcheck dictionaries!
There are obviuosly following advantages:
- The hunspell dictionary format is easily readable and editable
- If user-dictionaries use the same format it will be easy to send those
  to the maintainer of the main dictionary to incorporate them. 
  A bonus that would be very handy!

Of course because of the UI that allows editing the user-dictionarties we would
need some API extension  that allows to modify the files.
That is we need a component (or extension to hunspell) that implements the
XDictionary and maybe XDictionaryList interface for OOo spellchecker dictionaries.

This probably requires to beef up the OOo dictionary format a bit beacause AFAIK
there is currently no dictionary for "Language ALL" and no support for negative
dictionaries (i.e. a list of words that should be reported as incorrect for a
defined language and allows to provide one or more suggestions).
Also it would be required to allow the user to switch on/off usage of those
dictionaries at runtime (but not for the main dictionaries that gets
pre-installed or downloaded) as it is currently possible in the UI for
user-dictionaries.

AFAIR Kevin Hendricks once mentioned that he wanted to allow "add-on's" to the
main dictionary e.g. en_US_medical, en_US_finacial, ... as positive
dictionarties and sth like en_US_bad_words as negative dictionary.
Thus the finally accepted words would be consisting from the words of the main
dictionary plus those from en_US_medical and en_US_finacial and removing all
(i.e. flagging them as incorrect) from en_US_bad_words.

If that kind of feature would be available and could be configured during
runtime by the user we can simply replace the current format of
user-dictionaries with the one the taht OOo dictionaries are using.

But I don't know how far Kevin got with this, not even if he had started with this.

TL->all: What do you think about the above?
Would it be possible? And is it something we like to go for?


Maybe the above idea could be extended even a little further by defining a
main dictionary for all languages and a "changes to be applied" dictionary for
the country variants.
For example:
We may have a "de" dic and aff file and de_AT, de_CH and de_DE dic and aff files
that define the changes (entries to be added/removed) to the main dictionary.
If the main dictionary is to hold only entries common to all variants then there
would also be no need for "entries to be removed" in the variant dictionaries.


Note: One other thing just came to my mind is that the current format for
user-dictionaries allows to specify hyphenation points (no alternative spellings
though). Could we have this working somehow as well?

Comment 6 nemeth.lacko 2006-02-02 14:31:51 UTC

nemeth->mmeeks, tl: Hi, I'm very fond of this patch. I would like to improve it
with the following morphological extension:

OOoDICT1
lang: <none>
type: positive
---
furtivelyhutch
sjkdhkjshfddksj/word

Where "word" is a dictionary item in the "lang" spell checking (and
morphological) dictionary. For example:

Tesco/John
xexexe/change

means, that Tesco can combine with affixes of John: Tesco's and
xexexe can be suffixed, as change: xexexes, xexexed etc.

Or more better, we can specify the right homonym with Hunspell's morphological item:

xexexe/change VERB
yeyeye/change NOUN

I have made an issue for this:

http://lingucomponent.openoffice.org/issues/show_bug.cgi?id=61525

Thomas, Michael? What do you think, how can we solve the user dictionary and
spell checker integration?

For example: spell checker loads the relevant user dictionaries, and
need a new fuction in the spell checker API: reload_user_dictionary(),
it is called, when user add new words to the user dictionary.

Comment 7 mmeeks 2006-02-02 14:47:39 UTC

Hi tl:

> - The magic header lines removed in the patch are to distinguish between

   So - if you read the patch carefully, you'll notice that this shouldn't
affect backwards compatibility at all. Yes I move some code around, so a quick
glance looks like it breaks stuff - but it does not. ie. this is an incremental
change :-)

> - Also if applying a modified version of the patch right now would mean
> that we  need to take care of another new file format for those files.

   Take care ? as in maintain ? sure - of course the only really controversial
bit in here is the format used for re-writing the files in :-) ideally we'd use
the same format we loaded them in rather than silently upgrading them; that
would prolly be a better move.

>  Thus it seems sth like this should better be done for a major release like
>  OOo 3.0 when proper migration tools would be required anyway.
>  (Migration tools are not available in minor releases)

   ie. no migration tools necessary - this just adds support for a cleaner format.

> - The hunspell dictionary format is easily readable and editable
> - If user-dictionaries use the same format it will be easy to send those
>  to the maintainer of the main dictionary to incorporate them. 
>  A bonus that would be very handy!

Sounds sensible to me. Of course - getting this data into hunspell where it can
be interpreted sensibly is more difficult.
Also - I'd quite like to see this 1st cut go up-stream.

What I suggest is we leave enough syntactic room to compatibly add this stuff
later; ie. break on '/' and ignore after that etc. ?

> This probably requires to beef up the OOo dictionary format a bit beacause 

And a huge amount of work which I personally am not that interested in. This was
a quick hack to make an ugly file format less ugly quickly, while not really
changing what it does in some structural way - so as to let us manage user-dicts
sensibly.

> But I don't know how far Kevin got with this, not even if he had
> started with this.

Kevin seems inactive these days.

> TL->all: What do you think about the above?
> Would it be possible? And is it something we like to go for?

My desire is to shrink my outstanding patch set; so can we not conflate some
nice feature / wish-list stuff with the simple format re-work :-) of course, now
people can see the format no doubt they'll want that too but ...

Either way - thanks for the summary of potential places to hack here :-)

Comment 8 Mathias_Bauer 2006-02-06 11:03:30 UTC

Two questions concerning compatibility:

(1) User has "old" dictionary and loads it with the patched Ooo version, makes
some changes - in which format (old or new) will the dictionary be written?

(2) User has a dictionary in "new" format (however he got it), but accesses the
profile with an older, unpatched version of OOo (because e.g. he has two
versions of OOo2 installed, but uses the same profile for both). What will happen?

I think that file format changes should be done in major releases, where user
data migration is possible in the installation/configuration step. Of course if
a change in a file format fixes a major user problem we should do it even in a
micro or minor release, but I fail to see a major user problem in this case.
Yes, the file format is not useful for direct access by a text editor - but
please explain why you think that it is necessary to be able to edit the user
dictionary with an external program. That's still unclear to us. 
There are a lot of other options to change the dictionary: use the API (e.g. by
a Basic Macro) or change the myspell/hunspell dictionary.

Comment 9 Mathias_Bauer 2006-02-06 11:09:06 UTC

adding me to cc

Comment 10 mmeeks 2006-02-06 11:22:52 UTC

Hi Mathias:

> (1) User has "old" dictionary and loads it with the patched Ooo version, makes
> some changes - in which format (old or new) will the dictionary be written?

   Trivial to ensure it's written in the same format - as I say; the only really
controversial piece here is that it's silently 'upgraded' to the new version:
trivial to fix.

> (2) User has a dictionary in "new" format (however he got it), but accesses
> the profile with an older, unpatched version of OOo (because e.g. he has
> two versions of OOo2 installed, but uses the same profile for both). What
> will happen?

   Magic numbers will not match & he'll silently loose those user-dict words.

   However - argument 2) is the same argument that is always used against almost
*any* incremental change. Sacrificing 100% forward compatibility at all times is
part and parcel of incremental feature addition; the same argument can be
deployed against adding any feature that serializes any state. This of course
has really minimal impact since a) we can make it not the default until the next
'major' release; and b) it doesn't need to 'break' existing dicts either.

> I fail to see a major user problem in this case. Yes, the file format is not 
> useful for direct access by a text editor - but please explain why you think 
> that it is necessary to be able to edit the user dictionary with an external 
> program. That's still unclear to us. 

Lock-down. I want a small perl-script to be able to edit this file; without
having to resort to strange and nasty binary bit bashing. Just such a perl
script can be found here (not perfect yet but ;-):
     http://go-oo.org/ooo-build/bin/ootool.in

> There are a lot of other options to change the dictionary: use the API
> (e.g. by a Basic Macro) or change the myspell/hunspell dictionary. 

A basic macro cannot be run by the super-user at package install time, or (if
you want any degree of security) at all by the super-user :-) wrt. hacking up
the system myspell/hunspell dictionary - sure; that could work - wrt. appending
strings to the global system dictionary; it's possibly even a better solution -
but the problem is that these are not managed as config files & when you upgrade
the underlying dictionary your changes will be gone.

Perhaps there is a better way of doing this with myspell dicts ? didn't look
into it; ideas appreciated.

Either way - the change is clearly an improvement over the existing twisted
binary file format :-) allows some degree of extensibility [ yes is missing a
spec. - and this is one of the rare cases where IMHO writing a spec. is
worthwhile ], and yet you don't want it :-) [ and indeed, AFAICS if it doesn't
get in in it's current state - I'm optimistic that it will never do so - it will
sit here in the issue, ignored for months, slowly gathering other requests for
enhancement & desires to extend the capabilities beyond what is
necessary/interesting for me ;-) to the point that no-one will either want to,
nor be able to implement all the suggestions & it will die a death ;-] [ I look
forward to being proved wrong on this one of course. ].

I'd love to see the feature (which incidentally re-factors some nice cut/paste
code out from IsVers20OrNewer) get in - disabled by default (fair enough) - but
at least having the capability to read 'new-format' files, and get deployed.
Then perhaps after a few minor releases your fears about co-existence with older
versions will be moot anyway [ pragmatically they will ~all support the format ].

Comment 11 thomas.lange 2006-03-02 09:56:25 UTC

Setting target to OOo 2.0.3.

Comment 12 thomas.lange 2006-03-08 09:37:58 UTC

Fixed in CWS tl18.

Files changed:
- linguistic/source/dicimp.cxx
- linguistic/source/dicimp.hxx
- linguistic/source/dlistimp.cxx

I have made three minor changes compared to the patch though:
1) I fixed the 'getTag' function
2) I replaced the magic string 'OOoDict1' with 'OOoUserDict1' in hope to make clear 
  to any user who tries to modify the file that it is a user-dictionary and not
one of 
  the OOo spellcheckers dictionary (which may reside in the same directory and have 
  the same extension since downloaded dictionaries nowadays unfortunately get 
  placed in the directory of the user-dictionaries :-(  ).
3) The default file format version used for new user-dictionaries created via
the UI is the 
  one OOo currently uses. This is because if a dictionary created with OOo 2.0.3
  (where this patch should be included) gets copied to a e.g. OOo 2.0.0 installation
  we like the dictionary to be functional with OOo 2.0.0 as well.
  If we are to create new dictionaries with the new tagged file format this won't be
  possible.
  (Even though I think this scenario is unlikely we shouldn't create
incompatibilities
  within minor releases.)

Thus from OOo 2.0.3 user-dictionaries can also make use of the more user
editable tagged file format and everything else remains as it was.

Comment 13 thomas.lange 2006-03-08 09:41:23 UTC

Created attachment 34666 [details]
Archive of small sample dictionaries using different user-dictionary file formats.

Comment 14 mmeeks 2006-03-08 09:49:51 UTC

Hi tl ! :-)

   Thanks so much for getting this committed.

> 2) I replaced the magic string 'OOoDict1' with 'OOoUserDict1' 

    Sounds great - of course, then it's necessary to read more bytes from the
beginning of the file to sniff it but that's no problem.

> 3) The default file format version used for new user-dictionaries created via
> the UI is the one OOo currently uses.

Interesting - the cws still has the // save new dictionaries with in 7.0 Format
fragment with: nDicVersion = 7; surely you mean that to stay as 6 ? [ also we
should s/with in/in/ in the comment ;-]

Anyhow - my only worry is that we make sure that if a file is in ver 7 (plain
text) format - that we leave it in that version/format unmodified - so my dict.
manipulation tools carry on working :-)

Anyhow - thanks again - much appreciated etc.

Comment 15 thomas.lange 2006-03-08 10:08:31 UTC

About the sample dictionaries in the archive attached above:

a40neg.dic  is a negative dictionary created SO 4.x
a40pos.dic  is a positive dictionary created SO 4.x
a52neg.dic  is a negative dictionary created SO 5.2
a52pos.dic  is a positive dictionary created SO 5.2

a7neg.dic  is a negative dictionary using the new tagged file format
a7pos.dic  is a positive dictionary using the new tagged file format

I forgot to add dictionaries in the current format (used since SO 6 and similar
to SO 5.2 but using UTF-8) but those can easily be created using a current
office version.


TL->QA: What needs to be checked is that all of the above dictionaries can be
properly "read->modified->written->read again" with this CWS. And that without
changing their file format version.

The file format of a dictionary can be determined by opening the file in a text
editor and looking for the 'magic string' those are:
- "WBSWG2"
  Allowing for positive and negative dictionaries. Negative dictionaries without
  suggestion text though. Used in SO 4 for positive and negative dictionaries.
- "WBSWG5"
  Similar to the above but allowing for a suggestion text in negative 
  dictionaries. Used in SO 5.2 but only for negative dictionaries, the positve
  ones still use "WBSWG2" as magic string.
- "WBSWG6"
  Depicts the switch to UTF-8 encoding (the default currently in use for 
  positive and negative dictionaries).
- "OOoUserDict1"
  The new (optional) tagged file format.

When modifiying a dictionaries it should always be saved in the same file format
it was found. There is one exception though: 
negative dictionaries in "WBSWG2" file format get saved in "WBSWG5" file format
since the older one does not allow for suggestion text and it would be odd to
now have an UI that allows to specify suggestions but not saving them.

Comment 16 thomas.lange 2006-03-08 10:17:41 UTC

>> 3) The default file format version used for new user-dictionaries created via
>> the UI is the one OOo currently uses.
>
>Interesting - the cws still has the // save new dictionaries with in 7.0 Format

Well, that is because of I wrote the comments in the issue before I committed
the changes (which I did righjt now).
If you check again now the new versions of the files should be available. ^_~

>Anyhow - my only worry is that we make sure that if a file is in ver 7 (plain
>text) format - that we leave it in that version/format unmodified - so my dict.
>manipulation tools carry on working :-)

Now that I've committed the changes you can easily check this yourself by
building the DLL. Or you simply wait until the CWS handed over to QA.

Comment 17 thomas.lange 2006-03-08 10:21:17 UTC

TL->mmeeks: Too much praise since it is your patch to begin with. And compared
to the patches I've seen until now this also was the longest. ^_^

Comment 18 thomas.lange 2006-04-26 11:41:47 UTC

.

re-open issue and reassign to sba@openoffice.org

Comment 19 thomas.lange 2006-04-26 11:41:52 UTC

reassign to sba@openoffice.org

Comment 20 thomas.lange 2006-04-26 11:41:56 UTC

reset resolution to FIXED

Comment 21 stefan.baltzer 2006-05-02 13:03:54 UTC

SBA: Verified in CWS tl18.

Comment 22 kendy 2006-05-18 12:06:44 UTC

In m168, closing.

Comment 23 imd 2009-10-13 05:03:39 UTC

Any plans to make this default?  No one did for the 3.0 release, but it
shouldn't break anything at this point to do so.

I'd like to suggest adding an option for OOo to use Hunspell's dictionary
format.  This way, OOo, Firefox, Emacs, and any other app than can use Hunspell
can use the same dictionary (via links or symlinks).  One way I can imagine
doing this is another file format:

    OOoUserDICT2
    lang: <none>
    hunspelldic: foo.dic

There would be no type key because Hunspell dictionaries can include both
positive and negative entries.

Comment 24 thomas.lange 2009-10-13 08:23:32 UTC

tl->imd: please don't make suggestion in an already fixed and closed issue.
I you want something to change or new to be implemented please make a new issue
or feature request.

About Hunspell dictionaries: Nice to be and already thought the same long since.
But at this point we can not right away use Hunspell dictionaries because
current OOo dictionaries have two featuires that are unsupported by Hunspell
dictionaries. These are:
- the possibility of setting hyphenation points in dictionary entries
- the concept of negative/exception dictionaries
To make use of hunspell dictionaries as implementation for user-dictionaries
either hunspell needs to provide support for that (at least exception
dictionaries would be nice) or we need new separate dictionary formats just for
those two tasks...