Apache OpenOffice (AOO) Bugzilla – Issue 60698
user-dict format re-work ...
Last modified: 2013-08-07 14:40:21 UTC
So - the user-dictionary format looks (to me) like a text file written in a random binary file format for no good reason :-) So - wanting to be able to efficiently manipulate these files much more easily - I re-wrote the read/write pass to use a simple - but extensible tagged file header + string list body as now ie. my dicts look like this: OOoDICT1 lang: <none> type: positive --- furtivelyhutch sjkdhkjshfddksj Instead of this: michael@linux:/opt/OOInstall/share/wordbook/en-US> hexdump -C oldsun.dic 00000000 06 00 57 42 53 57 47 32 ff 00 00 07 00 41 64 61 |..WBSWG2.....Ada| 00000010 62 61 73 3d 0b 00 43 6f 6d 70 61 63 74 50 43 49 |bas=..CompactPCI| 00000020 3d 08 00 48 6f 74 4a 61 76 61 3d 0a 00 4a 61 76 |=..HotJava=..Jav| 00000030 61 42 65 61 6e 73 3d 08 00 4a 61 76 61 43 68 69 |aBeans=..JavaChi| 00000040 70 07 00 4a 61 76 61 4f 53 3d 06 00 4a 61 76 61 |p..JavaOS=..Java| 00000050 53 4d 0b 00 4a 61 76 61 53 74 61 74 69 6f 6e 05 |SM..JavaStation.| ... Which - incidentally - manages to share a 'magic' file number with: $ file oldsun.dic oldsun.dic: DBase 3 index file Presumably that's a mistake ? or is this really a DBase 3 index file ? :-) cf. $ file ~/.ooo-2.0/user/wordbook/standard.dic /home/michael/.ooo-2.0/user/wordbook/standard.dic: ASCII text Anyhow - I hope you'll concur that this is a pleasant improvement. Of course, the code is still backwards compatible etc.
Reassigned to SBA.
Created attachment 33302 [details] patch
TL: taking ownership.
Adding Lacy (owner of hunspell, OOo 2.0.2 spellchecking engine) to CC.
Adding maccy as well to CC list. TL->mmeeks: Thanks for providing a patch! But I need to add some thoughts here: - The magic header lines removed in the patch are to distinguish between different file format versions used for user-dictionaries since StarWriter 2.0 Even so it is quite unlikely to still have dictionarties this old there were changes to the file format at least two times later on and we should still be able to read those old formats. - Also if applying a modified version of the patch right now would mean that we need to take care of another new file format for those files. Thus it seems sth like this should better be done for a major release like OOo 3.0 when proper migration tools would be required anyway. (Migration tools are not available in minor releases) TL->mmeeks, Lacy, maccy: And if it is all about having the user-dictionaries more easily editable I would suggest a completely different way to go that would have other benefits as well: If a new file format is to be used we should use the very same file format hunspell uses for the spellcheck dictionaries! There are obviuosly following advantages: - The hunspell dictionary format is easily readable and editable - If user-dictionaries use the same format it will be easy to send those to the maintainer of the main dictionary to incorporate them. A bonus that would be very handy! Of course because of the UI that allows editing the user-dictionarties we would need some API extension that allows to modify the files. That is we need a component (or extension to hunspell) that implements the XDictionary and maybe XDictionaryList interface for OOo spellchecker dictionaries. This probably requires to beef up the OOo dictionary format a bit beacause AFAIK there is currently no dictionary for "Language ALL" and no support for negative dictionaries (i.e. a list of words that should be reported as incorrect for a defined language and allows to provide one or more suggestions). Also it would be required to allow the user to switch on/off usage of those dictionaries at runtime (but not for the main dictionaries that gets pre-installed or downloaded) as it is currently possible in the UI for user-dictionaries. AFAIR Kevin Hendricks once mentioned that he wanted to allow "add-on's" to the main dictionary e.g. en_US_medical, en_US_finacial, ... as positive dictionarties and sth like en_US_bad_words as negative dictionary. Thus the finally accepted words would be consisting from the words of the main dictionary plus those from en_US_medical and en_US_finacial and removing all (i.e. flagging them as incorrect) from en_US_bad_words. If that kind of feature would be available and could be configured during runtime by the user we can simply replace the current format of user-dictionaries with the one the taht OOo dictionaries are using. But I don't know how far Kevin got with this, not even if he had started with this. TL->all: What do you think about the above? Would it be possible? And is it something we like to go for? Maybe the above idea could be extended even a little further by defining a main dictionary for all languages and a "changes to be applied" dictionary for the country variants. For example: We may have a "de" dic and aff file and de_AT, de_CH and de_DE dic and aff files that define the changes (entries to be added/removed) to the main dictionary. If the main dictionary is to hold only entries common to all variants then there would also be no need for "entries to be removed" in the variant dictionaries. Note: One other thing just came to my mind is that the current format for user-dictionaries allows to specify hyphenation points (no alternative spellings though). Could we have this working somehow as well?
nemeth->mmeeks, tl: Hi, I'm very fond of this patch. I would like to improve it with the following morphological extension: OOoDICT1 lang: <none> type: positive --- furtivelyhutch sjkdhkjshfddksj/word Where "word" is a dictionary item in the "lang" spell checking (and morphological) dictionary. For example: Tesco/John xexexe/change means, that Tesco can combine with affixes of John: Tesco's and xexexe can be suffixed, as change: xexexes, xexexed etc. Or more better, we can specify the right homonym with Hunspell's morphological item: xexexe/change VERB yeyeye/change NOUN I have made an issue for this: http://lingucomponent.openoffice.org/issues/show_bug.cgi?id=61525 Thomas, Michael? What do you think, how can we solve the user dictionary and spell checker integration? For example: spell checker loads the relevant user dictionaries, and need a new fuction in the spell checker API: reload_user_dictionary(), it is called, when user add new words to the user dictionary.
Hi tl: > - The magic header lines removed in the patch are to distinguish between So - if you read the patch carefully, you'll notice that this shouldn't affect backwards compatibility at all. Yes I move some code around, so a quick glance looks like it breaks stuff - but it does not. ie. this is an incremental change :-) > - Also if applying a modified version of the patch right now would mean > that we need to take care of another new file format for those files. Take care ? as in maintain ? sure - of course the only really controversial bit in here is the format used for re-writing the files in :-) ideally we'd use the same format we loaded them in rather than silently upgrading them; that would prolly be a better move. > Thus it seems sth like this should better be done for a major release like > OOo 3.0 when proper migration tools would be required anyway. > (Migration tools are not available in minor releases) ie. no migration tools necessary - this just adds support for a cleaner format. > - The hunspell dictionary format is easily readable and editable > - If user-dictionaries use the same format it will be easy to send those > to the maintainer of the main dictionary to incorporate them. > A bonus that would be very handy! Sounds sensible to me. Of course - getting this data into hunspell where it can be interpreted sensibly is more difficult. Also - I'd quite like to see this 1st cut go up-stream. What I suggest is we leave enough syntactic room to compatibly add this stuff later; ie. break on '/' and ignore after that etc. ? > This probably requires to beef up the OOo dictionary format a bit beacause And a huge amount of work which I personally am not that interested in. This was a quick hack to make an ugly file format less ugly quickly, while not really changing what it does in some structural way - so as to let us manage user-dicts sensibly. > But I don't know how far Kevin got with this, not even if he had > started with this. Kevin seems inactive these days. > TL->all: What do you think about the above? > Would it be possible? And is it something we like to go for? My desire is to shrink my outstanding patch set; so can we not conflate some nice feature / wish-list stuff with the simple format re-work :-) of course, now people can see the format no doubt they'll want that too but ... Either way - thanks for the summary of potential places to hack here :-)
Two questions concerning compatibility: (1) User has "old" dictionary and loads it with the patched Ooo version, makes some changes - in which format (old or new) will the dictionary be written? (2) User has a dictionary in "new" format (however he got it), but accesses the profile with an older, unpatched version of OOo (because e.g. he has two versions of OOo2 installed, but uses the same profile for both). What will happen? I think that file format changes should be done in major releases, where user data migration is possible in the installation/configuration step. Of course if a change in a file format fixes a major user problem we should do it even in a micro or minor release, but I fail to see a major user problem in this case. Yes, the file format is not useful for direct access by a text editor - but please explain why you think that it is necessary to be able to edit the user dictionary with an external program. That's still unclear to us. There are a lot of other options to change the dictionary: use the API (e.g. by a Basic Macro) or change the myspell/hunspell dictionary.
adding me to cc
Hi Mathias: > (1) User has "old" dictionary and loads it with the patched Ooo version, makes > some changes - in which format (old or new) will the dictionary be written? Trivial to ensure it's written in the same format - as I say; the only really controversial piece here is that it's silently 'upgraded' to the new version: trivial to fix. > (2) User has a dictionary in "new" format (however he got it), but accesses > the profile with an older, unpatched version of OOo (because e.g. he has > two versions of OOo2 installed, but uses the same profile for both). What > will happen? Magic numbers will not match & he'll silently loose those user-dict words. However - argument 2) is the same argument that is always used against almost *any* incremental change. Sacrificing 100% forward compatibility at all times is part and parcel of incremental feature addition; the same argument can be deployed against adding any feature that serializes any state. This of course has really minimal impact since a) we can make it not the default until the next 'major' release; and b) it doesn't need to 'break' existing dicts either. > I fail to see a major user problem in this case. Yes, the file format is not > useful for direct access by a text editor - but please explain why you think > that it is necessary to be able to edit the user dictionary with an external > program. That's still unclear to us. Lock-down. I want a small perl-script to be able to edit this file; without having to resort to strange and nasty binary bit bashing. Just such a perl script can be found here (not perfect yet but ;-): http://go-oo.org/ooo-build/bin/ootool.in > There are a lot of other options to change the dictionary: use the API > (e.g. by a Basic Macro) or change the myspell/hunspell dictionary. A basic macro cannot be run by the super-user at package install time, or (if you want any degree of security) at all by the super-user :-) wrt. hacking up the system myspell/hunspell dictionary - sure; that could work - wrt. appending strings to the global system dictionary; it's possibly even a better solution - but the problem is that these are not managed as config files & when you upgrade the underlying dictionary your changes will be gone. Perhaps there is a better way of doing this with myspell dicts ? didn't look into it; ideas appreciated. Either way - the change is clearly an improvement over the existing twisted binary file format :-) allows some degree of extensibility [ yes is missing a spec. - and this is one of the rare cases where IMHO writing a spec. is worthwhile ], and yet you don't want it :-) [ and indeed, AFAICS if it doesn't get in in it's current state - I'm optimistic that it will never do so - it will sit here in the issue, ignored for months, slowly gathering other requests for enhancement & desires to extend the capabilities beyond what is necessary/interesting for me ;-) to the point that no-one will either want to, nor be able to implement all the suggestions & it will die a death ;-] [ I look forward to being proved wrong on this one of course. ]. I'd love to see the feature (which incidentally re-factors some nice cut/paste code out from IsVers20OrNewer) get in - disabled by default (fair enough) - but at least having the capability to read 'new-format' files, and get deployed. Then perhaps after a few minor releases your fears about co-existence with older versions will be moot anyway [ pragmatically they will ~all support the format ].
Setting target to OOo 2.0.3.
Fixed in CWS tl18. Files changed: - linguistic/source/dicimp.cxx - linguistic/source/dicimp.hxx - linguistic/source/dlistimp.cxx I have made three minor changes compared to the patch though: 1) I fixed the 'getTag' function 2) I replaced the magic string 'OOoDict1' with 'OOoUserDict1' in hope to make clear to any user who tries to modify the file that it is a user-dictionary and not one of the OOo spellcheckers dictionary (which may reside in the same directory and have the same extension since downloaded dictionaries nowadays unfortunately get placed in the directory of the user-dictionaries :-( ). 3) The default file format version used for new user-dictionaries created via the UI is the one OOo currently uses. This is because if a dictionary created with OOo 2.0.3 (where this patch should be included) gets copied to a e.g. OOo 2.0.0 installation we like the dictionary to be functional with OOo 2.0.0 as well. If we are to create new dictionaries with the new tagged file format this won't be possible. (Even though I think this scenario is unlikely we shouldn't create incompatibilities within minor releases.) Thus from OOo 2.0.3 user-dictionaries can also make use of the more user editable tagged file format and everything else remains as it was.
Created attachment 34666 [details] Archive of small sample dictionaries using different user-dictionary file formats.
Hi tl ! :-) Thanks so much for getting this committed. > 2) I replaced the magic string 'OOoDict1' with 'OOoUserDict1' Sounds great - of course, then it's necessary to read more bytes from the beginning of the file to sniff it but that's no problem. > 3) The default file format version used for new user-dictionaries created via > the UI is the one OOo currently uses. Interesting - the cws still has the // save new dictionaries with in 7.0 Format fragment with: nDicVersion = 7; surely you mean that to stay as 6 ? [ also we should s/with in/in/ in the comment ;-] Anyhow - my only worry is that we make sure that if a file is in ver 7 (plain text) format - that we leave it in that version/format unmodified - so my dict. manipulation tools carry on working :-) Anyhow - thanks again - much appreciated etc.
About the sample dictionaries in the archive attached above: a40neg.dic is a negative dictionary created SO 4.x a40pos.dic is a positive dictionary created SO 4.x a52neg.dic is a negative dictionary created SO 5.2 a52pos.dic is a positive dictionary created SO 5.2 a7neg.dic is a negative dictionary using the new tagged file format a7pos.dic is a positive dictionary using the new tagged file format I forgot to add dictionaries in the current format (used since SO 6 and similar to SO 5.2 but using UTF-8) but those can easily be created using a current office version. TL->QA: What needs to be checked is that all of the above dictionaries can be properly "read->modified->written->read again" with this CWS. And that without changing their file format version. The file format of a dictionary can be determined by opening the file in a text editor and looking for the 'magic string' those are: - "WBSWG2" Allowing for positive and negative dictionaries. Negative dictionaries without suggestion text though. Used in SO 4 for positive and negative dictionaries. - "WBSWG5" Similar to the above but allowing for a suggestion text in negative dictionaries. Used in SO 5.2 but only for negative dictionaries, the positve ones still use "WBSWG2" as magic string. - "WBSWG6" Depicts the switch to UTF-8 encoding (the default currently in use for positive and negative dictionaries). - "OOoUserDict1" The new (optional) tagged file format. When modifiying a dictionaries it should always be saved in the same file format it was found. There is one exception though: negative dictionaries in "WBSWG2" file format get saved in "WBSWG5" file format since the older one does not allow for suggestion text and it would be odd to now have an UI that allows to specify suggestions but not saving them.
>> 3) The default file format version used for new user-dictionaries created via >> the UI is the one OOo currently uses. > >Interesting - the cws still has the // save new dictionaries with in 7.0 Format Well, that is because of I wrote the comments in the issue before I committed the changes (which I did righjt now). If you check again now the new versions of the files should be available. ^_~ >Anyhow - my only worry is that we make sure that if a file is in ver 7 (plain >text) format - that we leave it in that version/format unmodified - so my dict. >manipulation tools carry on working :-) Now that I've committed the changes you can easily check this yourself by building the DLL. Or you simply wait until the CWS handed over to QA.
TL->mmeeks: Too much praise since it is your patch to begin with. And compared to the patches I've seen until now this also was the longest. ^_^
. re-open issue and reassign to sba@openoffice.org
reassign to sba@openoffice.org
reset resolution to FIXED
SBA: Verified in CWS tl18.
In m168, closing.
Any plans to make this default? No one did for the 3.0 release, but it shouldn't break anything at this point to do so. I'd like to suggest adding an option for OOo to use Hunspell's dictionary format. This way, OOo, Firefox, Emacs, and any other app than can use Hunspell can use the same dictionary (via links or symlinks). One way I can imagine doing this is another file format: OOoUserDICT2 lang: <none> hunspelldic: foo.dic There would be no type key because Hunspell dictionaries can include both positive and negative entries.
tl->imd: please don't make suggestion in an already fixed and closed issue. I you want something to change or new to be implemented please make a new issue or feature request. About Hunspell dictionaries: Nice to be and already thought the same long since. But at this point we can not right away use Hunspell dictionaries because current OOo dictionaries have two featuires that are unsupported by Hunspell dictionaries. These are: - the possibility of setting hyphenation points in dictionary entries - the concept of negative/exception dictionaries To make use of hunspell dictionaries as implementation for user-dictionaries either hunspell needs to provide support for that (at least exception dictionaries would be nice) or we need new separate dictionary formats just for those two tasks...