SA Bugzilla – Bug 6364
Russian UTF-8 in TextCat
Last modified: 2021-04-18 13:03:19 UTC
Please add support UTF-8 encoding for russian language in Textcat. Without it do not work valuably definition of Russian language and "normalize_charset". For problem solution it is necessary to add a file ru.utf-8.lm in source. This file called ru-utf8.lm, is accessible by link: http://trac.greenstone.org/browser/main/trunk/greenstone2/perllib/textcat (In the same place there are some refreshed files for other languages.) P.S. This problem with russian language very old and unpleasant, I think all Russian-speaking community will tell for this small file of thanks :)
In the repository specified above added utf-8 textcat models for russian, french, spanish, italian and chinese.
Also notice my bug. https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6229 TextCat is currently broken case and encoding wise. It should be completely revamped.
Created attachment 5702 [details] Another Russian UTF-8 spam Take the attached obviously Russian message "a". Well no matter normalize_charset 0 or 1 (it's UTF-8 anyway) we get the same X-Spam-Textcatresults: zh.gb2312:149533(1.00) zh.big5:151313(1.01) ko:152101(1.02) ja.shift-jis:152161(1.02) th:152504(1.02) ja.euc-jp:152931(1.02) hy:152988(1.02) ar.iso-8859-6:153918(1.03) am.utf-8:154133(1.03) ta:154349(1.03) mr:155147(1.04) hi:155343(1.04) ru.iso-8859-5:156383(1.05) uk.koi8-r:156736(1.05) vi:156983(1.05) ka:157040(1.05) bg.iso-8859-5:157425(1.05) ru.koi8-r:157592(1.05) ar.windows-1256:157731(1.05) pl:158285(1.06) Why is ar.windows-1256 even tied with Russian?
Added new languages es.utf8 fr.utf8 it.utf8 ru.utf8 zh.utf8 Sending trunk/rules/languages Transmitting file data .done Committing transaction... Committed revision 1888898.