SA Bugzilla – Bug 6452
__HK_LOTTO_BALLOT too broad
Last modified: 2011-10-31 19:19:24 UTC
HK_LOTTO currently scores 3.599 2.755 2.993 3.599 One component is __HK_LOTTO_BALLOT which matches "on-line ballot" in the body This is way too high a score for such a broad pattern. It falsely scored an election announcement from American Physical Society
(In reply to comment #0) > HK_LOTTO currently scores 3.599 2.755 2.993 3.599 > One component is __HK_LOTTO_BALLOT > which matches "on-line ballot" in the body > > This is way too high a score for such a broad pattern. > It falsely scored an election announcement from American Physical Society Pls attach a raw sample message including full SA report header (munge rcpt address)
Created attachment 4778 [details] Election notice from APS
Just FYI, that rule may also overlap some things I'm doing in LOTSA_MONEY... Andrew, do you have any objections to putting that sample into my ham corpus?
(In reply to comment #3) > Just FYI, that rule may also overlap some things I'm doing in LOTSA_MONEY... > > Andrew, do you have any objections to putting that sample into my ham corpus? For myself, no, though perhaps further obfuscate triumf.ca to example.com or xxxx. Perahps one should ask Ken Cole at APS. I am unfamiliar with standard practise in these cases. It's obviously not a personal message but one widely distributed to APS members and I can't see that it contains anything sensitive, except perhaps the contact address.
(In reply to comment #3) > Andrew, do you have any objections to putting that sample into my ham corpus? Attaching the sample to this bug report already made it public. What objection could there possibly be to use it in a local, non-published corpus?
(In reply to comment #5) > (In reply to comment #3) > > Andrew, do you have any objections to putting that sample into my ham corpus? > > Attaching the sample to this bug report already made it public. What objection > could there possibly be to use it in a local, non-published corpus? It would actually go into my uploaded corpus. I currently don't do local checks and upload just the results. That's why I asked.
(In reply to comment #6) > It would actually go into my uploaded corpus. I currently don't do local checks > and upload just the results. That's why I asked. That still is not public (unlike this bug report and its attachments!), and access is strictly limited to SA devs.
(In reply to comment #6) > (In reply to comment #5) > > (In reply to comment #3) > > > Andrew, do you have any objections to putting that sample into my ham corpus? > > > > Attaching the sample to this bug report already made it public. What objection > > could there possibly be to use it in a local, non-published corpus? > > It would actually go into my uploaded corpus. I currently don't do local checks > and upload just the results. That's why I asked. As has been pointed out, I have already made it public (or more public than on a wide mailout), so I'd say go ahead, if debating the issue will hold things up. The only reason (other than carelessness) that I had not redacted more PII is that some tools like the Razor plugin depend on an unmodified message body.
(In reply to comment #4) > (In reply to comment #3) > > Just FYI, that rule may also overlap some things I'm doing in LOTSA_MONEY... > > > > Andrew, do you have any objections to putting that sample into my ham corpus? > > For myself, no, though perhaps further obfuscate triumf.ca to example.com or > xxxx. I note you've replaced the member IDs and codes with descriptive text or "xxxxx". Those sort of things in a message can also be rule fodder - could I ask you to re-sanitize the original message and instead of what you did, just change the codes while retaining their format? For example, if the APS member ID is a string of numbers and letters, change the numbers to different numbers and the letters to different letters. If you're willing to do that, thanks; if not, I understand.
Created attachment 4781 [details] Election notice from APS Sanitised as follows, keeping length of original fields: Replace recipient uid with "userxx" Replace APS member ID with different digits Replace APS PI code with different letters, digits Replace APS executive personal name with "John Doe" Replace APS executive personal address with "someuserxx" Replace APS executive phone number with 555-1234
Thanks. Uploaded to nightly masscheck ham corpora.
Scores are now generating much lower on the overall meta and can be considered resolved. 72_scores.cf:score HK_LOTTO_NAME 0.999 0.042 0.999 0.042