Bug 6793 - PATCH reduce sa-awl memory usage
Summary: PATCH reduce sa-awl memory usage
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Tools (show other bugs)
Version: 3.3.2
Hardware: PC All
: P2 normal
Target Milestone: 3.4.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-23 19:32 UTC by Vitaly V. Bursov
Modified: 2012-05-14 15:54 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
patch patch None Vitaly V. Bursov [NoCLA]
the promised patch patch None Mark Martinec [HasCLA]
do it in chunks if huge patch None Mark Martinec [HasCLA]
oops patch None Mark Martinec [HasCLA]
oops2, hope this diff file is the correct one patch None Mark Martinec [HasCLA]
how about this then patch None Mark Martinec [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Vitaly V. Bursov 2012-04-23 19:32:57 UTC
Created attachment 5050 [details]
patch

Hello,

current version of sa-awl loads full database key list to memory before showing any
stats or performing maintenance. I believe it's obvious that this behavior is
undesirable and makes large databases impossible to handle.

The patch below improves sa-awl scaling and responsiveness by scanning database
row-by-row basis instead of loading all keys to memory first.

Tested cleaning db with over 8 million rows.

For a cached db with 850K rows memory usage lowers from 1G to 6M, execution time
is around 12% slower (26 vs. 29 sec), though.

I'm not a perl expert, please review.

Thanks.
Comment 1 Mark Martinec 2012-04-23 23:12:10 UTC
> current version of sa-awl loads full database key list to memory before
> showing any stats or performing maintenance. I believe it's obvious that
> this behavior is undesirable and makes large databases impossible to handle

I fully agree. Operations should not depend on being able to load a
complete database into memory if the size of a database is potentially
unlimited.

> The patch below improves sa-awl scaling and responsiveness by scanning
> database row-by-row basis instead of loading all keys to memory first.
> Tested cleaning db with over 8 million rows.
> For a cached db with 850K rows memory usage lowers from 1G to 6M,

Great, thanks!

> execution time is around 12% slower (26 vs. 29 sec), though.

Despite Kevin's concern on the ML, I don't think this is worth
introducing yet another option (and risking people using it
without fully understanding the implications).

One can make a program arbitrarily fast
if correct and reliable operation is not mandated  :)

> I'm not a perl expert, please review.

My only concern is deleting a key while the each() operator is active.

Please see my alternative patch.
Comment 2 Mark Martinec 2012-04-23 23:13:42 UTC
Created attachment 5051 [details]
the promised patch
Comment 3 Henrik Krohns 2012-04-24 06:22:26 UTC
+1 for mark

The old way makes no sense even as an option.
Comment 4 Vitaly V. Bursov 2012-04-24 07:47:11 UTC
New patch has the same root issue as original sa-awl has - potentially stores a lot of keys in memory. I had to reduce a DB from 8M keys to 800K and proposed  algorithm would keep around 3M in memory keys and if 800K keys eat 1G of RAM on x86-64.... well, not much better than original.

Looks like there are few options then.

1. Create a new DB file and replace the old one with it. Probably it's hard to do correctly if SA is running.

2. Leave missed or duplicate keys as is - probably duplicates are harmless and missed keys will be handled on next runs. Ugly.

3. Modify the algorithm. Few options here as well:

a) check size of @delete_keys, stop iteration if over limit, delete keys, start all over again (very slow);

b) if we got 'totscore' key, also get and check 'count' (strip off /totscore$/ part from the key name), if it's 'count' key proceed as usual. If keys should be deleted delete the current one and remember another in @delete_keys. The trick is that keys should not be deleted afterwards all at once but on every iteration key name should be checked if it's in @delete_keys and if so, this key must be deleted and removed from @delete_keys. The size of @delete_keys should be checked also to keep it from consuming too much memory.

c) store keys that should be deleted in another on-disk DB.

Few more thoughts.

It's hard to predict how efficient method 3b going to be compared to 3a as keys are randomly sorted (I think), probably it's best to have two implementations and make it an option - the 'fast' one but not entirely correct for huge DBs (like 2) and, default, correct one and not so fast (like 3b) for small-medium sized DBs.

Hope this helps,
Thanks.
Comment 5 Henrik Krohns 2012-04-24 08:57:00 UTC
(In reply to comment #4)
> New patch has the same root issue as original sa-awl has - potentially stores a
> lot of keys in memory.

From what I understand, the amount of deleted keys normally would be very small compared to previous code which arrayed _all_ keys. So there really isn't an issue here?
Comment 6 Henrik Krohns 2012-04-24 09:08:38 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > New patch has the same root issue as original sa-awl has - potentially stores a
> > lot of keys in memory.
> 
> From what I understand, the amount of deleted keys normally would be very small
> compared to previous code which arrayed _all_ keys. So there really isn't an
> issue here?

Nothing against making it safe for the worst scenario though.. I would go for KISS method and just restart when @deleted_keys is over 100k-1M keys. I doubt anyone would encounter the limit in normal usage and it's pretty much 0-cpu operation to check array size in the loop. If someone is deleting that much keys, probably he's deleting most of them, thus restarting pretty much starts where it was left at..
Comment 7 Vitaly V. Bursov 2012-04-24 09:22:12 UTC
(In reply to comment #5)
> From what I understand, the amount of deleted keys normally would be very small
> compared to previous code which arrayed _all_ keys. So there really isn't an
> issue here?

Yes, makes sense, I completely missed that.

So, to avoid a memory exhaustion in a rare case when a lot of keys should be deleted either 3a from comment #4 will do or iteration can be simply aborted when 10000-100000 keys are collected in @delete_keys, warning message printed and non-zero exit code returned. Just in case.
Comment 8 Mark Martinec 2012-04-24 10:22:27 UTC
Created attachment 5053 [details]
do it in chunks if huge

> Nothing against making it safe for the worst scenario though.. I would go
> for KISS method and just restart when @deleted_keys is over 100k-1M keys.

Fully agree. I was considering this, but the hour was late...

A new patch attached.
Comment 9 Mark Martinec 2012-04-24 10:38:29 UTC
Created attachment 5054 [details]
oops

(the previous patch included an unrelated patch, sorry)
Comment 10 Kevin A. McGrail 2012-04-24 12:48:28 UTC
+1 on patch oops but I only reviewed the logic.  And I'm definitely a slow and steady wins the race over optimization with the potential memory usage being discussed so no worries there.

Vitaly, can you test Mark's patch by any chance?

regards,
KAM
Comment 11 Vitaly V. Bursov 2012-04-24 13:12:25 UTC
Tested. Progress nearly stops after first 10000 removed keys, I guess @delete_keys contents should be pruned.

I hope I got the correct version of the patch, as 'oops' looks unrelated too.
Comment 12 Kevin A. McGrail 2012-04-24 13:19:50 UTC
(In reply to comment #11)
> Tested. Progress nearly stops after first 10000 removed keys, I guess
> @delete_keys contents should be pruned.
> 
> I hope I got the correct version of the patch, as 'oops' looks unrelated too.

It looks like I tested 5053 and 5054.  Mark, can you clarify?
Comment 13 Mark Martinec 2012-04-24 14:33:47 UTC
> It looks like I tested 5053 and 5054.  Mark, can you clarify?

5053 and 5054 are the same regarding sa-awl
(the 5053 contains an additional unrelated patch, no harm)

> Tested. Progress nearly stops after first 10000 removed keys,
> I guess @delete_keys contents should be pruned.

Can you clarify what you mean?

> Progress nearly stops after first 10000 removed keys.

You are probably observing the deletion phase for the
first collected set of 10000 keys, there are no printouts
during this phase. Or is it something else?
Comment 14 Mark Martinec 2012-04-24 14:38:18 UTC
> 5053 and 5054 are the same regarding sa-awl
> (the 5053 contains an additional unrelated patch, no harm)

My mistake again, sorry, will fix that...
Just use the 5053.
Comment 15 Mark Martinec 2012-04-24 14:40:09 UTC
Created attachment 5055 [details]
oops2, hope this diff file is the correct one
Comment 16 Vitaly V. Bursov 2012-04-24 14:58:44 UTC
(In reply to comment #13)
> Can you clarify what you mean?

after
  delete $h{$_}  for @delete_keys;
@delete_keys contents is not needed on next for(;;) loop iteration, something like

-  my(@delete_keys);
   for (;;) {
+    my(@delete_keys);

but I haven't tested it.
Comment 17 Mark Martinec 2012-04-24 15:20:01 UTC
> after
>   delete $h{$_}  for @delete_keys;
> @delete_keys contents is not needed on next for(;;) loop iteration, something
> like
> 
> -  my(@delete_keys);
>    for (;;) {
> +    my(@delete_keys);
> 
> but I haven't tested it.

Indeed, good catch. Too much hurry is never a good thing.
Comment 18 Mark Martinec 2012-04-24 15:27:18 UTC
Created attachment 5056 [details]
how about this then
Comment 19 Kevin A. McGrail 2012-04-24 15:37:23 UTC
(In reply to comment #18)
> Created attachment 5056 [details]
> how about this then

The 3rd one burned, fell over into the swamp but the 4th one stood. ;-)

The logic is sound and the scoping change on delete_keys makes sense to me.  Again, just reviewed the logic and not trying the actual patch.
Comment 20 Vitaly V. Bursov 2012-04-24 16:37:39 UTC
Well, I haven't found any more issues.

Performance is surprisingly good too: processed DB with >8M entries (two database keys per entry), 7.5M of which were deleted in less than 40 minutes.

BerkeleyDB 4.8 if it matters.
Comment 21 Mark Martinec 2012-04-24 17:35:51 UTC
> Well, I haven't found any more issues. 
> Performance is surprisingly good too: processed DB with >8M entries (two
> database keys per entry), 7.5M of which were deleted in less than 40 minutes.
> BerkeleyDB 4.8 if it matters.

Very good, thanks for testing!

trunk (3.4):
  Bug 6793: reduce sa-awl memory usage
Sending sa-awl.raw
Committed revision 1329877.

Can we close this now?
Comment 22 Henrik Krohns 2012-04-24 17:39:07 UTC
First of all, +1 but didn't look so indepth this time either. Enough to me that it's pretty well tested already. :-)


(In reply to comment #20)
> Well, I haven't found any more issues.
> 
> Performance is surprisingly good too: processed DB with >8M entries (two
> database keys per entry), 7.5M of which were deleted in less than 40 minutes.
> 
> BerkeleyDB 4.8 if it matters.

A bit off topic, but have you ever tried DB_TXN_NOSYNC ? Should add considerable performance, and the "risks" are more than fine for this use..

You could try adding it in this line in BDB.pm:
my $flags = DB_INIT_LOCK|DB_INIT_LOG|DB_INIT_MPOOL|DB_INIT_TXN;

In fact, I think the flags should be user configurable to some degree..
Comment 23 Vitaly V. Bursov 2012-04-24 18:03:23 UTC
(In reply to comment #22)

> A bit off topic, but have you ever tried DB_TXN_NOSYNC ?

No, tests were performed on hardware 2 disk SATA RAID with write caching enabled, NOSYNC shouldn't make a big difference. Most likely it was read-congested, not write.
Comment 24 Mark Martinec 2012-05-14 14:21:59 UTC
Close?
Comment 25 Kevin A. McGrail 2012-05-14 15:54:51 UTC
Thanks Vitaly and Mark!