SA Bugzilla – Bug 6793
PATCH reduce sa-awl memory usage
Last modified: 2012-05-14 15:54:51 UTC
Created attachment 5050 [details] patch Hello, current version of sa-awl loads full database key list to memory before showing any stats or performing maintenance. I believe it's obvious that this behavior is undesirable and makes large databases impossible to handle. The patch below improves sa-awl scaling and responsiveness by scanning database row-by-row basis instead of loading all keys to memory first. Tested cleaning db with over 8 million rows. For a cached db with 850K rows memory usage lowers from 1G to 6M, execution time is around 12% slower (26 vs. 29 sec), though. I'm not a perl expert, please review. Thanks.
> current version of sa-awl loads full database key list to memory before > showing any stats or performing maintenance. I believe it's obvious that > this behavior is undesirable and makes large databases impossible to handle I fully agree. Operations should not depend on being able to load a complete database into memory if the size of a database is potentially unlimited. > The patch below improves sa-awl scaling and responsiveness by scanning > database row-by-row basis instead of loading all keys to memory first. > Tested cleaning db with over 8 million rows. > For a cached db with 850K rows memory usage lowers from 1G to 6M, Great, thanks! > execution time is around 12% slower (26 vs. 29 sec), though. Despite Kevin's concern on the ML, I don't think this is worth introducing yet another option (and risking people using it without fully understanding the implications). One can make a program arbitrarily fast if correct and reliable operation is not mandated :) > I'm not a perl expert, please review. My only concern is deleting a key while the each() operator is active. Please see my alternative patch.
Created attachment 5051 [details] the promised patch
+1 for mark The old way makes no sense even as an option.
New patch has the same root issue as original sa-awl has - potentially stores a lot of keys in memory. I had to reduce a DB from 8M keys to 800K and proposed algorithm would keep around 3M in memory keys and if 800K keys eat 1G of RAM on x86-64.... well, not much better than original. Looks like there are few options then. 1. Create a new DB file and replace the old one with it. Probably it's hard to do correctly if SA is running. 2. Leave missed or duplicate keys as is - probably duplicates are harmless and missed keys will be handled on next runs. Ugly. 3. Modify the algorithm. Few options here as well: a) check size of @delete_keys, stop iteration if over limit, delete keys, start all over again (very slow); b) if we got 'totscore' key, also get and check 'count' (strip off /totscore$/ part from the key name), if it's 'count' key proceed as usual. If keys should be deleted delete the current one and remember another in @delete_keys. The trick is that keys should not be deleted afterwards all at once but on every iteration key name should be checked if it's in @delete_keys and if so, this key must be deleted and removed from @delete_keys. The size of @delete_keys should be checked also to keep it from consuming too much memory. c) store keys that should be deleted in another on-disk DB. Few more thoughts. It's hard to predict how efficient method 3b going to be compared to 3a as keys are randomly sorted (I think), probably it's best to have two implementations and make it an option - the 'fast' one but not entirely correct for huge DBs (like 2) and, default, correct one and not so fast (like 3b) for small-medium sized DBs. Hope this helps, Thanks.
(In reply to comment #4) > New patch has the same root issue as original sa-awl has - potentially stores a > lot of keys in memory. From what I understand, the amount of deleted keys normally would be very small compared to previous code which arrayed _all_ keys. So there really isn't an issue here?
(In reply to comment #5) > (In reply to comment #4) > > New patch has the same root issue as original sa-awl has - potentially stores a > > lot of keys in memory. > > From what I understand, the amount of deleted keys normally would be very small > compared to previous code which arrayed _all_ keys. So there really isn't an > issue here? Nothing against making it safe for the worst scenario though.. I would go for KISS method and just restart when @deleted_keys is over 100k-1M keys. I doubt anyone would encounter the limit in normal usage and it's pretty much 0-cpu operation to check array size in the loop. If someone is deleting that much keys, probably he's deleting most of them, thus restarting pretty much starts where it was left at..
(In reply to comment #5) > From what I understand, the amount of deleted keys normally would be very small > compared to previous code which arrayed _all_ keys. So there really isn't an > issue here? Yes, makes sense, I completely missed that. So, to avoid a memory exhaustion in a rare case when a lot of keys should be deleted either 3a from comment #4 will do or iteration can be simply aborted when 10000-100000 keys are collected in @delete_keys, warning message printed and non-zero exit code returned. Just in case.
Created attachment 5053 [details] do it in chunks if huge > Nothing against making it safe for the worst scenario though.. I would go > for KISS method and just restart when @deleted_keys is over 100k-1M keys. Fully agree. I was considering this, but the hour was late... A new patch attached.
Created attachment 5054 [details] oops (the previous patch included an unrelated patch, sorry)
+1 on patch oops but I only reviewed the logic. And I'm definitely a slow and steady wins the race over optimization with the potential memory usage being discussed so no worries there. Vitaly, can you test Mark's patch by any chance? regards, KAM
Tested. Progress nearly stops after first 10000 removed keys, I guess @delete_keys contents should be pruned. I hope I got the correct version of the patch, as 'oops' looks unrelated too.
(In reply to comment #11) > Tested. Progress nearly stops after first 10000 removed keys, I guess > @delete_keys contents should be pruned. > > I hope I got the correct version of the patch, as 'oops' looks unrelated too. It looks like I tested 5053 and 5054. Mark, can you clarify?
> It looks like I tested 5053 and 5054. Mark, can you clarify? 5053 and 5054 are the same regarding sa-awl (the 5053 contains an additional unrelated patch, no harm) > Tested. Progress nearly stops after first 10000 removed keys, > I guess @delete_keys contents should be pruned. Can you clarify what you mean? > Progress nearly stops after first 10000 removed keys. You are probably observing the deletion phase for the first collected set of 10000 keys, there are no printouts during this phase. Or is it something else?
> 5053 and 5054 are the same regarding sa-awl > (the 5053 contains an additional unrelated patch, no harm) My mistake again, sorry, will fix that... Just use the 5053.
Created attachment 5055 [details] oops2, hope this diff file is the correct one
(In reply to comment #13) > Can you clarify what you mean? after delete $h{$_} for @delete_keys; @delete_keys contents is not needed on next for(;;) loop iteration, something like - my(@delete_keys); for (;;) { + my(@delete_keys); but I haven't tested it.
> after > delete $h{$_} for @delete_keys; > @delete_keys contents is not needed on next for(;;) loop iteration, something > like > > - my(@delete_keys); > for (;;) { > + my(@delete_keys); > > but I haven't tested it. Indeed, good catch. Too much hurry is never a good thing.
Created attachment 5056 [details] how about this then
(In reply to comment #18) > Created attachment 5056 [details] > how about this then The 3rd one burned, fell over into the swamp but the 4th one stood. ;-) The logic is sound and the scoping change on delete_keys makes sense to me. Again, just reviewed the logic and not trying the actual patch.
Well, I haven't found any more issues. Performance is surprisingly good too: processed DB with >8M entries (two database keys per entry), 7.5M of which were deleted in less than 40 minutes. BerkeleyDB 4.8 if it matters.
> Well, I haven't found any more issues. > Performance is surprisingly good too: processed DB with >8M entries (two > database keys per entry), 7.5M of which were deleted in less than 40 minutes. > BerkeleyDB 4.8 if it matters. Very good, thanks for testing! trunk (3.4): Bug 6793: reduce sa-awl memory usage Sending sa-awl.raw Committed revision 1329877. Can we close this now?
First of all, +1 but didn't look so indepth this time either. Enough to me that it's pretty well tested already. :-) (In reply to comment #20) > Well, I haven't found any more issues. > > Performance is surprisingly good too: processed DB with >8M entries (two > database keys per entry), 7.5M of which were deleted in less than 40 minutes. > > BerkeleyDB 4.8 if it matters. A bit off topic, but have you ever tried DB_TXN_NOSYNC ? Should add considerable performance, and the "risks" are more than fine for this use.. You could try adding it in this line in BDB.pm: my $flags = DB_INIT_LOCK|DB_INIT_LOG|DB_INIT_MPOOL|DB_INIT_TXN; In fact, I think the flags should be user configurable to some degree..
(In reply to comment #22) > A bit off topic, but have you ever tried DB_TXN_NOSYNC ? No, tests were performed on hardware 2 disk SATA RAID with write caching enabled, NOSYNC shouldn't make a big difference. Most likely it was read-congested, not write.
Close?
Thanks Vitaly and Mark!