The following *is* true with Apache 2.0.47 on Windows. It *may* well be true on other platforms as well -- I've not done sufficient testing to say for certain. Apache crashes when the number of distinct users authenticating against LDAP exceeds the setting used for LDAPCacheEntries. This does not always occur on first exceeding this cache size, but in my experience it will invariably occur after a few occurences of exceeding the cache size. A little debugging strongly suggests that there is an issue with the code which removes old entries from the cache in this case. The workaround is either to use a value of 0 for LDAPCacheEntries, i.e. disable the cache, or use a value that is larger than your user population plus some safety factor. The safety factor is necessary in that it appears to be possible to have more than one entry for a given user in the cache. This appears to occur when one request is using the user entry when another request for authenticating the same user comes in. This issue is masked by bug #24800 and cannot be reached until you work around it.
P.S. I believe this issue might may still be masked by an undersized shared memory block even though bug #24800 appears to be fixed in 2.0.49. For instance with: LDAPCacheEntries 2150 # Next line was necessary last I checked as 0 caused issues with active cache LDAPOpCacheEntries 1 LDAPSharedCacheSize 865000 LDAPSharedCacheFile logs/mod_ldap_cache I get a child process crash one I get to somewhere between 2151 and 2155 distinct users. Finally, I'm pretty sure I verified that this issue exists on Solaris and AIX as well -- but I clearly forgot to note it here.
Trying to look at this now, although I'm not that familiar with the cache code. Do you have an example of a stacktrace where the crash is occuring? I'm trying to work out why the problem would be in cache cleanup rather than in adding to the cache - maybe it's an edge case somewhere in the cleanup?
It's a long-standing bug that the shared memory caching code does not check for the apr_rmm_*alloc functions returning NULL, so it will of course die horribly if the rmm segment fills up and the code tries to allocate more: return (void *)apr_rmm_addr_get(cache->rmm_addr, apr_rmm_calloc(cache->rmm_addr, size));
That is a separate bug -- which I believe has been fixed in/by 2.0.49 -- at least my test case for it no longer failed there. This bug is about the case where the physical shared memory bytes are sufficient but the specified logical cache size (i.e. # of entries) is not. In this case, the cache should simply purge older entries. Instead it crashes (attempting to do this). I've been meaning to generate a stack trace, but have not managed yet.
Created attachment 11633 [details] Add checking for NULL in *_rmm_* functions
Does this patch make any difference for you?
In util_ald_cache_insert(), it attempts to add an item to the cache. There is no check for whether the cache is full, because it is assumed that on the edge case (of the very last cache entry being allocated) util_ald_cache_purge() will run, which again is assumed to bring down the cache size. So in this case, it looks like util_ald_cache_purge() is not bringing down the cache size, so on the next entry we overflow. Try this patch and see if it makes a difference - it checks for overflow before we add, not after. The purge code is probably still broken, but at least we won't segfault.
Created attachment 11634 [details] Add sanity check so that we don't overflow if purge fails for any reason
Just committed the above patches to the v2.1.0-dev tree, as they stomp on the segfaults. The cache problem remains however, if the cache sizes at set to 1, mod_auth_ldap starts returning auth failures.
I applied the patch provided to 2.0.49 sources (the latest I had readily available) and get a crash with the following traceback (on Windows). Note this was for user 2161 with a cache size of 2150. Also note that this executable also includes the latest patches for util_ldap.c [for authenticated LDAP server access] and mod_auth_ldap.c [for avoiding double-escaping with Microsoft's LDAP SDK]. util_ldap_dn_compare_node_compare(void * 0x00815b98, void * 0x04d4de80) line 91 + 12 bytes util_ald_cache_fetch(util_ald_cache * 0x00d8008c, void * 0x04d4de80) line 351 + 17 bytes util_ldap_cache_checkuserid(request_rec * 0x6fb51341, util_ldap_connection_t * 0x007dd1e8, const char * 0x0078ced0, const char * 0x007799c8, int 7991832, char * * 0x00000002, const char * 0x00000000, const char * 0x04d4def0, const char * * 0x007dee59, const char * * * 0x04d4dee4) line 766 + 22 bytes mod_auth_ldap_check_user_id(request_rec * 0x6ff10e5f) line 334 ap_run_check_user_id(request_rec * 0x007dd1e8) line 69 + 31 bytes ap_process_request_internal(request_rec * 0x6ff0d6f8) line 193 + 6 bytes ap_process_request(request_rec * 0x007dd1e8) line 245 ap_process_http_connection(conn_rec * 0x6ff0423f) line 250 + 6 bytes ap_run_process_connection(conn_rec * 0x007c8ab8) line 42 + 31 bytes ap_process_connection(conn_rec * 0x007c8ab8, void * 0x007c89e8) line 175 + 6 bytes worker_main(long 2013300156) line 718 MSVCRT! 780085bc() KERNEL32! 7c581af6() Once I let this process die a new child process is created and the test set (of 2500 users) works fine. For testing this sort of thing, I recommend just exporting a single user (with password) from LDAP and using this export as a template to programmatically create many users all the same attributes except for the user name. You can then use a simple program, script, or even Ant to attempt to fetch an authenticated resource on behalf of each user in turn.
Patches to fix segfaults in the cache code were applied to v2.1.0-dev and v2.0.50-dev. Testing this by reducing the cache sizes to a size of 1 show that the segfaults are gone, but the mod_auth_ldap module is returning an auth fail when it shouldn't, and the cache gets full and stays full. I have created a new bug report for this: 29207. *** This bug has been marked as a duplicate of 29207 ***
> Note that the last time I tested the cache entry overflow it still > crashed when I through 2500 unique user login attempts at a 2150 > entry cache. This is more representative of our real use cases > than 5 unique users against a single user entry cache or the like > and I've not had a chance to (or much interest in) testing this > particular case. I've built an Apache 2.0.50 from sources for Windows (to get HTTPS support, of course, plus tiny extensions to mod_deflate and sockopt -- which is missing send-buffer-size configurability on Windows) and re-ran the test noted above. I get a 100% repeatable crash at around user 2160, i.e. the buffer overflow is *not* fixed, at least not on Windows. [I can test Solaris and AIX when I get those binaries built.] In short, this bug is *not* fixed in 2.0.50.
Created attachment 12817 [details] Fix to util_ald_cache_purge() to relink lists properly
As per the last comment, I have found the problem behind this bug: util_ald_cache_purge() simply never relinked the linked list entries during cache purge. Instead it freed various elements in the linked list without updating any linked list pointers, thus begging for trouble as the memory is reused, etc... Also, I know this has been resolved as "duplicate", but the fix I have found proves that the problem was not limited to "duplicate"' bug 29207. I am thus reopening this until someone commits my patch.
The final patch for this bug that fixes the util_ald_cache_purge()relink problem has been backported and posted. See dist/httpd/patches/apply_to_2.0.52.
*** Bug 29207 has been marked as a duplicate of this bug. ***