Bug 27748 - ldap auth periodically fails, requires restart
Summary: ldap auth periodically fails, requires restart
Status: CLOSED FIXED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_auth_ldap (show other bugs)
Version: 2.0-HEAD
Hardware: Sun Solaris
: P3 major (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: PatchAvailable
: 17599 18661 21787 24595 24683 25764 27134 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-03-17 16:36 UTC by William Leumas
Modified: 2004-11-16 19:05 UTC (History)
8 users (show)



Attachments
fix for ldap rebinding failures (491 bytes, patch)
2004-03-31 18:36 UTC, William Leumas
Details | Diff
Rollup of LDAP fixes to v2.1.0 against v2.0.49 (21.21 KB, patch)
2004-05-20 22:47 UTC, Graham Leggett
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description William Leumas 2004-03-17 16:36:13 UTC
Over time (a few days, with an average of 350k hits of which 25k are authed with
auth_ldap) it will stop authenticating random users, with the error:

[Wed Mar 17 08:40:51 2004] [warn] [client 147.178.68.203] [26904] auth_ldap
authenticate: user {username} authentication failed; URI {path} [User not
found][No such object]

It does this in the middle of a functional session (i.e. the user was logged in,
clicking around and suddenly pop, no access).  The 'fix' is to restart the
webserver.  I presume this is a cacheing issue.  We are running 2.0.49rc1
Comment 1 William Leumas 2004-03-25 20:23:21 UTC
This is linked against OpenLDAP stable 2.1.25.  Is there a way to perhaps turn
off the cache, to see if that is what is causing the problem?  I have written a
perl script that watches the log for this error and immediately cross-checks
LDAP for the user, if the user exists it restarts Apache.  We are seeing about
30 restarts a day.
Comment 2 William Leumas 2004-03-31 18:36:52 UTC
Created attachment 11078 [details]
fix for ldap rebinding failures
Comment 3 William Leumas 2004-03-31 18:39:31 UTC
The problem is in the poor way the ldap session is managed (which could cause
other severe problems, if individual users cannot browse the tree, and it should
be re-considered).  Kurt Olsen has found this problem and come up with a quick
fix (see patch).  Note: this also relates to bug# 17274.  Kurt's description:

--------------
In the file util_ldap.c, in the function util_ldap_cache_checkuserid, when a
user tries to authenticate the module takes these steps:

1) check the cache, returning success or failure if results cached.
2) open a connection via the function util_ldap_connection_open, using the ldc
struct.
   if ldc->bound = 1, then don't do anything in util_ldap_connection_open.
3) do a search to validate, and locate the dn for, the username provided.
4) verify that there is only 1 result of the search in #3.
5) verify that the password is non empty.
6) rebind with the dn found in step 3 with the password provided, using the ldc
struct.
   if there is a failure then return failure status.
   on success update cache and return success status.

The problem is that the ldc used in #6 is the same ldc used to lookup a user's
dn in the tree.  So if the password is incorrect then the ldap_simple_bind_s
used to verify the password will have screwed up the ldc->ldap binding.
The next time this ldc struct is used, the ldc->bound value is set to 1, but
the actual valid bind has been hosed. One simple fix is to add an "ldc->bound = 0;"
into the two tests for failure after the ldap_simple_bind_s. This causes
the util_ldap_connection_open to re-bind with the proper DN prior to looking
up users.

Even in the case where the users are logging in correctly, there is still
the problem that when user A authenticates the ldc->ldap bind is now bound
with his username and password. If user A doesn't have rights to search the
tree, then when user B comes along at a later point in time the search for
user B's dn in the tree will fail. The correct fix would be to create an
util_ldap_connection_t *foo; that would be used for testing provided passwords,
but would not have an impact on the ldc struct used for searching and what not.

Kurt Olsen
Comment 4 André Malo 2004-04-03 22:41:57 UTC
Not fixed in the code yet...
adding Patchavailable keyword.
Comment 5 Kurt Olsen 2004-04-15 21:36:46 UTC
Additional bugs with this issue and some of them also have fixes:

17274
17599
18661
21787
24595
24683 (probably, commentary is old)
27134
27271

And

28413 may be the same thing, but it's not really clear except that they
experience failures against AD.

I think that the comment that a connection should be marked as unbound after any
user bind is the proper solution. The patch included in this report only marks
unbound upon auth failures. Adding an ldc->bound = 0; at line 847 in util_ldap.c
(release 2.0.49) should fix both issues I have addressed in my re-explanation of
the problem.
Comment 6 Graham Leggett 2004-05-20 22:37:01 UTC
The attached patch has been committed to v2.1.0-dev, and is included against
v2.0.49.

Please test and tell me whether this fixes the problem.
Comment 7 Graham Leggett 2004-05-20 22:47:43 UTC
Created attachment 11618 [details]
Rollup of LDAP fixes to v2.1.0 against v2.0.49
Comment 8 Graham Leggett 2004-05-20 22:49:08 UTC
The attachment includes bnicholes fix:

    *) mod_ldap calls ldap_simple_bind_s() to validate the user
       credentials.  If the bind fails, the connection is left
       in an unbound state.  Make sure that the ldap connection
       record is updated to show that the connection is no longer
       bound.
Comment 9 Graham Leggett 2004-05-21 01:17:11 UTC
*** Bug 25764 has been marked as a duplicate of this bug. ***
Comment 10 Graham Leggett 2004-05-21 14:20:10 UTC
*** Bug 17599 has been marked as a duplicate of this bug. ***
Comment 11 Graham Leggett 2004-05-21 14:49:38 UTC
*** Bug 21787 has been marked as a duplicate of this bug. ***
Comment 12 Graham Leggett 2004-05-21 14:50:08 UTC
*** Bug 24595 has been marked as a duplicate of this bug. ***
Comment 13 Graham Leggett 2004-05-21 15:30:29 UTC
*** Bug 27134 has been marked as a duplicate of this bug. ***
Comment 14 Graham Leggett 2004-05-21 15:51:29 UTC
*** Bug 24683 has been marked as a duplicate of this bug. ***
Comment 15 Graham Leggett 2004-05-21 16:58:11 UTC
*** Bug 18661 has been marked as a duplicate of this bug. ***
Comment 16 Graham Leggett 2004-05-21 23:18:46 UTC
Fixed in v2.0.50-dev.
Comment 17 Albert Lunde 2004-05-24 20:43:36 UTC
I repeated my test set-up that I'd been using under bug 27134, with the roll-up
patch 11618 from bug 27748. This was on Red Hat Linux 9.0, building Apache from
patched 2.0.49 sources (not Red Hat sources)

This uses two test data sets with 11 valid username/password pairs and some
pseudo-random failures. One data set walks through the usernames in nearly
serial order (because this will tend to show the worst-case usage of the
connection pool). This makes 103 requests. The other data set uses a more random
series of usernames. This makes 804 requests.

The results look good.

I'm now getting no unexpected authentication results, and socket usage looks
similar to Denis Gervalle's previous patch.

I still have the warning "LDAP cache: Unable to init 
Shared Cache: no file", but I suppose that's a different issue.

I did the tests first with the default settings of

StartServers         5
MinSpareServers      5
MaxSpareServers     10
MaxClients         150
MaxRequestsPerChild  0

For comparison, I set up a low process number test with:

StartServers         1
MinSpareServers      1
MaxSpareServers     1
MaxClients         150
MaxRequestsPerChild  0

and high process number test with:

StartServers         10
MinSpareServers      10
MaxSpareServers     20
MaxClients         150
MaxRequestsPerChild  0

All the tests give correct results (authentication works or fails as expected).

I looked at sockets in use with "netstat -an" on the LDAP server.

With the default prefork process config:
the serial data set left 9 sockets to the LDAP server in use at the end;
the random data set left 4 sockets in use at the end

With the "low process" config:
the serial data set left 1 socket in use at the end;
the random data set left 0 sockets in use at the end

With the "high process" config
the serial data set left 14 sockets in use at the end;
the random data set left 11 sockets in use at the end;

I'm guessing that if I could get rid of the "Unable to init 
Shared Cache" warning I'd get results more like the "low
process" config. Can anyone suggest another fix/bug that
applies to that issue?