Bug 34332 - apr_socket_send() gives up too easily
Summary: apr_socket_send() gives up too easily
Status: CLOSED FIXED
Alias: None
Product: APR
Classification: Unclassified
Component: APR (show other bugs)
Version: 0.9.6
Hardware: Other other
: P2 major (vote)
Target Milestone: ---
Assignee: Apache Portable Runtime bugs mailinglist
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-04-06 18:15 UTC by Al Begley
Modified: 2014-02-17 13:55 UTC (History)
1 user (show)



Attachments
try again (1.32 KB, patch)
2005-05-24 02:16 UTC, Wilfredo Sanchez
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Al Begley 2005-04-06 18:15:02 UTC
The apr_socket_send() function in the UNIX version of sendrecv.c does a write() loop on the socket. If the 
write fails with EAGAIN, it does an apr_wait_for_io_or_timeout() and tries the write loop once more. If that 
write also fails with EAGAIN, it returns to the caller, instead of continuing to try.

The result is that an HTTP GET may abort prematurely, and only return the first 64K of a file.
Comment 1 Paul Querna 2005-04-06 18:18:58 UTC
Moving to the APR project.
Comment 2 Paul Querna 2005-04-06 18:22:13 UTC
I believe that apr_socket_send is documented to return EAGAIN and that the
caller is responsible for calling it again.

An alternative would be to add a apr_socket_sendfull() that would handle EAGAIN
and keep trying to send in all cases.
Comment 3 Joe Orton 2005-04-13 16:04:47 UTC
Per dev@httpd discussion, that is by design; if select/poll indicates
writability and the subsequence write() call fails with EAGAIN there is some
other problem.  Paul mentioned it may be a Darwin issue....

(I'd also check it's not another issue with the
O_NONBLOCK-inheritance-over-accept-detection too)
Comment 4 Al Begley 2005-05-06 15:10:37 UTC
To help explore the possibility of a Darwin issue ... Current probes indicate that no code is waiting in 
select or poll, or that poll/select/etc are returning right away indicating that the socket is ready for 
writing. What mechanism does apache2 use to determine when a socket is writable? 
Comment 5 Joe Orton 2005-05-06 15:44:06 UTC
It uses poll() if available or select() otherwise.  I don't know which it is on
Darwin.

Jeff commented on dev@httpd:

> Another thing to check is if apr_poll() is telling the I/O routine
> that data is ready when in fact it is not.  I recall some recent
> complaints about APR using poll() on OS X 10.3, where poll() has some
> negative attributes (I don't recall details).

You need to answer some of these question before we can help you any more on
this.  Specifically have you checked whether this is an O_NONBLOCK-inheritance
issue?  What's the output from running:

http://people.apache.org/~jorton/nonblock.c
Comment 6 Al Begley 2005-05-06 16:19:43 UTC
The output is:

bound to 0.0.0.0:53696
Comment 7 Joe Orton 2005-05-09 16:14:13 UTC
That means that APR has detected that O_NONBLOCK is inherited across accept(),
which is expected for a BSDish platform.

So, the other question: which is used by APR on Darwin, poll or select?  Have
you tried forcing the decision one way or the other?
Comment 8 Al Begley 2005-05-09 20:05:33 UTC
It appears APR prefers poll() on Mac OS X 10.4 (note difference from Mac OS X 10.3, where bug 29985 
was encountered). Summary of poll-related log messages configuring Apache 2.0.53 on Mac OS X 10.4 
Server:

    checking for poll... yes
    checking poll.h usability... yes
    checking poll.h presence... yes
    checking for poll.h... yes
    checking sys/poll.h usability... yes
    checking sys/poll.h presence... yes
    checking for sys/poll.h... yes
    checking for POLLIN in poll.h sys/poll.h... yes

I'm not certain this is the best way to force it not to use poll(), but after running ./configure I set the 
following in httpd-2.0.53/srclib/apr/include/arch/unix/apr_private.h:

    #define HAVE_POLL 0
    #define HAVE_POLLIN 0
    #define HAVE_POLL_H 0

Built this way, apache2 still behaves the same way:

y:/tmp al$ curl -o 1.2m.html http://xx.apple.com:8080/1.2M.html
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  4 1280k    4 65536    0     0   4339      0  0:05:02  0:00:15  0:04:47     0
curl: (18) transfer closed with 1245184 bytes remaining to read
Comment 9 Wilfredo Sanchez 2005-05-23 21:42:20 UTC
From a kernel developer:

Select does not guarantee writability; merely that at the time that the select wakeup was issued (not the 
time the caller runs) the socket was writable.  There are many reasons that the socket may become 
unwritable again (resource shortages, competing access), and the caller should simply accept this and re-
select.

This implies we aren't doing the right thing in APR.
Comment 10 Jeff Trawick 2005-05-23 22:17:39 UTC
>There are many reasons that the socket may become
>unwritable again (resource shortages, competing access),

Competing access isn't an issue here.  There is one thread checking for
writability, and no other threads will consume the writeability characteristic.

It seems very odd that an obscure problem such as a resource shortage (lack of
mbufs ?) is the explanation for what is going on, since since this problem is so
easily recreatable for you.

Is there any way to get a syscall trace of the httpd process during the failure
scenario so we can see this explicitly?
Comment 11 Wilfredo Sanchez 2005-05-24 02:09:20 UTC
I'm not sure that resource contention is in fact a problem; the point was that getting EWOULDBLOCK 
(EAGAIN) on write after poll() tells you that it's writeable is valid behavior and that we should probably deal 
with that case by continuing to retry rather than giving up and sending the errno up to the caller.

In this case, httpd isn't aware that it's dealing with a non-blocking socket, so that it can get the EAGAIN 
seems like a bug in APR.  I worked around this (in r160348 on httpd HEAD) by having httpd retry the write 
if it gets EAGAIN, but I think that the fix probably should move down to APR, since httpd shouldn't have to 
anticipate an EAGAIN from APR.
Comment 12 Wilfredo Sanchez 2005-05-24 02:16:40 UTC
Created attachment 15135 [details]
try again

Attaching Al's patch to APR for reference.  Rather than trying one, and then if
failure, try one more time, just keep trying.
Comment 13 Wilfredo Sanchez 2005-05-24 02:21:36 UTC
The above patch would have to be applied to the other functions as well, if it's appropriate.  The question 
is whether apr_socket_send() and friends are supposed to yiled EAGAIN to the caller if the caller is doing 
blocking I/O.
Comment 14 Joe Orton 2005-05-24 23:09:22 UTC
1) using "#define HAVE_POLL 0" in apr_private.h will not force use of select;
it's a defined-or-not constant, not a defined-to-1-or-0 constant.

I'd really like to see conclusive results of using select() over poll() first. 
There are many independent reports that poll() is broken (not to mention
unsupported) in Darwin, e.g. http://curl.haxx.se/mail/lib-2005-05/0122.html

Preferable way to force use of select is:

export ac_cv_func_poll=no
./configure ...

(and check apr_private.h does *not* define HAVE_POLL at all after that)

2) echoing other comment: our kernel guys here agree that this kind of situation
*can* occur in rare circumstance (e.g. memory pressure) etc with other kernels
also.  That does not seem to be what is happening here; but it seems APR should
indeed cope with it in _recv() and _send() (and ...) at least.  A loop would be
better than a goto, of course!
Comment 15 Al Begley 2005-05-25 01:24:18 UTC
OK, I configured Apache 2.0.54 as Joe suggested, and verified that the unix version of apr_private.h does 
not define HAVE_POLL. Built that way, Apache no longer encounters the problem on Mac OS X 10.4. I was 
able to download not only the 1.2M file but also a roughly 2.3G file without incident:

[y:/tmp] albegley% curl -o 2.3G.html http://xx.apple.com:8080/2.3G.html
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2465M  100 2465M    0     0  10.0M      0  0:04:04  0:04:04 --:--:-- 10.4M
Comment 16 Joe Orton 2005-05-25 08:28:49 UTC
OK, I checked in the fix to prevent use of poll() on Darwin too.  Thanks for
investigating this.

http://svn.apache.org/viewcvs?rev=178386&view=rev
Comment 17 Wilfredo Sanchez 2005-05-25 22:42:34 UTC
Joe, I fixed APR's usage of poll/select earlier yesterday (http://svn.apache.org/viewcvs.cgi?
view=rev&rev=178340) which removes the need to disable poll() on Darwin.  Anyway, it at least makes Al's 
problem go away.

The POLLIN/POLLHUP issue mentioned in the curl list message doesn't seem to be related...

The upshot being that disabling poll() may not be necessary.