Bug 42753 - Race condition when using available() or reading in CometProcessor.event()
Summary: Race condition when using available() or reading in CometProcessor.event()
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 6
Classification: Unclassified
Component: Catalina (show other bugs)
Version: unspecified
Hardware: All other
: P1 normal (vote)
Target Milestone: default
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-06-27 05:21 UTC by Sebastiaan van Erk
Modified: 2007-06-29 06:31 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sebastiaan van Erk 2007-06-27 05:21:53 UTC
Tomcat version: 6.0.13 (no 6.0.13 available in the version field)

When trying to process data in CometProcessor.event() either during the BEGIN
event or during the READ event, it is impossible to guarantee that all data has
been read before returning (since new data may arrive between the last statement
before the return, and the return statement itself). However, the
CoyoteProcessor does this check:

 } else if (!error && read && request.getAvailable()) {
    // If this was a read and not all bytes have been read, or if no data
    // was read from the connector, then it is an error
    error = true;
    log.error(sm.getString("coyoteAdapter.read"));
 } 

causing a severe error and the Comet request to break when this race condition
manifests itself.

I am using the following code according to the aio.html description to read data
in my comet event handler:

  while (request.getInputStream().available() > 0) {
     // read some data
  }
  // ***** NO MORE DATA AVAILABLE
  return;

And I am experiencing the error as described in certain conditions (rapid
arrival of fragmented data):

SEVERE: The servlet did not read all available bytes during the processing of
the read event
Comment 1 Remy Maucherat 2007-06-27 11:00:55 UTC
available() does not make any socket access to see if bytes have arrived, so the
scenario you describe cannot happen (if available returns 0, it will continue
returning 0 until the next read event). If the message you saw is logged,
CometAdapter.event will return false, which will lead to closing the connection,
not causing a loop.

My only theory is that you're doing asynchronous reads, which is not allowed,
and you'll need to look into this further.

What does "(no 6.0.13 available in the version field)" mean ?
Comment 2 Sebastiaan van Erk 2007-06-27 12:35:36 UTC
I have very explicit logging and I'm 100% certain that I'm not doing any async
reads. In fact, from the log I sent to the user list before reporting this as a
bug you can see that everything is happening in 1 thread, namely http-8080-exec-4:

2007-06-26 14:37:08,427 DEBUG [http-8080-exec-4]
com.sebster.myservlet.TomcatCometServlet - BEGIN event for request
org.apache.catalina.connector.RequestFacade@d47c99
2007-06-26 14:37:08,427 DEBUG [http-8080-exec-4]
com.sebster.myservlet.TomcatCometServlet - 127.0.0.1:60578 POST /mycometservlet
2007-06-26 14:37:08,432 DEBUG [http-8080-exec-4]
com.sebster.myservlet.TomcatCometServlet -
[24224039-a37e-40d0-a076-89d1df363390] read loop in BEGIN event, input stream
data available: 1
2007-06-26 14:37:08,438 DEBUG [http-8080-exec-4]
com.sebster.myservlet.TomcatCometServlet -
[24224039-a37e-40d0-a076-89d1df363390] read loop done, input stream data
available: 0
2007-06-26 2:37:08.MD org.apache.catalina.connector.CoyoteAdapter event
SEVERE: The servlet did not read all available bytes during the processing of
the read event

Note that I'm not implying that available() does a socket access; what I saw was
that when the loop I described in the original post ended the condition
"available() == 0" was true, and when it got to CatalinaAdapter the condition
Request.getAvailable() was true.

Futhermore, I really am seeing the loop without the connection being closed,
both on the windows platform and on the Linux platform. I'm happy to help debug
this further, but I'm not making it up: I have a breakpoint in my
CometProcessor.event() method, I see the Poller thread spin like crazy, and when
I set a break point in the Poller the CPU load goes away. I can step through it
and see that the connection stays open, it sets the event type to END on the
comet event, but it never reaches my CometProcessor.event() method.

Finally, what I mean with "no 6.0.13 available in the version field" is the
dropdown box in which to mark the Tomcat version in the bugzilla bug form. It
goes up to and including 6.0.11.

Regards,
Sebastiaan
Comment 3 Remy Maucherat 2007-06-27 14:47:18 UTC
Since you're apparently not willing to write things that make sense, I will
ignore your report.
Comment 4 Sebastiaan van Erk 2007-06-27 15:33:55 UTC
(In reply to comment #3)
> Since you're apparently not willing to write things that make sense, I will
> ignore your report.

I am getting really exasperated by your replies. You give me no hint at what it
is you want or what you think is not making sense. I will test your theory more
by writing a wrapper around the input stream to log the thread of every read
that happens just to make sure tomorrow.

In either case, after I got the error the Poller went into a busy loop. I'm not
smoking pot and I saw what I saw. Ok, maybe there's a bug somewhere in my code
and I'm doing something wrong, I certainly don't want to exclude that
possibility. But I saw the Poller go into an infinite loop (not caused by the
NIO bug I was talking about previously (which is Linux only)). It did NOT close
the channel. It did NOT call my event method with a READ, ERROR or END event
(which I know because I have a log.debug on enter, AND a break point set). This
happened directly AFTER I got the SEVERE log message. I got this on Windows and
Linux. Note that I'm not theorizing now, these are the raw observations.
Comment 5 Filip Hanik 2007-06-27 16:04:07 UTC
If you follow the code logic, if the error happens, where available() returns >0
after a READ has been invoked, the connection is never marked as a comet, so you
will not get the ERROR/END events invoked. so the connection goes into the
poller for the next request.
Could you see if you can work up a reproducible scenario, and we can take it
from there.

Comment 6 Sebastiaan van Erk 2007-06-28 02:47:50 UTC
(In reply to comment #5)
> If you follow the code logic, if the error happens, where available() returns >0
> after a READ has been invoked, the connection is never marked as a comet, so you
> will not get the ERROR/END events invoked. so the connection goes into the
> poller for the next request.
> Could you see if you can work up a reproducible scenario, and we can take it
> from there.

When I look in the code I see the following:

                if (response.isClosed() || !request.isComet()) {
                    res.action(ActionCode.ACTION_COMET_END, null);
                } else if (!error && read && request.getAvailable()) {
                    // If this was a read and not all bytes have been read, or
if no data
                    // was read from the connector, then it is an error
                    error = true;
                    log.error(sm.getString("coyoteAdapter.read"));
                }
                return (!error);

So when you get the error, the condition response.isClosed() ||
!request.isComet() must evaluate to false, which means that !response.isClosed()
&& request.isComet(). So for me to get the error it seems the request must
already be marked as a comet request.

I am not able to give you a small test case. The problem is that I have two
different products that start an embedded tomcat using exactly the same class,
and in one it consistently works (no errors whatsoever), and in the other it
consistently fails. I have not seen this issue before (and I have already done a
lot of testing in different situations), and I have no idea what the
precondition is for this issue to manifest itself.

However, I added extra debugging information, and I did find something strange.
I do the following logging FIRST thing in my Comet processor and I wrapped the
input stream to log which thread does the reading:

  final EventType eventType = event.getEventType();
  if (Debug.ENABLED && logger.isDebugEnabled()) {
    try {
      final Field field = request.getClass().getDeclaredField("request");
//$NON-NLS-1$
      field.setAccessible(true);
      final Request internalRequest = (Request) field.get(request);
      logger.debug("{} event for request facade {}, request {}", new Object[] {
eventType, request, internalRequest }); //$NON-NLS-1$
    } catch (final Exception e) {
      logger.debug("{} event for request facade {}", eventType, request);
//$NON-NLS-1$
      logger.debug("failed to get request", e); //$NON-NLS-1$
    }
  }

I set a breakpoint in the CoyoteAdapter where it gives the SEVERE error, looked
at the id of the request object, and checked my log what events I got for that
object.

My results where as follows:

2007-06-28 11:16:00,021 DEBUG [http-8080-exec-4]
com.sebster.myservlet.MyCometServlet - BEGIN event for request facade
org.apache.catalina.connector.RequestFacade@1dc696e, request
org.apache.catalina.connector.Request@18dbef1

This was the ONLY line in the log with Request@18dbef1. All the reads in this
BEGIN event happen on the http-8080-exec-4 thread. After returning from the read
(due to available() == 0), it immediately stops in my breakpoint in
CoyoteAdapter to log the SEVERE error. Note that I first set the break point to
find the id of the Request object to look for in the log.

The strange thing is, that the breakpoint happens in the http-8080-exec-5
thread. The comet field in the Request@1dc696e object is set to true, and
request.getAvailable() also returns true.

These are my observations so far. Note that I am no expert on Tomcat internals,
but I am very willing to spend time on this issue to resolve it, so any tips,
suggestions, questions, or assignments :-) are welcome!

To summarize:
1) I get a BEGIN event for Request@1dc696e in http-8080-exec-4
2) I read until available() == 0 in http-8080-exec-4
3) I return from the event() method in http-8080-exec-4
4) Tomcat logs a SEVERE error stating that I have not read all data
     * The request object is Request@1dc696e
     * request.isComet() == true
     * request.getAvailable() == true
     * the thread is http-8080-exec-5
     * at this point there are zero log lines for thread http-8080-exec-5

Regards,
Sebastiaan






Comment 7 Sebastiaan van Erk 2007-06-28 03:20:08 UTC
I did some additional debugging on the Poller loop that occurs after the SEVERE
error. This is what happens:

1) the CoyoteAdapter.event() method gets called.
2) request.read() returns false
3) the condition status == SocketStatus.STOP evaluates to true
4) the following statements are executed
    request.getEvent().setEventType(CometEvent.EventType.END);
    request.getEvent().setEventSubType(CometEvent.EventSubType.SERVER_SHUTDOWN);
5) in the following statement getFirst() returns the "basic" field, a
StandardEngineValve:
    connector.getContainer().getPipeline().getFirst().event(request, response,
request.getEvent());
6) my CometProcessor.event() method is NOT called
7) error = false, and the event() method returns true

This loop repeats ad infinitum, and consumes 100% CPU. Note that no code of my
own is called in this sequence.

Regards,
Sebastiaan
Comment 8 Filip Hanik 2007-06-28 11:34:50 UTC
I just wrote a simple test client that doesn't read the contents and I
experienced no loop. END nor ERROR got called, so there is a problem here, the
container should call one of those methods.
are you able to test your solution against the latest
tc6.0.x branch as well as the trunk branch?
Comment 9 Filip Hanik 2007-06-28 11:46:06 UTC
(In reply to comment #8)
> I just wrote a simple test client that doesn't read the contents and I
> experienced no loop. END nor ERROR got called, so there is a problem here, the
> container should call one of those methods.
> are you able to test your solution against the latest
> tc6.0.x branch as well as the trunk branch?

A patch for the END/ERROR never being called has been submitted to both 6.0.x
and trunk. This should at least give you a chance to cleanup, 

your original problem of available()==0 but there yet being data, causing an
infinite loop is to me non reproducible.

I'll wait for your feedback before marking this fixed.
Comment 10 Sebastiaan van Erk 2007-06-28 12:11:47 UTC
(In reply to comment #9)

> A patch for the END/ERROR never being called has been submitted to both 6.0.x
> and trunk. This should at least give you a chance to cleanup, 

Thanks :-), I will test this ASAP.

> your original problem of available()==0 but there yet being data, causing an
> infinite loop is to me non reproducible.

The main clue (as far as I can tell) of something going wrong is the fact that
the BEGIN event + read loop is in 1 thread, and the error in happens another
(see the summary of comment 6). Do you not agree that this should not happen? If
you agree that this is wrong (whatever the cause, Tomcat or my code), I can try
and debug what causes this to happen.

Regards,
Sebastiaan

Comment 11 Filip Hanik 2007-06-28 12:32:50 UTC
Once the socket goes back to the poller, and the poller dispatches it again,
then that can be on a different thread, as tomcat uses a thread pool, so there
is no guarantee what thread the poller dispatches too
Comment 12 Sebastiaan van Erk 2007-06-28 12:56:51 UTC
(In reply to comment #11)
> Once the socket goes back to the poller, and the poller dispatches it again,
> then that can be on a different thread, as tomcat uses a thread pool, so there
> is no guarantee what thread the poller dispatches too

Yes I understand that, but for the specific Request object instance I get only 1
comet event, in thread http-8080-exec-4 and the check of Request.getAvailable()
to see if I read everything happens in http-8080-exec-5 (with no other comet
events in between - I ONLY get the BEGIN event for this Request instance). Is
this not strange? Should the check not be performed in the same thread before
the socket goes back to the poller?
Comment 13 Filip Hanik 2007-06-28 13:37:57 UTC
not really, getAvailable() is only called if read() returned true.
in your case, the following might happen
1. HTTP request comes in, no body
2. You get BEGIN event, no additional data on the socket
3. read() returns false, so you dont get an immediate READ event,



Comment 14 Sebastiaan van Erk 2007-06-29 03:13:28 UTC
I have found the cause of the problem.

The code place where it goes wrong is ApplicationFilterFactory line 126:

            Request req = (Request) request;
            if (Globals.IS_SECURITY_ENABLED) {
                // Security: Do not recycle
                filterChain = new ApplicationFilterChain();
            } else {
                filterChain = (ApplicationFilterChain) req.getFilterChain();
                if (filterChain == null) {
                    filterChain = new ApplicationFilterChain();
                    req.setFilterChain(filterChain);
                }
            }
            comet = req.isComet();

When a security manager is installed the filter chain is not set on the request.

On subsequent Comet events, since the filter chain is null, the event is not
handled properly and the poller goes into a busy loop.



Comment 15 Filip Hanik 2007-06-29 06:05:37 UTC
Excellent, we will fix this!
Comment 16 Remy Maucherat 2007-06-29 06:10:40 UTC
The "issue" in this bug report is that the user claims that the value of
"available()" randomly changes. This minor glitch is unrelated.
Comment 17 Sebastiaan van Erk 2007-06-29 06:25:25 UTC
(In reply to comment #16)
> The "issue" in this bug report is that the user claims that the value of
> "available()" randomly changes. This minor glitch is unrelated.

The issue was that Tomcat reported that I had not read all available bytes even
though I did. The reason turns out to be because Tomcat was not calling the
event method of my comet processor, which entails that indeed bytes were not
being read.

My original issue is thereby resolved and the summary of this bug turns out to
be wrong. It just looked to me like there was a race condition, since I had a
breakpoint in event() and it was not being called anymore, so I thought
(erroneously) that it was complaining at the end of the BEGIN event.

So, as far as I'm concerned this issue can be closed. I don't care if you mark
it INVALID if that's what you want because the summary turns out to be
incorrect; that's fine by me.

I'm happy that the issue is resolved and the bug is fixed. :-)
Comment 18 Sebastiaan van Erk 2007-06-29 06:27:07 UTC
(In reply to comment #15)
> Excellent, we will fix this!

Thanks a lot. :-) Sorry for my initially incorrect analysis of the problem. I
should stick to reporting observed behavior instead of theorizing.