Bug 50957 - Blocking IO can serve wrong response data
Summary: Blocking IO can serve wrong response data
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 7
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 7.0.8
Hardware: PC All
: P2 major with 1 vote (vote)
Target Milestone: ---
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-03-22 18:09 UTC by Brad Plies
Modified: 2014-02-17 13:47 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Brad Plies 2011-03-22 18:09:35 UTC
Tomcat: 7.0.8
OS:  Windows 2008 Server (x64)

Compression enabled on both HTTP and HTTPS connectors.

When using standard blocking IO HTTP/1.1 over a period of 2+ days under moderate traffic, Tomcat 7 will begin to occasionally serve the incorrect responses for a request.

Incorrect responses span a variety behaviors
* Serving a javascript (.js) file instead of HTML or Image
* Serving the wrong image
* Serving a .css file instead of HTML, javascript file, or Image

This behavior has been observed in both HTTP & HTTPS.

Another observation is that if the wrong response is served, the response is at least for something that was also requested and not something random.


Once the behavior was detected, I used WebScarab as a proxy to monitor the entire Browser <--> Tomcat conversation.  I was able to confirm a scenario like the following:
--------
ImageA.gif
ImageB.gif

Each has different file size, ETAG, etc.

GET ImageB.gif returned ImageA.gif (with ImageA.gif's ETAG, headers, binary content, and content size)
-------

This report corroborates another report discovered in the wild:
http://www.apacheserver.net/q11460/Tomcat-serving-wrong-content

In the above report, the person requested a page and received an image.

No reproduction steps have been identified to trigger this condition.  The only thing the above reporter & I noticed is that Tomcat needs to be up for a while and serve traffic throughout that period.  Once the condition has been established it is still seemingly random when it occurs.

Whenever it begins to happen, however, disable caching & keep reloading an image-heavy page until you see obvious misplaced images or other things. Using a debugging proxy will allow you to verify that incorrect content is served.

Switching to NIO appears to bypass the problem or at least NIO does not exhibit the same behavior.  As the other reporter had noted, restarting Tomcat will clear the condition.  

My hypothesis is that there is some condition within Tomcat & Blocking IO where a series of (pipelined?) requests (images, css, js) spawned from a page can become confused resulting in swapped responses (A <--> B) or duplicated responses (A & A).  Maybe some output buffer is getting clobbered.  Based upon Http11Processor, pipelined requests are constrained to the same Socket so I do not believe there is a risk of swapping responses across users.  It is unknown if this bug could be deliberately exploited in some way.


Needless to say, a server that is serving incorrect content justifies a very serious severity rating. I would not recommend using BIO in any Production environment for the time being.
Comment 1 Mark Thomas 2011-03-22 18:45:08 UTC
Experience has shown that most instances of this type of error are triggered by application bugs rather than Tomcat bugs - usually in the form of retaining and re-using a reference to the request or response object. One way to test this is to set the system property org.apache.catalina.connector.RECYCLE_FACADES to true. If you see NPEs then that is indicative of an application bug.

Something else that would be very useful since you are able to reproduce this (albeit rarely) is a Wireshark trace of a connection where this occurred. If it is a Tomcat bug it is likely to be triggered by an error condition in a previous request on the connection.

It is worth updating to the latest 7.0.x in case you are seeing a variation of bug 50189.

Ultimately, without more information that points to a Tomcat bug this is going to get resolved as WORKSFORME.
Comment 2 Brad Plies 2011-03-22 18:51:58 UTC
Found a few other somewhat related threads

"Response Swapping"
http://marc.info/?l=tomcat-user&m=130048233431588&w=2
mod_jk & Tomcat 5.5.30


http://marc.info/?l=tomcat-user&m=129796706820107&w=4
Tomcat 7 BIO Connector sockets are queued
Implies a difference in BIO between Tomcat 6 & 7 and may explain why I've never seen this bug in Tomcat 6.
Comment 3 Brad Plies 2011-03-22 19:17:07 UTC
(In reply to comment #1)
> Experience has shown that most instances of this type of error are triggered by
> application bugs rather than Tomcat bugs - usually in the form of retaining and
> re-using a reference to the request or response object. One way to test this is
> to set the system property org.apache.catalina.connector.RECYCLE_FACADES to
> true. If you see NPEs then that is indicative of an application bug.
 
If it were true that this could be caused by application references to request & response objects, that may not explain why changing to NIO would have any different behavior.  Also, why would it take an amount of time before exhibiting the behavior?  If an application did in fact do this, one would expect a higher occurance rate.

In case I am unable to locate examples of these past instances you describe, could you provide a few that you know of?  That way I can do a better job matching characteristics and symptoms.

I will have to evaluate some code to see if any references to request or response object are being held anywhere.  I would like to try the RECYCLE_FACADES recommendation but will not be able to put BIO back into the environment where it was detected.


> It is worth updating to the latest 7.0.x in case you are seeing a variation of
> bug 50189.

50189 differs from this submission in that it relied on AJP and possessed zero-length messages as symptoms.  By contrast, this submission is using BIO and not using AJP and positive-length messages are delivered just with the wrong payloads.  

I cannot yet say at this point if application code is reading from a request after the response outputstream is closed.  I really doubt it but I'll look anyway.
Comment 4 Brad Plies 2011-03-22 19:25:29 UTC
(In reply to comment #1)
> Experience has shown that most instances of this type of error are triggered by
> application bugs rather than Tomcat bugs - usually in the form of retaining and
> re-using a reference to the request or response object. One way to test this is
> to set the system property org.apache.catalina.connector.RECYCLE_FACADES to
> true. If you see NPEs then that is indicative of an application bug.

I forgot to mention that this same application has run on Tomcat 6 BIO for years without this problem.  It only surfaced once we ran the app in Tomcat 7 BIO.
Comment 5 Mark Thomas 2011-03-22 19:31:20 UTC
That BIO worked in 6 but has the error in 7 does point to this being a Tomcat bug.

I've checked the request queuing code and that looks OK. However, a larger change was the async support. I have found one area where I have some suspicions but I need to do more research to see if there is a bug there or not.

While bug 50189 was reported against AJP, the fix was made for all connectors. There may also have been issues with HTTP although I don't recall any reports. However, the code was the same in 6.0.x so that tends to rule that out as a possibility.

Getting ahead of myself I know, but if my suspicions prove to be correct, are you able to test BIO and a potentially fixed 7.0.x in the environment where you saw this error?
Comment 6 Brad Plies 2011-03-22 20:19:48 UTC
(In reply to comment #5)
> Getting ahead of myself I know, but if my suspicions prove to be correct, are
> you able to test BIO and a potentially fixed 7.0.x in the environment where you
> saw this error?

Thanks for the reply.  As much as I'd love to continue to help the ASF & everyone else with this I will unfortunately be unable to test the potential fix on that environment.  We appear to have a functioning workaround (NIO) and it must remain that way.  

The best I can offer is to try some simulations of the potential fix using the same application in another environment which will not match the same uptime or traffic levels - and may not trigger the condition at all.

As for resolution of this issue either you or someone else will manage to find more suspicious code, other users will report "me too", or it will be fixed in relation to some other issue.
Comment 7 Konstantin Kolinko 2011-03-22 20:55:21 UTC
(In reply to comment #0)
> Tomcat: 7.0.8
> OS:  Windows 2008 Server (x64)
> 
> Compression enabled on both HTTP and HTTPS connectors.
> 
> (...)
> Once the behavior was detected, I used WebScarab as a proxy to monitor the
> entire Browser <--> Tomcat conversation.  I was able to confirm a scenario like
> the following:
> --------
> ImageA.gif
> ImageB.gif
> 
> Each has different file size, ETAG, etc.
> 
> GET ImageB.gif returned ImageA.gif (with ImageA.gif's ETAG, headers, binary
> content, and content size)
> -------

Do you know/remember whether GET ImageA.gif returned ImageA.gif as well, whether either of them was compressed, and whether the requests were from the same client? Do you know the size of those files?

(gif files are not in AbstractHttp11Processor .compressableMimeTypes by default, so I think that they should have not be compressed)

What were your compression settings? Just compression="on"?
Comment 8 Brad Plies 2011-03-23 01:01:58 UTC
(In reply to comment #7)
> (In reply to comment #0)
> > Tomcat: 7.0.8
> > OS:  Windows 2008 Server (x64)
> > 
> > Compression enabled on both HTTP and HTTPS connectors.
> > 
> > (...)
> > Once the behavior was detected, I used WebScarab as a proxy to monitor the
> > entire Browser <--> Tomcat conversation.  I was able to confirm a scenario like
> > the following:
> > --------
> > ImageA.gif
> > ImageB.gif
> > 
> > Each has different file size, ETAG, etc.
> > 
> > GET ImageB.gif returned ImageA.gif (with ImageA.gif's ETAG, headers, binary
> > content, and content size)
> > -------
> 
> Do you know/remember whether GET ImageA.gif returned ImageA.gif as well,
> whether either of them was compressed, and whether the requests were from the
> same client? Do you know the size of those files?
> 
> (gif files are not in AbstractHttp11Processor .compressableMimeTypes by
> default, so I think that they should have not be compressed)
> 
> What were your compression settings? Just compression="on"?

I recall a case where ImageA and ImageB were swapped and because the images were very different dimensions, the page looked really awkward.  I'm fairly sure I had also witnessed ImageA as both ImageA and ImageB.  Yes all observations were performed as the same client.

Even though compression was enabled on the connectors the image MIME types are not included (as you suspected).

Other scenarios occurred where compressable text resources (.js, .css, .xml, .html) were also served with the incorrect response:  HTML -> IMG, HTML -> JS, HTML -> CSS, and so on.

I had only mentioned compression="on" just in case it happens to be a contributory factory.  I would not suspect that it would be related...

I have not and will not have the opportunity to try BIO with compression="off"
Comment 9 Brad Plies 2011-03-28 14:44:59 UTC
Brainstorming out loud after examining some Tomcat code...

In the witnessed case of receiving ImageA.gif as ImageB.gif, this suggests that the same InternalOutputBuffer was re-used to respond to both requests.  This could hypothetically be possible if the output buffer were not properly reset between requests.

Http11Processor.process(socket) 
AbstractOutputBuffer.nextRequest() will call response.recycle() that will reset the contentType, status, headers, encoding, etc.  Most of these recycling/resetting operations are guarded by !isAsync() (AsyncStateMachine)


I wonder if there is a concurrency hazard surrounding isAsync()? Namely that isAsync accesses shared state and is, itself, not synchronized.  So couldn't you have, due to unlucky timing, a dirty read where: 

isAsync() returns true but almost immediately becomes false or
isAsync() returns false but almost immediately becomes true

While all of the async state mutating methods are synchronized, an unsynchronized read is governing control flow of external code.  It should be possible for the state to be written concurrently with the unsync read and perhaps cause an unanticipated code path to execute.

Http11Processor.process(socket) calls isAsync() 4 times (swallowInput(), endRequest(), buffers.nextRequest(), & SocketState.LONG) and could all conceivably return different answers from call to call.
Comment 10 Mark Thomas 2011-03-28 15:04:59 UTC
I have been looking at this as well.

To address the isAsync() question I suspect that this is not the cause since:
- the issue is with static content where async state is not changing
- it doesn't explain why BIO is affected but NIO is not

In the same way a failure to recycled an output buffer could cause this, so could not recycling an input buffer. I have been reviewing the code and for a single request there are multiple points where recycle is called. While I can find a error conditions where I can skip one of the recycle calls, I can't find a code path that would allow all of them to be skipped.

I'll tighten up the areas I can find for 7.0.12 but without a clear idea of exactly how this issue is triggered I can't be sure that the issue will be fixed. If no further reports appear post 7.0.12 then that would suggest that the issue was fixed. I'm not exactly comfortable with that plan but barring a reproducible test case or an explanation of how the error occurs I don't have a better plan right now.
Comment 11 Brad Plies 2011-03-28 15:33:18 UTC
(In reply to comment #10)
> To address the isAsync() question I suspect that this is not the cause since:
> - the issue is with static content where async state is not changing
Yes and no.  Even though it is static content being swapped & duplicated, static content is not the only resource in the request stream.  For instance ImageA.gif & ImageB.gif are included in a page that contains a number of dynamic JSP elements as well.

In post #1 I listed a scenario where a JS file was served instead of HTML.  In that case, the page was a dynamic JSP.


> - it doesn't explain why BIO is affected but NIO is not
If indeed BIO & NIO share nearly identical isAsync() logic and if the real problem is a concurrency hazard within that logic then the unlucky timing would just happen to be more likely in BIO.  NIO Protocol has notably less isAsync() calls.  So for all I know NIO is affected also, we just haven't noticed yet.  Maybe it is a Bohrbug or Heisenbug :)

Maybe someone with exceptionally strong concurrency skills can review isAsync() and its usage for safety.  Either it is a hazard or it isn't.

 
> In the same way a failure to recycled an output buffer could cause this, so
> could not recycling an input buffer. I have been reviewing the code and for a
> single request there are multiple points where recycle is called. While I can
> find a error conditions where I can skip one of the recycle calls, I can't find
> a code path that would allow all of them to be skipped.
Good point on input v.s. output.  Yet aren't those input recycle calls governed by the same suspect guard?  Perhaps it is not necessary to skip all recycle calls in order to cause the behavior.

 
> I'll tighten up the areas I can find for 7.0.12 but without a clear idea of
> exactly how this issue is triggered I can't be sure that the issue will be
> fixed. If no further reports appear post 7.0.12 then that would suggest that
> the issue was fixed. I'm not exactly comfortable with that plan but barring a
> reproducible test case or an explanation of how the error occurs I don't have a
> better plan right now.

I can respect that.  Given how long Tomcat 7.0 has been GA before this report was submitted, I would not expect additional reports to be as forthcoming even if it still is a bug.  At least you and others are aware of this report and might be able to match it to future reports.  Hopefully someone else will be able to do a better job defining a reproducible test case.
Comment 12 Mark Thomas 2011-03-28 16:03:53 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > To address the isAsync() question I suspect that this is not the cause since:
> > - the issue is with static content where async state is not changing
> Yes and no.  Even though it is static content being swapped & duplicated,
> static content is not the only resource in the request stream.  For instance
> ImageA.gif & ImageB.gif are included in a page that contains a number of
> dynamic JSP elements as well.
> 
> In post #1 I listed a scenario where a JS file was served instead of HTML.  In
> that case, the page was a dynamic JSP.

None of those requests use async processing. Unless you explicitly use Servlet 3.0 async APIs then isAsync() will always return false.

> > - it doesn't explain why BIO is affected but NIO is not
> If indeed BIO & NIO share nearly identical isAsync() logic and if the real
> problem is a concurrency hazard within that logic then the unlucky timing would
> just happen to be more likely in BIO.  NIO Protocol has notably less isAsync()
> calls.  So for all I know NIO is affected also, we just haven't noticed yet. 
> Maybe it is a Bohrbug or Heisenbug :)
> 
> Maybe someone with exceptionally strong concurrency skills can review isAsync()
> and its usage for safety.  Either it is a hazard or it isn't.

At the moment, I'm happy isAsync() is safe. There were issues in the past (bug 49884) but the re-factoring that fixed that bug means that it should not be possible for one thread to be reading isAsync() whilst another is changing the value it returns.
Comment 13 Brad Plies 2011-03-28 16:42:53 UTC
(In reply to comment #12)
> None of those requests use async processing. Unless you explicitly use Servlet
> 3.0 async APIs then isAsync() will always return false.

In our application we are not explicitly using Servlet 3.0 async so the hypothesis that async() is suspicious in this case is invalid.  So as far as deduction goes it one must backtrack to examining how buffers are used and pipelined requests are handled.


> At the moment, I'm happy isAsync() is safe. There were issues in the past (bug
> 49884) but the re-factoring that fixed that bug means that it should not be
> possible for one thread to be reading isAsync() whilst another is changing the
> value it returns.

isAsync(), I see, uses volatile so maybe it is alright afterall.

At least we've managed to deduce a few things that aren't the problem.  Thanks for helping.  I'll take a look at anything that gets committed and will at least try to run a recent build w/ BIO in an alternate environment.
Comment 14 Tim Whittington 2011-04-01 02:32:40 UTC
It may be unrelated, but I'm trying to track down a (very) reproducible error on one of our products where using Firefox with HTTP pipelining enabled results in garbled responses (and a smattering of 401 and 505 responses).

The user reports of this error went along the lines of seeing JavaScript content where HTML responses were expected, images switched etc., which sounds similar to this.
Comment 15 Brad Plies 2011-04-01 12:16:03 UTC
(In reply to comment #14)
> It may be unrelated, but I'm trying to track down a (very) reproducible error
> on one of our products where using Firefox with HTTP pipelining enabled results
> in garbled responses (and a smattering of 401 and 505 responses).
> 
> The user reports of this error went along the lines of seeing JavaScript
> content where HTML responses were expected, images switched etc., which sounds
> similar to this.

Thank you for the report.  

This report certainly matches the primary characteristic of swapped content.  During my tests I do not recall observing those response codes, but that wasn't what I was primarily paying attention to anyway.  I was triggering the condition via rapid browser reloads so perhaps some responses could have had unusual codes like that and I just missed them.

Can you confirm the Tomcat version and that BIO is used?  Have you tried NIO?  If it is very reproducible what are the exact conditions that you have established to trigger it?  Would you be able to provide a "Wireshark trace" as Mark recommended?
Comment 16 Brad Plies 2011-04-01 12:18:46 UTC
I cannot remember now, but I'm sure I observed this under IE as well during my proxy tests.  So I'm not certain this problem is confined to Firefox
Comment 17 Christopher Schultz 2011-04-01 17:47:27 UTC
If all the bug takes is load to reproduce, it should be easy to set up a test that uses wget or something similar to just download a list of files a whole bunch of times. If you have filenames which match the expected byte count (4096.gif or whatever), the client can detect that an error has occurred and log a message including a timestamp.

If keepalive is required to trigger this bug, then a test might be a bit more difficult to rig... I'm not even sure how to get HttpURLConnection to let me make multiple requests via a single connection.

Any idea if keepalives either allow this bug to occur or increase the chances of it occurring?
Comment 18 Sebb 2011-04-01 18:05:02 UTC
Sounds like a job for Apache JMeter ... that can issue a large load, and supports Keep-Alive and assertions to check for failures.

The HttpClient version of the HTTP Sampler has better control over connection re-use.
Comment 19 Tim Whittington 2011-04-02 23:07:22 UTC
My testing was against 7.0.8 with the BIO connector on CentOS 5.4/x64.
Client was Firefox 4.0 with network.http.pipelining=true enabled in about:config.
Web app is a fairly complex mixed static/JSP/other dynamic resources app.
To replicate all I have to do is log in and do 30-60 seconds of random 'refresh 'o' doom' - once errors start appearing, many resource loads in Firefox break (e.g. favicon.ico with wrong image, JS resources where HTML should be etc.)

On the HTTP side of things I was seeing 401 (Bad Request) responses from Tomcat, and the occasional 505 (Version Not Supported). Looking at SOCKS proxy and Wireshark traces of the HTTP, I couldn't see anything that Firefox was doing wrong.

I've tested with r1088179 (post 7.0.12) of trunk, and things appear to be better - I can't get the UI to break anymore in Firefox - but I'm now seeing no 401 responses and many more 505 responses.

I'm not able to spend much time on this in the next couple of weeks, but I'll try to drop in a build with some additional tracing into our test system at some point.
Comment 20 Tim Whittington 2011-04-03 19:26:46 UTC
Correction: the 401 responses are actually 400 Bad Requests.
Comment 21 Mark Thomas 2011-04-06 07:11:34 UTC
Tim's comments got me thinking along the right lines.

There is an issue with how pipe-lined requests are handled. Fortunately, the fix in 7.0.12 will stop any chance of request mix-up but you will see 505 responses.

I have a fix in mind but it needs some tests. The workaround is to switch to NIO or APR/native, neither of which suffer from this problem.

I am reducing the severity since the response mix-ups are no longer possible with 7.0.12.

The investigation of this issue has highlighted a number of areas that need a wider discussion. Tim has already started a thread on some of these on the dev list and I will be adding my observations there.
Comment 22 Mark Thomas 2011-04-06 12:40:31 UTC
The fix fir pipe-lining with HTTP BIO has been applied to trunk and will be included in 7.0.13 onwards.
Comment 23 Tim Whittington 2011-04-06 20:30:34 UTC
I'm unable to reproduce the 505 errors now with trunk.