Issue Details (XML | Word | Printable)

Key: INFRA-697
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Critical Critical
Assignee: Justin Erenkrantz
Reporter: Yonik Seeley
Votes: 1
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Infrastructure

JIRA output is corrupted

Created: 27/Jan/06 06:25 AM   Updated: 25/Sep/06 03:49 AM
Return to search
Component/s: JIRA
Security Level: public (Regular issues)

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works ajp.patch 2006-02-28 07:28 AM Justin Erenkrantz 0.7 kB
Zip Archive Licensed for inclusion in ASF works INFRA-697-tpe.zip 2006-02-27 10:41 PM Tim Ellison 510 kB
HTML File INFRA-738-2.htm 2006-02-23 01:45 AM Remy Maucherat 24 kB
HTML File INFRA-738.htm 2006-02-23 01:45 AM Remy Maucherat 38 kB
Zip Archive Licensed for inclusion in ASF works server.zip 2006-02-17 03:22 AM Remy Maucherat 17 kB
Zip Archive server2.zip 2006-02-28 04:05 PM Remy Maucherat 12 kB
Text File Licensed for inclusion in ASF works ssl-flush.patch 2006-02-25 07:18 PM Justin Erenkrantz 1.0 kB
Image Attachments:

1. corrupt pages.png
(147 kB)
Issue Links:
Reference


 Description  « Hide
Many (almost all) of the JIRA front pages are whacked out, with bad HTML. Examples:

http://issues.apache.org/jira/browse/GERONIMO
http://issues.apache.org/jira/browse/NUTCH
http://issues.apache.org/jira/browse/LUCENE

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doug Cutting added a comment - 27/Jan/06 06:33 AM
I find this to be intermittent: reloading a page will sometimes fix it.

Yonik Seeley added a comment - 27/Jan/06 06:39 AM
Yesterday, I was downloading a zip file attachment from a JIRA bug.
The first two times, it was corrupt (but the correct size). The 3rd time I finally got it.
At the time, I figured it was a browser problem on my end... now I'm not so sure.

Hoss Man added a comment - 27/Jan/06 07:00 AM
I've noticed this problem happen intermitently with a variety of pages: mainly search results, and individual issue pages, but also when i was customizing my dashboard. the problem manifests for me as sections of the page which are duplicated, or javascript appearing in the middle of the page, or sections of the page not being displayed at all.... all of which could be a sympton of the underlying page outputing hte same sections of raw HTML more then once.

Reloading the affected pages (even 1 second later) tends to solve the problem, suggesting to me that it may be load or cache related.

Jeff Turner added a comment - 27/Jan/06 08:32 AM
I've fixed this temporarily by restarting Jira. It looks like a mod_proxy_ajp problem, because when I port-forwarded directly to Tomcat on ajax, the pages appeared fine.

Can we interest someone from dev@httpd in this issue?

Perhaps we should switch back to mod_proxy for now.

Noel J. Bergman added a comment - 27/Jan/06 01:40 PM
Let's try to get some attention from the mod_proxy_ajp and coyote folks. Remember that we have SSL in the picture now.

Yonik Seeley added a comment - 14/Feb/06 07:28 AM
It's happening again right now...

Remy Maucherat added a comment - 17/Feb/06 03:22 AM
I have identified a rare situation which can trigger that crash, and which corresponds to the traces from the server logs. I recommend the Tomcat instance to be patched using the attached patch (extract in the Tomcat folder).

Remy Maucherat added a comment - 17/Feb/06 03:25 AM
Sorry for the above, I commented on the wrong case ...
The corruption described in this report may be caused by a bug which is fixed in updated tcnative library versions, although this is still under investigation.

Noel J. Bergman added a comment - 17/Feb/06 07:53 AM
I appplied Remy's patch and restarted JIRA. See also: INFRA-716.

Remy Maucherat added a comment - 17/Feb/06 07:58 AM
This issue is different (I made a mistake when I attached the patch to this case, as it's for the crash that was reported). This one may (or may not) be fixed by an update of the tcnative library, and we'll be investigating it further.

Remy Maucherat added a comment - 17/Feb/06 10:03 PM
To better investigate this issue, I recommend planning an upgrade of tcnative (a similar problem was fixed in the HTTP connector, and, although I think it should not affect AJP, it is hard to be certain), which would allow idetifying if mod_proxy_ajp is at fault (or not). The latest source is available here: http://www.apache.org/dist/tomcat/tomcat-connectors/native/tomcat-native-current.tar.gz

Noel J. Bergman added a comment - 22/Feb/06 09:18 AM
As per Remy's request, above, I have installed tomcat-native-1.1.2 (current as of today).

Remy Maucherat added a comment - 23/Feb/06 01:44 AM
I continued testing for this bug, and I am (I think) experiencing it again. I tried to look at the generated code to see what the problem looked like (I am attaching them). The main problem is that, while there is something (a repetition of a significant portion) which could be caused by bad buffering, there's also the odd behavior that the HTML code generated when the page is corrupted is significantly different. So I am puzzled right now, but will continue to investigate.

Remy Maucherat added a comment - 23/Feb/06 01:45 AM
One corrupted, one not corrupted.

Mladen Turk added a comment - 23/Feb/06 03:10 AM
Not sure what is the exact version of Apache httpd running. I suppose it's released 2.2.0 according to the headers.
Can I have a peek in the error.log. Is it contan something like '[error] ajp_unmarshal_response: Null header value'

Noel J. Bergman added a comment - 23/Feb/06 04:19 AM
The total number of lines in today's error_log with "ajp" in them is 88.

What I find in the logs are a lot of:

[Wed Feb 22 00:30:52 2006] [error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header
[Wed Feb 22 00:30:52 2006] [error] ajp_read_header: ajp_ilink_receive failed

And a few contiguous blocks of:

[Wed Feb 22 01:08:35 2006] [error] (104)Connection reset by peer: ajp_ilink_receive() can't receive header
[Wed Feb 22 01:08:35 2006] [error] ajp_read_header: ajp_ilink_receive failed
[Wed Feb 22 01:09:08 2006] [error] (111)Connection refused: proxy: AJP: attempt to connect to 127.0.0.1:18009 (localhost) failed
[Wed Feb 22 01:09:08 2006] [error] proxy: AJP: failed to make connection to backend: localhost
[Wed Feb 22 01:09:22 2006] [error] proxy: AJP: disabled connection for (localhost)
[Wed Feb 22 01:09:31 2006] [error] proxy: AJP: disabled connection for (localhost)
[Wed Feb 22 01:09:49 2006] [error] proxy: AJP: disabled connection for (localhost)

I don't know if the latter is when someone bounced Tomcat.

Mladen Turk added a comment - 23/Feb/06 07:46 AM
[Wed Feb 22 00:30:52 2006] [error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header

This looks like Tomcat is too busy for the current Timeout.
Tuning
ProxyPass ajp://.... timeout=xxx should fix the problems.

Seems that in that case internal buffers don't get cleared, so the consequitive request contains garbage, or simply
reads the late data from previous request.


Noel J. Bergman added a comment - 23/Feb/06 08:47 AM
According to http://httpd.apache.org/docs/2.2/mod/mod_proxy.html, the timeout should be 300 seconds, although it isn't clear if that applies to mod_proxy_ajp, as well.

And it seems wrong that mod_proxy_ajp isn't listed in the "See also" section for mod_proxy, whereas the others are so listed.

Noel J. Bergman added a comment - 23/Feb/06 09:08 AM
Could it be related to the FLUSH_WAIT defined in http://svn.apache.org/repos/asf/httpd/httpd/tags/2.2.0/modules/proxy/mod_proxy_ajp.c ? That's only a 0.01 second, hardcoded, timeout.

Mladen Turk added a comment - 23/Feb/06 07:57 PM
>Could it be related to the FLUSH_WAIT

I can not tell for sure. This is something added to my original design, that I doubt has any practical reason, because it breaks the AJP protocol spec.
I suppose you can rebuild mod_proxy with removing the
#define FLUSHING_BANDAID 1 from mod_proxy_ajp.

I think that the author of that patch was trying to accomplish something like mod_jk's
JkOption +JkFlushPackets. At least that's what I'm reading from the comments.

Tim Ellison added a comment - 23/Feb/06 11:53 PM
This is very scary -- even the download of attached files appears to get corrupted. I have downloaded the same attachment from a JIRA issue multiple times and gotten different results, either truncation or I've even seen a 200 OK in the middle of a diff. Beware your downloads.

Remy Maucherat added a comment - 24/Feb/06 12:42 AM
I think it could be worth installing mod_jk and testing for a little bit to determine for sure if mod_proxy_ajp is really at fault.

Noel J. Bergman added a comment - 25/Feb/06 02:10 PM
We are in the process of updating httpd to trunk, which has quite a few fixes to the proxy code. If that does not fix it, we will try a change to the ajp proxy suggested by Mladen. Failing that, we'll try mod_jk to see if we can conclusively pin the problem on one side or the other of the AJP connection.

Justin Erenkrantz added a comment - 25/Feb/06 07:18 PM
We've captured a network trace (at the AJP layer and having the HTML the user saw as well) and confirmed that Tomcat is sending the correct data back but that httpd was corrupting the data on the return. It usually happens on an AJP packet boundary: the 8-byte SEND_BODY_CHUNK is most likely to be corrupted towards the end of the response.

After playing with various fixes, my best hunch so far is that it's a bug in mod_ssl's output filter with regards to FLUSH buckets. I've attached the current patch that we're running on ajax right now. In my limited tests so far, I haven't been able to reproduce it with this patch applied. (We haven't found a sure-fire way to reproduce it yet.)

If anyone can find the corruption at this point, please also indicate if you were connecting via https or http. I'd be really interested in any corruption bugs over http as we can eliminate mod_ssl from the problem space.

Remy Maucherat added a comment - 25/Feb/06 07:26 PM
For me, the issue was occurring for both HTTP and HTTPS.

Noel J. Bergman added a comment - 27/Feb/06 12:40 AM
Based upon Justin's observations (above), we have updated httpd for issues.apache.org to the current trunk, which has fixes for the proxy code. Although we are seeing a new problem (httpd incorrectly returns 408 errors to the browser, appearing to effect IE more than Firefox, although both see them), we are not seeing corruption problems. Justin is working on the 408 problem.

You will want to clear your browser cache t oremoved cached corrupted data. After that, if you receive a corrupted page, please save the HTML source, and let us know.

Remy Maucherat added a comment - 27/Feb/06 03:16 AM
Has the fix for mod_proxy_ajp been identified ? It seems the best would be to backport it and continue using 2.2, since that's what the users will be doing too.

Justin Erenkrantz added a comment - 27/Feb/06 05:42 PM
My change wasn't to mod_proxy_ajp. Even with trunk, the AJP body response would still get corrupted. So, it's not fixed just yet. Even with my SSL fix, I believe Jeff Turner has reported corruption even since that patch (albeit very rarely).

So, until we've confirmed that the problem has completely gone away on httpd trunk, we're staying on that. Once we've identified the right 'fix', we'll ensure it gets backported to 2.2.x.

Posts have been sent to dev@httpd to try to recruit some more developer eyes on the issues we're seeing so far.

Remy Maucherat added a comment - 27/Feb/06 09:09 PM
I can confirm I have seen corruption too (yersterday), but it doesn't seem to happen nearly as often right now.

Tim Ellison added a comment - 27/Feb/06 10:41 PM
Here's a zip containing corruption seen today by me (I'm sat in the UK). I've highlighted the "200 OK" message on the first screen shot (the red elipse wasn't really there<g>) and I've included the page source as reported by the browser. HTH.

Justin Erenkrantz added a comment - 28/Feb/06 07:28 AM
After seeing the network traces, Rüdiger Plüm has identified the root cause of the problem. It's in Tomcat's AJP connector that corrupts the body response. The bug is in line 1265, doWrite() function, of AjpAprProcessor.java:

http://svn.apache.org/repos/asf/tomcat/connectors/trunk/jk/java/org/apache/coyote/ajp/AjpAprProcessor.java

We believe the attached patch should resolve the problem.

The final post and dev@httpd thread (with our collective analysis) are at:

http://mail-archives.apache.org/mod_mbox/httpd-dev/200602.mbox/%3c440376C9.9000606@apache.org%3e

So, can someone please take the lead of committing the fix to Tomcat and also updating JIRA's Tomcat instance?

Thanks!

Jim Jagielski added a comment - 28/Feb/06 09:14 AM
Jim will commit patch Feb 28th.


Jeff Turner added a comment - 28/Feb/06 04:05 PM
I have applied the patch. So far so good.

Remy Maucherat added a comment - 28/Feb/06 04:05 PM
Thanks for the fix, and sorry for the issue. Here is another zip containing the binary fix (extract in the Tomcat folder).

Justin Erenkrantz added a comment - 28/Feb/06 04:17 PM
Marking as resolved.

Remy Maucherat added a comment - 07/Mar/06 05:39 PM
Tomcat 5.5.16, including this fix, has been released.