|
[
Permlink
| « Hide
]
Doug Cutting added a comment - 27/Jan/06 06:33 AM
I find this to be intermittent: reloading a page will sometimes fix it.
Yesterday, I was downloading a zip file attachment from a JIRA bug.
The first two times, it was corrupt (but the correct size). The 3rd time I finally got it. At the time, I figured it was a browser problem on my end... now I'm not so sure. I've noticed this problem happen intermitently with a variety of pages: mainly search results, and individual issue pages, but also when i was customizing my dashboard. the problem manifests for me as sections of the page which are duplicated, or javascript appearing in the middle of the page, or sections of the page not being displayed at all.... all of which could be a sympton of the underlying page outputing hte same sections of raw HTML more then once.
Reloading the affected pages (even 1 second later) tends to solve the problem, suggesting to me that it may be load or cache related. I've fixed this temporarily by restarting Jira. It looks like a mod_proxy_ajp problem, because when I port-forwarded directly to Tomcat on ajax, the pages appeared fine.
Can we interest someone from dev@httpd in this issue? Perhaps we should switch back to mod_proxy for now. Let's try to get some attention from the mod_proxy_ajp and coyote folks. Remember that we have SSL in the picture now.
It's happening again right now...
I have identified a rare situation which can trigger that crash, and which corresponds to the traces from the server logs. I recommend the Tomcat instance to be patched using the attached patch (extract in the Tomcat folder).
Sorry for the above, I commented on the wrong case ...
The corruption described in this report may be caused by a bug which is fixed in updated tcnative library versions, although this is still under investigation. I appplied Remy's patch and restarted JIRA. See also:
This issue is different (I made a mistake when I attached the patch to this case, as it's for the crash that was reported). This one may (or may not) be fixed by an update of the tcnative library, and we'll be investigating it further.
To better investigate this issue, I recommend planning an upgrade of tcnative (a similar problem was fixed in the HTTP connector, and, although I think it should not affect AJP, it is hard to be certain), which would allow idetifying if mod_proxy_ajp is at fault (or not). The latest source is available here: http://www.apache.org/dist/tomcat/tomcat-connectors/native/tomcat-native-current.tar.gz
As per Remy's request, above, I have installed tomcat-native-1.1.2 (current as of today).
I continued testing for this bug, and I am (I think) experiencing it again. I tried to look at the generated code to see what the problem looked like (I am attaching them). The main problem is that, while there is something (a repetition of a significant portion) which could be caused by bad buffering, there's also the odd behavior that the HTML code generated when the page is corrupted is significantly different. So I am puzzled right now, but will continue to investigate.
One corrupted, one not corrupted.
Not sure what is the exact version of Apache httpd running. I suppose it's released 2.2.0 according to the headers.
Can I have a peek in the error.log. Is it contan something like '[error] ajp_unmarshal_response: Null header value' The total number of lines in today's error_log with "ajp" in them is 88.
What I find in the logs are a lot of: [Wed Feb 22 00:30:52 2006] [error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header [Wed Feb 22 00:30:52 2006] [error] ajp_read_header: ajp_ilink_receive failed And a few contiguous blocks of: [Wed Feb 22 01:08:35 2006] [error] (104)Connection reset by peer: ajp_ilink_receive() can't receive header [Wed Feb 22 01:08:35 2006] [error] ajp_read_header: ajp_ilink_receive failed [Wed Feb 22 01:09:08 2006] [error] (111)Connection refused: proxy: AJP: attempt to connect to 127.0.0.1:18009 (localhost) failed [Wed Feb 22 01:09:08 2006] [error] proxy: AJP: failed to make connection to backend: localhost [Wed Feb 22 01:09:22 2006] [error] proxy: AJP: disabled connection for (localhost) [Wed Feb 22 01:09:31 2006] [error] proxy: AJP: disabled connection for (localhost) [Wed Feb 22 01:09:49 2006] [error] proxy: AJP: disabled connection for (localhost) I don't know if the latter is when someone bounced Tomcat. [Wed Feb 22 00:30:52 2006] [error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header
This looks like Tomcat is too busy for the current Timeout. Tuning ProxyPass ajp://.... timeout=xxx should fix the problems. Seems that in that case internal buffers don't get cleared, so the consequitive request contains garbage, or simply reads the late data from previous request. According to http://httpd.apache.org/docs/2.2/mod/mod_proxy.html, the timeout should be 300 seconds, although it isn't clear if that applies to mod_proxy_ajp, as well.
And it seems wrong that mod_proxy_ajp isn't listed in the "See also" section for mod_proxy, whereas the others are so listed. Could it be related to the FLUSH_WAIT defined in http://svn.apache.org/repos/asf/httpd/httpd/tags/2.2.0/modules/proxy/mod_proxy_ajp.c ? That's only a 0.01 second, hardcoded, timeout.
>Could it be related to the FLUSH_WAIT
I can not tell for sure. This is something added to my original design, that I doubt has any practical reason, because it breaks the AJP protocol spec. I suppose you can rebuild mod_proxy with removing the #define FLUSHING_BANDAID 1 from mod_proxy_ajp. I think that the author of that patch was trying to accomplish something like mod_jk's JkOption +JkFlushPackets. At least that's what I'm reading from the comments. This is very scary -- even the download of attached files appears to get corrupted. I have downloaded the same attachment from a JIRA issue multiple times and gotten different results, either truncation or I've even seen a 200 OK in the middle of a diff. Beware your downloads.
I think it could be worth installing mod_jk and testing for a little bit to determine for sure if mod_proxy_ajp is really at fault.
We are in the process of updating httpd to trunk, which has quite a few fixes to the proxy code. If that does not fix it, we will try a change to the ajp proxy suggested by Mladen. Failing that, we'll try mod_jk to see if we can conclusively pin the problem on one side or the other of the AJP connection.
We've captured a network trace (at the AJP layer and having the HTML the user saw as well) and confirmed that Tomcat is sending the correct data back but that httpd was corrupting the data on the return. It usually happens on an AJP packet boundary: the 8-byte SEND_BODY_CHUNK is most likely to be corrupted towards the end of the response.
After playing with various fixes, my best hunch so far is that it's a bug in mod_ssl's output filter with regards to FLUSH buckets. I've attached the current patch that we're running on ajax right now. In my limited tests so far, I haven't been able to reproduce it with this patch applied. (We haven't found a sure-fire way to reproduce it yet.) If anyone can find the corruption at this point, please also indicate if you were connecting via https or http. I'd be really interested in any corruption bugs over http as we can eliminate mod_ssl from the problem space. For me, the issue was occurring for both HTTP and HTTPS.
Based upon Justin's observations (above), we have updated httpd for issues.apache.org to the current trunk, which has fixes for the proxy code. Although we are seeing a new problem (httpd incorrectly returns 408 errors to the browser, appearing to effect IE more than Firefox, although both see them), we are not seeing corruption problems. Justin is working on the 408 problem.
You will want to clear your browser cache t oremoved cached corrupted data. After that, if you receive a corrupted page, please save the HTML source, and let us know. Has the fix for mod_proxy_ajp been identified ? It seems the best would be to backport it and continue using 2.2, since that's what the users will be doing too.
My change wasn't to mod_proxy_ajp. Even with trunk, the AJP body response would still get corrupted. So, it's not fixed just yet. Even with my SSL fix, I believe Jeff Turner has reported corruption even since that patch (albeit very rarely).
So, until we've confirmed that the problem has completely gone away on httpd trunk, we're staying on that. Once we've identified the right 'fix', we'll ensure it gets backported to 2.2.x. Posts have been sent to dev@httpd to try to recruit some more developer eyes on the issues we're seeing so far. I can confirm I have seen corruption too (yersterday), but it doesn't seem to happen nearly as often right now.
Here's a zip containing corruption seen today by me (I'm sat in the UK). I've highlighted the "200 OK" message on the first screen shot (the red elipse wasn't really there<g>) and I've included the page source as reported by the browser. HTH.
After seeing the network traces, Rüdiger Plüm has identified the root cause of the problem. It's in Tomcat's AJP connector that corrupts the body response. The bug is in line 1265, doWrite() function, of AjpAprProcessor.java:
http://svn.apache.org/repos/asf/tomcat/connectors/trunk/jk/java/org/apache/coyote/ajp/AjpAprProcessor.java We believe the attached patch should resolve the problem. The final post and dev@httpd thread (with our collective analysis) are at: http://mail-archives.apache.org/mod_mbox/httpd-dev/200602.mbox/%3c440376C9.9000606@apache.org%3e So, can someone please take the lead of committing the fix to Tomcat and also updating JIRA's Tomcat instance? Thanks! Jim will commit patch Feb 28th.
I have applied the patch. So far so good.
Thanks for the fix, and sorry for the issue. Here is another zip containing the binary fix (extract in the Tomcat folder).
Tomcat 5.5.16, including this fix, has been released.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||