Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
ManifoldCF 1.7.2
-
None
-
None
Description
Hello.
I am using ManifoldCF web crawler for crawling a web site and index into Solr.
I have noticed that for most websites everything is OK.
However, for some, Manifold is unable to crawl i.e nothing pushed to Solr and the log shows entries like
Cancelling request execution
Please, see below for more detail.
At this point, I am not very sure what is causing this. It may have to do with the Gzip or the Keep-Alive header sent by the server?
DEBUG org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:122) 2014-11-24 02:15:51,710 (Thread-5783) - CookieSpec selected: compatibility DEBUG org.apache.http.client.protocol.RequestAuthCache.process(RequestAuthCache.java:75) 2014-11-24 02:15:51,712 (Thread-5783) - Auth cache not set in the context DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:217) 2014-11-24 02:15:51,714 (Thread-5783) - Opening connection {}->http://mysite.co.uk:80 DEBUG org.apache.http.impl.conn.HttpClientConnectionOperator.connect(HttpClientConnectionOperator.java:120) 2014-11-24 02:15:51,746 (Thread-5783) - Connecting to mysite.co.uk/11.11.11.11:80 DEBUG org.apache.http.impl.conn.HttpClientConnectionOperator.connect(HttpClientConnectionOperator.java:127) 2014-11-24 02:15:51,762 (Thread-5783) - Connection established 192.168.1.5:42919<->11.11.11.11:80 DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:238) 2014-11-24 02:15:51,763 (Thread-5783) - Executing request GET /hot/search/ HTTP/1.1 DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:243) 2014-11-24 02:15:51,763 (Thread-5783) - Target auth state: UNCHALLENGED DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:249) 2014-11-24 02:15:51,764 (Thread-5783) - Proxy auth state: UNCHALLENGED DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:124) 2014-11-24 02:15:51,764 (Thread-5783) - http-outgoing-1 >> GET /hot/search/ HTTP/1.1 DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,765 (Thread-5783) - http-outgoing-1 >> User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; webbot@crawler.net) DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,765 (Thread-5783) - http-outgoing-1 >> From: webbot@crawler.net DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,765 (Thread-5783) - http-outgoing-1 >> Accept: */* DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> Accept-Encoding: gzip,deflate DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> Host: mysite.co.uk:80 DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> Connection: Keep-Alive DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> "GET /hot/search/ HTTP/1.1[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,767 (Thread-5783) - http-outgoing-1 >> "User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; webbot@crawler.net)[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,768 (Thread-5783) - http-outgoing-1 >> "From: webbot@crawler.net[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "Accept: */*[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "Accept-Encoding: gzip,deflate[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "Host: mysite.co.uk:80[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "Connection: Keep-Alive[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,841 (Thread-5783) - http-outgoing-1 << "HTTP/1.1 200 OK[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,842 (Thread-5783) - http-outgoing-1 << "Date: Mon, 24 Nov 2014 02:17:06 GMT[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,842 (Thread-5783) - http-outgoing-1 << "Server: Apache[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,843 (Thread-5783) - http-outgoing-1 << "Set-Cookie: ci_session=a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3B%7D1dec34150fe1ab15f341d355f6ebd0dc; expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,843 (Thread-5783) - http-outgoing-1 << "Set-Cookie: ci_session=a%3A6%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3Bs%3A4%3A%22lang%22%3BN%3B%7Df6625848d5ca7bf8d5db71617607bada; expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,843 (Thread-5783) - http-outgoing-1 << "Vary: Accept-Encoding[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783) - http-outgoing-1 << "Content-Encoding: gzip[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783) - http-outgoing-1 << "Content-Length: 20[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783) - http-outgoing-1 << "Keep-Alive: timeout=5, max=99[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783) - http-outgoing-1 << "Connection: Keep-Alive[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,847 (Thread-5783) - http-outgoing-1 << "Content-Type: text/html[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,847 (Thread-5783) - http-outgoing-1 << "[\r][\n]" DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:86) 2014-11-24 02:15:51,848 (Thread-5783) - http-outgoing-1 << "[0x1f][0x8b][0x8][0x0][0x0][0x0][0x0][0x0][0x0][0x3][0x3][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0x0]" DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:113) 2014-11-24 02:15:51,849 (Thread-5783) - http-outgoing-1 << HTTP/1.1 200 OK DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,849 (Thread-5783) - http-outgoing-1 << Date: Mon, 24 Nov 2014 02:17:06 GMT DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,849 (Thread-5783) - http-outgoing-1 << Server: Apache DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,850 (Thread-5783) - http-outgoing-1 << Set-Cookie: ci_session=a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3B%7D1dec34150fe1ab15f341d355f6ebd0dc; expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/ DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,850 (Thread-5783) - http-outgoing-1 << Set-Cookie: ci_session=a%3A6%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3Bs%3A4%3A%22lang%22%3BN%3B%7Df6625848d5ca7bf8d5db71617607bada; expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/ DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,850 (Thread-5783) - http-outgoing-1 << Vary: Accept-Encoding DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,851 (Thread-5783) - http-outgoing-1 << Content-Encoding: gzip DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,851 (Thread-5783) - http-outgoing-1 << Content-Length: 20 DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,851 (Thread-5783) - http-outgoing-1 << Keep-Alive: timeout=5, max=99 DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,852 (Thread-5783) - http-outgoing-1 << Connection: Keep-Alive DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,852 (Thread-5783) - http-outgoing-1 << Content-Type: text/html DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:267) 2014-11-24 02:15:51,853 (Thread-5783) - Connection can be kept alive for 5000 MILLISECONDS DEBUG org.apache.http.client.protocol.ResponseProcessCookies.processCookies(ResponseProcessCookies.java:117) 2014-11-24 02:15:51,856 (Thread-5783) - Cookie accepted [ci_session="a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%2...", version:0, domain:mysite.co.uk, path:/, expiry:Wed Nov 23 02:17:06 GMT 2016] DEBUG org.apache.http.client.protocol.ResponseProcessCookies.processCookies(ResponseProcessCookies.java:117) 2014-11-24 02:15:51,860 (Thread-5783) - Cookie accepted [ci_session="a%3A6%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%2...", version:0, domain:mysite.co.uk, path:/, expiry:Wed Nov 23 02:17:06 GMT 2016] DEBUG org.apache.http.impl.execchain.ConnectionHolder.cancel(ConnectionHolder.java:140) 2014-11-24 02:15:51,866 (Thread-5783) - Cancelling request execution DEBUG org.apache.http.impl.conn.CPoolEntry.isExpired(CPoolEntry.java:81) 2014-11-24 02:15:57,017 (Idle cleanup thread) - Connection [id:1][route:{}->http://mysite.co.uk:80][state:null] expired @ Mon Nov 24 02:15:56 GMT 2014 DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.close(LoggingManagedHttpClientConnection.java:79) 2014-11-24 02:15:57,019 (Idle cleanup thread) - http-outgoing-1: Close connection