Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1113

Web connection being dropped while still in use?

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • ManifoldCF 1.7.2
    • None
    • Web connector
    • None

    Description

      Hello.
      I am using ManifoldCF web crawler for crawling a web site and index into Solr.

      I have noticed that for most websites everything is OK.
      However, for some, Manifold is unable to crawl i.e nothing pushed to Solr and the log shows entries like
      Cancelling request execution

      Please, see below for more detail.
      At this point, I am not very sure what is causing this. It may have to do with the Gzip or the Keep-Alive header sent by the server?

      DEBUG org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:122) 2014-11-24 02:15:51,710 (Thread-5783) - CookieSpec selected: compatibility
      DEBUG org.apache.http.client.protocol.RequestAuthCache.process(RequestAuthCache.java:75) 2014-11-24 02:15:51,712 (Thread-5783) - Auth cache not set in the context
      DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:217) 2014-11-24 02:15:51,714 (Thread-5783) - Opening connection {}->http://mysite.co.uk:80
      DEBUG org.apache.http.impl.conn.HttpClientConnectionOperator.connect(HttpClientConnectionOperator.java:120) 2014-11-24 02:15:51,746 (Thread-5783) - Connecting to mysite.co.uk/11.11.11.11:80
      DEBUG org.apache.http.impl.conn.HttpClientConnectionOperator.connect(HttpClientConnectionOperator.java:127) 2014-11-24 02:15:51,762 (Thread-5783) - Connection established 192.168.1.5:42919<->11.11.11.11:80
      DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:238) 2014-11-24 02:15:51,763 (Thread-5783) - Executing request GET /hot/search/ HTTP/1.1
      DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:243) 2014-11-24 02:15:51,763 (Thread-5783) - Target auth state: UNCHALLENGED
      DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:249) 2014-11-24 02:15:51,764 (Thread-5783) - Proxy auth state: UNCHALLENGED
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:124) 2014-11-24 02:15:51,764 (Thread-5783) - http-outgoing-1 >> GET /hot/search/ HTTP/1.1
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,765 (Thread-5783) - http-outgoing-1 >> User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; webbot@crawler.net)
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,765 (Thread-5783) - http-outgoing-1 >> From: webbot@crawler.net
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,765 (Thread-5783) - http-outgoing-1 >> Accept: */*
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> Accept-Encoding: gzip,deflate
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> Host: mysite.co.uk:80
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> Connection: Keep-Alive
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> "GET /hot/search/ HTTP/1.1[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,767 (Thread-5783) - http-outgoing-1 >> "User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; webbot@crawler.net)[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,768 (Thread-5783) - http-outgoing-1 >> "From: webbot@crawler.net[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "Accept: */*[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "Accept-Encoding: gzip,deflate[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "Host: mysite.co.uk:80[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "Connection: Keep-Alive[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,841 (Thread-5783) - http-outgoing-1 << "HTTP/1.1 200 OK[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,842 (Thread-5783) - http-outgoing-1 << "Date: Mon, 24 Nov 2014 02:17:06 GMT[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,842 (Thread-5783) - http-outgoing-1 << "Server: Apache[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,843 (Thread-5783) - http-outgoing-1 << "Set-Cookie: ci_session=a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3B%7D1dec34150fe1ab15f341d355f6ebd0dc; expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,843 (Thread-5783) - http-outgoing-1 << "Set-Cookie: ci_session=a%3A6%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3Bs%3A4%3A%22lang%22%3BN%3B%7Df6625848d5ca7bf8d5db71617607bada; expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,843 (Thread-5783) - http-outgoing-1 << "Vary: Accept-Encoding[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783) - http-outgoing-1 << "Content-Encoding: gzip[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783) - http-outgoing-1 << "Content-Length: 20[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783) - http-outgoing-1 << "Keep-Alive: timeout=5, max=99[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783) - http-outgoing-1 << "Connection: Keep-Alive[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,847 (Thread-5783) - http-outgoing-1 << "Content-Type: text/html[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,847 (Thread-5783) - http-outgoing-1 << "[\r][\n]"
      DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:86) 2014-11-24 02:15:51,848 (Thread-5783) - http-outgoing-1 << "[0x1f][0x8b][0x8][0x0][0x0][0x0][0x0][0x0][0x0][0x3][0x3][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0x0]"
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:113) 2014-11-24 02:15:51,849 (Thread-5783) - http-outgoing-1 << HTTP/1.1 200 OK
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,849 (Thread-5783) - http-outgoing-1 << Date: Mon, 24 Nov 2014 02:17:06 GMT
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,849 (Thread-5783) - http-outgoing-1 << Server: Apache
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,850 (Thread-5783) - http-outgoing-1 << Set-Cookie: ci_session=a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3B%7D1dec34150fe1ab15f341d355f6ebd0dc; expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,850 (Thread-5783) - http-outgoing-1 << Set-Cookie: ci_session=a%3A6%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3Bs%3A4%3A%22lang%22%3BN%3B%7Df6625848d5ca7bf8d5db71617607bada; expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,850 (Thread-5783) - http-outgoing-1 << Vary: Accept-Encoding
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,851 (Thread-5783) - http-outgoing-1 << Content-Encoding: gzip
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,851 (Thread-5783) - http-outgoing-1 << Content-Length: 20
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,851 (Thread-5783) - http-outgoing-1 << Keep-Alive: timeout=5, max=99
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,852 (Thread-5783) - http-outgoing-1 << Connection: Keep-Alive
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,852 (Thread-5783) - http-outgoing-1 << Content-Type: text/html
      DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:267) 2014-11-24 02:15:51,853 (Thread-5783) - Connection can be kept alive for 5000 MILLISECONDS
      DEBUG org.apache.http.client.protocol.ResponseProcessCookies.processCookies(ResponseProcessCookies.java:117) 2014-11-24 02:15:51,856 (Thread-5783) - Cookie accepted [ci_session="a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%2...", version:0, domain:mysite.co.uk, path:/, expiry:Wed Nov 23 02:17:06 GMT 2016]
      DEBUG org.apache.http.client.protocol.ResponseProcessCookies.processCookies(ResponseProcessCookies.java:117) 2014-11-24 02:15:51,860 (Thread-5783) - Cookie accepted [ci_session="a%3A6%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%2...", version:0, domain:mysite.co.uk, path:/, expiry:Wed Nov 23 02:17:06 GMT 2016]
      DEBUG org.apache.http.impl.execchain.ConnectionHolder.cancel(ConnectionHolder.java:140) 2014-11-24 02:15:51,866 (Thread-5783) - Cancelling request execution
      DEBUG org.apache.http.impl.conn.CPoolEntry.isExpired(CPoolEntry.java:81) 2014-11-24 02:15:57,017 (Idle cleanup thread) - Connection [id:1][route:{}->http://mysite.co.uk:80][state:null] expired @ Mon Nov 24 02:15:56 GMT 2014
      DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.close(LoggingManagedHttpClientConnection.java:79) 2014-11-24 02:15:57,019 (Idle cleanup thread) - http-outgoing-1: Close connection
      
      
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            arcadius Arcadius Ahouansou
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: