Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1613

Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.2.1
    • Fix Version/s: 2.3, 1.9
    • Component/s: protocol
    • Labels:
    • Patch Info:
      Patch Available

      Description

      1.) When using protocol-httpclient to crawl a single website (the same host) I would always get a bunch of timeout errors during fetching and the pages with errors would not be fetched. E.g.:

      2013-07-09 17:57:13,717 WARN fetcher.FetcherJob - fetch of http://www.... failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting for connection
      2013-07-09 17:57:13,718 INFO fetcher.FetcherJob - fetching http://www.... (queue crawl delay=0ms)
      2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following error:
      org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting for connection
      at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497)
      at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
      at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
      at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
      at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
      at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:95)
      at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174)
      at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133)
      at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518)

      This is because by default the connection pool manager only allows 2 connections per host so if more than 2 threads are used the others will tend to time out waiting to get a connection. The code previously set max connections correctly but not connection per host.

      2.) I also added at the same time simple modifications to both protocol-http and protocol-httpclient to allow specifying a cookie string in the conf file to include in request headers.

      I use this to crawl site content requiring authentication - it is better for me to specify the cookie string for the authentication than go through the whole authentication process and specifying login info.

      The nutch-site.xml property is the following:

      <property>
      <name>http.cookie_string</name>
      <value>XX_AL=authorization_value_goes_here</value>
      <description>String to use as the cookie value for HTTP requests</description>
      </property>

      Although I use it for authentication it can be used to specify any single cookie string for the crawl (httpclient does support different cookies for different hosts but I did not get into that).

        Attachments

        1. NUTCH-1613.patch
          4 kB
          Brian

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                brian44 Brian
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: