Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1836

Timeouts in protocol-httpclient when crawling same host with >2 threads NUTCH-1613 is not a complete solution

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.9
    • None
    • protocol
    • None

    Description

      NUTCH-1613 provided a fix for the hardcoded limitation of 2 threads for protocol-httpclient. However, just extending the hardwired 10 max threads and allocating them all to a single host only provides a partial solution. It is still possible to exhaust the thread pool and observe timeouts depending on the settings of:

      • fetcher.threads.per.host (nutch-site.xml)
      • mapred.tasktracker.map.tasks.maximum (mapred-site.xml)

      It would perhaps be more robust to set the httpclient thread pool as a derivative of these two configuration parameters as below:

          params.setMaxTotalConnections(maxThreadsTotal);
      
      // Add the following lines ...
      
      
      	// --------------------------------------------------------------------------------
      	// Modification to increase the number of available connections for
      	// multi-threaded crawls.
      	// --------------------------------------------------------------------------------
      	connectionManager.setMaxConnectionsPerHost(conf.getInt("fetcher.threads.per.host", 10));
      	connectionManager.setMaxTotalConnections(conf.getInt("mapred.tasktracker.map.tasks.maximum", 5) * conf.getInt("fetcher.threads.per.host", 10));
      	LOG.debug("setMaxConnectionsPerHost: " + connectionManager.getMaxConnectionsPerHost());
      	LOG.debug("setMaxTotalConnections  : " + connectionManager.getMaxTotalConnections());
      	// --------------------------------------------------------------------------------
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            acanewby Adrian Newby
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: