Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
1.9
-
None
-
None
Description
NUTCH-1613 provided a fix for the hardcoded limitation of 2 threads for protocol-httpclient. However, just extending the hardwired 10 max threads and allocating them all to a single host only provides a partial solution. It is still possible to exhaust the thread pool and observe timeouts depending on the settings of:
- fetcher.threads.per.host (nutch-site.xml)
- mapred.tasktracker.map.tasks.maximum (mapred-site.xml)
It would perhaps be more robust to set the httpclient thread pool as a derivative of these two configuration parameters as below:
params.setMaxTotalConnections(maxThreadsTotal); // Add the following lines ... // -------------------------------------------------------------------------------- // Modification to increase the number of available connections for // multi-threaded crawls. // -------------------------------------------------------------------------------- connectionManager.setMaxConnectionsPerHost(conf.getInt("fetcher.threads.per.host", 10)); connectionManager.setMaxTotalConnections(conf.getInt("mapred.tasktracker.map.tasks.maximum", 5) * conf.getInt("fetcher.threads.per.host", 10)); LOG.debug("setMaxConnectionsPerHost: " + connectionManager.getMaxConnectionsPerHost()); LOG.debug("setMaxTotalConnections : " + connectionManager.getMaxTotalConnections()); // --------------------------------------------------------------------------------