Description
In some cases we need to use host specific characteristics in determining crawl speed and bulk sizes because with our (Openindex) settings we can just recrawl host with up to 800k urls.
This patch solves the problem by introducing the HostDB to the Generator and providing powerful Jexl expressions. Check these two expressions added to the Generator:
-Dgenerate.max.count.expr=' if (unfetched + fetched > 800000) { return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1) } else { return conf.getDouble("generate.max.count", 300); }' -Dgenerate.fetch.delay.expr=' if (unfetched + fetched > 800000) { return (pct95._rs_ + 500); } else { return conf.getDouble("fetcher.server.delay", 1000) }'
For each large host: select as many records as possible that are possible to fetch based on number of threads, 95th percentile response time of the fetch limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
The second expression just follows up to that, settings the crawlDelay of the fetch queue.
Attachments
Issue Links
- is blocked by
-
NUTCH-2461 Generate passes the data to when maxCount == 0
- Closed
- is duplicated by
-
NUTCH-2402 Fetcher variable missing for generate.max.count.expr and fetcher.server.delay.expr
- Closed
- Parent Feature
-
NUTCH-2481 HostDatum deltas(previous step statistics) and Metadata expressions
- Open
- requires
-
NUTCH-2454 REST API fix for usage of hostdb in generator
- Closed
-
NUTCH-2455 Speed up the merging of HostDb entries for variable fetch delay
- Open