Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2368

Variable generate.max.count and fetcher.server.delay

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.12
    • Fix Version/s: 1.14
    • Component/s: generator
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      In some cases we need to use host specific characteristics in determining crawl speed and bulk sizes because with our (Openindex) settings we can just recrawl host with up to 800k urls.

      This patch solves the problem by introducing the HostDB to the Generator and providing powerful Jexl expressions. Check these two expressions added to the Generator:

      -Dgenerate.max.count.expr='
      if (unfetched + fetched > 800000) {
        return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
      } else {
        return conf.getDouble("generate.max.count", 300);
      }'
      
      -Dgenerate.fetch.delay.expr='
      if (unfetched + fetched > 800000) {
        return (pct95._rs_ + 500);
      } else {
        return conf.getDouble("fetcher.server.delay", 1000)
      }'
      

      For each large host: select as many records as possible that are possible to fetch based on number of threads, 95th percentile response time of the fetch limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.

      The second expression just follows up to that, settings the crawlDelay of the fetch queue.

      1. NUTCH-2368.patch
        12 kB
        Markus Jelsma
      2. NUTCH-2368.patch
        11 kB
        Markus Jelsma
      3. NUTCH-2368.patch
        11 kB
        Markus Jelsma
      4. NUTCH-2368.patch
        11 kB
        Markus Jelsma

        Activity

        Hide
        markus17 Markus Jelsma added a comment -

        Patch for trunk!

        Show
        markus17 Markus Jelsma added a comment - Patch for trunk!
        Hide
        markus17 Markus Jelsma added a comment -

        New patch. Removed system.out

        Show
        markus17 Markus Jelsma added a comment - New patch. Removed system.out
        Hide
        markus17 Markus Jelsma added a comment -

        Now this is odd, had to make this change but had it running with it:

        • crawlDelay = it.datum.getMetaData().get("variableFetchDelay").get();
          + crawlDelay = ((LongWritable)(it.datum.getMetaData().get("variableFetchDelay"))).get();

        Anyway,. updated patch!

        Show
        markus17 Markus Jelsma added a comment - Now this is odd, had to make this change but had it running with it: crawlDelay = it.datum.getMetaData().get(" variableFetchDelay ").get(); + crawlDelay = ((LongWritable)(it.datum.getMetaData().get(" variableFetchDelay "))).get(); Anyway,. updated patch!
        Hide
        markus17 Markus Jelsma added a comment -

        Any thought on this patch?

        Show
        markus17 Markus Jelsma added a comment - Any thought on this patch?
        Hide
        markus17 Markus Jelsma added a comment -

        Updated patch. Delay is not also set on minCrawlDelay to make it work if more than one thread works on the queue. The key is also temporarily set on every crawldatum but removed when passed to the fetch queue.

        Show
        markus17 Markus Jelsma added a comment - Updated patch. Delay is not also set on minCrawlDelay to make it work if more than one thread works on the queue. The key is also temporarily set on every crawldatum but removed when passed to the fetch queue.

          People

          • Assignee:
            markus17 Markus Jelsma
            Reporter:
            markus17 Markus Jelsma
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development