Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2368

Variable generate.max.count and fetcher.server.delay

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.12
    • 1.14
    • generator
    • None
    • Patch Available

    Description

      In some cases we need to use host specific characteristics in determining crawl speed and bulk sizes because with our (Openindex) settings we can just recrawl host with up to 800k urls.

      This patch solves the problem by introducing the HostDB to the Generator and providing powerful Jexl expressions. Check these two expressions added to the Generator:

      -Dgenerate.max.count.expr='
      if (unfetched + fetched > 800000) {
        return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
      } else {
        return conf.getDouble("generate.max.count", 300);
      }'
      
      -Dgenerate.fetch.delay.expr='
      if (unfetched + fetched > 800000) {
        return (pct95._rs_ + 500);
      } else {
        return conf.getDouble("fetcher.server.delay", 1000)
      }'
      

      For each large host: select as many records as possible that are possible to fetch based on number of threads, 95th percentile response time of the fetch limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.

      The second expression just follows up to that, settings the crawlDelay of the fetch queue.

      Attachments

        1. NUTCH-2368.patch
          11 kB
          Markus Jelsma
        2. NUTCH-2368.patch
          11 kB
          Markus Jelsma
        3. NUTCH-2368.patch
          11 kB
          Markus Jelsma
        4. NUTCH-2368.patch
          12 kB
          Markus Jelsma
        5. NUTCH-2368.patch
          17 kB
          Markus Jelsma
        6. NUTCH-2368.patch
          13 kB
          Markus Jelsma
        7. NUTCH-2368.patch
          13 kB
          Markus Jelsma
        8. NUTCH-2368.patch
          14 kB
          Markus Jelsma
        9. NUTCH-2368.patch
          14 kB
          Markus Jelsma
        10. NUTCH-2368_RESTAPI_Fix.patch
          1 kB
          Semyon Semyonov

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            markus17 Markus Jelsma
            markus17 Markus Jelsma
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment