Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2368

Variable generate.max.count and fetcher.server.delay

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.12
    • Fix Version/s: 1.14
    • Component/s: generator
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      In some cases we need to use host specific characteristics in determining crawl speed and bulk sizes because with our (Openindex) settings we can just recrawl host with up to 800k urls.

      This patch solves the problem by introducing the HostDB to the Generator and providing powerful Jexl expressions. Check these two expressions added to the Generator:

      -Dgenerate.max.count.expr='
      if (unfetched + fetched > 800000) {
        return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
      } else {
        return conf.getDouble("generate.max.count", 300);
      }'
      
      -Dgenerate.fetch.delay.expr='
      if (unfetched + fetched > 800000) {
        return (pct95._rs_ + 500);
      } else {
        return conf.getDouble("fetcher.server.delay", 1000)
      }'
      

      For each large host: select as many records as possible that are possible to fetch based on number of threads, 95th percentile response time of the fetch limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.

      The second expression just follows up to that, settings the crawlDelay of the fetch queue.

      1. NUTCH-2368.patch
        14 kB
        Markus Jelsma
      2. NUTCH-2368.patch
        14 kB
        Markus Jelsma
      3. NUTCH-2368.patch
        13 kB
        Markus Jelsma
      4. NUTCH-2368.patch
        13 kB
        Markus Jelsma
      5. NUTCH-2368.patch
        17 kB
        Markus Jelsma
      6. NUTCH-2368.patch
        12 kB
        Markus Jelsma
      7. NUTCH-2368.patch
        11 kB
        Markus Jelsma
      8. NUTCH-2368.patch
        11 kB
        Markus Jelsma
      9. NUTCH-2368.patch
        11 kB
        Markus Jelsma

        Issue Links

          Activity

          Hide
          markus17 Markus Jelsma added a comment -

          Patch for trunk!

          Show
          markus17 Markus Jelsma added a comment - Patch for trunk!
          Hide
          markus17 Markus Jelsma added a comment -

          New patch. Removed system.out

          Show
          markus17 Markus Jelsma added a comment - New patch. Removed system.out
          Hide
          markus17 Markus Jelsma added a comment -

          Now this is odd, had to make this change but had it running with it:

          • crawlDelay = it.datum.getMetaData().get("variableFetchDelay").get();
            + crawlDelay = ((LongWritable)(it.datum.getMetaData().get("variableFetchDelay"))).get();

          Anyway,. updated patch!

          Show
          markus17 Markus Jelsma added a comment - Now this is odd, had to make this change but had it running with it: crawlDelay = it.datum.getMetaData().get(" variableFetchDelay ").get(); + crawlDelay = ((LongWritable)(it.datum.getMetaData().get(" variableFetchDelay "))).get(); Anyway,. updated patch!
          Hide
          markus17 Markus Jelsma added a comment -

          Any thought on this patch?

          Show
          markus17 Markus Jelsma added a comment - Any thought on this patch?
          Hide
          markus17 Markus Jelsma added a comment -

          Updated patch. Delay is not also set on minCrawlDelay to make it work if more than one thread works on the queue. The key is also temporarily set on every crawldatum but removed when passed to the fetch queue.

          Show
          markus17 Markus Jelsma added a comment - Updated patch. Delay is not also set on minCrawlDelay to make it work if more than one thread works on the queue. The key is also temporarily set on every crawldatum but removed when passed to the fetch queue.
          Hide
          markus17 Markus Jelsma added a comment -

          Updated patch to fix NUTCH-2404.

          Show
          markus17 Markus Jelsma added a comment - Updated patch to fix NUTCH-2404 .
          Hide
          markus17 Markus Jelsma added a comment -

          Removed some files that didnt belong in the patch.

          Show
          markus17 Markus Jelsma added a comment - Removed some files that didnt belong in the patch.
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          +1 A powerful feature! A few remarks:

          • the instance variable maxCount is overwritten in the reduce method for a given host: this leads to unpredictable behavior if some hosts are missing in the hostdb. It's probably safer to keep maxCount untouched and use a local variable which holds the per-host count or to the value of maxCount/generate.max.count as fall-back.
          • ignored catch blogs: should log something, esp. if errors are severe such as a failure reading the hostdb
          • git complained about trailing white space while applying the patch
          • avoid string concatenations / use {} placeholders when calling LOG.xxx(...)
          Show
          wastl-nagel Sebastian Nagel added a comment - +1 A powerful feature! A few remarks: the instance variable maxCount is overwritten in the reduce method for a given host: this leads to unpredictable behavior if some hosts are missing in the hostdb. It's probably safer to keep maxCount untouched and use a local variable which holds the per-host count or to the value of maxCount / generate.max.count as fall-back. ignored catch blogs: should log something, esp. if errors are severe such as a failure reading the hostdb git complained about trailing white space while applying the patch avoid string concatenations / use { } placeholders when calling LOG.xxx(...)
          Hide
          markus17 Markus Jelsma added a comment -

          Good points! Updated patch!

          Show
          markus17 Markus Jelsma added a comment - Good points! Updated patch!
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          +1 A last point: would be good to have the new properties documented in nutch-default.xml

          Show
          wastl-nagel Sebastian Nagel added a comment - +1 A last point: would be good to have the new properties documented in nutch-default.xml
          Hide
          markus17 Markus Jelsma added a comment -

          Good point! Updated patch.

          Show
          markus17 Markus Jelsma added a comment - Good point! Updated patch.
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          Not to forget generate.hostdb (without nothing will happen). +1 otherwise!

          Show
          wastl-nagel Sebastian Nagel added a comment - Not to forget generate.hostdb (without nothing will happen). +1 otherwise!
          Hide
          markus17 Markus Jelsma added a comment -

          Of course, good point. Final patch, if still something is wrong, let's delete the entire issue. Will commit shortly.

          Show
          markus17 Markus Jelsma added a comment - Of course, good point. Final patch, if still something is wrong, let's delete the entire issue. Will commit shortly.
          Hide
          wastl-nagel Sebastian Nagel added a comment -

          +1

          Show
          wastl-nagel Sebastian Nagel added a comment - +1
          Hide
          markus17 Markus Jelsma added a comment -

          Hahahaha sure! Thanks Sebastian!

          Show
          markus17 Markus Jelsma added a comment - Hahahaha sure! Thanks Sebastian!
          Hide
          markus17 Markus Jelsma added a comment -

          Committed to master in 2de30d2e..44f7ad97 master -> master

          Show
          markus17 Markus Jelsma added a comment - Committed to master in 2de30d2e..44f7ad97 master -> master
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Nutch-trunk #3446 (See https://builds.apache.org/job/Nutch-trunk/3446/)
          NUTCH-2368 Variable generate.max.count and fetcher.server.delay (markus: https://github.com/apache/nutch/commit/44f7ad973f2017bacde2bf5277f846179eafc6dd)

          • (edit) src/java/org/apache/nutch/fetcher/FetchItemQueue.java
          • (edit) conf/nutch-default.xml
          • (edit) src/java/org/apache/nutch/crawl/Generator.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Nutch-trunk #3446 (See https://builds.apache.org/job/Nutch-trunk/3446/ ) NUTCH-2368 Variable generate.max.count and fetcher.server.delay (markus: https://github.com/apache/nutch/commit/44f7ad973f2017bacde2bf5277f846179eafc6dd ) (edit) src/java/org/apache/nutch/fetcher/FetchItemQueue.java (edit) conf/nutch-default.xml (edit) src/java/org/apache/nutch/crawl/Generator.java

            People

            • Assignee:
              markus17 Markus Jelsma
              Reporter:
              markus17 Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development