Nutch
  1. Nutch
  2. NUTCH-293

support for Crawl-delay in Robots.txt

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.8
    • Component/s: fetcher
    • Labels:
      None

      Description

      Nutch need support for Crawl-delay defined in robots.txt, it is not a standard but a de-facto standard.
      See:
      http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
      Webmasters start blocking nutch since we do not support it.

      1. crawlDelayv1.patch
        5 kB
        Stefan Groschupf

        Activity

        Hide
        Stefan Groschupf added a comment -

        A frist darft of a crawl delay support for nutch. The problem I see is that in case ip based delay is configured it can happen that we use the crawl delay of one host for a other host running on the same ip.
        Feedback is welcome.

        Show
        Stefan Groschupf added a comment - A frist darft of a crawl delay support for nutch. The problem I see is that in case ip based delay is configured it can happen that we use the crawl delay of one host for a other host running on the same ip. Feedback is welcome.
        Hide
        Stefan Groschupf added a comment -

        Any comments? There was already a posting in the nutch agent mailing list, where someone had banned nutch since nutch does not support crawl-delay.
        Becasue nutch tries to be polite from my point of view this is a small but important change.
        If there are no improvement suggestions can someone of the committers take care of that please?

        Show
        Stefan Groschupf added a comment - Any comments? There was already a posting in the nutch agent mailing list, where someone had banned nutch since nutch does not support crawl-delay. Becasue nutch tries to be polite from my point of view this is a small but important change. If there are no improvement suggestions can someone of the committers take care of that please ?
        Hide
        Andrzej Bialecki added a comment -

        Stefan, as you remember we had a discussion on modifying the fetcher, and specifically changing the thread spin-waiting mechanism into a worker-queue. As it is now this is a can of worms that I'd rather not touch - there are many subtle conditions here that would be further complicated by this patch. E.g. the number of spin-waiting threads vs. the number of free threads is normally affected only by five factors: total number of threads, non-uniqueness rate in the current fetchlist, sites' bandwidth, configured delay between requests, and allowed # of threads/host. This patch adds a sixth factor, variable per site .. which makes it much harder to predict how many threads you need to avoid dead-locking all of them.

        I'm not strongly opposed to this change, quite contrary - this is a useful functionality. It's just that I'm concerned that it adds yet another functionality to a messy code that needs to be rewritten from scratch.

        OTOH, it's a non-intrusive quick hack. If we have to have it now, it's definitely better than waiting for some distant future when we rewrite the fetcher ...

        Show
        Andrzej Bialecki added a comment - Stefan, as you remember we had a discussion on modifying the fetcher, and specifically changing the thread spin-waiting mechanism into a worker-queue. As it is now this is a can of worms that I'd rather not touch - there are many subtle conditions here that would be further complicated by this patch. E.g. the number of spin-waiting threads vs. the number of free threads is normally affected only by five factors: total number of threads, non-uniqueness rate in the current fetchlist, sites' bandwidth, configured delay between requests, and allowed # of threads/host. This patch adds a sixth factor, variable per site .. which makes it much harder to predict how many threads you need to avoid dead-locking all of them. I'm not strongly opposed to this change, quite contrary - this is a useful functionality. It's just that I'm concerned that it adds yet another functionality to a messy code that needs to be rewritten from scratch. OTOH, it's a non-intrusive quick hack. If we have to have it now, it's definitely better than waiting for some distant future when we rewrite the fetcher ...
        Hide
        Stefan Groschupf added a comment -

        Hi Andrzej,
        I agree but writing a queue based fetcher is a big step. I already have some basic code (nio based).
        Also I don't think that a new fetcher will be as stable as that we can put it into a .8 release. Since we plan to have .8 release it think it is a good idea for now to add this functionality. Maybe we do it configurable and switch it off by default?

        In any case I suggest that we solve NUTCH-289 first and than getting the fetcher done.

        Show
        Stefan Groschupf added a comment - Hi Andrzej, I agree but writing a queue based fetcher is a big step. I already have some basic code (nio based). Also I don't think that a new fetcher will be as stable as that we can put it into a .8 release. Since we plan to have .8 release it think it is a good idea for now to add this functionality. Maybe we do it configurable and switch it off by default? In any case I suggest that we solve NUTCH-289 first and than getting the fetcher done.
        Hide
        Sami Siren added a comment -

        perhaps instead of
        delay = crawlDelay > 0 ? crawlDelay : serverDelay;

        we could do
        delay=Math.max(crawlDelay, serverDelay);

        also the delay could be calculated only once and passed as a parameter
        to blockAddr, unblockAddr

        Show
        Sami Siren added a comment - perhaps instead of delay = crawlDelay > 0 ? crawlDelay : serverDelay; we could do delay=Math.max(crawlDelay, serverDelay); also the delay could be calculated only once and passed as a parameter to blockAddr, unblockAddr
        Hide
        Andrzej Bialecki added a comment -

        I'm working on this patch to commit it. Just a quick note to Sami: Math.max() is not optimal, because it always picks up the longest wait period. We are interested in getting a right period - it may be longer, but it may also be shorter than the serverDelay. If it's shorter then we win, because we are allowed to crawl this site faster.

        Show
        Andrzej Bialecki added a comment - I'm working on this patch to commit it. Just a quick note to Sami: Math.max() is not optimal, because it always picks up the longest wait period. We are interested in getting a right period - it may be longer, but it may also be shorter than the serverDelay. If it's shorter then we win, because we are allowed to crawl this site faster.
        Hide
        Andrzej Bialecki added a comment -

        Patch applied with minor changes. Thank you!

        Show
        Andrzej Bialecki added a comment - Patch applied with minor changes. Thank you!

          People

          • Assignee:
            Unassigned
            Reporter:
            Stefan Groschupf
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development