Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-962

max. redirects not handled correctly: fetcher stops at max-1 redirects

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2, 1.3, nutchgora
    • Fix Version/s: 1.3, nutchgora
    • Component/s: fetcher
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The fetcher stops following redirects one redirect before the max. redirects is reached.

      The description of http.redirect.max
      > The maximum number of redirects the fetcher will follow when
      > trying to fetch a page. If set to negative or 0, fetcher won't immediately
      > follow redirected URLs, instead it will record them for later fetching.
      suggests that if set to 1 that one redirect will be followed.

      I tried to crawl two documents the first redirecting by
      <meta http-equiv="refresh" content="0; URL=./to/meta_refresh_target.html">
      to the second with http.redirect.max = 1
      The second document is not fetched and the URL has state GONE in CrawlDb.

      fetching file:/test/redirects/meta_refresh.html
      redirectCount=0
      -finishing thread FetcherThread, activeThreads=1

      The attached patch would fix this: if http.redirect.max is 1 : one redirect is followed.
      Of course, this would mean there is no possibility to skip redirects at all since 0
      (as well as negative values) means "treat redirects as ordinary links".

        Attachments

          Activity

            People

            • Assignee:
              ab Andrzej Bialecki
              Reporter:
              wastl-nagel Sebastian Nagel
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: