Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-962

max. redirects not handled correctly: fetcher stops at max-1 redirects

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.2, 1.3, nutchgora
    • 1.3, nutchgora
    • fetcher
    • None
    • Patch Available

    Description

      The fetcher stops following redirects one redirect before the max. redirects is reached.

      The description of http.redirect.max
      > The maximum number of redirects the fetcher will follow when
      > trying to fetch a page. If set to negative or 0, fetcher won't immediately
      > follow redirected URLs, instead it will record them for later fetching.
      suggests that if set to 1 that one redirect will be followed.

      I tried to crawl two documents the first redirecting by
      <meta http-equiv="refresh" content="0; URL=./to/meta_refresh_target.html">
      to the second with http.redirect.max = 1
      The second document is not fetched and the URL has state GONE in CrawlDb.

      fetching file:/test/redirects/meta_refresh.html
      redirectCount=0
      -finishing thread FetcherThread, activeThreads=1

      The attached patch would fix this: if http.redirect.max is 1 : one redirect is followed.
      Of course, this would mean there is no possibility to skip redirects at all since 0
      (as well as negative values) means "treat redirects as ordinary links".

      Attachments

        1. Fetcher_redir.patch
          1 kB
          Sebastian Nagel

        Activity

          People

            ab Andrzej Bialecki
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: