Nutch
  1. Nutch
  2. NUTCH-962

max. redirects not handled correctly: fetcher stops at max-1 redirects

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2, 1.3, nutchgora
    • Fix Version/s: 1.3, nutchgora
    • Component/s: fetcher
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The fetcher stops following redirects one redirect before the max. redirects is reached.

      The description of http.redirect.max
      > The maximum number of redirects the fetcher will follow when
      > trying to fetch a page. If set to negative or 0, fetcher won't immediately
      > follow redirected URLs, instead it will record them for later fetching.
      suggests that if set to 1 that one redirect will be followed.

      I tried to crawl two documents the first redirecting by
      <meta http-equiv="refresh" content="0; URL=./to/meta_refresh_target.html">
      to the second with http.redirect.max = 1
      The second document is not fetched and the URL has state GONE in CrawlDb.

      fetching file:/test/redirects/meta_refresh.html
      redirectCount=0
      -finishing thread FetcherThread, activeThreads=1

      The attached patch would fix this: if http.redirect.max is 1 : one redirect is followed.
      Of course, this would mean there is no possibility to skip redirects at all since 0
      (as well as negative values) means "treat redirects as ordinary links".

      1. Fetcher_redir.patch
        1 kB
        Sebastian Nagel

        Activity

        Hide
        Markus Jelsma added a comment -

        Bulk close of resolved issues for 1.3.

        Show
        Markus Jelsma added a comment - Bulk close of resolved issues for 1.3.
        Hide
        Andrzej Bialecki added a comment -

        Committed in 1079764 (trunk) and 1079765 (1.3). Thank you!

        Show
        Andrzej Bialecki added a comment - Committed in 1079764 (trunk) and 1079765 (1.3). Thank you!
        Hide
        Sebastian Nagel added a comment -

        patch for 1.3 to respect count of redirects literally:
        http.redirect.max = 0 (or negative) :: treat redirects as ordinary links
        http.redirect.max = 1 :: follow max. 1 redirect
        http.redirect.max = 2 :: follow max. 2 redirects, etc.

        Show
        Sebastian Nagel added a comment - patch for 1.3 to respect count of redirects literally: http.redirect.max = 0 (or negative) :: treat redirects as ordinary links http.redirect.max = 1 :: follow max. 1 redirect http.redirect.max = 2 :: follow max. 2 redirects, etc.

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Sebastian Nagel
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development