Nutch
  1. Nutch
  2. NUTCH-962

max. redirects not handled correctly: fetcher stops at max-1 redirects

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2, 1.3, nutchgora
    • Fix Version/s: 1.3, nutchgora
    • Component/s: fetcher
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The fetcher stops following redirects one redirect before the max. redirects is reached.

      The description of http.redirect.max
      > The maximum number of redirects the fetcher will follow when
      > trying to fetch a page. If set to negative or 0, fetcher won't immediately
      > follow redirected URLs, instead it will record them for later fetching.
      suggests that if set to 1 that one redirect will be followed.

      I tried to crawl two documents the first redirecting by
      <meta http-equiv="refresh" content="0; URL=./to/meta_refresh_target.html">
      to the second with http.redirect.max = 1
      The second document is not fetched and the URL has state GONE in CrawlDb.

      fetching file:/test/redirects/meta_refresh.html
      redirectCount=0
      -finishing thread FetcherThread, activeThreads=1

      The attached patch would fix this: if http.redirect.max is 1 : one redirect is followed.
      Of course, this would mean there is no possibility to skip redirects at all since 0
      (as well as negative values) means "treat redirects as ordinary links".

      1. Fetcher_redir.patch
        1 kB
        Sebastian Nagel

        Activity

        Sebastian Nagel created issue -
        Sebastian Nagel made changes -
        Field Original Value New Value
        Attachment Fetcher_redir.patch [ 12469427 ]
        Andrzej Bialecki made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Andrzej Bialecki [ ab ]
        Fix Version/s 1.3 [ 12315470 ]
        Fix Version/s 2.0 [ 12314893 ]
        Resolution Fixed [ 1 ]
        Markus Jelsma made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Sebastian Nagel
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development